The Syntax of XML

Author: Jaidev

Introduction to XML Syntax

XML as we have seen, is a formal specification for markup languages. Every formal language specification has an associated syntax. In this chapter we will study the syntax of XML. The purpose is to lay the foundation for us to understand, in the next chapter, one of the methods of formally specifying the structure of XML compliant languages - the Document Type Definition (DTD).

Syntax of XML

XML documents as we have seen, comprise two basic components:

* Data: The actual content

* Markup: Meta-information about data that describes it

While being the more important part of the document (as compared with the "markup portion"), the "data" portion is not particularly interesting as it is basically an unstructured mass of character data and anything textual can go in here. The "markup" portion however is far more interesting and describes the structure of the document. We will therefore examine the "markup" more in detail shortly.

But before we go there, let us start with the more basic XML syntax rules.

Syntax Rule 0: XML is "Text"

* XML documents are text only. No binary content that can go in anywhere.

* XML is case sensitive text at that.

* In addition, even the use of certain characters is restricted. These characters if needed desperately (actually they are quite common, so you don?t have to be really desperate to need them!!) should be substituted as follows: Special Character

XML: special characters

Syntax Rule 1: The XML statement itself

The very first line of any XML document must declare the document to be an XML document and specify some other optional attributes. This done as follows:

<?xml version="1.0"?>

The statement above declares the document as an XML document, which means it complies with XML syntax rules - that is it is well formed (a concept which we have seen before). Sometime the "character set encoding" is also specified. For instance the statement:

<?xml version="1.0" encoding=UTF-8"?>

specifies that the encoding is "UTF-8". This example also introduced the syntax of specifying "attributes". For instance the term- encoding="UTF-8" - is an attribute-value pair within the ?xml element tag. This concept of elements and attributes is very important and is described in detail next.

Syntax Rule 2: Elements and Attributes

Markup in XML is made up of what are known as Elements. Elements are comprised of:

* Angle brackets < and > as we have seen before, within which there is

* A mandatory Element Name

* One or more optional Attributes=Value pairs

Drawing a parallel to English grammar, the

 "Element Name" is to "Noun", while

"Attribute=Value Pair" is to "Adjective" (describing the "Element Name")

For instance in the following two elements:

<ARTICLE TYPE="FULL PAPER">

<ARTICLE TYPE="SHORT PAPER">

The two elements <ARTICLE ...> could represent "articles" in a journal.

 Here,

* ARTICLE is the Element Name, while

* TYPE="FULL PAPER" is an Attribute=Value Pair, where the Attribute is "TYPE" and the Value is "FULL PAPER" or "SHORT PAPER". This attribute describes the type of article we are describing.

Every element has a start tag and an end tag (which is the same as the start tag except that it has a preceding forward slash). We have seen this in detail in an earlier chapter and will not go describe it again.

Syntax Rule 3: Have a Root (beer!)

Every XML document has a tree structure. And every tree as we know has a root. Thus every XML document must start with a "root element". For instance, in the example below, the root element is <ARTICLES> which as the name suggests, represents a list of articles.

XML root

Syntax Rule 4: Nested Elements

XML allows the nesting of elements. That implies that the "data" of an XML element can itself be XML. Consider the same XML document that we saw above.

There are two "article"s in this document - "XML Demystified" and "XSLT Demystified". The first is a Full Paper while the other is a Short Paper (attribute-value pairs). Notice that in this example, the "data" within the <ARTICLE> and </ARTICLE> tags itself is XML, starting with the <ARTICLEDATA> tag.

This kind of nesting of elements is allowed in XML syntax.

Syntax Rule 5: Empty Elements

 An element can be empty i.e. contain no data. If so, the element can be represented in two alternative ways. Let us see examples of both types of syntax from a familiar domain -HTML (or rather its XML form called XHTML):

<br></br> ... ... Syntax 1

<br/> ... ... Syntax 2

In the first syntax, notice that there is no data between the start and the end tags (since it is an empty element) but both tags are present.

In the second syntax notice that there is only one tag but it looks different - it has a forward slash at the end, just before the closing > symbol. This is an alternative representation of the empty element.

Syntax Rule 6: Comments

XML like many other languages allows comments. Comments are typically discarded by the parser ? thus they will not appear in the output of, say a browser.

The syntax of a comment is shown by the example below:

<!-- This is a Comment -->

The special element starting with the symbols <!-- and ending with the symbols --> is the comment and the text within forms the content of the comment. Note that comments cannot be nested since the use of the character string "--" is completely prohibited within a comment.

Syntax Rule 7: Preventing Parser Interpretation

 Sometimes one might want to prevent a parser from interpreting a portion of an XML document. For instance, if one wanted to display a piece of XML text as it is in another XML document one can do it this way.

Text View

<?xml version="1.0" encoding="UTF-8"?>
<ARTICLES>
<ARTICLE>
<ARTICLEDATA>
<![CDATA[
<TITLE>XML Demystified</TITLE>
<AUTHOR>Jaidev</AUTHOR>
]]></ARTICLEDATA>
</ARTICLE>
</ARTICLES>
 

Browser View

parts of xml marked with CDATA attribute are not displayed

Notice the section starting with "<![CDATA[" and ending with "]]>". Any text within these symbols is output as is, without interpretation by a parser. "CDATA" sections are quite useful when sending through large pieces of text with special characters etc that need to remain un-interpreted.

 

From Syntax to Structure

We have now understood the basic syntax of XML documents. The authoritative description of XML syntax can be found at: http://www.w3.org/TR/REC-xml/ Armed with this knowledge, we can now move on to study how the structure of XML documents is specified. The structure of XML documents can be declared by two means ? the older (but still more commonly used) Document Type Definition (DTD) and the more recent Schema definition. In the next chapter, we will consider the first of the two - the DTD in detail.

Copyright© 2004-2006 Aleksey Nudelman