XML Tutorial Markup, well formed ness and validity

Author: Jaidev

Markup – Description of Content

Consider the following two scenarios:

Scenario 1

An article needs to be published in a conference proceeding or in a journal. The author of the article has all the content stored in his mind. He knows his subject well and has managed to put that very eloquently into text. Unfortunately, he neither structured the document nor formatted it. If this article were to be submitted as such, it could well be published. But the publisher would have no idea what the title is, who the authors are or where the summary is. Further if the article were to be published somehow, the readers would find it tough to navigate the mass of text with no sections, no headings, no formatting and the like. Thus, we see that we require pieces of information that annotate the content but are not part of the content itself. These pieces make clear what that part of content actually is trying to convey (eg. title, abstract, introduction, references, body etc) or how it should be viewed (eg. boldface, italics etc).

Scenario 2

Information about a patient needs to be transmitted from one hospital to another. Doctors in both hospitals take down exactly the same patient information. However the database design for each hospital uses different database structures and notation. If the information were to be sent as-is from the source hospital, the destination hospital would have no idea how the source information was structured or what the pieces meant. This would make it very difficult for the information content to be processed correctly and importantly without doubt. Once again, we see the need to for meta-level information that annotates the content to express the structure and meaning.

Annotations of content with meta-level information are called markup. Markup is achieved by the use of tags

Tags

Providing information about the content in XML is done using what are known as tags. For example we could have tags for the title, the abstract, the authors or the formatting of a document. So how are these tags represented in XML?

They are quite simply represented by: A “Lesser Than” symbol < followed by the “Tag ID” followed by a “Greater Than” symbol >

If that sounded complex, let us look at a few examples of tags, which will clarify the issue: <title> is an example of a tag for the “title” of an article <abstract> is an example of a tag for the “abstract” of an article < author> is an example of a tag for the “author” of an article <b> is an example of a tag for “boldface” in an article

Markup using Tags




Now that we know what a tag is, let us see how tags are used to markup content. Consider our journal article again as an example.

In the figure on the left is the document without markup. The text in red is supposed to be the title that in green is the author’s name while blue represents the abstract which should be italicized and finally orange represents the body text some of which marked in pink must be bold face. ;

The XML version of this document is shown on the right. Notice the tags (shown in bold) for the title, author, abstract, italics, body and boldface.

If you note carefully, there are two types of tags. One type is those that only have the tag name within the signs < and >. For example the tag . Then there are those that have the tag name preceded with a forward slash “/”. For example, the tag <title>.  Then there are those that have the tag name preceded with a forward slash “/”. For example, the tag </title>. 

The former is the called the start tag while the latter is called the end tag or close tag. And the content is placed between the start and the end tags.

The start tag and the end tag are said to markup the content within

For instance, in the line XML Demystified the tag pair of and are said to markup the content “XML Demystified” within as a title.

Well Formedness/Validity

There are two key points to bear in mind when creating XML documents. They must be:

Valid implies that an XML document is compliant with the Document Type Definition (DTD) or Schema that it is supposed to instantiate. We will study validity in a later chapter of this tutorial.

Well formed refers to the document being compliant with general XML rules. Since we just discussed tags and markup, let us consider the main rules for well formed ness related to these (tags and markup). As we progress through this tutorial, we will encounter many more rules for well formedness.

Finally, for a thorough discussion of all the constraints for well formedness and validity, you are advised to refer to http://www.w3.org/TR/1998/REC-xml-19980210#sec-well-formed

Well Formedness Constraints on Tags/Markup

A start tag must be balanced with an end tag

Correct: <title>XML Demystified </title>

Incorrect: <title>XML Demystified Incorrect
 

Correct: <br/>
 

The first row is correct, since the tags <title> and </title> are balanced. The second row is incorrect for the very opposite reason. The third row is a special case of a balanced pair of tags <br> </br> with no content within. In this case the element can be represented as <br/>. Note that the forward slash is at the end of the tag name for empty elements. Those familiar with XHTML will recognize this as the line break element.

 Tags must be placed in a strictly nested order

<abstract><i> The abstract </i>/abstract> Correct abstract

<abstract><i>The abstract</abstract></i> Incorrect

In both rows above the star and end tags are balanced. However, in the top row the tags are nested properly i.e. the <i> and </i> are completely nested within the <abstract> and </abstract> tags. This is correct. In the lower row, the inner <i> tag is not ended when the /abstract> end tag is encountered. This implies that the tags are not nested correctly. Testing for Well Formed ness / Validity All XML parsers (programs which can read an XML file and construct a programmatic stricture out of it) will perform a well formed ness check on XML documents. Some parsers called validating parsers will also perform a validity check on the document by comparing the document against the DTD or schema. An excellent online tool for checking well formed ness can be found at RUWF: http://www.xml.com/pub/a/tools/ruwf/check.html I would suggest you key in the above correct and incorrect markup and test it against this checker as an exercise. You can also check validity of documents at the following site (though you might want to wait till a later chapter to really understand DTD and Schemas, before you try this out): http://www.stg.brown.edu/service/xmlvalid/

You’re marked(up) now !!

Now you know what tags and markup are. You also know that an XML file must be well formed and optionally should be valid as well. You have seen some examples on well formed documents pertaining to tags and markup. And to top it all off, you have some real online sites where you can try the whole thing out.

Now let’s move on to the very “legally mine or yours?” part of XML – the concept of namespaces. Namespaces, as we will see are ways of clearly defining without ambiguity the origin of any markup. Onward bound!

Copyright© 2004-2006 Aleksey Nudelman