XML: An Introduction

HTML

Link:

http://www.w3schools.com/html/html_intro.asp

* HTML stands for Hyper Text Markup Language

* An HTML file is a text file containing small markup tags

* The markup tags tell the Web browser how to display the page

* An HTML file must have an htm or html file extension

* An HTML file can be created using a simple text editor

from http://www.w3schools.com/html/html_intro.asp

Here's an example :

<html>
        <head>
                <title>Web Page</title>
        </head>
        <body>
                <strong>Hello</strong>, this is a web page!
                <hr/>
                <font color=red>This is some red text</font>
        </body>
</html>

XML

Links:

http://www.w3schools.com/xml/xml_whatis.asp

* XML stands for EXtensible Markup Language

* XML is a markup language much like HTML

* XML was designed to describe data

* XML tags are not predefined. You must define your own tags

* XML uses a Document Type Definition (DTD) or an XML Schema to describe the data

* XML with a DTD or XML Schema is designed to be self-descriptive

* XML is a W3C Recommendation

from http://www.w3schools.com/xml/xml_whatis.asp

If you use an RSS reader to access blogs or other newsfeeds, you use XML. Really Simple Syndication produces XML-based feeds summarizing frequently updated content.

Here's a real example from nytimes.com:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
        <channel>
                <title>NYT > Sunday Book Review</title>
                <link>http://www.nytimes.com/pages/books/review/index.html?partner=rssnyt</link>
                <description></description>
                <language>en-us</language>
                <copyright>Copyright 2007 The New York Times Company</copyright>
                <lastBuildDate>Fri, 24 Aug 2007 20:05:02 GMT</lastBuildDate>
                <image>
                        <title>NYT > Sunday Book Review</title>
                        <url>http://graphics.nytimes.com/images/section/NytSectionHeader.gif</url>
                        <link>http://www.nytimes.com/pages/books/review/index.html</link>
                </image>
                <item>
                        <title>On the Road Again</title>
                        <link>http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html?ex=1345176000&amp;en=b8402a9d3d6e4457&amp;ei=5088&amp;partner=rssnyt&amp;emc=rss</link>
                        <description>The novel that &#8220;On the Road&#8221; became was inarguably the book that young people needed in 1957, but the sparse and unassuming scroll is the living version for our time.</description>
                        <author>LUC SANTE</author>
                        <guid isPermaLink="false">http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html</guid>
                        <pubDate>Sun, 19 Aug 2007 02:56:43 GMT</pubDate>
                </item>
        </channel>
</rss>

Elements

Elements are surrounded by tags. Tags come in pairs. The open tag identifies the beginning of the element and the closing tag, denoted by the / before the tag name, identifies the end of the element. Tag names are generally fairly intuitive descriptions of the data that will be contained in the element. For example, as you might expect, the text between the author tags is the name of an author. In some cases, you may find empty elements that look as follows: <description/>.

Elements may contain text, other elements, and attributes (discussed below). In the example above, the rss element contains one element, channel. The channel element contains elements title, link, description, language, copyright, lastBuildData, image, and item.

Elements form a tree structure or hierarchy. Following is some relevant tree terminology:

root - The root of a tree is outermost element, in this case rss.
child - The children of an element are the elements it contains. The element channel is a child of rss. The element copyright is a child of channel. The element author is a child of item.
sibling - The siblings of an element are the elements that share its parent. The element image is a sibling of item.

Attributes

Attributes are name, value pairs that provide some information about the characteristics of an element. In the example above, the element rss has an attribute version. The version attribute has a value of 2.0. An element may have multiple attributes.

Prolog

The prolog identifies a document as an XML document and may contain other relevant information. An example document with prolog follows:

<?xml version="1.0" encoding="UTF-8" ?>

<!DOCTYPE greeting SYSTEM "hello.dtd">

<greeting>Hello, world!</greeting>

The first line identifies the document as an XML document and specifies the encoding. The second line is the document type declaration. It indicates that the root element is a greeting. The document type definition for a greeting document is found in "hello.dtd".

Well-Formedness

An XML document must be well formed. A well-formed document has the following properties:

Each start tag has a corresponding end tag.
Tags are nested correctly.
A document has a single root element.

Parsing

There are two models for parsing XML: DOM and SAX.

DOM - Document Object Model

A DOM parser reads an XML document, for example from a file, and builds a tree in memory. The programmer can then access and manipulate the information stored in the document by traversing the tree structure. Essentially, the job of the parser is to identify where elements start and end, and build objects to represent each element.

SAX - Simple API for XML

A SAX parser reads an XML document and generates events when elements are found. The user defines the actions be taken as different types of elements are found.

XML and Java

XMLTester.java - a very simple example

Java provides both DOM and SAX parsers in the javax.xml.parsers package. The DOM parser produces a Document object, where Document is in the org.w3c.dom package. The Document represents the entire XML tree, which is comprised of Node objects. The Node class provides an API to traverse the tree. Node has several subclasses, the most notable of which are Text and Element. All components in the tree are Nodes, but some are Elements and some are Text, and there are a few other subclasses as well. Below are a few of the most relevant APIs. For a full listing, see the Java API.

javax.xml.parsers

DocumentBuilderFactory - Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents.

DocumentBuilderFactory newInstance() - Obtain a new instance of a DocumentBuilderFactory.
DocumentBuilder newDocumentBuilder() - Creates a new instance of a DocumentBuilder using the currently configured parameters.

DocumentBuilder - Defines the API to obtain DOM Document instances from an XML document. Using this class, an application programmer can obtain a Document from XML.

Document parse(File f) - Parse the content of the given file as an XMLdocument and return a new DOM Document object.
abstract Document parse(InputSource is) - Parse the content of the given input source as an XMLdocument and return a new DOM Document object.
Document parse(InputStream is) - Parse the content of the given InputStream as an XML document and return anew DOM Document object.
Document parse(InputStream is, String systemId) - Parse the content of the given InputStream as an XML document and return anew DOM Document object.
Document parse(String uri) - Parse the content of thegiven URI as an XML document and return a new DOM Document object.

org.w3c.dom

Node

NodeList getChildNodes() - A NodeList that contains all children of this node.
Node getFirstChild() - The first child of this node.
Node getLastChild() - The last child of this node.
Node getNextSibling() - The node immediately following this node.
String getNodeName() - The name of this node, depending on its type; see the table above.
String getNodeValue() - The value of this node, depending on its type; see the table above.

Document

NodeList getElementsByTagName(String tagname) - Returns a NodeList of all the Elements in document order with a given tag name and are contained in the document.

Element

String getAttribute(String name) - Retrieves an attribute value by name.
String getTagName() - The name of the element.

NodeList

int getLength() - The number of nodes in the list.
Node item(int index) - Returns the indexth item in the collection.

XPath

The standard API provides minimal support for finding nodes. You can, for example, retrieve all elements with a particular tag name. However, if you wanted to retrieve all elements with a particular value you would have to traverse the entire structure to find the element you wanted. XPath is a mechanism you can use to specify a pattern to represent a node or set of nodes. You apply the XPath expression to a DOM tree and the result is a node or node list.

An XPath expression can select a node or set of nodes. The expression uses a location path, which is a series of location steps. A step specifies the following:

Axis - the direction to travel. The default is child, which is what we'll use, primarily.
Node Test - name of the node you want to select, or * to represent the wildcard.
Predicate (0 or more) - an additional boolean test to help filter nodes.

Example XPath expressions:

/rss/channel/link
/rss/channel/item/title[starts-with(., "Coverville")]
/rss/channel/item/guid[@isPermaLink='false']
/rss/channel/item[starts-with(title, "Coverville 400")]/pubDate

The javax.xml.xpath package provides support for evaluating XPath expressions in Java.

XPathTester.java - a simple example using XPath

XML Schema/DTDs

Suppose you want to write an RSS client application that downloads RSS feeds and displays them for the user. It would certainly be helpful to know what the RSS feeds will look like, for example that each item element will have a subelement title. Document Type Definitions (DTD) and XML Schema are two mechanisms for specifying the structure of a particular class of XML documents.

DTDs

DTDs are an older mechanism for specifying schema. DTDs allow you to specify the set of allowable elements, how they fit together, and the legal values that can be assigned to them. Here is a sample RSS DTD.

XML Schema

XML Schema is a newer and more powerful way to specify the schema for a particular class of XML document. XML Schema uses XML, so a standard XML parser can parse an XML Schema. In addition, XML Schema provides more powerful mechanisms to enable you to specify the order of elements, types of data, etc. Here is a sample RSS XSD.

Schema Datatypes

XML Schema allows the programmer to specify the type for a particular element. Example types include the following:

xs:string - any text
xs:token - token separated by whitespace
xs:integer
xs:decimal
xs:ID - a unique ID
xs:boolean - true/false
xs:dateTime - 2002-10-10T12:00:00-05:00 (yyyy-mm-ddThh:mm:ss-timezone)

Complex Types

Complex types specify compositions of simple types. For example, the schema for the following document must specify that a date element is composed of a month element, a day element, and a year element:

<date>
  <month>1</month>
  <day>27</day>
  <year>1995</year>
</date>

The schema for this document would look as follows:

<xs:element name="date">
<xs:complexType>
  <xs:all>
    <xs:element ref="year"/>
    <xs:element ref="month"/>
    <xs:element ref="day"/>
  </xs:all>
</xs:complexType>
</xs:element>
<xs:element name="year" type="xs:integer"/>
<xs:element name="month" type="xs:integer"/>
<xs:element name="day" type="xs:integer"/>

Value Restrictions

You can also use the simpleType to restrict the values for a particular element. For example, you can restrict the month element to a number between 1 and 12 as follows:

<xs:simpleType name="monthNum">
  <xs:restriction base="xs:integer">
    <xs:minInclusive value="1"/>
    <xs:maxInclusive value="12"/>
  </xs:restriction>
</xs:simpleType>
<xs:element name="month" type="monthNum"/>

Or, you can specify the pattern that an element must follow using a regular expression:

<xs:element name="price" type="priceval" 
<xs:simpleType name="priceval"> 
  <xs:restriction base="xs:token"> 
    <xs:pattern value="[0-9]+[0-9]2"/> 
  </xs:restriction> 
</xs:simpleType>

Or, you can use an enumeration to define the types that an element can take:

<xs:simpleType name="genderType"> 
  <xs:restriction base="xs:token"> 
    <xs:enumeration value="female"/> 
    <xs:enumeration value="male"/> 
  </xs:restriction> 
</xs:simpleType>

Groupings

There are several ways to group elements:

xs:choice - one of a set of child elements can occur
xs:all - each of a set of child elements will occur once
xs:sequence - the child elements must occur in the given order

Referencing a Schema

<purchaseReport
  xmlns="http://www.example.com/Report"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.example.com/Report
  http://www.example.com/Report.xsd"
  period="P3M" periodEnding="1999-12-31">

More Examples

Namespaces

If you take a look at the NPR Story of the Day, you'll notice that the XML looks a bit different than the previous RSS feed.

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/include/xsl/podcast.xsl"?>
<rss version="2.0" xmlns:npr="http://www.npr.org/rss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/"> 
        <channel>  
                <title>NPR: Story of the Day</title>
                <link>http://www.npr.org/?ft=2&amp;f=1090</link>  
                <description>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</description>  
                <copyright>Copyright 2007 NPR - For Personal Use Only</copyright>  
                <generator>NPR/RSS Generator 2.0</generator>  
                <lastBuildDate>Thu, 30 Aug 2007 01:06:17 EDT</lastBuildDate>  
                <language>en-us</language>  
                <itunes:summary>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:summary>  
                <itunes:subtitle>Editors&apos; Pick. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:subtitle>  
                <itunes:author>National Public Radio</itunes:author>  
                <itunes:keywords>story,of,the,day,NPR,National Public Radio,Story of the Day,Morning Edition,All Things Considered,Fresh Air</itunes:keywords>  
                <image>   
                        <url>http://media.npr.org/images/podcasts/thumbnail/npr_sotd_image_75.jpg</url>   
                        <title>Story of the Day</title>   
                        <link>http://www.npr.org/?ft=2&amp;f=1090</link> 
                </image>  
                <itunes:category text="Arts"/>  
                <itunes:category text="Society &amp; Culture"/>  
                <itunes:owner>   
                <itunes:email/>   
                <itunes:name/>  
                </itunes:owner>  
                <itunes:image href="http://media.npr.org/images/podcasts/primary/npr_sotd_image_300.jpg"/>  
                <item>   
                        <title>New Orleans Suffers Crisis in Mental Health Care</title>   
                        <description>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</description>   
                        <pubDate>Thu, 30 Aug 2007 01:06:08 EDT</pubDate>   
                        <link>http://www.npr.org/templates/story/story.php?storyId=14031894&amp;ft=2&amp;f=1090</link>   
                        <guid>http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3</guid>   
                        <itunes:summary>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</itunes:summary>   
                        <itunes:duration>0:13:27</itunes:duration>   
                        <itunes:keywords>NPR,National Public Radio,New Orleans Suffers Crisis in Mental Health Care,</itunes:keywords>   
                        <enclosure url="http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3" length="6456767" type="audio/mpeg"/>  
                </item>
        </channel>
</rss>

Among other things, you see a set of tags that have the prefix itunes. As you might imagine, the elements with tags beginning with itunes provide information that can be used by the iTunes program when it processes the feed. A standard RSS reader can process this same feed, but may ignore any elements with tags in the itunes namespace.

Essentially, namespaces enable you to create a document that conforms to multiple schemas. Each schema has its own namespace and element names are preceeded with the namespace. This avoids conflicts between two namespaces. The web page: http://www.feedforall.com/directory-namespace.htm lists some other common namespaces.

Sami Rollins

Wednesday, 07-Jan-2009 15:13:20 PST