XML: A Very Brief Introduction

If you use an RSS reader to access blogs or other newsfeeds, you use XML. Really Simple Syndication produces XML-based feeds summarizing frequently updated content.

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title>NYT > Sunday Book Review</title>
		<link>http://www.nytimes.com/pages/books/review/index.html?partner=rssnyt</link>
		<description></description>
		<language>en-us</language>
		<copyright>Copyright 2007 The New York Times Company</copyright>
		<lastBuildDate>Fri, 24 Aug 2007 20:05:02 GMT</lastBuildDate>
		<image>
			<title>NYT > Sunday Book Review</title>
			<url>http://graphics.nytimes.com/images/section/NytSectionHeader.gif</url>
			<link>http://www.nytimes.com/pages/books/review/index.html</link>
		</image>
		<item>
			<title>On the Road Again</title>
			<link>http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html?ex=1345176000&amp;en=b8402a9d3d6e4457&amp;ei=5088&amp;partner=rssnyt&amp;emc=rss</link>
			<description>The novel that &#8220;On the Road&#8221; became was inarguably the book that young people needed in 1957, but the sparse and unassuming scroll is the living version for our time.</description>
			<author>LUC SANTE</author>
			<guid isPermaLink="false">http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html</guid>
			<pubDate>Sun, 19 Aug 2007 02:56:43 GMT</pubDate>
		</item>
	</channel>
</rss>

Elements

Elements are surrounded by tags. Tags come in pairs. The open tag identifies the beginning of the element and the closing tag, denoted by the / before the tag name, identifies the end of the element. Tag names are genreally fairly intuitive descriptions of the data that will be contained in the element. For example, as you might expect, the text between the author tags is the name of an author. In some cases, you may find empty elements that look as follows: <description/>.

Elements may contain text, other elements, and attributes (discussed below). In the example above, the rss element contains one element, channel. The channel element contains elements title, link, description, language, copyright, lastBuildData, image, and item.

Elements form a tree structure or hierarchy. We'll talk about trees toward the end of the semester, but following is some relevant tree terminology:

Attributes

Attributes are name, value pairs that provide some information about the characteristics of an element. In the example above, the element rss has an attribute version. The version attribute has a value of 2.0. An element may have multiple attributes.

Parsing

A DOM parser reads an XML document, for example from a file, and builds a tree in memory. The programmer can then access and manipulate the information stored in the document by traversing the tree structure. Essentially, the job of the parser is to identify where elements start and end, and build objects to represent each element.

A SAX parser reads an XML document and generates events when elements are found. The user defines the actions be taken as different types of elements are found.

Namespaces

If you take a look at the NPR Story of the Day, you'll notice that the XML looks a bit different.

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/include/xsl/podcast.xsl"?>
<rss version="2.0" xmlns:npr="http://www.npr.org/rss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/"> 
	<channel>  
		<title>NPR: Story of the Day</title>
		<link>http://www.npr.org/?ft=2&amp;f=1090</link>  
		<description>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</description>  
		<copyright>Copyright 2007 NPR - For Personal Use Only</copyright>  
		<generator>NPR/RSS Generator 2.0</generator>  
		<lastBuildDate>Thu, 30 Aug 2007 01:06:17 EDT</lastBuildDate>  
		<language>en-us</language>  
		<itunes:summary>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:summary>  
		<itunes:subtitle>Editors&apos; Pick. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:subtitle>  
		<itunes:author>National Public Radio</itunes:author>  
		<itunes:keywords>story,of,the,day,NPR,National Public Radio,Story of the Day,Morning Edition,All Things Considered,Fresh Air</itunes:keywords>  
		<image>   
			<url>http://media.npr.org/images/podcasts/thumbnail/npr_sotd_image_75.jpg</url>   
			<title>Story of the Day</title>   
			<link>http://www.npr.org/?ft=2&amp;f=1090</link> 
		</image>  
		<itunes:category text="Arts"/>  
		<itunes:category text="Society &amp; Culture"/>  
		<itunes:owner>   
		<itunes:email/>   
		<itunes:name/>  
		</itunes:owner>  
		<itunes:image href="http://media.npr.org/images/podcasts/primary/npr_sotd_image_300.jpg"/>  
		<item>   
			<title>New Orleans Suffers Crisis in Mental Health Care</title>   
			<description>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</description>   
			<pubDate>Thu, 30 Aug 2007 01:06:08 EDT</pubDate>   
			<link>http://www.npr.org/templates/story/story.php?storyId=14031894&amp;ft=2&amp;f=1090</link>   
			<guid>http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3</guid>   
			<itunes:summary>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</itunes:summary>   
			<itunes:duration>0:13:27</itunes:duration>   
			<itunes:keywords>NPR,National Public Radio,New Orleans Suffers Crisis in Mental Health Care,</itunes:keywords>   
			<enclosure url="http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3" length="6456767" type="audio/mpeg"/>  
		</item>
	</channel>
</rss>

Among other things, you see a set of tags that have the prefix itunes. As you might imagine, the elements with tags beginning with itunes provide information that can be used by the iTunes program when it processes the feed. A standard RSS reader can process this same feed, but may ignore any elements with tags in the itunes namespace.

The web page: http://www.feedforall.com/directory-namespace.htm lists some other common namespaces. Notice that the same tag suffix may appear in multiple namespaces. For example, two name spaces may support a summary tag. However, using the namespace prefix enables the developer to distinguish between say itunes:summary and summary in another namespace.

XML and Java

Java provides both DOM and SAX parsers in the javax.xml.parsers package. The DOM parser produces a Document object, where Document is in the org.w3c.dom package. The Document represents the entire XML tree, which is comprised of Node objects. The Node class provides an API to traverse the tree. Node has several subclasses, the most notable of which are Text and Element. All components in the tree are Nodes, but some are Elements and some are Text, and there are a few other subclasses as well. Below are a few of the most relevant APIs. For a full listing, see the Java API.

javax.xml.parsers

DocumentBuilderFactory - Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents.

DocumentBuilder - Defines the API to obtain DOM Document instances from an XML document. Using this class, an application programmer can obtain a Document from XML.