Lab 2: XML Structure and Parsing



Due Tuesday 2/13 at start of class.
30 points


This lab serves two purposes: first, it provides an introduction to parsing and creating XML. We will be working with XML extensively in future labs. Second, it gets you started on the domain we will be working with in future labs: digital photos.

What to turn in: Place a copy of all your code, along with the generated XML, in your submit directory. Also, you must provide a README that clearly explains how to run your code. Labs without a README will not be graded.

On Languages: You may use any language you wish for this lab, with the following requirement: I must be able to run your code from your submit directory. I will present examples in Python and encourage you to use it. It's an excellent language, and well-suited for the tasks we're doing; Java, Ruby, C# and Perl are also good choices.

I also strongly encourage you to use good software engineering principles in this lab. You may find it helpful to reuse or extend this code in later labs.
  1. (10 points) Write a program that takes a directory as input, recursively traverses it, finds all images, and generates an XML database.

    The program should have a command-line interface that takes at least one argument: a root directory. It should then find all files in all directories below it that are images. If you are using Python, you will find os.walk() very helpful for this.

    For purposes of this assignment, you can assume that all images have one of the following extensions: jpg, JPG, jpeg, JPEG, gif, GIF, tif, TIF, tiff, TIFF.

    The program should construct a single XML file, called pictures.xml, that contains information about all of the pictures below the root directory. The file should be in the following format:
    <?xml version="1.0" ?>
    <pictures>
      <picture>
      <filename> DSC01011.jpg </filename>
      <directory> /Users/brooks/pictures </directory>
      <size> 102201 </size>
      <tags>
        <tag> peru </tag>
        <tag> llama </tag>
        <tag> sun </tag>
      </tags>
      <date_modified>
      2007-02-04 15:52:13
      </date_modified>
      <exif>
        <FileDateTime isodate="20060515T011855Z">Mon, 15 May 2006 01:18:55 +0000</FileDateTime>
        <CameraMake>SONY</CameraMake>
        <CameraModel>CYBERSHOT</CameraModel>
        <DateTime isodate="20040111T195735">Sun, 11 Jan 2004 19:57:35</DateTime>
        <Resolution>
          <Width>1552</Width>
          <Height>647</Height>
        </Resolution>
        <IsColor>true</IsColor>
        <FlashUsed>false</FlashUsed>
        <FocalLength units="mm">8.0</FocalLength>
        <ExposureTime units="s" equiv="1/800">0.001</ExposureTime>
        <ApertureFNumber>f/5.6</ApertureFNumber>
        <ISOequivalent>100</ISOequivalent>
        <ExposureBias>0.70</ExposureBias>
        <Whitebalance>cloudy</Whitebalance>
        <MeteringMode>spot</MeteringMode>
        <ExposureProgram>program (auto)</ExposureProgram>
        <JpegProcess>Progressive</JpegProcess>
        <JpegThumbnail>
    '\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\ ...'
         </JpegThumbnail>
    </exif>
    </picture>
    
    <picture> ...
    
    </picture>
    </pictures>
    </xml>
    
    
    This format stores the following information: the file's name, directory, size on disk, and the date it was last modified. It also stores EXIF data. EXIF is a metadata standard supported by most digital cameras. This information is embedded in the header for the picture. If you are using Python, I find EXIF.py to be useful for this task. If you're using Java, this looks promising, although I haven't tried it. If you're using Perl or Ruby, search around a little; there's lots of libraries on the web for extracting EXIF data.

    For this task, you'll probably find it easiest to construct XML through concatenation of strings, as opposed to directly manipulating the DOM tree. This will be much easier and more efficient if you use careful abstraction. For example, I found the following function very useful in building this:
    def constructElement(tag, value) :
        return '<' + tag + '>' + value + '</' + tag + '>'
    
    (note that I'm not starting with buf = '' and repeatedly concatenating elements to it. This is much more wasteful.)

    This allows me to isolate the messy and easy-to-get-wrong string construction in one place, and build a slightly larger function called XMLForFile:
     def XMLForFile(name, path, size) :
       return constructElement('filename', name) +  \
              constructElement('directory', path) + 
              constructElement('size', size) + 
    etc.
    
  2. (5 points)Write a CSS stylesheet to display your photo database. It should show one picture per row, and include name, directory, filesize, date modified, and tags.
  3. (10 points) Write a program that uses a DOM parser to load the XML into memory and allow the user to search it. The user should be able to specify an XML tag (such as 'filename' or 'date_modified') and get all photos that match that criterion. The user should also be able to type '?' and get a list of all possible tags.
  4. (5 points) Write a program that uses a SAX parser to process your XML database. It should take two command-line arguments: an XML file to read in and an XML tag to look for. It should print out the values for all tags that match this argument.