Parsing XML

<person>
  <name>Terence</name>
  <phone>422 5707</phone>
</person>
<person>
  <name>Rosa Maria</name>
  <phone>422 6530</phone>
</person>

First step: break up the input stream into vocabulary symbols. In this case of the book either symbols are the tags and the stuff in between the tags.

Second step: apply grammatical structure to the sequence of tokens. This means ensuring that the tags are nested properly and that each person record has the appropriate subtags.

Talk about why DFA is not powerful enough to handle nested stuff.

Start out showing well formed (nested properly). Just do OPEN vs CLOSE and TEXT tags. I.e., not:

<person>
  <name>Terence</person>
</name>

When a tag, you can push its name on a stack. upon encountering an end tag, check to see that the top of stack is the correct start tag. pop the tag off the stack and continue.

tags = new stack
scanner = new TagScanner
t = scanner.nextToken()
while ( not EOF ) {
    if t is begin tag:
        tags.push(t)
    else if t is end tag:
        top = tags.pop
        if token name != top name
            throw non wellformed exception
    t = scanner.nextToken()
}       
if tags stack still has elements:
    throw non wellformed exception

This doesn't guarantee anything about the relationship between tags, however. Doesn't say that there can be more than one person and that a person has to have nested tags. We needed to follow the following grammar:

input : person+ ;
person: name phone ;
name  : '<name>' TEXT '</name>' ;
phone : '<phone>' TEXT '</phone>' ;
WS    : (' ' | '\t' | '\n')+ {skip();} ; // assume we throw out whitespace
TEXT  : (~'<')+ ;

explain relationship with DTD.

Run ANTLRWorks

SO, we need a lexical analyzer or scanner and lexer that returns a sequence of vocabulary symbols: open person '<person>', close person, open name '<name>', close name '</name>', open phone, close phone, TEXT. Ignore whitespace.

Show rules for converting grammar to parser.

What to do when it's recursive? imagine we can have a single person nested within a person, which is the child of the outer person.

<person>
  <name>Terence</name>
  <phone>422 5707</phone>
  <person>
    <name>Imaginary Child</name>
    <phone>422 5707</phone>
  </person>
</person>

altered grammar:

input : person+ ;
person: name phone person ;
name  : '<name>' TEXT '</name>' ;
phone : '<phone>' TEXT '</phone>' ;
WS    : (' ' | '\t' | '\n')+ {skip();} ; // assume we throw out whitespace
TEXT  : (~'<')+ ;

What does the lexical analyzer look like this?

while not eof:
  if whitespace: consume and continue
  if input char is '<':
    consume char
    if input char is '/':
      consume until '>'
      return as correct end tag
    else
      consume until '>'
      return as correct open tag
  else:
    consume until '<', return as TEXT token