<person> <name>Terence</name> <phone>422 5707</phone> </person> <person> <name>Rosa Maria</name> <phone>422 6530</phone> </person>
First step: break up the input stream into vocabulary symbols. In this case of the book either symbols are the tags and the stuff in between the tags.
Second step: apply grammatical structure to the sequence of tokens. This means ensuring that the tags are nested properly and that each person record has the appropriate subtags.
Talk about why DFA is not powerful enough to handle nested stuff.
Start out showing well formed (nested properly). Just do OPEN vs CLOSE and TEXT tags. I.e., not:
<person> <name>Terence</person> </name>
When a tag, you can push its name on a stack. upon encountering an end tag, check to see that the top of stack is the correct start tag. pop the tag off the stack and continue.
tags = new stack scanner = new TagScanner t = scanner.nextToken() while ( not EOF ) { if t is begin tag: tags.push(t) else if t is end tag: top = tags.pop if token name != top name throw non wellformed exception t = scanner.nextToken() } if tags stack still has elements: throw non wellformed exception
This doesn't guarantee anything about the relationship between tags, however. Doesn't say that there can be more than one person and that a person has to have nested tags. We needed to follow the following grammar:
input : person+ ; person: name phone ; name : '<name>' TEXT '</name>' ; phone : '<phone>' TEXT '</phone>' ; WS : (' ' | '\t' | '\n')+ {skip();} ; // assume we throw out whitespace TEXT : (~'<')+ ;
explain relationship with DTD.
Run ANTLRWorks
SO, we need a lexical analyzer or scanner and lexer that returns a sequence of vocabulary symbols: open person '<person>', close person, open name '<name>', close name '</name>', open phone, close phone, TEXT. Ignore whitespace.
Show rules for converting grammar to parser.
What to do when it's recursive? imagine we can have a single person nested within a person, which is the child of the outer person.
<person> <name>Terence</name> <phone>422 5707</phone> <person> <name>Imaginary Child</name> <phone>422 5707</phone> </person> </person>
altered grammar:
input : person+ ; person: name phone person ; name : '<name>' TEXT '</name>' ; phone : '<phone>' TEXT '</phone>' ; WS : (' ' | '\t' | '\n')+ {skip();} ; // assume we throw out whitespace TEXT : (~'<')+ ;
What does the lexical analyzer look like this?
while not eof: if whitespace: consume and continue if input char is '<': consume char if input char is '/': consume until '>' return as correct end tag else consume until '>' return as correct open tag else: consume until '<', return as TEXT token