CS 662: AI Programming
Assignment 4: NLP and Information Extraction

Assigned: September 25
Due: October 4.
30 points total.

To turn in: Hard copies of your source code. Also, please check your code into your subversion repository. Create a subdirectory called assigment4 for it. Everything necessary to run your code should be in this directory. If anything out of the ordinary is needed to run your code, please provide a README.

In this assignment, you will get some basic exposure to natural language processing using NLTK a python-based toolkit for natural language processing. NLTK is quite extensive; we will only work with a small subset in this assignment.

Resources: Note: One objective of this assignment is to give you some experience in learning to work with an existing tool. This means that I'll expect you to read about it and try things out - I won't necessarily explain every detail.

  1. (5 points) Warmup and Intro to NLTK. Read Chapter 3 of the NLTK book. Build a tool that can:
  2. (5 points) Tagging parts of speech. (Chapter 4 of the NLTK book). A tagger assigns tags representing parts of speech to words in a sentence. Chapter 4 describes several different approaches to tagging, including the default tagger, the regular expression tagger, the unigram tagger, the affix tagger, and the bigram tagger.

    Construct a tagger that can get better than 90% accuracy on the Brown 'a' corpus, as follows. You will probably want to use other taggers as backoff taggers. You may use whatever taggers or combinations of taggers you like.
    >>> nltk.tag.accuracy(mytagger, nltk.corpus.brown.tagged_sents('a'))
    0.95726674224794639
    
  3. (5 points) Tagging wikipages. Modify your wikipage class so that it uses the tagger you created in the previous section to tag each of the words with a part of speech. Store the resulting list in an instance variable called POS. (depending on the tagger you used, you might not want to remove stopwords)
    w.POS
    [('main', 'NN'), ('page', 'NN'), ('free', 'NN'), ('encyclopedia',
       'NP'), ('var', 'NN'), ('skin', 'NN'), ('stylepath', 'NN'), ('your',
       'NN'), ('continued', 'VBD'), ('donations', 'NNS'), ('keep', 'NN'),
       ('wikipedia', 'NP'), ('main', 'NN'), ('page', 'NN'), ('from',
       'IN'), ('free', 'NN'), ('encyclopedia', 'NP'), ('jump', 'NN'),
       ('navigation', 'NN'), ('search', 'WDT'), ('welcome', 'NN'),
       ('wikipedia', 'NP'), ('free', 'NN'), ('encyclopedia', 'NP'),
       ('can', 'MD'), ('edit', 'NN'), ...
    
  4. (10 points) Chunking. Chunking is a form of shallow parsing; it assigns tags such as NP to large portions of a sentence. Read Chapter 7.1-7.4.2 and follow the instructions for building a noun phrase chunker. Test its performance on the CoNLL-2000 corpus. Then integrate that chunker into your wikipage class.

    Run it on at least five different Wikipedia pages and highlight any misclassified noun chunks.
  5. (5 points) Named Entity Recognition. Once we have a program that can extract chunks (specifically noun phrases) we can do all sorts of interesting things. Noun phrases are arguably the best words for classifying by topic, and they can help us to discover relationships between data.

    A type of noun phrase that's of particular interest is a Named Entity. This might be a person, such as Homer Simpson, or a place, such as Springfield, or a business, such as Moe's Tavern. If we can identify named entities, we can learn relationships between them.

    In general, this is a hard problem. Words can have multiple uses, and there's an unbounded number of possible names. Within a domain, though, we can have better luck.

    Let's consider the Simpsons domain. In this domain, we can assume that the following Named Entities exist: Homer, Marge, Bart, Lisa, Maggie, Milhouse, Ned, Krusty, Springfield, Moe's Tavern. (there are lots of others, of course) Springfield and Moe's Tavern are of type "PLACE", and the others are of type "PERSON".

    Extend your chunker to attach types to known noun phrases. It should modify the node's type to be "NP-PLACE" for known places, and "NP-PERSON" for known persons. Use the following pages as input: Print out all the Named Entities you find.

    Later in the semester, we'll discuss how to learn Named Entities, and how to then use this information to extract relational information about them.