CS 662: AI Programming

Project #3: Naive Bayes Text Classification

Assigned: November 17

Due: Friday, Dec 9, 11:59 pm

Total points: 100

Introduction

In this project, you will build a Naive Bayes classifier for the purpose of classifying web pages. You'll begin by building and evaluating a basic Naive Bayes classifier, and then implementing some extensions and evaluating the effect of these extensions on your classifier's ability to learn. This will give you the chance to become familiar with the Naive Bayes algorithm and actually see it applied to a real (~8000 pages) data set. It will also give you experience performing and presenting experiments.

The project has two required components: the basic classifier and the report. These comprise 70 points out of 100. There are also a number of optional components, each of which has a point value attached. You may do up to 35 points of optional components. (Note that the assignment is worth 100 points, so it's possible to get 5 points of extra credit.)

What to turn in

Please note that there are parts of this project that are specified more vaguely than, for example, project 1. In those cases, you are encouraged to use your discretion as to how to approach the problem. For example, in this project, the classifier is described in algorithmic terms, rather than as as specific set of classes. You are free to choose how to implement these algorithms in Python.

Building a Bayesian classifier

The basic Naive Bayes classifier is another example of an algorithm that is fairly straightforward to code, once you understand it, but requires a little bit of thinking to understand. The idea is this:

We want to be able to predict the likelihood that a Web page belongs to one of a set of predefined classes. We'll define a Web page (for the moment) as a collection of words a1, a2, ..., an. Therefore, if there are m classes of web pages, labeled c1, c2, ..., cm, we want to know P(c | a1,a2,...,am) for each class. We can use Bayes' rule to do this. The Naive Bayes assumption allows us to assume that each word is conditionally independent of the others, so

P(c | a1,a2,...,am) = x * P(a1 | c) * P(a2 | c) * ... * P(am | c) * P(c)

where x is a normalizing factor.

In order to do this sort of computation, we will need:

  1. Estimated priors for each class of web page.
  2. Frequencies for each word in each class of web page. (this is P(aj | c))

About the data

In /home/public/cs662/webkb You'll find a set of webpages from five different universities. (Cornell, Texas, Wisconsin, Washington, and "other"). These are divided into seven different categories, corresponding to the type of web page: course, department, faculty, student, staff, project, and other. Our task will be to build a Naive Bayes classifier that can classify these pages.

This data originally comes from the Web->KB project at Carnegie Mellon University. . There is a gzipped copy of the data at this link.

READ THIS

This link is provided in case you want to do your development on your home machine. The data set is somewhat large (11 Meg tarred and zipped, ~60 Megs unpacked), especially if everyone makes a local copy of it. Therefore, you should NOT make a copy of this data set in your home directory - instead, just create a symbolic link to the copy in /home/public/cs662/webkb, or read from these files directly. Copying it to your home directory may result in you going over your quota and being locked out until you find a sysadmin.

The Basic classifier

In this section, I'll describe how to build the basic Bayesian classifier. This section is required.

Preliminaries

Before classifying pages, we'll need to "massage" the data a bit. You should have all this code already from project1; this step will mostly be a matter of finding it and remembering how it works.

Removing HTML

Recall that Naive Bayes uses as "bag of words" approach. It simply builds up a list of all the words in a document, and their frequency. In order to do that, we'll first need to get rid of all the HTML tags. You should already have code that does this.

Stop Words

One challenge in classifying text accoring to probability is the presence of words that carry no meaning, such as 'a', 'an', 'the', etc. These are often called stop words . A useful list of stop words taken from the WordNet project can be found here

You should also already have code that does this.

Computing Conditional Probabilities of Words

You'll need a tool that can count words and store them in a dictionary (I bet you've already got this). For each category, you'll want to be able to take all the pages in that category and count the frequency of each word (less stopwords) occurring on those pages. This will serve as an estimate of P(w | category) for each category. For example, to estimate P('cat' | faculty), count the number of times 'cat' occurs in all faculty pages, divided by the total number of non-stop words.

Computing Priors on categories

You'll also need to be able to calculate the prior probability of a category occurring. Build a Python tool which, given a set of input files, can tell you the fraction of files from each of a set of categories, such as page types, or schools. (Keep in mind that we'll want to train on different subsets of the data, so don't just calculate the fractions for the entire dataset).

Conditional Probabilities.

Here's where the rubber hits the road. You'll want to use your previous tools to build a program that, for a given category (for example, faculty), can compute the conditional probability for each word given that it's in a faculty page.

To begin, we'll need to decide how we want to classify pages. For this project, we'll focus on the type of page. You'll want to write a simple Python function that can randomly select a fraction of the data to act as your training set, and a fraction as a test set. (random.shuffle is useful for this)

Once you have this, you'll build Vocabulary , a list of all unique words occurring in the training set.

Then, compute the priors for the category as described above.

We'll call the fraction of documents in the training set that are of a given category the Text . So the faculty Text is all the faculty web pages in the training set. Let n be the total number of words in Text (including duplicates).

Then, for each word in Vocabulary , the conditional probability of it occurring given the category is: (timesWordOccurs + 1 ) / n + |Vocabulary|

the result should be a dictionary for each category giving the conditional probability of each word, given that category. You may find it helpful to write these dictionaries to files as objects with pickle, since you'll want them later.

Classifying unseen instances.

Now you're ready to use the classifier to categorize unseen web pages. It should take as input an unknown webpage and strip out the HTML tags and stop words. Then, compute the likelihood for each category given the stream of words. The formula for Naive Bayes is:

P(category | set of words) = P(category) * Product P(word | category)

(where Product means multiply the prob. for each word)

You may find that, when computing Naive Bayes, you run into underflow problems. Luckily, there's a slick way to get around that. Remember that we're not really interested in the probability of each classification, but the MAP hypothesis: which classification is most likely. (This is why we're not computing the demoninator of Bayes' rule.)

Since all we're doing is comparing likelihoods and finding the largest one, we can apply any sort of transformation to the likelihood value, as long as it doesn't change the ordering. To be precise we'll use log.

If P(c1) > P(c2), then log(P(c1)) > log(P(c2))

So why log? Log also has the following very useful property:

log(ab) = log(a) + log(b)

In Naive Bayes terms, this means that: log(P(c | w1,w2,...,wn)) = log(P(w1 |c)) + log(P(w2 | c) + .. + log(P(wn | c)) + log(P(c))

Now let's test your classifier.

To begin, let's build a boolean classifier. That is, it will predict whether it is more likely to belong to the 'faculty' category or not.

  1. We will perform N-fold cross-validation, with N = 5. We will test the performance of the classifier as the size of the training set increases. Begin with data sets of (100, 250, 500, 750, 1000, 1500, 2000). In each case, split the dataset into 80% training/20% test, and perform 5 tests (I would strongly recommend writing a script to automate all of this.) Measure the precision and accuracy of your classifier.
  2. Next, modify your classifier to be a multinomial classifier. That is, rather than just predicting whether a page is faculty or not, it should output the most probable class (faculty, staff, student, course, department, project, other). Use the same N-fold cross-validation as above. Measure the fraction of pages correctly classified for each of the data set sizes above.
  3. Prepare a "happy graph" that compares the performance of the boolean classifier to the multinomial classifier. Which one performs better? Why do you think this is?

Extensions

The above portion of the project is worth 70 points. (50 for the code, and 20 for the report.) Below are a number of possible extensions to the basic classifier, along with a point value. You may do up to 35 points worth of extensions. In each case, you should describe the improvement in your report and prepare a happy graph comparing the performance of the classifier with the extension to the standard classifier.