Project #3: Naive Bayes Text Classification
Due: Friday, Dec 9, 11:59 pm
Total points: 100
In this project, you will build a Naive Bayes classifier for the purpose of classifying web pages. You'll begin by building and evaluating a basic Naive Bayes classifier, and then implementing some extensions and evaluating the effect of these extensions on your classifier's ability to learn. This will give you the chance to become familiar with the Naive Bayes algorithm and actually see it applied to a real (~8000 pages) data set. It will also give you experience performing and presenting experiments.
The project has two required components: the basic classifier and the report. These comprise 70 points out of 100. There are also a number of optional components, each of which has a point value attached. You may do up to 35 points of optional components. (Note that the assignment is worth 100 points, so it's possible to get 5 points of extra credit.)
Please note that there are parts of this project that are specified more vaguely than, for example, project 1. In those cases, you are encouraged to use your discretion as to how to approach the problem. For example, in this project, the classifier is described in algorithmic terms, rather than as as specific set of classes. You are free to choose how to implement these algorithms in Python.
The basic Naive Bayes classifier is another example of an algorithm that is fairly straightforward to code, once you understand it, but requires a little bit of thinking to understand. The idea is this:
We want to be able to predict the likelihood that a Web page belongs to one of a set of predefined classes. We'll define a Web page (for the moment) as a collection of words a1, a2, ..., an. Therefore, if there are m classes of web pages, labeled c1, c2, ..., cm, we want to know P(c | a1,a2,...,am) for each class. We can use Bayes' rule to do this. The Naive Bayes assumption allows us to assume that each word is conditionally independent of the others, so
P(c | a1,a2,...,am) = x * P(a1 | c) * P(a2 | c) * ... * P(am | c) * P(c)
where x is a normalizing factor.
In order to do this sort of computation, we will need:
In /home/public/cs662/webkb You'll find a set of webpages from five different universities. (Cornell, Texas, Wisconsin, Washington, and "other"). These are divided into seven different categories, corresponding to the type of web page: course, department, faculty, student, staff, project, and other. Our task will be to build a Naive Bayes classifier that can classify these pages.
This data originally comes from the Web->KB project at Carnegie Mellon University. . There is a gzipped copy of the data at this link.
This link is provided in case you want to do your development on your home machine. The data set is somewhat large (11 Meg tarred and zipped, ~60 Megs unpacked), especially if everyone makes a local copy of it. Therefore, you should NOT make a copy of this data set in your home directory - instead, just create a symbolic link to the copy in /home/public/cs662/webkb, or read from these files directly. Copying it to your home directory may result in you going over your quota and being locked out until you find a sysadmin.
Before classifying pages, we'll need to "massage" the data a bit. You should have all this code already from project1; this step will mostly be a matter of finding it and remembering how it works.
One challenge in classifying text accoring to probability is the presence of words that carry no meaning, such as 'a', 'an', 'the', etc. These are often called stop words . A useful list of stop words taken from the WordNet project can be found here
You should also already have code that does this.
You'll need a tool that can count words and store them in a dictionary (I bet you've already got this). For each category, you'll want to be able to take all the pages in that category and count the frequency of each word (less stopwords) occurring on those pages. This will serve as an estimate of P(w | category) for each category. For example, to estimate P('cat' | faculty), count the number of times 'cat' occurs in all faculty pages, divided by the total number of non-stop words.
You'll also need to be able to calculate the prior probability of a category occurring. Build a Python tool which, given a set of input files, can tell you the fraction of files from each of a set of categories, such as page types, or schools. (Keep in mind that we'll want to train on different subsets of the data, so don't just calculate the fractions for the entire dataset).
Here's where the rubber hits the road. You'll want to use your previous tools to build a program that, for a given category (for example, faculty), can compute the conditional probability for each word given that it's in a faculty page.
To begin, we'll need to decide how we want to classify pages. For this project, we'll focus on the type of page. You'll want to write a simple Python function that can randomly select a fraction of the data to act as your training set, and a fraction as a test set. (random.shuffle is useful for this)
Once you have this, you'll build Vocabulary , a list of all unique words occurring in the training set.
Then, compute the priors for the category as described above.
We'll call the fraction of documents in the training set that are of a given category the Text . So the faculty Text is all the faculty web pages in the training set. Let n be the total number of words in Text (including duplicates).
Then, for each word in Vocabulary , the conditional probability of it occurring given the category is: (timesWordOccurs + 1 ) / n + |Vocabulary|
the result should be a dictionary for each category giving the conditional probability of each word, given that category. You may find it helpful to write these dictionaries to files as objects with pickle, since you'll want them later.
Now you're ready to use the classifier to categorize unseen web pages. It should take as input an unknown webpage and strip out the HTML tags and stop words. Then, compute the likelihood for each category given the stream of words. The formula for Naive Bayes is:
P(category | set of words) = P(category) * Product P(word | category)
(where Product means multiply the prob. for each word)
You may find that, when computing Naive Bayes, you run into underflow problems. Luckily, there's a slick way to get around that. Remember that we're not really interested in the probability of each classification, but the MAP hypothesis: which classification is most likely. (This is why we're not computing the demoninator of Bayes' rule.)
Since all we're doing is comparing likelihoods and finding the largest one, we can apply any sort of transformation to the likelihood value, as long as it doesn't change the ordering. To be precise we'll use log.
If P(c1) > P(c2), then log(P(c1)) > log(P(c2))
So why log? Log also has the following very useful property:
log(ab) = log(a) + log(b)
In Naive Bayes terms, this means that: log(P(c | w1,w2,...,wn)) = log(P(w1 |c)) + log(P(w2 | c) + .. + log(P(wn | c)) + log(P(c))
Now let's test your classifier.
To begin, let's build a boolean classifier. That is, it will predict whether it is more likely to belong to the 'faculty' category or not.
The above portion of the project is worth 70 points. (50 for the code, and 20 for the report.) Below are a number of possible extensions to the basic classifier, along with a point value. You may do up to 35 points worth of extensions. In each case, you should describe the improvement in your report and prepare a happy graph comparing the performance of the classifier with the extension to the standard classifier.
The most well-known approach to solving this problem is the Porter stemming algorithm. The page linked above contains a link to a Python implementation, which you may use. Add stemming to the removal of HTML tags and stop words and compare the performance of your Naive Bayes classifier.
WordNet is a lexical reference tool that can identify the potential parts of speech that a word belongs to. It's installed on the lab machines in /home/public/cs662/WordNet-2.0 (or you can download it yourself for use on your personal machine). There is also a Python interface to WordNet, called PyWordNet (note: if you run this on your home machine, please be aware that PyWordNet only works with WordNet 2.0, not WordNet 2.1).
To use WordNet on the lab machines, you'll need to set the following environment variable:
export WNHOME=/home/public/cs662/WordNet-2.0
Use WordNet to remove all words that are not nouns or adjectives (in addition to stop words) and use that as input to your Naive Bayes classifier. Compare the performance of this classifier to the basic one.
A bi-gram is a sequence of two words. Modify your classifier to consider all two-word sequences, rather than just single words. (keep in mind that "cat dog bunny squirrel" has three bi-grams: 'cat dog', 'dog bunny' and 'bunny squirrel'. You'll need to be careful about estimating priors for bi-grams and counting |Vocabulary| correctly.
Compare the performance of the bi-gram-trained classifier to the standard classifier.
Use boosting to train five separate boolean classifiers. Each classifier should have a weighted vote, based on its precision. Compare the performance of the boosted classifiers to the original boolean classifier.