CS 662 Assignment 8: Bayesian Learning
Assigned: Tuesday, November 7.
Due: Tuesday, November 21 at the start of class.
60 points total.
What to turn in: Written answers for questions 1 and 2. For question
3, hard copy of your source code.
Also, please put a copy of your code in the submit directory for this
class: /home/submit/cs662/(yourname)/assignment8. Everything necessary
to run your code should be in this directory. If anything out of the
ordinary is needed to run your code, please provide a README.
Question 1: Probability (10 Points)
Question 2: Bayesian Networks (10 points)
- (3 points) Do Problem 13.5 a-c, on pp 489 of Russell and
Norvig. (a royal straight flush is 5 cards of the same suit,
including A, K, Q, J, 10)
- (4 points) Do problem 13.6 a-d on pp 489 of Russell and Norvig.
- (3 points) Do problem 13.11 a-c on pp 490 of Russell and
Do Problem 14.3 a-e on pp 534 of Russell and Norvig (2 points each).
Question 3: Naive Bayes Spam Classification (40 points)
In this problem, you will implement a Naive Bayes classifier in Python
that can distinguish between spam and non-spam, or "ham".
Your program should be able to train on a set of spam and a set of
"ham." This training should include counting the frequency of each
token in both spam and ham corpora.
Your program should then be able to classify an unseen email as either
spam or ham by computing the MAP hypothesis:
P(spam | t1, t2, ...,tn) = alpha * P(t1,t2,tn | spam)P(spam)
Which we'll estimate as:
alpha * P(t1 | spam)P(t2 | spam) ... P(spam)
You may build your program however you like, although we will still
expect you to use the good programming and design practices you have
learned throughout the semester. The only requirement is that we be
able to run your program exactly like this:
python ./nb.py -hamtrain dir1 -spamtrain dir2 -hamtest dir3 -spamtest dir4
Where dir1-4 are directories containing ham and spam emails used for
training and testing.
Your program must print out its results in the following format:
Size of ham training set: 500 emails
Size of spam training set: 700 emails:
Percentage of ham classified correctly: 98.2
Percentage of spam classified correctly: 97.0
Total accuracy: 97.5
False Positives: 1.8
There are several public repositories of email that can be used for
training and testing. We will be using the SpamAssassin
public corpus to evaluate your classifier. You are welcome to
download the data yourself; a copy is also in /home/public/cs662.
Note: If you are working on the lab machines, please do not copy all
of the data to your home directory. Instead, just make a link to
/home/public/cs662. This will keep you from exceeding your quota and
having to find a sysadmin to unlock your account for you.
You will find that there are a lot of decisions and tweaks you can
make to influence the performance of your classifier. For example,
should you look at all terms, or just English words? Should you treat
headers differently? When classifying an email, should you use all of
its words, or just the most significant? What are reasonable priors
for spam and non-spam? These decisions are up to you: you are
encouraged to experiment as much as possible.
Naive Bayes is a popular approach to spam filtering; you will find
lots of resources on the Web, including:
There are many other resources as well. You are welcome to use the
ideas (but NOT the code) from any outside source, with the following
You MUST give appropriate credit to any ideas you discover
elsewhere. For example, if you read Graham's article and notice that
he only uses the 15 most significant words in classifying an unseen
email and decide to take this approach, you should indicate in a
comment in your code that this idea is from Graham's
article. Include author and URL whenever possible. Students who use
other people's ideas without proper attribution will receive an
This part of your assignment will be graded as follows:
30 points: correctness, completeness, and style. The usual sorts of
10 points: performance. To compute your score on this, we will run
each student's classifier on a dataset of our choosing and compute
the following metric: accuracy - percentage of false positives. We will
then score your performance as follows:
In addition, the following extra credit is available:
- More than 1 standard devation above class average: 10
points. (or above average, if >1 stdev is not possible)
- Within one standard deviation of class average: 8 points.
- More than 1 standard deviation below class average: 6
Late assignments are not eligible for extra credit.
- the top three scores will receive 3 points extra
- Anyone whose classifier outperforms Brooks' will receive 3
points extra credit.