# CS 662 Assignment 8: Bayesian Learning

Assigned: Thursday, November 1.
Due: Tuesday, November 20 at the start of class.
60 points total.

What to turn in: Written answers for questions 1 and 2. For question 3, hard copy of your source code.

Question 1: Probability (10 Points)
• (3 points) A bag contains k fair coins and one two-headed coin. Without looking, you select a coin at random and toss it n times. All n tosses are heads. What is the probability that the coin is fair?
• (4 points) Do problem 13.6 a-d on pp 489 of Russell and Norvig.
• (3 points) Joe Student comes to his professor and tells him that he forgot to bring his project to hand in, and wants to turn it in tomorrow without penalty. The professor knows that 1 time in 100, a student completes her assignment and forgets to bring it. The professor also knows that 50% of the time, a student who hasn't completed the assignment will say that they forgot it. Finally, the professor believes that 90% of the students in the class completed the assignment.

What is the probability that the student actually completed the assignment?
Question 2: Bayesian Networks (10 points)

Do Problem 14.2 a-e on pp 534 of Russell and Norvig (2 points each).

Question 3: Naive Bayes Spam Classification (40 points)

In this problem, you will implement a Naive Bayes classifier in Python that can distinguish between spam and non-spam, or "ham".

Your program should be able to train on a set of spam and a set of "ham." This training should include counting the frequency of each token in both spam and ham corpora.

Your program should then be able to classify an unseen email as either spam or ham by computing the MAP hypothesis:

P(spam | t1, t2, ...,tn) = alpha * P(t1,t2,tn | spam)P(spam)

Which we'll estimate as: alpha * P(t1 | spam)P(t2 | spam) ... P(spam)

You may build your program however you like, although we will still expect you to use the good programming and design practices you have learned throughout the semester. The only requirement is that we be able to run your program exactly like this:
```python ./nb.py --hamtrain=dir1 --spamtrain=dir2 --hamtest=dir3 --spamtest=dir4
```
Where dir1-4 are directories containing ham and spam emails used for training and testing.

Your program must print out its results in the following format:
```Size of ham training set: 500 emails
Size of spam training set: 700 emails:
Percentage of ham classified correctly: 98.2
Percentage of spam classified correctly: 97.0
Total accuracy: 97.5
False Positives: 1.8
```
There are several public repositories of email that can be used for training and testing. We will be using the SpamAssassin public corpus to evaluate your classifier. You are welcome to download the data yourself; a copy is also in /home/public/cs662.

The SpamAssassin corpus contains spam, easy ham, and hard ham. We'll use the hard ham to test your classifier.

#### Details

You will find that there are a lot of decisions and tweaks you can make to influence the performance of your classifier. For example, should you look at all terms, or just English words? Should you treat headers differently? When classifying an email, should you use all of its words, or just the most significant? What are reasonable priors for spam and non-spam? What about trying to parse the email and only use some chunks? These decisions are up to you: you are encouraged to experiment as much as possible.

Naive Bayes is a popular approach to spam filtering; you will find lots of resources on the Web, including:
There are many other resources as well. You are welcome to use the ideas (but NOT the code) from any outside source, with the following caveat:

You MUST give appropriate credit to any ideas you discover elsewhere. For example, if you read Graham's article and notice that he only uses the 15 most significant words in classifying an unseen email and decide to take this approach, you should indicate in your report (see below) that this idea is from Graham's article. Include author and URL whenever possible. Students who use other people's ideas without proper attribution will receive an automatic zero.