#### CS 662: AI Programming Homework #6: Decision Trees

Assigned: October 18

Due: October 25.

What to turn in: For question 1, a handwritten (or typed) description of how your decision tree was constructed. For question 2, turn in a copy of your Python code, and also place a copy of your code in your submit directory.

1. Decision trees (by hand) (10 points):

Complete the PlayTennis example we started in class by hand. For each node, show the entropy/information in the data set and the potential gain for each possible attribute. Also, show the final tree. The data set is included below.

 Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

2. Part 2 (20 points).

In this part, you'll write Python code to construct a decision tree from a data file. (My solution is about 130 lines, including comments, btw)

Your code should be able to run from the command line, and either a) read in a training set, or b) read in a test set. for example:

python dt.py -train zoo

should read in the files zoo.csv (containing the data) and zoo.txt (containing the labels for each attribute), construct a decision tree representing the data, and write the decision tree out to a file called zoo.pickle (using pickle).

python dt.py -test zoo should read in the tree stored in zoo.pickle and use the test data in zoo.test to determine the accuracy of your tree. It should print out a result indicating the fraction of test cases correctly classified.

Your decision tree program should be able to work on any dataset (don't hardcode in attributes or values).

In particular, it should be able to run on the following datasets:

• restaurant data:
• zoo data:

Note: The restuarant example has two classes (WillWait and WillNotWait), but many of the attributes have three for more values. (For example, restaurant type). Given a set of circumstances, the tree should tell you whether you are willing to wait or not.

The zoo example has one unique trait (animal name), 15 boolean traits (encoded as 0/1), and two integer-valued traits (numberOfLegs and type). For type, 1=mammal, 2=bird,3=reptile,4=fish,5=amphibian, 6=insect, 7=crustacean. You may ignore the 'animalName' trait. The task is to determine the animal's type (fish, mammale, etc) from its other attributes.

You are welcome to change the formats of the data files if you wish. (For example, replacing 0 with 'False' and 1 with 'True') If you do so, please include copies of the data that correspond to your format in your submit directory, so that we can run your code.

How to write this:

I recommend starting from the bottom and working your way up - you are not required to follow the recipe below, but you may find it useful.

• Start by writing a function called entropy. It should take as input a list of attributes and compute the entropy of that list.
• Now write a function called getRemainder. It should take as input a list of tuples, such as [('Rain', 'Yes'), ('Sunny, 'No'), ('Overcast','Yes')] that indicate values for an attribute the corresponding class that that data instance belonged to. getRemainder should then compute the remainder of that data set.
• Now, you can write a wrapping function called selectAttribute that takes as input a data set (a list of lists is a good representation), computes the remainder for each possible attribute and selects the attribute with the highest information gain.
• You've got the tough parts now. The rest is building and/or traversing the tree. I'd suggest making a Node class, which contains a string indicating the attribute being tested, a float representing the entropy at that point, and (if it's not a leaf) a dictionary that maps attribute values to Nodes representing children.

You'll want to use recursion to build the tree; write a function called makeTree that recursively constructs the tree, following the pseudocode in the slides.

• Finally, you'll want a method called classify that should take as input a list of attributes and return the tree's predicted classification. It should do this by traversing the tree recursively.

Constructing the training and test sets

In order to evaluate the effectiveness of your decision tree, you will need to test it on data that was not used to construct the tree. This will require you to construct a training set and a test set.

For this homework, we will train on 80% of the data, and test on 20%. You should build a separate Python program that can randomly separate a data set into training and test sets. (Note - be sure this separation is random; don't just take the first 80% of the lines in the file).

We will use this to perform n-fold cross-validation, where n=5. In other words, repeat this five times and average the results:

1. Create a random training and test set.
2. Use the training set to construct a decision tree.
3. Measure the performance of the tree on the test set; what percentage of the test set was correctly classified? This is the tree's accuracy

What is the average accuracy of your tree on the restaurant and zoo datasets?

Extra credit (5 points) :

Both of the data sets above are examples of toy problems. They're nice for understanding how to build decision trees, but they're not real-world problems.

The following data is actual anonymized credit screening data. Your task is to build a decision tree that can accurately classify users into credit-worthy and non-credit-worthy classes. As above, perform five-fold cross-validation, using 80% training and 20% test sets.

There are two wrinkles that make this dataset interesting:

1. Some values are continuous. You must decide on a scheme for discretizing these values.
2. Some attributes have missing values. You must decide how to best deal with this problem. You may NOT edit the dataset; instead, your program must be robust enough to deal with missing data.