Homework #6: Decision Trees
Due: October 25.
What to turn in: For question 1, a handwritten (or typed) description of how your decision tree was constructed. For question 2, turn in a copy of your Python code, and also place a copy of your code in your submit directory.
Complete the PlayTennis example we started in class by hand. For each node, show the entropy/information in the data set and the potential gain for each possible attribute. Also, show the final tree. The data set is included below.
| Day | Outlook | Temperature | Humidity | Wind | PlayTennis |
| D1 | Sunny | Hot | High | Weak | No |
| D2 | Sunny | Hot | High | Strong | No |
| D3 | Overcast | Hot | High | Weak | Yes |
| D4 | Rain | Mild | High | Weak | Yes |
| D5 | Rain | Cool | Normal | Weak | Yes |
| D6 | Rain | Cool | Normal | Strong | No |
| D7 | Overcast | Cool | Normal | Strong | Yes |
| D8 | Sunny | Mild | High | Weak | No |
| D9 | Sunny | Cool | Normal | Weak | Yes |
| D10 | Rain | Mild | Normal | Weak | Yes |
| D11 | Sunny | Mild | Normal | Strong | Yes |
| D12 | Overcast | Mild | High | Strong | Yes |
| D13 | Overcast | Hot | Normal | Weak | Yes |
| D14 | Rain | Mild | High | Strong | No |
In this part, you'll write Python code to construct a decision tree from a data file. (My solution is about 130 lines, including comments, btw)
Your code should be able to run from the command line, and either a) read in a training set, or b) read in a test set. for example:
python dt.py -train zoo
should read in the files zoo.csv (containing the data) and zoo.txt (containing the labels for each attribute), construct a decision tree representing the data, and write the decision tree out to a file called zoo.pickle (using pickle).
python dt.py -test zoo should read in the tree stored in zoo.pickle and use the test data in zoo.test to determine the accuracy of your tree. It should print out a result indicating the fraction of test cases correctly classified.
Your decision tree program should be able to work on any dataset (don't hardcode in attributes or values).
In particular, it should be able to run on the following datasets:
Note: The restuarant example has two classes (WillWait and WillNotWait), but many of the attributes have three for more values. (For example, restaurant type). Given a set of circumstances, the tree should tell you whether you are willing to wait or not.
The zoo example has one unique trait (animal name), 15 boolean traits (encoded as 0/1), and two integer-valued traits (numberOfLegs and type). For type, 1=mammal, 2=bird,3=reptile,4=fish,5=amphibian, 6=insect, 7=crustacean. You may ignore the 'animalName' trait. The task is to determine the animal's type (fish, mammale, etc) from its other attributes.
You are welcome to change the formats of the data files if you wish. (For example, replacing 0 with 'False' and 1 with 'True') If you do so, please include copies of the data that correspond to your format in your submit directory, so that we can run your code.
How to write this:
I recommend starting from the bottom and working your way up - you are not required to follow the recipe below, but you may find it useful.
You'll want to use recursion to build the tree; write a function called makeTree that recursively constructs the tree, following the pseudocode in the slides.
Constructing the training and test sets
In order to evaluate the effectiveness of your decision tree, you will need to test it on data that was not used to construct the tree. This will require you to construct a training set and a test set.
For this homework, we will train on 80% of the data, and test on 20%. You should build a separate Python program that can randomly separate a data set into training and test sets. (Note - be sure this separation is random; don't just take the first 80% of the lines in the file).
We will use this to perform n-fold cross-validation, where n=5. In other words, repeat this five times and average the results:
What is the average accuracy of your tree on the restaurant and zoo datasets?
Extra credit (5 points) :
Both of the data sets above are examples of toy problems. They're nice for understanding how to build decision trees, but they're not real-world problems.
The following data is actual anonymized credit screening data. Your task is to build a decision tree that can accurately classify users into credit-worthy and non-credit-worthy classes. As above, perform five-fold cross-validation, using 80% training and 20% test sets.
There are two wrinkles that make this dataset interesting: