CS 662: AI Programming

Homework #6: Decision Trees

Assigned: October 18

Due: October 25.

What to turn in: For question 1, a handwritten (or typed) description of how your decision tree was constructed. For question 2, turn in a copy of your Python code, and also place a copy of your code in your submit directory.

  1. Decision trees (by hand) (10 points):

    Complete the PlayTennis example we started in class by hand. For each node, show the entropy/information in the data set and the potential gain for each possible attribute. Also, show the final tree. The data set is included below.

    Day Outlook Temperature Humidity Wind PlayTennis
    D1 Sunny Hot High Weak No
    D2 Sunny Hot High Strong No
    D3 Overcast Hot High Weak Yes
    D4 Rain Mild High Weak Yes
    D5 Rain Cool Normal Weak Yes
    D6 Rain Cool Normal Strong No
    D7 Overcast Cool Normal Strong Yes
    D8 Sunny Mild High Weak No
    D9 Sunny Cool Normal Weak Yes
    D10 Rain Mild Normal Weak Yes
    D11 Sunny Mild Normal Strong Yes
    D12 Overcast Mild High Strong Yes
    D13 Overcast Hot Normal Weak Yes
    D14 Rain Mild High Strong No

  2. Part 2 (20 points).

    In this part, you'll write Python code to construct a decision tree from a data file. (My solution is about 130 lines, including comments, btw)

    Your code should be able to run from the command line, and either a) read in a training set, or b) read in a test set. for example:

    python dt.py -train zoo

    should read in the files zoo.csv (containing the data) and zoo.txt (containing the labels for each attribute), construct a decision tree representing the data, and write the decision tree out to a file called zoo.pickle (using pickle).

    python dt.py -test zoo should read in the tree stored in zoo.pickle and use the test data in zoo.test to determine the accuracy of your tree. It should print out a result indicating the fraction of test cases correctly classified.

    Your decision tree program should be able to work on any dataset (don't hardcode in attributes or values).

    In particular, it should be able to run on the following datasets: