# Experiment 9: Decision Tree

This is the report of Experiment 9: Decision Tree.

# Purpose

In this experiment, we want to classify the given wine dataset with a decision tree.

I chose the ID3 algorithm, which decides whether it's a leaf node with information gain.

# Procedure

# generate training set and testing set

We use 10-fold cross validation to show the reliability and robustness of the decision tree.

# calculate entropy and information gain

Shannon Entropy is defined as:

E(S)=CiCpilog2piE(S) = -\sum_{C_i \in \mathcal C}p_i \log_2 p_i

The information gain is defined as follows:

Gain(S,A)=E(S)I(S,A)=E(S)iSi=SSiSE(Si)Gain(S,A) = E(S) - I(S,A) = E(S) - \sum_{\cup_i S_i = S}\frac{|S_i|}{|S|}E(S_i)

# Pruning

We use a pre-pruning strategy, which set a threshold value of information gain called ε\varepsilon. We set ε=0.1\varepsilon = 0.1. When the information gain is less than ε\varepsilon, the set will not be classified.

# results

right:394
wrong:94
accuracy: 0.8073770491803278
right:359
wrong:129
accuracy: 0.735655737704918
right:366
wrong:122
accuracy: 0.75
right:401
wrong:87
accuracy: 0.8217213114754098
right:424
wrong:64
accuracy: 0.8688524590163934
right:367
wrong:121
accuracy: 0.7520491803278688
right:350
wrong:138
accuracy: 0.7172131147540983
right:358
wrong:130
accuracy: 0.7336065573770492
right:398
wrong:90
accuracy: 0.8155737704918032
right:405
wrong:83
accuracy: 0.8299180327868853

The mean accuracy is about 0.796250.79625, which is close to the experimental expectation.

# About Visualizing the Tree

Unfortunately, I don't know how to visualize a tree with more than 100 nodes, whose information also needs to be presented. So I gave up visualizing it.