# Experiment 9: Decision Tree

This is the report of Experiment 9: Decision Tree.

# Purpose

In this experiment, we want to classify the given wine dataset with a decision tree.

I chose the ID3 algorithm, which decides whether it's a leaf node with information gain.

# Procedure

# generate training set and testing set

We use 10-fold cross validation to show the reliability and robustness of the decision tree.

# calculate entropy and information gain

Shannon Entropy is defined as:

$E(S) = -\sum_{C_i \in \mathcal C}p_i \log_2 p_i$

The information gain is defined as follows:

$Gain(S,A) = E(S) - I(S,A) = E(S) - \sum_{\cup_i S_i = S}\frac{|S_i|}{|S|}E(S_i)$

# Pruning

We use a pre-pruning strategy, which set a threshold value of information gain called $\varepsilon$ . We set $\varepsilon = 0.1$ . When the information gain is less than $\varepsilon$ , the set will not be classified.

# results

right:394
wrong:94
accuracy: 0.8073770491803278
right:359
wrong:129
accuracy: 0.735655737704918
right:366
wrong:122
accuracy: 0.75
right:401
wrong:87
accuracy: 0.8217213114754098
right:424
wrong:64
accuracy: 0.8688524590163934
right:367
wrong:121
accuracy: 0.7520491803278688
right:350
wrong:138
accuracy: 0.7172131147540983
right:358
wrong:130
accuracy: 0.7336065573770492
right:398
wrong:90
accuracy: 0.8155737704918032
right:405
wrong:83
accuracy: 0.8299180327868853

The mean accuracy is about $0.79625$ , which is close to the experimental expectation.

# About Visualizing the Tree

Unfortunately, I don't know how to visualize a tree with more than 100 nodes, whose information also needs to be presented. So I gave up visualizing it.