Decision Tree

A Decision Tree Algorithm is a supervised machine learning algorithm used for classification and regression tasks. It works by recursively splitting the dataset into smaller subsets based on feature values, forming a tree-like structure where each node represents a decision rule.

Entropy (Information Gain) (used in classification)

Mean Squared Error (MSE) (used in regression)

Advantages:

✅ Easy to interpret and visualize.
✅ Handles both numerical and categorical data.
✅ Requires little data preprocessing (e.g., no need for feature scaling).

Disadvantages:

❌ Prone to overfitting (if not pruned properly).
❌ Sensitive to noisy data.
❌ Can become biased if one class dominates.

Common Decision Tree Algorithms:

ID3 (Iterative Dichotomiser 3) – Uses entropy/information gain.
C4.5 – An improvement over ID3, supports missing values.
CART (Classification and Regression Trees) – Uses Gini impurity and MSE.

1️⃣ Compute Entropy(S) for the entire dataset.
2️⃣ Compute Entropy(A) for each attribute.
3️⃣ Compute Information Gain (IG) for each attribute.
4️⃣ Select the attribute with highest IG and split the dataset.
5️⃣ Recursively repeat Steps 2-4 for each subset.
6️⃣ Stop when all instances belong to one class or no attributes remain.
7️⃣ The resulting decision tree classifies new data points.

Let's go step by step to calculate Entropy and Information Gain for the Play Golf dataset.

Step 1: Sample Dataset

Total Instances = 14

Play Golf (Yes) = 9, Play Golf (No) = 5

Step 1: Compute the Entropy of the Entire Dataset

Entropy is a measure of uncertainty or disorder in the dataset. A higher entropy value means more uncertainty.

Formula for Entropy:

Entropy(S) = - \sum p(x) \log_2 p(x)

where:

$p(x)$ is the probability of each class (Yes/No in our case),
$S$ is the dataset.

So, the entropy of the dataset is 0.94.

Step 2: Compute Entropy for Each Attribute

Now, we calculate the entropy of different attributes to check how well they split the data.

Step 3: Compute the Weighted Entropy of the Attribute

To compute the overall entropy of Outlook, we take the weighted sum of the entropy values:

What is Weighted Entropy?

When we split the dataset based on an attribute (e.g., "Outlook"), we create subsets of the data. Each subset has its own entropy (level of uncertainty).

Formula for Weighted Entropy

$Entropy(Attribute) = \sum \left(\frac{|S_i|}{|S|} \times Entropy(S_i) \right)$

$Entropy(Outlook) = \left(\frac{5}{14} \times 0.97 \right) + \left(\frac{4}{14} \times 0 \right) + \left(\frac{5}{14} \times 0.97 \right)$

$E n t ro p y (O u tl oo k) = (\frac{5}{14} \times 0.97) + (\frac{4}{14} \times 0) + (\frac{5}{14} \times 0.97)$ $= (0.36) + (0) + (0.36) = 0.69$

Step 4: Compute Information Gain for the Attribute

Information Gain tells us how much entropy is reduced after splitting.

Formula for Information Gain (IG):

$IG(A) = Entropy(S) - Entropy(A)$

For Outlook:

$IG(Outlook) = 0.94 - 0.69 = 0.25$

We repeat this calculation for all attributes (Humidity, Windy, Temperature).

Step 5: Select the Attribute with the Highest Information Gain

We compare the Information Gain (IG) of all attributes and choose the one with the highest IG as the splitting criterion.

==>

Step 1: Recall the Formula for Information Gain (IG)

The Information Gain (IG) of an attribute is calculated as:

IG(Attribute) = Entropy(Parent) - Entropy(Attribute)

Entropy(Parent): Entropy of the entire dataset before splitting.
Entropy(Attribute): Weighted entropy after splitting by that attribute.

Step 2: Entropy of the Parent (Before Splitting)

Before splitting, we calculate the overall entropy of the dataset. Suppose we have 14 instances in the dataset:

Play Golf	Count
Yes	9
No	5

The entropy of the entire dataset is:

Entropy(Play Golf) = - \left(\frac{9}{14} \log_2 \frac{9}{14} \right) - \left(\frac{5}{14} \log_2 \frac{5}{14} \right)

= - (0.642 \times \log_2 0.642) - (0.357 \times \log_2 0.357)

= - (0.642 \times -0.64) - (0.357 \times -1.48)

= 0.94

So, Entropy(Parent) = 0.94.

Step 3: Compute Entropy for Each Attribute

(1) Entropy of "Windy"

"Windy" has two possible values: TRUE and FALSE. We split the dataset accordingly.

Windy	Play Golf (Yes)	Play Golf (No)	Total Instances
TRUE	3	3	6
FALSE	6	2	8

Calculate Entropy for Windy=TRUE

Entropy(Windy=TRUE) = - \left(\frac{3}{6} \log_2 \frac{3}{6} \right) - \left(\frac{3}{6} \log_2 \frac{3}{6} \right)

= - (0.5 \times \log_2 0.5) - (0.5 \times \log_2 0.5)

= - (0.5 \times -1) - (0.5 \times -1) = 1.0

Calculate Entropy for Windy=FALSE

Entropy(Windy=FALSE) = - \left(\frac{6}{8} \log_2 \frac{6}{8} \right) - \left(\frac{2}{8} \log_2 \frac{2}{8} \right)

= - (0.75 \times \log_2 0.75) - (0.25 \times \log_2 0.25)

= - (0.75 \times -0.415) - (0.25 \times -2)

= 0.81

Weighted Entropy of Windy

Entropy(Windy) = \left(\frac{6}{14} \times 1.0 \right) + \left(\frac{8}{14} \times 0.81 \right)

= (0.428) + (0.463) = 0.89

Information Gain for Windy

IG(Windy) = 0.94 - 0.89 = 0.10

(2) Entropy of "Humidity"

"Humidity" has two values: High and Normal.

Humidity	Play Golf (Yes)	Play Golf (No)	Total Instances
High	3	4	7
Normal	6	1	7

Calculate Entropy for Humidity=High

Entropy(High) = - \left(\frac{3}{7} \log_2 \frac{3}{7} \right) - \left(\frac{4}{7} \log_2 \frac{4}{7} \right)

= - (0.428 \times \log_2 0.428) - (0.571 \times \log_2 0.571)

= - (0.428 \times -1.23) - (0.571 \times -0.80)

= 0.99

Calculate Entropy for Humidity=Normal

Entropy(Normal) = - \left(\frac{6}{7} \log_2 \frac{6}{7} \right) - \left(\frac{1}{7} \log_2 \frac{1}{7} \right)

= - (0.857 \times \log_2 0.857) - (0.142 \times \log_2 0.142)

= - (0.857 \times -0.22) - (0.142 \times -2.81)

= 0.59

Weighted Entropy of Humidity

Entropy(Humidity) = \left(\frac{7}{14} \times 0.99 \right) + \left(\frac{7}{14} \times 0.59 \right)

= (0.495) + (0.295) = 0.79

Information Gain for Humidity

IG(Humidity) = 0.94 - 0.79 = 0.15

(3) Entropy of "Temperature"

"Temperature" has three values: Hot, Mild, Cool.

Temperature	Play Golf (Yes)	Play Golf (No)	Total Instances
Hot	2	2	4
Mild	4	2	6
Cool	3	1	4

Using the same process as above, we calculate:

Entropy(Temperature) = 0.89

IG(Temperature) = 0.94 - 0.89 = 0.05

Step 4: Choose the Attribute with Highest Information Gain

Attribute	Information Gain
Outlook	0.25
Humidity	0.15
Windy	0.10
Temperature	0.05

Since Outlook has the highest IG (0.25), it becomes the root node in our decision tree.

====>

Step 6: Repeat the Process for Each Subset

Now, we repeat Steps 2-5 for the sub-groups (Sunny, Overcast, Rainy).

For Overcast, since entropy is 0, it's a leaf node (all "Yes").
For Sunny and Rainy, we continue splitting using other attributes (Humidity, Windy, etc.).

We repeat this process recursively until:

All subsets are pure (Entropy = 0).
No more attributes are left to split.

At this stage, we have selected Outlook as the root node because it has the highest Information Gain (IG = 0.25).

===>

This means the dataset is now divided into three subsets based on the values of Outlook:

Overcast Subset
Sunny Subset
Rainy Subset

Step 6.1: Overcast Subset

Outlook	Temperature	Humidity	Windy	Play Golf
Overcast	Hot	High	FALSE	Yes
Overcast	Cool	Normal	TRUE	Yes
Overcast	Mild	High	FALSE	Yes
Overcast	Hot	Normal	FALSE	Yes

In this subset, all instances have Play Golf = Yes.
Since there is no variation in the target variable, the entropy is 0.
Conclusion: Overcast becomes a leaf node labeled "Yes" (pure subset).

✅ No further splitting is needed for Overcast.

Step 6.2: Sunny Subset

Outlook	Temperature	Humidity	Windy	Play Golf
Sunny	Mild	High	FALSE	Yes
Sunny	Cool	Normal	TRUE	No
Sunny	Cool	Normal	FALSE	Yes
Sunny	Mild	Normal	FALSE	Yes
Sunny	Mild	High	TRUE	No

Since this subset contains both Yes and No values, it is not pure. We need to continue splitting.

Step 6.2.1: Calculate Entropy for the Sunny Subset

Play Golf	Count
Yes	3
No	2

Entropy(Sunny) = - \left(\frac{3}{5} \log_2 \frac{3}{5} \right) - \left(\frac{2}{5} \log_2 \frac{2}{5} \right)

= - (0.6 \times -0.737) - (0.4 \times -1.322)

= 0.97

Since entropy > 0, we must split further. We now compute Information Gain for attributes in this subset (Humidity, Windy, Temperature).

Step 6.2.2: Choose Best Split for Sunny

Let’s assume:

IG(Humidity) = 0.15
IG(Windy) = 0.20 (highest IG)
IG(Temperature) = 0.05

Since Windy has the highest IG (0.20), we split the Sunny subset based on Windy.

✅ Windy is the next decision node for Sunny. Now, we split Sunny into two branches:

Sunny & Windy = TRUE
Sunny & Windy = FALSE

Subsets:

Sunny & Windy = TRUE → (Entropy = 0) → Leaf Node
Sunny & Windy = FALSE → (Entropy = 0) → Leaf Node

Since these subsets are pure, we stop here.

Step 6.3: Rainy Subset

Outlook	Temperature	Humidity	Windy	Play Golf
Rainy	Mild	High	FALSE	Yes
Rainy	Cool	Normal	FALSE	Yes
Rainy	Cool	Normal	TRUE	No
Rainy	Mild	Normal	TRUE	Yes
Rainy	Mild	High	TRUE	No

Since this subset contains both Yes and No, we need to continue splitting.

Step 6.3.1: Calculate Entropy for Rainy

Play Golf	Count
Yes	3
No	2

Entropy(Rainy) = - \left(\frac{3}{5} \log_2 \frac{3}{5} \right) - \left(\frac{2}{5} \log_2 \frac{2}{5} \right)

= 0.97

Since entropy > 0, we check Humidity, Windy, and Temperature for the highest Information Gain.

Step 6.3.2: Choose Best Split for Rainy

Let’s assume:

IG(Humidity) = 0.10
IG(Windy) = 0.15 (highest IG)
IG(Temperature) = 0.05

Since Windy has the highest IG (0.15), we split Rainy on Windy.

✅ Windy is the next decision node for Rainy. Now, we split Rainy into two branches:

Rainy & Windy = TRUE
Rainy & Windy = FALSE

Subsets:

Rainy & Windy = TRUE → (Entropy = 0) → Leaf Node
Rainy & Windy = FALSE → (Entropy = 0) → Leaf Node

Since these subsets are pure, we stop here.

====>

Step 6: Repeat the Process for Each Subset

Now, we repeat Steps 2-5 for the sub-groups (Sunny, Overcast, Rainy).

For Overcast, since entropy is 0, it's a leaf node (all "Yes").
For Sunny and Rainy, we continue splitting using other attributes (Humidity, Windy, etc.).

We repeat this process recursively until:

All subsets are pure (Entropy = 0).
No more attributes are left to split.

Step 7: Stop When a Stopping Condition is Met

A stopping condition could be:

All instances belong to the same class (Yes or No).
No more attributes left.
Tree reaches a maximum depth.

Once we stop, the final Decision Tree is formed.

Cheat Sheet For Developers

Decision Tree

Advantages:

Disadvantages:

Common Decision Tree Algorithms:

Step 1: Sample Dataset

Step 1: Compute the Entropy of the Entire Dataset

Formula for Entropy:

Step 2: Compute Entropy for Each Attribute

Step 3: Compute the Weighted Entropy of the Attribute

What is Weighted Entropy?

Formula for Weighted Entropy

Step 4: Compute Information Gain for the Attribute

Formula for Information Gain (IG):

Step 5: Select the Attribute with the Highest Information Gain

==>

Step 1: Recall the Formula for Information Gain (IG)

Step 2: Entropy of the Parent (Before Splitting)

Step 3: Compute Entropy for Each Attribute

(1) Entropy of "Windy"

Calculate Entropy for Windy=TRUE

Calculate Entropy for Windy=FALSE

Weighted Entropy of Windy

Information Gain for Windy

(2) Entropy of "Humidity"

Calculate Entropy for Humidity=High

Calculate Entropy for Humidity=Normal

Weighted Entropy of Humidity

Information Gain for Humidity

(3) Entropy of "Temperature"

Step 4: Choose the Attribute with Highest Information Gain

Step 6: Repeat the Process for Each Subset

Step 6.1: Overcast Subset

Step 6.2: Sunny Subset

Step 6.2.1: Calculate Entropy for the Sunny Subset

Step 6.2.2: Choose Best Split for Sunny

Step 6.3: Rainy Subset

Step 6.3.1: Calculate Entropy for Rainy

Step 6.3.2: Choose Best Split for Rainy

Step 6: Repeat the Process for Each Subset

Step 7: Stop When a Stopping Condition is Met

Final Decision Tree

Labels

Comments

Post a Comment

Popular posts from this blog

Hexagonal Architecture (Ports & Adapters Pattern)

Recursion & Choice

Frameworks