Decision Trees

 Decision Trees 

Understanding Decision Trees

  • Decision Trees is a Supervised Machine Learning Algorithm, it can be used for both classification & regression problems.

  • It looks like a down- grown tree or Flow Chart tree.

  • It breaks down the data by asking a series of Questions.

  • Divide and conquer or recursive partition.

  • Ross Quinlan built the first version of Decision Trees(ID3) in early 1980. (Handles only categorical data)

  • In 1984, Later new version C4.5 (Handles both categorical & numerical data)

  • In 1984, L. Breiman, J. Friedman, R. Olshen, and C. Stone also developed Decision Tree (CART, Classification And Regression Trees).

  • We have to do the Data Preparation before building the model, you can refer to the Data Pre Processing.

  • It uses Information Gain, Entropy, Gain Ratio calculations to identify the fields which are having good predicting power.

Terminologies

  • Root Node: Every Decision Tree has only one Root Node, this is the starting point.

  • Decision Node: It uses splitting criteria to split the data, it has a parent node.

  • Terminal Node: Leaf node is the terminal node, there is no further splitting.

Advantages:

  • No Scaling is required, Robust to outliers, Handles high skewed data.

  • Does not rely on the data Distribution i.e. Non Parametric.

  • Easy to understand & interpret.

  • Can Handle missing values(CART) ( Either treat them as separate categories or build another tree).

Disadvantages:

  • Underfitting - Less data, Overfitting - Noise & exhaustive search.

  • We have to set a threshold that a proportion of data points at the leaf nodes to be between .30 and 1.0 to overcome both Overfitting & Underfitting.

  • It is hard to select the features when they are strongly correlated to each other.

We are going to use the following table to explain different kinds of calculations used in different kinds of Decision trees building approaches.


Entropy: It calculates the uncertainty of a Feature based on the target Feature. It ranges b/w 0 & 1. If any Feature has more Entropy means that feature has more uncertainty, We prefer Feature with less Entropy while building a decision tree.



Information Gain Ratio: shortcoming of Information Gain is fixed by Information Gain Ratio. Information Gain Ratio prefers the Feature which is having less disticnt values whenever it encounters two two Features are having the same Information Gain. Gini Impurity or Gini Index: The Gini Index of a dataset ranges b/w 0 & 0.5, which indicates the likelihood of unseen data being misclassified. if it were given a random class label according to the class distribution in the dataset. b/w 0 & 1, high value indicates prune more number of nodes.


Index

Age

Income

Student

Credit_Rating

Buys_Computer

1

<=30

High

No

Fair

No

2

<=30

High

No

Excellent

No

3

31-40

High

No

Fair

Yes

4

>40

Medium

No

Fair

Yes

5

>40

Low

Yes

Fair

Yes

6

>40

Low

Yes

Excellent

No

7

31-40

Low

Yes

Excellent

Yes

8

<=30

Medium

No

Fair

No

9

<=30

Low

Yes

Fair

Yes

10

>40

Medium

Yes

Fair

Yes

11

<=30

Medium

Yes

Excellent

Yes

12

31-40

Medium

No

Excellent

Yes

13

31-40

High

No

Fair

Yes

14

>40

Medium

Yes

Excellent

No

a) ID3 (Iterative Dichotomiser 3)

Use: When all features are categorical and simplicity is preferred

Avoid: With continuous variables or when overfitting is a concern

How trees are built:

1. Uses Information Gain for splitting

2. Only handles categorical attributes

3. Builds the tree top-down, without pruning

Pros:

1. Simple and fast.

2. Good for datasets with categorical features

Cons:

1. Prone to overfitting

2. Cannot handle continuous attributes Alternative: Use C4.5 or CART for more robust trees

Limitations and Cons:

1. Only works with categorical features.

2. Prone to overfitting, especially with noisy data.

3. Biased towards attributes with many values.

3. Cannot handle missing values.

5. Does not support pruning.

Solutions:

1. Use discretization techniques for continuous features

2. Implement early stopping or set a minimum number of samples per leaf

3. Use C4.5 or CART which address many of these limitations

4. Ensemble methods like Random Forests can mitigate overfitting

ID3 uses Information Gain:

1. It calculates the reduction in Entropy. 2. It measures how well a Feature separates the Target Feature. 3. We prefer the Feature which has more information gain. 4. If two Features are having the same Information Gain, then the feature which is having more number of distinct values is preferred for splitting. But the Feature which is having less number of distinct values takes less resources and time. We can say that it is shortcoming of Information Gain.

The following steps are involved in evaluating Information Gain

Step 1: Calculate entropy of the target variable (Buys_Computer)

E(S) = -p(Yes)log₂(p(Yes)) - p(No)log₂(p(No)) 

= -(9/14)log₂(9/14) - (5/14)log₂(5/14) 

≈ 0.940

Step 2: Calculate entropy for each value of Age

Index

Age

Income

Student

Credit_Rating

Buys_Computer

1

<=30

High

No

Fair

No

2

<=30

High

No

Excellent

No

8

<=30

Medium

No

Fair

No

9

<=30

Low

Yes

Fair

Yes

11

<=30

Medium

Yes

Excellent

Yes

E(S|Age<=30) = -(2/5)log₂(2/5) - (3/5)log₂(3/5) ≈ 0.971

—---------------- —-------------

Index

Age

Income

Student

Credit_Rating

Buys_Computer

3

31-40

High

No

Fair

Yes

7

31-40

Low

Yes

Excellent

Yes

12

31-40

Medium

No

Excellent

Yes

13

31-40

High

Yes

Fair

Yes

E(S|Age31-40) = -(4/4)log₂(4/4) - (0/4)log₂(0/4) = 0

—---------------- —-------------

Index

Age

Income

Student

Credit_Rating

Buys_Computer

4

>40

Medium

No

Fair

Yes

5

>40

Low

Yes

Fair

Yes

6

>40

Low

Yes

Excellent

No

10

>40

Medium

Yes

Fair

Yes

14

>40

Medium

Yes

Excellent

No

E(S|Age>40) = -(3/5)log₂(3/5) - (2/5)log₂(2/5) ≈ 0.971

Step 3: Calculate Information Gain 

IG(S,Age) = E(S) - Σ(|Sv|/|S| * E(Sv)) 

= 0.940 - (5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971) 

≈ 0.246

----------- ------------ ------------- -

C4.5:

1. Uses Gain Ratio for splitting

2. Handles both continuous and categorical attributes

3. Implements pruning

Use:

For both classification and regression tasks, especially when binary decisions are natural 

Avoid:

When multi-way splits are necessary or when interpretability of complex trees is challenging

Pros:

1. Handles both continuous and categorical data

2. Implements pruning to reduce overfitting Cons:

3. Can still create biased trees with imbalanced datasets Alternative: Use ensemble methods like Random Forests

Limitations and Cons:

1. Can still create biased trees if not properly configured

2. Pruning can sometimes be too aggressive

3. Sensitive to small changes in training data

4. Computationally expensive for large datasets

Solutions:

1. Use cross-validation to fine-tune pruning parameter.

2. Implement ensemble methods to improve stability

3. Use sampling techniques for large datasets

4. Consider CART for better handling of numerical attributes

5. CART (Classification and Regression Trees)

C4.5 (Gain Ratio):

Step 1: Calculate Information Gain (same as ID3) 

IG(S,Age) ≈ 0.246

Step 2: Calculate Split Information 

SplitInfo(S,Age) = -Σ((|Sv|/|S|) * log₂(|Sv|/|S|)) = -(5/14)log₂(5/14) - (4/14)log₂(4/14) - (5/14)log₂(5/14) ≈ 1.577

Step 3: Calculate Gain Ratio 

GainRatio(S,Age) = IG(S,Age) / SplitInfo(S,Age) 

= 0.246 / 1.577

 ≈ 0.156

----------- ------------ ------------- -

3) CART:

  • Uses Gini Index for splitting

  • Produces binary trees (two branches per split)

  • Supports both classification and regression

 Pros:

  • Supports both classification and regression

  • Produces easy-to-interpret binary trees Cons:

  • May create deep, complex trees Alternative: Use tree pruning or set a maximum depth

Limitations and Cons:

  • Prone to creating complex trees that may overfit

  • Can be biased towards attributes with many split points

  • Binary splits may not be optimal for all problems

  • Sensitive to outliers in the training data

Solutions:

  • Use cost-complexity pruning to simplify trees

  • Implement regularization techniques

  • Consider multiway splits for certain problems

  • Use robust splitting criteria less sensitive to outliers

  • CHAID (Chi-square Automatic Interaction Detector)

Use: For both classification and regression tasks, especially when binary decisions are natural 

Avoid: When multi-way splits are necessary or when interpretability of complex trees is challenging

CART (Gini Index):

Step 1: Calculate Gini Index of the target variable 

Gini(S) = 1 - Σ(p(i)²) 

= 1 - ((9/14)² + (5/14)²) 

≈ 0.459

Step 2: Calculate Gini Index for each value of Age Gini(S|Age<=30) = 1 - ((2/5)² + (3/5)²) = 0.480 

Gini(S|Age31-40) = 1 - ((4/4)² + (0/4)²) = 0 

Gini(S|Age>40) = 1 - ((3/5)² + (2/5)²) = 0.480

Step 3: Calculate Gini Gain

GiniGain(S,Age) = Gini(S) - Σ((|Sv|/|S|) * Gini(Sv)) 

= 0.459 - (5/14 * 0.480 + 4/14 * 0 + 5/14 * 0.480) 

≈ 0.171

CHAID:

  • Uses Chi-square test for splitting

  • Can create non-binary splits

  • Primarily used for classification

Pros:

  • Can create multi-way splits

  • Good for datasets with categorical predictors Cons:

  • Less effective with continuous variables Alternative: Use CART or C4.5 for datasets with many continuous variables

Limitations and Cons:

  • Primarily designed for categorical predictors

  • Can be computationally intensive for large datasets

  • Multiple comparisons problem in statistical testing

  • May create overly complex trees

Solutions:

  • Use other algorithms for datasets with many continuous variables

  • Implement parallel processing for large datasets

  • Apply Bonferroni correction for multiple comparisons

  • Set stricter significance levels for splitting

Use: In market segmentation, when categorical predictors dominate Avoid: When many continuous variables are present or when binary splits are preferred


CHAID is a decision tree algorithm that uses the chi-square test of independence to determine the best split at each node. It's particularly useful for categorical variables and can create non-binary splits. Here's a step-by-step mathematical explanation:

Chi-square Statistic:

The core of CHAID is the chi-square (χ²) test. For a contingency table with r rows and c columns, the χ² statistic is calculated as:

χ² = Σ [(Oᵢⱼ - Eᵢⱼ)² / Eᵢⱼ]

Where: Oᵢⱼ = Observed frequency in cell (i,j) Eᵢⱼ = Expected frequency in cell (i,j)

Expected Frequency Calculation:

Eᵢⱼ = (row total * column total) / grand total

Degrees of Freedom:

df = (r - 1) * (c - 1)

Where: r = number of rows c = number of columns

P-value Calculation:

The p-value is calculated from the χ² statistic and the degrees of freedom using the chi-square distribution.

Merging Categories:

CHAID can merge categories of predictor variables. For ordinal predictors:

a) Calculate χ² for adjacent categories. 

b) Find the pair with the smallest χ² (largest p-value). 

c) If p-value > merge threshold, merge these categories. 

d) Repeat until no pair has p-value > threshold.

For nominal predictors, compare all pairs of categories.

Bonferroni Correction:

To address the multiple comparison problem, CHAID applies the Bonferroni correction:

p' = p * m

Where: p' = adjusted p-value p = original p-value m = number of comparisons

Splitting Decision:

a) For each predictor, find the merged categories with the lowest adjusted p-value. b) Choose the predictor with the lowest overall adjusted p-value. c) If this p-value < significance threshold, split on this predictor.Stopping Criteria:

The tree growth stops when:

  • No predictor has p-value < significance threshold

  • Maximum tree depth is reached

  • Minimum node size is reached


  • Mathematical Example:

Let's consider a simple example with a binary target variable (Yes/No) and a predictor with three categories (A, B, C).

Observed Frequencies: Yes No Total A 30 10 40 B 20 20 40 C 10 30 40 Total 60 60 120

Step 1: 

Calculate expected frequencies E(A,Yes) = (40 * 60) / 120 = 20 E(A,No) = (40 * 60) / 120 = 20 (Similarly for B and C)

Step 2: 

Calculate χ² statistic χ² = (30-20)²/20 + (10-20)²/20 + ... = 20

Step 3: 

Degrees of freedom df = (3-1) * (2-1) = 2

Step 4: 

Calculate p-value p-value = P(χ² > 20 | df=2) ≈ 4.54e-5

Step 5: Check for merging (assuming merge threshold of 0.05) χ²(AB) = 5, p-value ≈ 0.025 > 0.05 χ²(BC) = 5, p-value ≈ 0.025 > 0.05 χ²(AC) = 20, p-value ≈ 7.74e-6 < 0.05 No merging occurs as A and C can't be merged.

Step 6

Bonferroni correction 

p' = 4.54e-5 * 3 = 1.36e-4

Step 7:

Splitting decision If 1.36e-4 < significance threshold (e.g., 0.05), split on this predictor.

This process would be repeated for all predictors at each node until a stopping criterion is met.

CHAID's key advantages are its ability to handle categorical variables naturally, create multi-way splits, and merge categories when appropriate. However, it can be computationally intensive and may struggle with continuous variables or datasets with many predictors due to the multiple comparison problem.

































Previous Post Next Post

Contact Form