Decision Trees
Understanding Decision Trees
Decision Trees is a Supervised Machine Learning Algorithm, it can be used for both classification & regression problems.
It looks like a down- grown tree or Flow Chart tree.
It breaks down the data by asking a series of Questions.
Divide and conquer or recursive partition.
Ross Quinlan built the first version of Decision Trees(ID3) in early 1980. (Handles only categorical data)
In 1984, Later new version C4.5 (Handles both categorical & numerical data)
In 1984, L. Breiman, J. Friedman, R. Olshen, and C. Stone also developed Decision Tree (CART, Classification And Regression Trees).
We have to do the Data Preparation before building the model, you can refer to the Data Pre Processing.
It uses Information Gain, Entropy, Gain Ratio calculations to identify the fields which are having good predicting power.
Decision Trees is a Supervised Machine Learning Algorithm, it can be used for both classification & regression problems.
It looks like a down- grown tree or Flow Chart tree.
It breaks down the data by asking a series of Questions.
Divide and conquer or recursive partition.
Ross Quinlan built the first version of Decision Trees(ID3) in early 1980. (Handles only categorical data)
In 1984, Later new version C4.5 (Handles both categorical & numerical data)
In 1984, L. Breiman, J. Friedman, R. Olshen, and C. Stone also developed Decision Tree (CART, Classification And Regression Trees).
We have to do the Data Preparation before building the model, you can refer to the Data Pre Processing.
It uses Information Gain, Entropy, Gain Ratio calculations to identify the fields which are having good predicting power.
Terminologies
Root Node: Every Decision Tree has only one Root Node, this is the starting point.
Decision Node: It uses splitting criteria to split the data, it has a parent node.
Terminal Node: Leaf node is the terminal node, there is no further splitting.
Root Node: Every Decision Tree has only one Root Node, this is the starting point.
Decision Node: It uses splitting criteria to split the data, it has a parent node.
Terminal Node: Leaf node is the terminal node, there is no further splitting.
Advantages:
No Scaling is required, Robust to outliers, Handles high skewed data.
Does not rely on the data Distribution i.e. Non Parametric.
Easy to understand & interpret.
Can Handle missing values(CART) ( Either treat them as separate categories or build another tree).
No Scaling is required, Robust to outliers, Handles high skewed data.
Does not rely on the data Distribution i.e. Non Parametric.
Easy to understand & interpret.
Can Handle missing values(CART) ( Either treat them as separate categories or build another tree).
Disadvantages:
Underfitting - Less data, Overfitting - Noise & exhaustive search.
We have to set a threshold that a proportion of data points at the leaf nodes to be between .30 and 1.0 to overcome both Overfitting & Underfitting.
It is hard to select the features when they are strongly correlated to each other.
We are going to use the following table to explain different kinds of calculations used in different kinds of Decision trees building approaches.
Entropy:
It calculates the uncertainty of a Feature based on the target Feature.
It ranges b/w 0 & 1.
If any Feature has more Entropy means that feature has more uncertainty,
We prefer Feature with less Entropy while building a decision tree.
Information Gain Ratio:
shortcoming of Information Gain is fixed by Information Gain Ratio.
Information Gain Ratio prefers the Feature which is having less disticnt values
whenever it encounters two two Features are having the same Information Gain.
Gini Impurity or Gini Index:
The Gini Index of a dataset ranges b/w 0 & 0.5, which indicates the likelihood of unseen data
being misclassified. if it were given a random class label according to the class distribution in the dataset.
b/w 0 & 1, high value indicates prune more number of nodes.
a) ID3 (Iterative Dichotomiser 3)
Use: When all features are categorical and simplicity is preferred
Avoid: With continuous variables or when overfitting is a concern
How trees are built:
1. Uses Information Gain for splitting
2. Only handles categorical attributes
3. Builds the tree top-down, without pruning
Pros:
1. Simple and fast.
2. Good for datasets with categorical features
Cons:
1. Prone to overfitting
2. Cannot handle continuous attributes Alternative: Use C4.5 or CART for more robust trees
Limitations and Cons:
1. Only works with categorical features.
2. Prone to overfitting, especially with noisy data.
3. Biased towards attributes with many values.
3. Cannot handle missing values.
5. Does not support pruning.
Solutions:
1. Use discretization techniques for continuous features
2. Implement early stopping or set a minimum number of samples per leaf
3. Use C4.5 or CART which address many of these limitations
4. Ensemble methods like Random Forests can mitigate overfitting
ID3 uses Information Gain:
1. It calculates the reduction in Entropy.
2. It measures how well a Feature separates the Target Feature.
3. We prefer the Feature which has more information gain.
4. If two Features are having the same Information Gain, then the feature which is having more number of distinct values is preferred for splitting. But the Feature which is having less number of distinct values takes less resources and time. We can say that it is shortcoming of Information Gain.
The following steps are involved in evaluating Information Gain
Step 1: Calculate entropy of the target variable (Buys_Computer)
E(S) = -p(Yes)log₂(p(Yes)) - p(No)log₂(p(No))
= -(9/14)log₂(9/14) - (5/14)log₂(5/14)
≈ 0.940
Step 2: Calculate entropy for each value of Age
E(S|Age<=30) = -(2/5)log₂(2/5) - (3/5)log₂(3/5) ≈ 0.971
—---------------- —-------------
E(S|Age31-40) = -(4/4)log₂(4/4) - (0/4)log₂(0/4) = 0
—---------------- —-------------
E(S|Age>40) = -(3/5)log₂(3/5) - (2/5)log₂(2/5) ≈ 0.971
Step 3: Calculate Information Gain
IG(S,Age) = E(S) - Σ(|Sv|/|S| * E(Sv))
= 0.940 - (5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971)
≈ 0.246
----------- ------------ ------------- -
C4.5:
1. Uses Gain Ratio for splitting
2. Handles both continuous and categorical attributes
3. Implements pruning
Use:
For both classification and regression tasks, especially when binary decisions are natural
Avoid:
When multi-way splits are necessary or when interpretability of complex trees is challenging
Pros:
1. Handles both continuous and categorical data
2. Implements pruning to reduce overfitting Cons:
3. Can still create biased trees with imbalanced datasets Alternative: Use ensemble methods like Random Forests
Limitations and Cons:
1. Can still create biased trees if not properly configured
2. Pruning can sometimes be too aggressive
3. Sensitive to small changes in training data
4. Computationally expensive for large datasets
Solutions:
1. Use cross-validation to fine-tune pruning parameter.
2. Implement ensemble methods to improve stability
3. Use sampling techniques for large datasets
4. Consider CART for better handling of numerical attributes
5. CART (Classification and Regression Trees)
C4.5 (Gain Ratio):
Step 1: Calculate Information Gain (same as ID3)
IG(S,Age) ≈ 0.246
Step 2: Calculate Split Information
SplitInfo(S,Age) = -Σ((|Sv|/|S|) * log₂(|Sv|/|S|)) = -(5/14)log₂(5/14) - (4/14)log₂(4/14) - (5/14)log₂(5/14) ≈ 1.577
Step 3: Calculate Gain Ratio
GainRatio(S,Age) = IG(S,Age) / SplitInfo(S,Age)
= 0.246 / 1.577
≈ 0.156
----------- ------------ ------------- -
3) CART:
Uses Gini Index for splitting
Produces binary trees (two branches per split)
Supports both classification and regression
Pros:
Supports both classification and regression
Produces easy-to-interpret binary trees Cons:
May create deep, complex trees Alternative: Use tree pruning or set a maximum depth
Limitations and Cons:
Prone to creating complex trees that may overfit
Can be biased towards attributes with many split points
Binary splits may not be optimal for all problems
Sensitive to outliers in the training data
Solutions:
Use cost-complexity pruning to simplify trees
Implement regularization techniques
Consider multiway splits for certain problems
Use robust splitting criteria less sensitive to outliers
CHAID (Chi-square Automatic Interaction Detector)
Use: For both classification and regression tasks, especially when binary decisions are natural
Avoid: When multi-way splits are necessary or when interpretability of complex trees is challenging
CART (Gini Index):
Step 1: Calculate Gini Index of the target variable
Gini(S) = 1 - Σ(p(i)²)
= 1 - ((9/14)² + (5/14)²)
≈ 0.459
Step 2: Calculate Gini Index for each value of Age Gini(S|Age<=30) = 1 - ((2/5)² + (3/5)²) = 0.480
Gini(S|Age31-40) = 1 - ((4/4)² + (0/4)²) = 0
Gini(S|Age>40) = 1 - ((3/5)² + (2/5)²) = 0.480
Step 3: Calculate Gini Gain
GiniGain(S,Age) = Gini(S) - Σ((|Sv|/|S|) * Gini(Sv))
= 0.459 - (5/14 * 0.480 + 4/14 * 0 + 5/14 * 0.480)
≈ 0.171
CHAID:
Uses Chi-square test for splitting
Can create non-binary splits
Primarily used for classification
Pros:
Can create multi-way splits
Good for datasets with categorical predictors Cons:
Less effective with continuous variables Alternative: Use CART or C4.5 for datasets with many continuous variables
Limitations and Cons:
Primarily designed for categorical predictors
Can be computationally intensive for large datasets
Multiple comparisons problem in statistical testing
May create overly complex trees
Solutions:
Use other algorithms for datasets with many continuous variables
Implement parallel processing for large datasets
Apply Bonferroni correction for multiple comparisons
Set stricter significance levels for splitting
Use: In market segmentation, when categorical predictors dominate Avoid: When many continuous variables are present or when binary splits are preferred
CHAID is a decision tree algorithm that uses the chi-square test of independence to determine the best split at each node. It's particularly useful for categorical variables and can create non-binary splits. Here's a step-by-step mathematical explanation:
Chi-square Statistic:
The core of CHAID is the chi-square (χ²) test. For a contingency table with r rows and c columns, the χ² statistic is calculated as:
χ² = Σ [(Oᵢⱼ - Eᵢⱼ)² / Eᵢⱼ]
Where: Oᵢⱼ = Observed frequency in cell (i,j) Eᵢⱼ = Expected frequency in cell (i,j)
Expected Frequency Calculation:
Eᵢⱼ = (row total * column total) / grand total
Degrees of Freedom:
df = (r - 1) * (c - 1)
Where: r = number of rows c = number of columns
P-value Calculation:
The p-value is calculated from the χ² statistic and the degrees of freedom using the chi-square distribution.
Merging Categories:
CHAID can merge categories of predictor variables. For ordinal predictors:
a) Calculate χ² for adjacent categories.
b) Find the pair with the smallest χ² (largest p-value).
c) If p-value > merge threshold, merge these categories.
d) Repeat until no pair has p-value > threshold.
For nominal predictors, compare all pairs of categories.
Bonferroni Correction:
To address the multiple comparison problem, CHAID applies the Bonferroni correction:
p' = p * m
Where: p' = adjusted p-value p = original p-value m = number of comparisons
Splitting Decision:
a) For each predictor, find the merged categories with the lowest adjusted p-value. b) Choose the predictor with the lowest overall adjusted p-value. c) If this p-value < significance threshold, split on this predictor.Stopping Criteria:
The tree growth stops when:
No predictor has p-value < significance threshold
Maximum tree depth is reached
Minimum node size is reached
Mathematical Example:
Let's consider a simple example with a binary target variable (Yes/No) and a predictor with three categories (A, B, C).
Observed Frequencies: Yes No Total A 30 10 40 B 20 20 40 C 10 30 40 Total 60 60 120
Step 1:
Calculate expected frequencies E(A,Yes) = (40 * 60) / 120 = 20 E(A,No) = (40 * 60) / 120 = 20 (Similarly for B and C)
Step 2:
Calculate χ² statistic χ² = (30-20)²/20 + (10-20)²/20 + ... = 20
Step 3:
Degrees of freedom df = (3-1) * (2-1) = 2
Step 4:
Calculate p-value p-value = P(χ² > 20 | df=2) ≈ 4.54e-5
Step 5: Check for merging (assuming merge threshold of 0.05) χ²(AB) = 5, p-value ≈ 0.025 > 0.05 χ²(BC) = 5, p-value ≈ 0.025 > 0.05 χ²(AC) = 20, p-value ≈ 7.74e-6 < 0.05 No merging occurs as A and C can't be merged.
Step 6:
Bonferroni correction
p' = 4.54e-5 * 3 = 1.36e-4
Step 7:
Splitting decision If 1.36e-4 < significance threshold (e.g., 0.05), split on this predictor.
This process would be repeated for all predictors at each node until a stopping criterion is met.
CHAID's key advantages are its ability to handle categorical variables naturally, create multi-way splits, and merge categories when appropriate. However, it can be computationally intensive and may struggle with continuous variables or datasets with many predictors due to the multiple comparison problem.
Underfitting - Less data, Overfitting - Noise & exhaustive search.
We have to set a threshold that a proportion of data points at the leaf nodes to be between .30 and 1.0 to overcome both Overfitting & Underfitting.
It is hard to select the features when they are strongly correlated to each other.
Uses Gini Index for splitting
Produces binary trees (two branches per split)
Supports both classification and regression
Supports both classification and regression
Produces easy-to-interpret binary trees Cons:
May create deep, complex trees Alternative: Use tree pruning or set a maximum depth
Prone to creating complex trees that may overfit
Can be biased towards attributes with many split points
Binary splits may not be optimal for all problems
Sensitive to outliers in the training data
Use cost-complexity pruning to simplify trees
Implement regularization techniques
Consider multiway splits for certain problems
Use robust splitting criteria less sensitive to outliers
CHAID (Chi-square Automatic Interaction Detector)
Uses Chi-square test for splitting
Can create non-binary splits
Primarily used for classification
Can create multi-way splits
Good for datasets with categorical predictors Cons:
Less effective with continuous variables Alternative: Use CART or C4.5 for datasets with many continuous variables
Primarily designed for categorical predictors
Can be computationally intensive for large datasets
Multiple comparisons problem in statistical testing
May create overly complex trees
Use other algorithms for datasets with many continuous variables
Implement parallel processing for large datasets
Apply Bonferroni correction for multiple comparisons
Set stricter significance levels for splitting
No predictor has p-value < significance threshold
Maximum tree depth is reached
Minimum node size is reached
Mathematical Example: