CART Notes: An example of calculating gini gain in CART

This post describes a simple example of calculating gini gain in CART algorithm. Gini gain, which is an impurity-based criterion that measures the divergences between the probability distribution of the target attributes' values.

Definition 1 (gini index): Given a training set $S$, the target attribute takes on $k$ different values (i.e. classes), then the gini index of $S$ is defined as\[Gini(S)=1-\sum_{i=1}^{k} p_i ^2\]
where $p_i$ is the probability of $S$ belonging to class $i$, that is, consider a probability distribution $P=(p_1, p_2, \ldots, p_k)$ and $\displaystyle \sum_{i=1}^{k} p_i =1$.

For instance, if a data set has only one class, its gini index is $1-1^2=0$, which is a purity data set. On the other hand, if a probability distribution is uniform $P=(1/k, 1/k, \ldots, 1/k)$, its gini index achieves maximum.

Definition 2 (gini gain): Gini gain is the evaluation criterion for selecting the attributes $A$ which is defined as
\[GiniGain (A,S)=Gini(S)-Gini(A,S)=Gini(S)-\sum_{i=1}^{n} \frac{|S_i|}{|S|} Gini(S_i)\]
where $S_i$ is the partition of $S$ induced by the value of attribute $A$.

CART can also deal with the case of features with continuous ranges. Assuming that features $A$ has a continuous range. It has values in the training set, say they are, in increasing, $A_1, A_2, \ldots, A_m$. Then for each value $A_i, i=1,2,\ldots, m$, we partition the records into two sets: the first set has the $A$ values up to and include $A_i$; the second set has the $A$ values greater than $A_i$. For each of these partitions we compute the gini gain of $(A_i, S), i=1,2,\ldots, m$, and choose the partition that maximizes the gini gain. If all features are continuous, we will obtain a binary tree.

Let's see an example to show the process of calculating gini gain.
Example: A small training set is shown in Table 1. There are 3 attributes which are $a_1, a_2,$ and $a_3$. $a_3$ is a continuous attribute. The target set $t$ has 2 classes which are $P, N$ (i.e. Positive & Negative). 


Solution: The gini index of $a_1, a_2$ is:
\[Gini(t)=1-[(\frac{4}{9})^2+(\frac{5}{9})^2]=40/81\\
Gini(a_1 =T)=1-[(\frac{3}{4})^2+(\frac{1}{4})^2]=3/8\\
Gini(a_1 =F)=1-[(\frac{1}{5})^2+(\frac{4}{5})^2]=8/25\\
Gini(a_2 =T)=1-[(\frac{2}{5})^2+(\frac{3}{5})^2]=12/25\\
Gini(a_2 =F)=1-[(\frac{2}{4})^2+(\frac{2}{4})^2]=1/2\\
GiniGain(a_1)=Gini(t)-[\frac{4}{9}\cdot Gini(a_1=T)+\frac{5}{9}\cdot Gini(a_1=F)]=0.149\\
GiniGain(a_2)=Gini(t)-[\frac{5}{9}\cdot Gini(a_2=T)+\frac{4}{9}\cdot Gini(a_2=F)]=0.005\]
According to the gini gain, the best split between $a_1$ and $a_2$ is $a_1$ due to it has the higher gini gain.

For $a_3$ which is a continuous attribute, firstly sort all values in increasing order (i.e. 1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and then partition each position (i.e. 0.5, 2.0, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5) into 2 divisions which becomes a binary tree. The following calculation shows the gini gain on No.3 split position, that is, 3.5.
\[GiniGain(SplitNo.3)\\
=Gini(t)-[\frac{2}{9}\cdot Gini(SplitNo.3left)+\frac{7}{9}\cdot Gini(SplitNo.3right)]\\
=\frac{40}{81}-\{\frac{2}{9}\cdot [1-(\frac{1}{2})^2-(\frac{1}{2})^2]+\frac{7}{9}\cdot [1-(\frac{3}{7})^2-(\frac{4}{7})^2]\}=0.002\]
The results of every possible split are shown in Table 2.


According to the results of $a_1, a_2, a_3$, the best split is $a_1$ due to its highest value of gini gain.

Download this post as PDF format


没有评论:

发表评论