MATHS@ZHAO Yin: CART

CART Notes: An example of calculating gini gain in CART

This post describes a simple example of calculating gini gain in CART algorithm. Gini gain, which is an impurity-based criterion that measures the divergences between the probability distribution of the target attributes' values.

Definition 1 (gini index): Given a training set $S$ , the target attribute takes on $k$ different values (i.e. classes), then the gini index of $S$ is defined as $Gini(S)=1-\sum_{i=1}^{k} p_i ^2$
where $p_i$ is the probability of $S$ belonging to class $i$ , that is, consider a probability distribution $P=(p_1, p_2, \ldots, p_k)$ and $\displaystyle \sum_{i=1}^{k} p_i =1$ .

阅读更多博文 »

CART Notes: How to Compute the Complexity Parameter in CART?

Proposition: The complexity parameter $\alpha$ (i.e. $cp$ in R "rpart" package) is

$\alpha = \frac{R(t)-R(T_t)}{|\tilde{T}|-1}$

Proof:

Recall that the definition:

$R_\alpha (T) = R(T) + \alpha |\tilde{T}|$ , and

$T_t$ is a branch including node

$t$ . For any single node

$t \in T$ , we have

$R_\alpha (t) = R(t) + \alpha$

since there is only one terminal node at a single node.

Similarly, for any branch

$T_t \in T$ , we have

$R_\alpha(T_t)=R(T_t)+\alpha |\tilde{T_t}|$

When

$\alpha=0$ ,

$R_0(t)=R(t)>R(T_t)=R_0(T_t)$ . This inequality is guaranteed because the first step of pruning is to prune off all of the terminal nodes which satisfy

$R(t)=R(t_L)+R(t_R)$ . That is, the remaining nodes must be satisfied

$R(t)>R(t_L)+R(t_R)$ (the details can be found in this post). Furthermore, the inequality holds for sufficient small

$\alpha$ .

阅读更多博文 »

CART Notes: A Basic Principle in CART Algorithm

This article tries to explain why growing a tree can reduce the error rate. In other words, is that better for splitting a node into its sub-nodes?

Proposition: For any split of a node $t$ into $t_L$ and $t_R$

$R(t)\ge R(t_L)+R(t_R)$
Notations:

Let's begin with some notations and basic concepts.

阅读更多博文 »