Processing math: 100%
显示标签为“CART”的博文。显示所有博文
显示标签为“CART”的博文。显示所有博文

CART Notes: An example of calculating gini gain in CART

This post describes a simple example of calculating gini gain in CART algorithm. Gini gain, which is an impurity-based criterion that measures the divergences between the probability distribution of the target attributes' values.

Definition 1 (gini index): Given a training set S, the target attribute takes on k different values (i.e. classes), then the gini index of S is defined asGini(S)=1-\sum_{i=1}^{k} p_i ^2

where p_i is the probability of S belonging to class i, that is, consider a probability distribution P=(p_1, p_2, \ldots, p_k) and \displaystyle \sum_{i=1}^{k} p_i =1.

CART Notes: How to Compute the Complexity Parameter in CART?

Proposition: The complexity parameter \alpha (i.e. cp in R "rpart" package) is

\alpha = \frac{R(t)-R(T_t)}{|\tilde{T}|-1}


Proof:
Recall that the definition: R_\alpha (T) = R(T) + \alpha |\tilde{T}|, and T_t is a branch including node t. For any single node t \in T, we have
R_\alpha (t) = R(t) + \alpha
since there is only one terminal node at a single node.
Similarly, for any branch T_t \in T, we have
R_\alpha(T_t)=R(T_t)+\alpha |\tilde{T_t}|
When \alpha=0, R_0(t)=R(t)>R(T_t)=R_0(T_t). This inequality is guaranteed because the first step of pruning is to prune off all of the terminal nodes which satisfy R(t)=R(t_L)+R(t_R). That is, the remaining nodes must be satisfied R(t)>R(t_L)+R(t_R) (the details can be found in this post). Furthermore, the inequality holds for sufficient small \alpha.

CART Notes: A Basic Principle in CART Algorithm

This article tries to explain why growing a tree can reduce the error rate. In other words, is that better for splitting a node into its sub-nodes?

Proposition: For any split of a node t into t_L and t_R
R(t)\ge R(t_L)+R(t_R)

Notations:
Let's begin with some notations and basic concepts.