CART Notes: How to Compute the Complexity Parameter in CART?

Proposition: The complexity parameter \(\alpha\) (i.e. \(cp\) in R "rpart" package) is

\[\alpha = \frac{R(t)-R(T_t)}{|\tilde{T}|-1}\]

Proof:
Recall that the definition: \(R_\alpha (T) = R(T) + \alpha |\tilde{T}|\), and \(T_t\) is a branch including node \(t\). For any single node \(t \in T\), we have
\[R_\alpha (t) = R(t) + \alpha\]
since there is only one terminal node at a single node.
Similarly, for any branch \(T_t \in T\), we have
\[R_\alpha(T_t)=R(T_t)+\alpha |\tilde{T_t}|\]
When \(\alpha=0\), \(R_0(t)=R(t)>R(T_t)=R_0(T_t)\). This inequality is guaranteed because the first step of pruning is to prune off all of the terminal nodes which satisfy \(R(t)=R(t_L)+R(t_R)\). That is, the remaining nodes must be satisfied \(R(t)>R(t_L)+R(t_R)\) (the details can be found in this post). Furthermore, the inequality holds for sufficient small \(\alpha\).

Then if we gradually increase \(\alpha\), \(R_\alpha (T_t)\) increases faster than \(R_\alpha (t)\) since the coefficients \(|\tilde{T_t}|>1\). In other words, at a certain \(\alpha\) we will have \(R_\alpha(T_t)=R_\alpha(t)\). Solve the above equations, we have
\[R(t)+\alpha = R(T_t)+\alpha |\tilde{T_t}|\]
\[\Rightarrow (\tilde{T_t}-1)\cdot \alpha=R(t)-R(T_t)\]
\[\Rightarrow\alpha=\frac{R(t)-R(T_t)}{|\tilde{T}|-1}\]
as desired.

Example:
This example will simply show how to calculate the complexity parameter \(\alpha\) (see Figure below). The data set has 2 classes say \(A\), \(B\), and 200 samples in all. \(T_1\) is a subtree of the whole tree \(T\), there are 5 terminal nodes in \(T_1\), say \(t_5\), \(t_6\), \(t_7\), \(t_8\), and \(t_9\).


According to the formula, we have

\[\alpha (T_1(t_1))=\frac{R(t_1)-R(T_{t_1})}{5-1}=\frac{100/200-0}{4}=1/8\]
\[\alpha (T_1(t_2))=\frac{R(t_2)-R(T_{t_2})}{3-1}=\frac{10/200-0}{2}=1/40\]
\[\alpha (T_1(t_3))=\frac{R(t_3)-R(T_{t_3})}{2-1}=\frac{60/200-0}{1}=3/10\]
\[\alpha (T_1(t_4))=\frac{R(t_4)-R(T_{t_4})}{2-1}=\frac{2/200-0}{1}=1/100\]
\(\alpha(T_1(t_4))\) is the first value of \(\alpha\) since it obtains the lowest value. That is, we prune the tree below the node \(t_4\). After this a new iteration should be used as before and the tree will be pruned once again. 

Download this post as PDF format

没有评论:

发表评论