CART Notes: A Basic Principle in CART Algorithm

This article tries to explain why growing a tree can reduce the error rate. In other words, is that better for splitting a node into its sub-nodes?

Proposition: For any split of a node \(t\) into \(t_L\) and \(t_R\)
\[R(t)\ge R(t_L)+R(t_R)\]
Notations:
Let's begin with some notations and basic concepts.

\(t\): a node in the tree

\(T\): the collection of all the nodes in the tree

\(\tilde{T}\): the collection of all the leaves (i.e. terminal nodes) in the tree

\(t_L\) and \(t_R\): the left child node and the right child node, respectively.

\(N\): the total number of samples, that is, \(N_j\) refers to the number of samples in class \(j\), \(N(t)\) refers to the number of samples at node \(t\), and \(N_j(t)\) refers to the number of samples in class \(j\) and going to node \(t\).

\(p(t)\): the probability of data falling into the region corresponding to node \(t\), that is, \(N(t)/N\). Additionally, \(p(t_L)\) and \(p(t_R)\) indicates \(N(t_L)/N\) and \(N(t_R)/N\), respectively. \(p_L\) and \(p_R\) indicates \(p(t_L|t)=p(t_L)/p(t)=N(t_L)/N(t)\) and \(p(t_R|t)=p(t_R)/p(t)=N(t_R)/N(t)\), respectively. Thus, we have: \(p_L+p_R=1\) and \(p(t_L)+p(t_R)=p(t)\).

\(p(j|t)\): the estimated posterior probability of class \(j\) given a point is at node \(t\), that is, \(N_j(t)/N(t)\).

\(p(t|j)\): the estimated probability of a sample in class \(j\) going to node \(t\), that is, \(N_j(t)/N_j\).

\(p(j, t)\): the joint probability of a sample begin in class \(j\) and going to node \(t\), that is, \(N_j(t)/N\).

\(j^*\): the class which is assigned to node \(t\) or say the majority class at node \(t\), that is, \(\underset{j}{\arg\max}\ p(j|t)\).

\(r(t)\): the re-substitution risk (i.e. the probability of misclassification) estimate of a given case falling into node \(t\), that is, \(1-\underset{j}{max}\ p(j|t)=1-p(j^*|t)\).

\(R(t)\): the re-substitution estimation for the overall misclassification rate at node \(t\), that is, \(r(t)p(t)\).

This proposition indicates that if we split a node \(t\) into child nodes, then the misclassification rate should be improved.


Proof:
In order to prove the inequality
\[R(t)\ge R(t_L)+R(t_R)\]
We have to prove
\[R(t)\ge p(t_L)\cdot r(t_L)+p(t_R)\cdot r(t_R)\]
\[\Leftrightarrow R(t)\ge p(t)\cdot p_L\cdot r(t_L)+p(t)\cdot p_R\cdot r(t_R)\]
The left can be written as \(R(t)=r(t)p(t)\), thus we only have to prove
\[r(t)\ge p_L\cdot r(t_L)+p_R\cdot r(t_R)\]
\[\Leftrightarrow r(t)\ge p_L\cdot (1-\underset{j}{max}\ p(j|t_L))+p_R\cdot (1-\underset{j}{max}\ p(j|t_R))\]
\[\Leftrightarrow r(t)\ge (p_L+p_R)-[p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)]\]
\[\Leftrightarrow r(t)\ge 1-[p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)]\]
On the other hand, we know that
\[r(t)=1-\underset{j}{max}\ p(j|t)=1-p(j^*|t)\]
Hence we have to prove
\[p(j^*|t)\le p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)\]
From the left
\[p(j^*|t)=p(j^*, t_L|t)+p(j^*, t_R|t)\]
\[=p(j^*|t_L)\cdot p(t_L|t)+p(j^*|t_R)\cdot p(t_R|t)\]
\[=p_L\cdot p(j^*|t_L)+p_R\cdot p(j^*|t_R)\]
\[\le p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)\]
as desired.
The last step is because \(j^*\) refers to the majority class at node \(t\) but may be or not be at its child nodes \(t_L\) and \(t_R\).

Download this post as PDF format

没有评论:

发表评论