MATHS@ZHAO Yin: CART Notes: A Basic Principle in CART Algorithm

This article tries to explain why growing a tree can reduce the error rate. In other words, is that better for splitting a node into its sub-nodes?

Proposition: For any split of a node \(t\) into \(t_L\) and \(t_R\)

\[R(t)\ge R(t_L)+R(t_R)\]
Notations:

Let's begin with some notations and basic concepts.

\(t\): a node in the tree

\(T\): the collection of all the nodes in the tree

\(\tilde{T}\): the collection of all the leaves (i.e. terminal nodes) in the tree

\(t_L\) and \(t_R\): the left child node and the right child node, respectively.

\(N\): the total number of samples, that is, \(N_j\) refers to the number of samples in class \(j\), \(N(t)\) refers to the number of samples at node \(t\), and \(N_j(t)\) refers to the number of samples in class \(j\) and going to node \(t\).

\(p(t)\): the probability of data falling into the region corresponding to node \(t\), that is, \(N(t)/N\). Additionally, \(p(t_L)\) and \(p(t_R)\) indicates \(N(t_L)/N\) and \(N(t_R)/N\), respectively. \(p_L\) and \(p_R\) indicates \(p(t_L|t)=p(t_L)/p(t)=N(t_L)/N(t)\) and \(p(t_R|t)=p(t_R)/p(t)=N(t_R)/N(t)\), respectively. Thus, we have: \(p_L+p_R=1\) and \(p(t_L)+p(t_R)=p(t)\).

\(p(j|t)\): the estimated posterior probability of class \(j\) given a point is at node \(t\), that is, \(N_j(t)/N(t)\).

\(p(t|j)\): the estimated probability of a sample in class \(j\) going to node \(t\), that is, \(N_j(t)/N_j\).

\(p(j, t)\): the joint probability of a sample begin in class \(j\) and going to node \(t\), that is, \(N_j(t)/N\).

\(j^*\): the class which is assigned to node \(t\) or say the majority class at node \(t\), that is, \(\underset{j}{\arg\max}\ p(j|t)\).

\(r(t)\): the re-substitution risk (i.e. the probability of misclassification) estimate of a given case falling into node \(t\), that is, \(1-\underset{j}{max}\ p(j|t)=1-p(j^*|t)\).

\(R(t)\): the re-substitution estimation for the overall misclassification rate at node \(t\), that is, \(r(t)p(t)\).

This proposition indicates that if we split a node \(t\) into child nodes, then the misclassification rate should be improved.

Proof:

In order to prove the inequality

\[R(t)\ge R(t_L)+R(t_R)\]

We have to prove

\[R(t)\ge p(t_L)\cdot r(t_L)+p(t_R)\cdot r(t_R)\]

\[\Leftrightarrow R(t)\ge p(t)\cdot p_L\cdot r(t_L)+p(t)\cdot p_R\cdot r(t_R)\]

The left can be written as \(R(t)=r(t)p(t)\), thus we only have to prove

\[r(t)\ge p_L\cdot r(t_L)+p_R\cdot r(t_R)\]

\[\Leftrightarrow r(t)\ge p_L\cdot (1-\underset{j}{max}\ p(j|t_L))+p_R\cdot (1-\underset{j}{max}\ p(j|t_R))\]

\[\Leftrightarrow r(t)\ge (p_L+p_R)-[p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)]\]

\[\Leftrightarrow r(t)\ge 1-[p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)]\]

On the other hand, we know that

\[r(t)=1-\underset{j}{max}\ p(j|t)=1-p(j^*|t)\]

Hence we have to prove

\[p(j^*|t)\le p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)\]

From the left

\[\le p_L\cdot \underset{j}{max}\ p(j|t_L)+p_R\cdot \underset{j}{max}\ p(j|t_R)\]

as desired.

The last step is because \(j^*\) refers to the majority class at node \(t\) but may be or not be at its child nodes \(t_L\) and \(t_R\).

Download this post as PDF format

MATHS@ZHAO Yin

CART Notes: A Basic Principle in CART Algorithm

没有评论:

发表评论