A Simpler Proof of AdaBoost Algorithm

Adaboost is a powerful algorithm for predicting models. However, a major disadvantage is that Adaboost may lead to over-fit in the presence of noise. Freund, Y. & Schapire, R. E. (1997) proved that the training error of the ensemble is bounded by the following expression:

\begin{equation}\label{ada1}e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)}

\end{equation}
where $\epsilon_t$ is the error rate of each base classifier $t$. If the error rate is less than 0.5, we can write $\epsilon_t=0.5-\gamma_t$, where $\gamma_t$ measures how much better the classifier is than random guessing (on binary problems). The bound on the training error of the ensemble becomes
\begin{equation}\label{ada2}
e_{ensemble}\le \prod_{t}\sqrt{1-4{\gamma_t}^2}\le e^{-2\sum_{t}{\gamma_t}^2}
\end{equation}
Thus if each base classifier is slightly better than random so that $\gamma_t>\gamma$ for some $\gamma>0$, then the training error drops exponentially fast. Nevertheless, because of its tendency to focus on training examples that are misclassified, Adaboost algorithm can be quite susceptible to over-fitting.
We will give a new simple proof of \ref{ada1} and \ref{ada2}; additionally, we try to explain why the parameter $$\alpha_t=\frac{1}{2}\cdot\log\frac{1-\epsilon_t}{\epsilon_t}$$ in boosting algorithm.
AdaBoost Algorithm:
Recall the boosting algorithm is:
Given $(x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m)$, where $x_i\in X, y_i\in Y=\{-1, +1\}$.
Initialize $$D_1(i)=\frac{1}{m}$$ For $t=1, 2, \ldots, T$: Train weak learner using distribution $D_t$.
Get weak hypothesis $h_t: X\rightarrow \{-1, +1\}$ with error \[\epsilon_t=\Pr_{i\sim D_t}[h_t (x_i)\ne y_i]\] If $\epsilon_i >0.5$, then the weights $D_t (i)$ are reverted back to their original uniform values $\frac{1}{m}$.
Choose
\begin{equation}\label{boost3}
\alpha_t=\frac{1}{2}\cdot \log\frac{1-\epsilon_t}{\epsilon_t}
\end{equation}
Update:
\begin{equation}\label{boost4}
D_{t+1}(i)=\frac{D_{t}(i)}{Z_t}\times
\left\{\begin{array}{c c}
e^{-\alpha_t} & \quad \textrm{if $h_t(x_i)=y_i$}\\
e^{\alpha_t} & \quad \textrm{if $h_t(x_i)\ne y_i$}
\end{array} \right.
\end{equation}
where $Z_t$ is a normalization factor.
Output the final hypothesis: \[H(x)=\text{sign}\left(\sum_{t=1}^{T}\alpha_t\cdot h_t(x)\right)\]

Proof:
Firstly, we will prove \ref{ada1}, note that $D_{t+1}(i)$ is the distribution and its summation $\sum_{i}D_{t+1}(i)$ equals to 1, hence
\[Z_t=\sum_{i}D_{t+1}(i)\cdot Z_t=\sum_{i}D_t(i)\times \left\{\begin{array}{c c}
e^{-\alpha_t} & \quad \textrm{if $h_t(x_i)=y_i$}\\
e^{\alpha_t} & \quad \textrm{if $h_t(x_i)\ne y_i$}
\end{array} \right.\]
\[=\sum_{i:\ h_t(x_i)=y_i}D_t(i)\cdot e^{-\alpha_t}+\sum_{i:\ h_t(x_i)\ne y_i}D_t(i)\cdot e^{\alpha_t}\]
\[=e^{-\alpha_t}\cdot \sum_{i:\ h_t(x_i)=y_i}D_t(i)+e^{\alpha_t}\cdot \sum_{i:\ h_t(x_i)\ne y_i}D_t(i)\]
\begin{equation}\label{boost5}
=e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t
\end{equation}
In order to find $\alpha_t$ we can minimize $Z_t$ by making its first order derivative equal to 0.
\[{[e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t]}^{'}=-e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t=0\]
\[\Rightarrow \alpha_t=\frac{1}{2}\cdot \log\frac{1-\epsilon_t}{\epsilon_t}\]
which is \ref{boost3} in the boosting algorithm.
Then we put $\alpha_t$ back to \ref{boost5}
\[Z_t=e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t=e^{-\frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}}\cdot (1-\epsilon_t)+e^{\frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}}\cdot\epsilon_t\]
\begin{equation}\label{boost6}
=2\sqrt{\epsilon_t\cdot(1-\epsilon_t)}
\end{equation}
On the other hand, derive from \ref{boost4} we have
\[D_{t+1}(i)=\frac{D_t(i)\cdot e^{-\alpha_t\cdot y_i\cdot h_t(x_i)}}{Z_t}=\frac{D_t(i)\cdot e^{K_t}}{Z_t}\]
Since the product will either be $1$ if $h_t (x_i )=y_i$ or $-1$ if $h_t (x_i )\ne y_i$.
Thus we can write down all of the equations
\[D_1(i)=\frac{1}{m}\]
\[D_2(i)=\frac{D_1(i)\cdot e^{K_1}}{Z_1}\]
\[D_3(i)=\frac{D_2(i)\cdot e^{K_2}}{Z_2}\]
\[\ldots\ldots\ldots\]
\[D_{t+1}(i)=\frac{D_t(i)\cdot e^{K_t}}{Z_t}\]
Multiply all equalities above and obtain
\[D_{t+1}(i)=\frac{1}{m}\cdot\frac{e^{-y_i\cdot f(x_i)}}{\prod_{t}Z_t}\]
where $f(x_i)=\sum_{t}\alpha_t\cdot h_t(x_i)$.
Thus
\begin{equation}\label{boost7}
\frac{1}{m}\cdot \sum_{i}e^{-y_i\cdot f(x_i)}=\sum_{i}D_{t+1}(i)\cdot\prod_{t}Z_t=\prod_{t}Z_t
\end{equation}
Note that if $\epsilon_i>0.5$ the data set will be re-sampled until $\epsilon_i\le0.5$. In other words, the parameter $\alpha_t\ge0$ in each valid iteration process. The training error of the ensemble can be expressed as
\[e_{ensemble}=\frac{1}{m}\cdot\sum_{i}\left\{\begin{array}{c c}
1 & \quad \textrm{if $y_i\ne h_t(x_i)$}\\
0 & \quad \textrm{if $y_i=h_t(x_i)$}
\end{array} \right. =\frac{1}{m}\cdot \sum_{i}\left\{\begin{array}{c c}
1 & \quad \textrm{if $y_i\cdot f(x_i)\le0$}\\
0 & \quad \textrm{if $y_i\cdot f(x_i)>0$}
\end{array} \right.\]
\begin{equation}\label{boost8}
\le\frac{1}{m}\cdot\sum_{i}e^{-y_i\cdot f(x_i)}=\prod_{t}Z_t
\end{equation}
The last step derives from \ref{boost7}.
According to \ref{boost6} and \ref{boost8}, we have proved \ref{ada1}
\begin{equation}\label{boost9}
e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)}
\end{equation}
In order to prove \ref{ada2}, we have to firstly prove the following inequality:
\begin{equation}\label{boost10}
1+x\le e^x
\end{equation}
Or the equivalence $e^x-x-1\ge0$.
Let $f(x)=e^x-x-1$, then
\[f^{'}(x)=e^x-1=0\Rightarrow x=0\]
Since $f^{''}(x)=e^x>0$, so
\[{f(x)}_{min}=f(0)=0\Rightarrow e^x-x-1\ge0\]
which is desired.
Now we go back to \ref{boost9} and let
\[\epsilon_t=\frac{1}{2}-\gamma_t\]
where $\gamma_t$ measures how much better the classifier is than random guessing (on binary problems). Based on \ref{boost10} we have
\[e_{ensemble}\le\prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)}\]
\[=\prod_{t}\sqrt{1-4\gamma_t^2}\]
\[=\prod_{t}[1+(-4\gamma_t^2)]^{\frac{1}{2}}\]
\[\le\prod_{t}(e^{-4\gamma_t^2})^\frac{1}{2}=\prod_{t}e^{-2\gamma_t^2}\]
\[=e^{-2\cdot\sum_{t}\gamma_t^2}\]
as desired.

Download this post as PDF format




2 条评论:

  1. Why the training error of the ensemble can be expressed as (8)? Does it need to sum over t?

    回复删除
  2. Z_t is the minimum error of h_t and H_t is calculated by weighted sum of h_t. Thus, H_t has a lower bound which equals to weighted sum of Z_t, is that right? Thanks

    回复删除