Rademacher Complexity

In PAC Learning, We have shown that uniform convergence is a sufficient condition for learnability. Rademacher complexity measures the rate of uniform convergence. Rademacher complexity can also be used to provide generalization bounds.

Definition

Let us denote:

$F \overset{def}{=} l \circ H \overset{def}{=} {z \to l (h, z) : h \in H}$

given $f \in F$ , we also define:

$L_{D} (f) = E_{z \sim D} [f (z)], L_{S} (f) = \frac{1}{m} \sum_{i = 1}^{m} f (z_{i})$

We define the representativeness of $S$ with respect to $F$ as the largest gap between the true error of a function $f$ , and its empirical error:

${Rep}_{D} (F, S) \overset{def}{=} \sup_{f \in F} (L_{D} (f) - L_{S} (f))$

Suppose we would like to estimate the representativeness of $S$ using the sample $S$ only. One simple idea is to split $S$ into 2 disjoint sets, $S = S_{1} \cup S_{2}$ ; refer to $S_{1}$ as the validation set and $S_{2}$ as the training set. We can then estimate the representativeness of $S$ by:

$\sup_{f \in F} (L_{S_{1}} (f) - L_{S_{2}} (f))$

If we define $σ = (σ_{1}, \dots, σ_{m}) \in {\pm 1}^{m}$ , to be a vector such that $S_{1} = {z_{i} : σ_{i} = 1}$ and $S_{2} = {z_{i} : σ_{i} = - 1}$ . If we further assume $| S_{1} | = | S_{2} |$ , then:

$\frac{2}{m} \sup_{f \in F} \sum_{i = 1}^{m} σ_{i} f (z_{i})$

The Rademacher complexity measure captures this idea by considering the expectation of the above with respect to a random choice of $σ$ . Formally, let $F \circ S$ be the set of all possible evaluations a function $f \in F$ can achieve on sample S, namely:

$F \circ S = {(f (z_{1}), \dots, f (z_{m})) : f \in F}$

Let the variables in $σ$ be distributed i.i.d. according to $P [σ_{i} = 1] = P [σ_{i} = - 1] = \frac{1}{2}$ . Then the Rademacher complexity of $F$ with respect to $S$ is defined as:

$R (F \circ S) \overset{d e f}{=} \frac{1}{m} E_{σ \in {\pm 1}^{m}} [\sup_{f \in F} \sum_{i = 1}^{m} σ_{i} f (z_{i})]$

More generally, given a set of vectors $A \subset R^{m}$ , we define

$R (A) \overset{def}{=} \frac{1}{m} E_{σ} [\sup_{f \in F} \sum_{i = 1}^{m} σ_{i} f (z_{i})]$

The following lemma bounds the expected value of the representativeness of $S$ by twice the expected Rademacher complexity.

$E_{S \sim D^{m}} [{Rep}_{D} (F, S)] \leq 2 E_{S \sim D^{m}} R (F \circ S)$

This lemma yields that, in expectation, the ERM rule finds a hypothesis which is close to the optimal hypothesis in \mathcal{H}.

$E_{S \sim D^{m}} [L_{D} (E R M_{H} (S)) - L_{S} (E R M_{H} (S))] \leq 2 E_{S \sim D^{m}} (l \circ H \circ S)$

Furthermore, for any $h^{*} \in H$

$E_{S \sim D^{m}} [L_{D} (E R M_{H} (S)) - L_{D} (h^{*})] \leq 2 E_{S \sim D^{m}} (l \circ H \circ S)$

Furthermore, if $h^{*} = {argmin}_{h} L_{D} (h)$ then for each $δ \in (0, 1)$ with probability of at least $1 - δ$ over the choice of $S$ , we have:

$L_{D} (E R M_{H} (S) - L_{D} (h^{*})) \leq \frac{2 E_{S^{'} \sim D^{m}} R (l \circ H \circ S^{'})}{δ}$

Using McDiarmid’s Inequality, we can derive bounds with better dependence on the confidence parameter:

Assume that for all $z$ and $h \in H$ we have that $| l (h, z) \leq c |$ . Then,

With probability at least $1 - δ$ , for all $h \in H$ ,

$L_{D} (h) - L_{S} (h) \leq 2 E_{S^{'} \sim D^{m}} R (l \circ H \circ S^{'}) + c \sqrt{\frac{2 \ln (2 / δ)}{m}}$

In particular, this holds for $h = E R M_{H} (S)$ .

With probability at least $1 - δ$ , for all $h \in H$ ,

$L_{D} (h) - L_{S} (h) \leq 2 R (l \circ H \circ S) + 4 c \sqrt{\frac{2 \ln (4 / δ)}{m}}$

For any $h^{*}$ , with probability at least $1 - δ$ ,

$L_{D} (E R M_{H} (S)) - L_{D} (h^{*}) \leq 2 R (l \circ H \circ S) + 5 c \sqrt{\frac{2 \ln (8 / δ)}{m}}$

Massart’s lemma states that the Rademacher complexity of a finite set grows logarithmically with the size of the set.

Let $A = {a_{1}, \dots, a_{N}}$ be a finite set of vectors in $R^{m}$ . Define $\bar{a} = \frac{1}{N} \sum_{i = 1}^{N} a_{i}$ . Then,

$R (A) \leq \max_{a \in A} ‖ a - \bar{a} ‖ \frac{\sqrt{2 \log (N)}}{m}$

The contraction lemma shows that composing $A$ with a Lipschitz function does not blow up the Rademacher complexity.

Rademacher complexity of linear classes

We define 2 linear classes:

$H_{1} = {x \to ⟨ w, x ⟩ : ‖ w ‖_{1} \leq 1}$

$H_{2} = {x \to ⟨ w, x ⟩ : ‖ w ‖_{2} \leq 1}$

$H_{2}$ is bounded by the following lemma:

Let $S = (x_{1}, \dots, x_{m})$ be vectors in an Hilbert space. Define $H_{2} \circ S = {(⟨ w, x_{1} ⟩), ⟨ w, x_{2} ⟩), \dots, ⟨ w, x_{m} ⟩) : ‖ w ‖_{2} \leq 1}$ . Then,

$R (H_{2} \circ S) \leq \frac{\max_{i} ‖ x_{i} ‖_{2}}{\sqrt{m}}$

The following lemma $H_{1}$ :

Let $S = (x_{1}, \dots, x_{m})$ be vectors in $R^{n}$ . Then,

$R (H_{1} \circ S) \leq \max_{i} ‖ x_{i} ‖_{\infty} \sqrt{\frac{2 \log (2 n)}{m}}$