tags
§machine_learning

In PAC Learning, We have shown that uniform convergence is a sufficient condition for learnability. Rademacher complexity measures the rate of uniform convergence. Rademacher complexity can also be used to provide generalization bounds.

## Definition

Let us denote:

$$\mathcal{F} \overset{\mathrm{def}}{=} l \circ \mathcal{H} \overset{\mathrm{def}}{=} \left\{ z \rightarrow l(h,z) : h \in \mathcal{H} \right\}$$

given $$f \in \mathcal{F}$$, we also define:

$$L_D(f) = \mathbb{E}_{z \sim D} \left[ f(z) \right], L_S(f) = \frac{1}{m} \sum_{i=1}^{m} f(z_i)$$

We define the representativeness of $$S$$ with respect to $$\mathcal{F}$$ as the largest gap between the true error of a function $$f$$, and its empirical error:

$$\mathrm{Rep}_D(\mathcal{F}, S) \overset{\mathrm{def}}{=} \mathrm{sup}_{f \in \mathcal{F}} (L_D(f) - L_S(f))$$

Suppose we would like to estimate the representativeness of $$S$$ using the sample $$S$$ only. One simple idea is to split $$S$$ into 2 disjoint sets, $$S = S_1 \cup S_2$$ ; refer to $$S_1$$ as the validation set and $$S_2$$ as the training set. We can then estimate the representativeness of $$S$$ by:

$$\mathrm{sup}_{f \in \mathcal{F}} (L_{S_1}(f) - L_{S_2}(f))$$

If we define $$\mathbf{\sigma} = (\sigma_1, \dots, \sigma_m) \in \left\{ \pm 1\right\}^m$$, to be a vector such that $$S_1 = \{ z_i : \sigma_i = 1\}$$ and $$S_2 = \{ z_i : \sigma_i = -1\}$$. If we further assume $$|S_1| = |S_2|$$, then:

$$\frac{2}{m} \mathrm{sup}_{f \in \mathcal{F}} \sum_{i=1}^{m} \sigma_i f(z_i)$$

The Rademacher complexity measure captures this idea by considering the expectation of the above with respect to a random choice of $$\mathcal{\sigma}$$. Formally, let $$\mathcal{F} \circ S$$ be the set of all possible evaluations a function $$f \in \mathcal{F}$$ can achieve on sample S, namely:

$$\mathcal{F} \circ S = \left\{ (f(z_1), \dots, f(z_m)) : f \in \mathcal{F} \right\}$$

Let the variables in $$\mathbf{\sigma}$$ be distributed i.i.d. according to $$\mathbb{P}[\sigma_i = 1] = \mathbb{P}[\sigma_i = -1] = \frac{1}{2}$$. Then the Rademacher complexity of $$\mathcal{F}$$ with respect to $$S$$ is defined as:

$$R(\mathcal{F} \circ S) \overset{def}{=} \frac{1}{m} \mathbb{E}_{\mathbf{\sigma} \in \{ \pm 1\}^m} \left[ \mathrm{sup}_{f \in \mathcal{F}} \sum_{i=1}^{m} \sigma_i f(z_i) \right]$$

More generally, given a set of vectors $$A \subset \mathbb{R}^m$$, we define

$$R(A) \overset{\mathrm{def}}{=} \frac{1}{m} \mathbb{E}_{\mathbf{\sigma}} \left[ \mathrm{sup}_{f \in \mathcal{F}} \sum_{i=1}^{m} \sigma_i f(z_i) \right]$$

The following lemma bounds the expected value of the representativeness of $$S$$ by twice the expected Rademacher complexity.

$$\mathbb{E}_{S \sim \mathcal{D}^m} \left[ \mathrm{Rep}_{\mathcal{D}} (\mathcal{F}, S) \right] \le 2 \mathbb{E}_{S \sim \mathcal{D}^m} R(\mathcal{F} \circ S)$$

This lemma yields that, in expectation, the ERM rule finds a hypothesis which is close to the optimal hypothesis in \mathcal{H}.

$$\mathbb{E}_{S \sim \mathcal{D}^m} \left[ L_D(ERM_{\mathcal{H}}(S)) - L_S(ERM_{\mathcal{H}}(S))\right] \le 2 \mathbb{E}_{S \sim \mathcal{D}^m} (l \circ \mathcal{H} \circ S)$$

Furthermore, for any $$h^* \in \mathcal{H}$$

$$\mathbb{E}_{S \sim \mathcal{D}^m} \left[ L_D(ERM_{\mathcal{H}}(S)) - L_D(h^*)\right] \le 2 \mathbb{E}_{S \sim \mathcal{D}^m} (l \circ \mathcal{H} \circ S)$$

Furthermore, if $$h^* = \mathrm{argmin}_h L_{\mathcal{D}}(h)$$ then for each $$\delta \in (0,1)$$ with probability of at least $$1 - \delta$$ over the choice of $$S$$, we have:

$$L_{\mathcal{D}} (ERM_{\mathcal{H}}(S) - L_{\mathcal{D}}(h^*)) \le \frac{2 \mathbb{E}_{S’ \sim \mathcal{D}^m} R(l \circ \mathcal{H} \circ S’)}{\delta}$$

Using McDiarmid’s Inequality, we can derive bounds with better dependence on the confidence parameter:

Assume that for all $$z$$ and $$h \in \mathcal{H}$$ we have that $$|l(h,z) \le c|$$. Then,

1. With probability at least $$1 - \delta$$, for all $$h \in \mathcal{H}$$,

$$L_{\mathcal{D}} (h) - L_S(h) \le 2 \mathbb{E}_{S’ \sim \mathcal{D}^m} R(l \circ \mathcal{H} \circ S’) + c \sqrt{\frac{2 \ln(2/\delta)}{m}}$$

In particular, this holds for $$h = ERM_{\mathcal{H}}(S)$$.

1. With probability at least $$1 - \delta$$, for all $$h \in \mathcal{H}$$,

$$L_{\mathcal{D}} (h) - L_S(h) \le 2 R(l \circ \mathcal{H} \circ S) + 4c\sqrt{\frac{2 \ln(4/\delta)}{m}}$$

1. For any $$h^*$$ , with probability at least $$1 - \delta$$,

$$L_{\mathcal{D}} (ERM_{\mathcal{H}} (S)) - L_D(h^*) \le 2 R(l \circ \mathcal{H} \circ S) + 5c\sqrt{\frac{2 \ln(8/\delta)}{m}}$$

Massart’s lemma states that the Rademacher complexity of a finite set grows logarithmically with the size of the set.

Let $$A = \{a_1, \dots, a_N\}$$ be a finite set of vectors in $$\mathbb{R}^m$$. Define $$\bar{a} = \frac{1}{N} \sum_{i=1}^N a_i$$. Then,

$$R(A) \le \mathrm{max}_{a \in A} \lVert a - \bar{a} \rVert \frac{\sqrt{2 \log(N)}}{m}$$

The contraction lemma shows that composing $$A$$ with a Lipschitz function does not blow up the Rademacher complexity.

## Rademacher complexity of linear classes

We define 2 linear classes:

$$\mathcal{H}_1 = \left\{x \rightarrow \langle w,x \rangle : \lVert w \rVert_1 \le 1\right\}$$

$$\mathcal{H}_2 = \left\{x \rightarrow \langle w,x \rangle : \lVert w \rVert_2 \le 1\right\}$$

$$\mathcal{H}_2$$ is bounded by the following lemma:

Let $$S = (x_1, \dots, x_m)$$ be vectors in an Hilbert space. Define $$\mathcal{H}_2 \circ S = \left\{( \langle w, x_1 \rangle), \langle w, x_2 \rangle), \dots, \langle w, x_m \rangle) : \lVert w \rVert_2 \le 1 \right\}$$. Then,

$$R(\mathcal{H}_2 \circ S) \le \frac{\mathrm{max}_i \lVert x_i \rVert_2}{\sqrt{m}}$$

The following lemma $$\mathcal{H}_1$$:

Let $$S = (x_1, \dots, x_m)$$ be vectors in $$\mathbb{R}^n$$. Then,

$$R(\mathcal{H}_1 \circ S) \le \mathrm{max}_i \lVert x_i \rVert_\infty \sqrt{\frac{2 \log(2n)}{m}}$$

Icon by Laymik from The Noun Project. Website built with ♥ with Org-mode, Hugo, and Netlify.