Jethro's Braindump

Rademacher Complexity

In PAC Learning, We have shown that uniform convergence is a sufficient condition for learnability. Rademacher complexity measures the rate of uniform convergence. Rademacher complexity can also be used to provide generalization bounds.

Definition

Let us denote:

F=deflH=def{zl(h,z):hH}

given fF, we also define:

LD(f)=EzD[f(z)],LS(f)=1mi=1mf(zi)

We define the representativeness of S with respect to F as the largest gap between the true error of a function f, and its empirical error:

RepD(F,S)=defsupfF(LD(f)LS(f))

Suppose we would like to estimate the representativeness of S using the sample S only. One simple idea is to split S into 2 disjoint sets, S=S1S2 ; refer to S1 as the validation set and S2 as the training set. We can then estimate the representativeness of S by:

supfF(LS1(f)LS2(f))

If we define σ=(σ1,,σm){±1}m, to be a vector such that S1={zi:σi=1} and S2={zi:σi=1}. If we further assume |S1|=|S2|, then:

2msupfFi=1mσif(zi)

The Rademacher complexity measure captures this idea by considering the expectation of the above with respect to a random choice of σ. Formally, let FS be the set of all possible evaluations a function fF can achieve on sample S, namely:

FS={(f(z1),,f(zm)):fF}

Let the variables in σ be distributed i.i.d. according to P[σi=1]=P[σi=1]=12. Then the Rademacher complexity of F with respect to S is defined as:

R(FS)=def1mEσ{±1}m[supfFi=1mσif(zi)]

More generally, given a set of vectors ARm, we define

R(A)=def1mEσ[supfFi=1mσif(zi)]

The following lemma bounds the expected value of the representativeness of S by twice the expected Rademacher complexity.

ESDm[RepD(F,S)]2ESDmR(FS)

This lemma yields that, in expectation, the ERM rule finds a hypothesis which is close to the optimal hypothesis in \mathcal{H}.

ESDm[LD(ERMH(S))LS(ERMH(S))]2ESDm(lHS)

Furthermore, for any hH

ESDm[LD(ERMH(S))LD(h)]2ESDm(lHS)

Furthermore, if h=argminhLD(h) then for each δ(0,1) with probability of at least 1δ over the choice of S, we have:

LD(ERMH(S)LD(h))2ESDmR(lHS)δ

Using McDiarmid’s Inequality, we can derive bounds with better dependence on the confidence parameter:

Assume that for all z and hH we have that |l(h,z)c|. Then,

  1. With probability at least 1δ, for all hH,

LD(h)LS(h)2ESDmR(lHS)+c2ln(2/δ)m

In particular, this holds for h=ERMH(S).

  1. With probability at least 1δ, for all hH,

LD(h)LS(h)2R(lHS)+4c2ln(4/δ)m

  1. For any h , with probability at least 1δ,

LD(ERMH(S))LD(h)2R(lHS)+5c2ln(8/δ)m

Massart’s lemma states that the Rademacher complexity of a finite set grows logarithmically with the size of the set.

Let A={a1,,aN} be a finite set of vectors in Rm. Define a¯=1Ni=1Nai. Then,

R(A)maxaAaa¯2log(N)m

The contraction lemma shows that composing A with a Lipschitz function does not blow up the Rademacher complexity.

Rademacher complexity of linear classes

We define 2 linear classes:

H1={xw,x:w11}

H2={xw,x:w21}

H2 is bounded by the following lemma:

Let S=(x1,,xm) be vectors in an Hilbert space. Define H2S={(w,x1),w,x2),,w,xm):w21}. Then,

R(H2S)maxixi2m

The following lemma H1:

Let S=(x1,,xm) be vectors in Rn. Then,

R(H1S)maxixi2log(2n)m