## Bayesian Perspective on Generalization (Sam Smith & Quoc Le, 2018)

Bayesian model comparisons were first made on Neural Networks by Mackay. Consider a classification model $$M$$ with a single parameter $$w$$, training inputs $$x$$ and training labels $$y$$. We can infer a posterior probability distribution over the parameter by applying Bayes theorem:

$$P(w|y,x;M) = \frac{P(y|w,x;M)P(w;M)}{P(y|x;M)}$$

The assumption of a Gaussian prior for $$P(w;M)$$ leads to a posterior density of the parameter given the new training data $$P(w|y;x;M) \propto \sqrt{\lambda/2\pi}e^{-C(w;M)}$$, where $$C(w;M) = H(w;M) + \lambda w^2 / 2$$, which is the L2 regularized cross-entropy.

We can evaluate the normalizing constant, $$P(y|x;M) = \sqrt{\frac{\lambda}{2\pi}} \int dw e^{-C(w;M)}$$. Assuming that the integral is dominated by the region near the minimum $$w_0$$, we can estimate the evidence by Taylor expanding $$C(w;M) \approx C(w_0) + C”(w_0)(w-w_0)^2$$.

$$P(y|x;M) = \mathrm{exp} \left\{ -\left( C(w_0) + \frac{1}{2}ln(C”(w_0)/\lambda) \right) \right\}$$

In models with many parameters, $$P(y|x;M) \approx \frac{\lambda^{p/2}f^{-C(w_0)}} {| \nabla \nabla C(w) |_{w_0}^{1 / 2}}$$, where the denominator can be thought of as an “Occam factor”, causing the network to prefer broad minima.

# Bibliography

Smith, S., & Le, Q. V., A bayesian perspective on generalization and stochastic gradient descent, In , (pp. ) (2018). : .