# Bayesian Deep Learning

## The Case For Bayesian Learning (Wilson, n.d.)

• Vague parameter prior + structured model (e.g. CNN) = structured function prior!
• The success of ensembles encourages Bayesians, since ensembles approximate the Bayesian Model Average

## Bayesian Perspective on Generalization (Smith and Le, n.d.)

Bayesian model comparisons were first made on Neural Networks by Mackay. Consider a classification model $$M$$ with a single parameter $$w$$, training inputs $$x$$ and training labels $$y$$. We can infer a posterior probability distribution over the parameter by applying Bayes theorem:

\begin{equation} P(w|y,x;M) = \frac{P(y|w,x;M)P(w;M)}{P(y|x;M)} \end{equation}

The assumption of a Gaussian prior for $$P(w;M)$$ leads to a posterior density of the parameter given the new training data $$P(w|y;x;M) \propto \sqrt{\lambda/2\pi}e^{-C(w;M)}$$, where $$C(w;M) = H(w;M) + \lambda w^2 / 2$$, which is the L2 regularized cross-entropy.

We can evaluate the normalizing constant, $$P(y|x;M) = \sqrt{\frac{\lambda}{2\pi}} \int dw e^{-C(w;M)}$$. Assuming that the integral is dominated by the region near the minimum $$w_0$$, we can estimate the evidence by Taylor expanding $$C(w;M) \approx C(w_0) + C’'(w_0)(w-w_0)^2$$.

\begin{equation} P(y|x;M) = \mathrm{exp} \left\{ -\left( C(w_0) + \frac{1}{2}ln(C’'(w_0)/\lambda) \right) \right\} \end{equation}

In models with many parameters, $$P(y|x;M) \approx \frac{\lambda^{p/2}f^{-C(w_0)}} {| \nabla \nabla C(w) |_{w_0}^{1 / 2}}$$, where the denominator can be thought of as an “Occam factor”, causing the network to prefer broad minima.

## Bibliography

Smith, Sam, and Quoc V. Le. n.d. “A Bayesian Perspective on Generalization and Stochastic Gradient Descent.” https://openreview.net/pdf?id=BJij4yg0Z.

Wilson, Andrew Gordon. n.d. “The Case for Bayesian Deep Learning.”