Jethro's Braindump

Bayesian Deep Learning

tags
§deep_learning

The Case For Bayesian Learning (Wilson, 2019)

  • Vague parameter prior + structured model (e.g. CNN) = structured function prior!
  • The success of ensembles encourages Bayesians, since ensembles approximate the Bayesian Model Average

Bayesian Perspective on Generalization (Sam Smith & Quoc Le, 2018)

Bayesian model comparisons were first made on Neural Networks by Mackay. Consider a classification model \(M\) with a single parameter \(w\), training inputs \(x\) and training labels \(y\). We can infer a posterior probability distribution over the parameter by applying Bayes theorem:

\begin{equation} P(w|y,x;M) = \frac{P(y|w,x;M)P(w;M)}{P(y|x;M)} \end{equation}

The assumption of a Gaussian prior for \(P(w;M)\) leads to a posterior density of the parameter given the new training data \(P(w|y;x;M) \propto \sqrt{\lambda/2\pi}e^{-C(w;M)}\), where \(C(w;M) = H(w;M) + \lambda w^2 / 2\), which is the L2 regularized cross-entropy.

We can evaluate the normalizing constant, \(P(y|x;M) = \sqrt{\frac{\lambda}{2\pi}} \int dw e^{-C(w;M)}\). Assuming that the integral is dominated by the region near the minimum \(w_0\), we can estimate the evidence by Taylor expanding \(C(w;M) \approx C(w_0) + C’'(w_0)(w-w_0)^2\).

\begin{equation} P(y|x;M) = \mathrm{exp} \left\{ -\left( C(w_0) + \frac{1}{2}ln(C’'(w_0)/\lambda) \right) \right\} \end{equation}

In models with many parameters, \(P(y|x;M) \approx \frac{\lambda^{p/2}f^{-C(w_0)}} {| \nabla \nabla C(w) |_{w_0}^{1 / 2}}\), where the denominator can be thought of as an “Occam factor”, causing the network to prefer broad minima.

Bibliography

Wilson, A. G., The case for Bayesian deep learning, NYU Courant Technical Report, (), (2019).

Smith, S., & Le, Q. V., A bayesian perspective on generalization and stochastic gradient descent, In , (pp. ) (2018). : .

Icon by Laymik from The Noun Project. Website built with ♥ with Org-mode, Hugo, and Netlify.