Jethro's Braindump

Two Levels Of Inference

tags
Occam’s Razor

There are 2 levels of inference: model fitting and model comparison. In model fitting, assuming a model is true (say \(\mathcal{H}_i\)), fit the model to the data by inferring what values its free parameters should possibly take.

\begin{equation} P\left(\mathbf{w} | D, \mathcal{H}_{i}\right)=\frac{P\left(D | \mathbf{w}, \mathcal{H}_{i}\right) P\left(\mathbf{w} | \mathcal{H}_{i}\right)}{P\left(D | \mathcal{H}_{i}\right)} \end{equation}

The normalizing constant is irrelevant to the first level of inference. It is common to use gradient-based methods to find the maximum of the posterior \(\mathbf{w}_{\text{MP}}\). Error bars for these parameters can be obtained by evaluating the Hessian at \(\mathbf{w}_{\text{MP}}\), \(\mathbf{A}=-\nabla \nabla \ln P\left(\mathbf{w} | D, \mathcal{H}_{i}\right)\), and Taylor-expanding the log posterior probability with \(\Delta \mathbf{w}=\mathbf{w}-\mathbf{w}_{\mathrm{MP}}\):

\begin{equation} P\left(\mathbf{w} | D, \mathcal{H}_{i}\right) \simeq P\left(\mathbf{w}_{\mathrm{MP}} | D, \mathcal{H}_{i}\right) \exp \left(-1 / 2 \Delta \mathbf{w}^{\mathrm{T}} \mathbf{A} \Delta \mathbf{w}\right) \end{equation}

locally approximating the posterior as a Gaussian with covariance matrix \(\mathbf{A}^{-1}\).

In model comparison, we compare models in light of the data, assign some sort of preference.

Bayesian methods can consistently and quantitatively solve both types of inferences, although adopting the Bayesian method for the first type leads to similar results from orthodox statistical methods. Orthodox statistical methods will find it difficult to perform model comparisons, because it is not possible simply to choose the model that fits the data itself. For example, maximum likelihood can fail by choosing implausible, over-parameterized models that overfit the data.

How do Bayesian methods perform model comparison? The posterior probability for each model is:

\begin{equation} P\left(\mathcal{H}_{i} | D\right) \propto P\left(D | \mathcal{H}_{i}\right) P\left(\mathcal{H}_{i}\right) \end{equation}

Hence, if we assign equal priors to the alternative models, models \(\mathcal{H}_i\) are ranked by evaluating the evidence. If the posterior is well approximated by a Gaussian, Bayesian model comparison is a simple extension of maximum likelihood model selection: the evidence is obtained by multiplying the best-fit likelihood by the Occam factor, obtained from the determinant of the covariance matrix \(\mathbf{A}^{-1}\) (the inverse Hessian).

Links to this note