Bayesian Inference
Setup
We have some unknown quantity
- A sampling model, often expressed as a probability density
function
, which we call the likelihood function - A prior distribution
, which expresses any prior knowledge or beliefs that we have about their values before observing the data
Scalable Bayesian Inference
There is an increasingly immense literature focused on big data. Most of the
focus has been on optimization methods. Rapidly obtaining a point estimate even
when sample size
The end goal is having general probabilistic inference algorithms for complex data, and being able to handle arbitrarily complex probability models. We want algorithms to be scalable to huge data, and also be able to accurately quantify uncertainty.
Classical Posterior Approximations
- In conjugate models, one can express the posterior in simple form - e.g. as a multivariate Gaussian (see Exponential Family)
- In more complex settings, one can approximate the posterior using some tractable class of distributions
- Large sample Gaussian approximations:
This is also known as the Bayesian central limit theorem (Bernstein von Mises). This relies on:
- sample size
being large relative to the number of parameters - Likelihood being smooth and differentiable
- True value of
in interior of parameter space - Related class of approximations include the Laplace approximation
- Do well to approximating first and second moments, whereas variational Bayes may have trouble with 2nd moment
Alternative Analytic approximations
- We can define some approximating class
, which may be something like a product of exponential family distributions. - We could think to define some discrepancy between
and , e.g. KL divergence. - This forms the basis of variational Bayes, expectation-propagation and related methods.
- Optimize parameters to minimize discrepancy
See ICML 2018 Tutorial by Tamara Broderick for an overview. In general, we have no clue how accurate the approximation is. See also Pati, Bhattacharya, and Yang, n.d..
Markov Chain Monte Carlo
Accurate analytic approximations to the posterior have proven elusive outside of narrow settings. MCMC and other posterior sampling algorithms provide an alternative.
MCMC: sequential algorithm to obtain correlated draws from the posterior, and bypasses the need to approximate the marginal likelihood.
Often, samples are more useful than an analytic form anyway.
We can get MCMC-based summaries of the posterior for any functional
sample a proposal ( is a sample at step t)- Accept a proposal by letting
with probability
We want to design efficient MH algorithms by choosing good proposals
For example, in Gibbs sampling, let
In random walk,
In HMC/Langevin, we exploit gradient information to generate samples far from
MCMC & Computational Bottlenecks
Time per iteration of MCMC increases with the number of parameters and unknowns,
and also the increase with sample size
MCMC does not produce independent samples from the posterior distribution
A well designed MCMC algorithm with a good proposal should ideally exhibit rapid convergence and mixing.
Embarrassingly Parallel MCMC e.g. , Srivastava et al., n.d., @li16
We can replace expensive transition kernels with approximations Johndrow et al., n.d.. for example, we approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data, vastly speeding up MCMC sampling in high-dimensional settings.
Robustness in Big Data
In standard Bayesian inference, it is assumed that the model is correct. Small violations of this assumption sometimes have a large impacts, particularly in large datasets. The ability to carefully modelling assumptions decreases for big/complex data. This appeals to tweaking the Bayesian paradigm to be inherently more robust.
High-p problems
- There is a huge literature proposing different penalties: adaptive lasso, fused lasso, elastic net, etc.
- In general, these methods only produce a sparse point estimate are dangerous scientifically, and there are many errors in interpreting the zero vs non-zero elements
- Parallel Bayesian literature on shrinkage priors - horseshoe, generalized double Pareto, Dirichlet-Laplace, etc.
What’s an appropriate