Scalable Bayesian Inference

There is an increasingly immense literature focused on big data. Most of the focus has been on optimization methods. Rapidly obtaining a point estimate even when sample size \(n\) & overall “size” of data is immense. There has been a huge focus on specific settings - e.g. linear regression, labelling images, etc. This leads to many people working on similar problems, while critical open problems remain untouched.

The end goal is having general probabilistic inference algorithms for complex data, and being able to handle arbitrarily probability models. We want algorithms to be scalable to huge data, and also be able to accurately quantify uncertainty.

Classical Posterior Approximations

\begin{equation} \pi_n(\theta|Y^{(n)}) \approx N(\hat{\mu}_s, \Sigma_n) \end{equation}

This is also known as the Bayesian central limit theorem (Bernstein von Mises). This relies on:

Alternative Analytic approximations

See ICML 2018 Tutorial by Tamara Broderick for an overview. In general, we have no clue how accurate the approximation is. See also (Pati {\it et al.}, 2017).

Markov Chain Monte Carlo

Accurate analytic approximations to the posterior have proven elusive outside of narrow settings. MCMC and other posterior sampling algorithms provide an alternative.

MCMC: sequential algorithm to obtain correlated draws from the posterior, and bypasses the need to approximate the marginal likelihood.

Often, samples are more useful than an analytic form anyway.

We can get MCMC-based summaries of the posterior for any functional \(f(\theta)\). As the number of samples \(T\) increases, these summaries become more accurate. MCMC constructs a Markov chain with stationary distribution \(\pi_n(\theta|Y^{(n)})\). A transition kernel is carefully chosen and iterative sampling proceeds. Most MCMC algorithms types of Metropolis-Hastings (MH):

  1. \(\theta^* \sim g(\theta^{(t-1)})\) sample a proposal (\(\theta^{(t)}\) is a sample at step t)
  2. Accept a proposal by letting \(\theta^{(t)} = \theta^*\) with probability

\begin{equation} \mathrm{min} \left(1, \frac{\pi(\theta^*)L(Y^{(n)}|\theta)}{\pi(\theta^{(t-1)})L(Y^{(n)}|\theta^{(t-1)})} \frac{g(\theta^{(t-1)})}{g(\theta^*)} \right) \end{equation}

We want to design efficient MH algorithms by choosing good proposals \(g\). \(g(\cdot)\) can depend on the previous value of \(\theta\) and on the data but not on further back samples - except in adaptive MH.

For example, in Gibbs sampling, let \(\theta = (\theta_1, \dots, \theta_p)’\) we draw subsets of \(\theta\) from their exact conditional posterior distributions fixing the other.

In random walk, \(g(\theta^{(t-1)})\) is a distribution centered on \(\theta^{(t-1)}\) with a tunable covariance.

In HMC/Langevin, we exploit gradient information to generate samples far from \(\theta^{(t-1)}\) having high posterior density.

MCMC & Computational Bottlenecks

Time per iteration of MCMC increases with the number of parameters and unknowns, and also the increase with sample size \(n\). This is due to the cost of sampling proposal & calculating acceptance probability. This is similar to costs that occur in most optimization algorithms.

MCMC does not produce independent samples from the posterior distribution \(\pi_n(\theta)\). These draws are auto-correlated, and as the level of correlation increases, the information provided by each sample decreases. “Slow mixing” Markov chains have highly autocorrelated draws.

A well designed MCMC algorithm with a good proposal should ideally exhibit rapid convergence and mixing.

Embarrassingly Parallel MCMC e.g. (Sanvesh Srivastava {\it et al.}, 2015),(Li {\it et al.}, 2016)

We can replace expensive transition kernels with approximations (Johndrow {\it et al.}, 2015). for example, we approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data, vastly speeding up MCMC sampling in high-dimensional settings.

Robustness in Big Data

In standard Bayesian inference, it is assumed that the model is correct. Small violations of this assumption sometimes have a large impacts, particularly in large datasets. The ability to carefully modelling assumptions decreases for big/complex data. This appeals to tweaking the Bayesian paradigm to be inherently more robust.

High-p problems

What’s an appropriate \(\pi(\beta)\) for the high dimensional vector of coefficients? Most commonly used is a local-global scale mixture of Gaussians. (Johndrow {\it et al.}, 2017)

Bibliography

Pati, D., Bhattacharya, A., & Yang, Y., On statistical optimality of variational bayes, CoRR, (), (2017).

Srivastava, S., Cevher, V., Dinh, Q., & Dunson, D., WASP: Scalable Bayes via barycenters of subset posteriors, In G. Lebanon, & S. V. N. Vishwanathan, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 912–920) (2015). San Diego, California, USA: PMLR.

Li, C., Srivastava, S., & Dunson, D. B., Simple, scalable and accurate posterior interval estimation, CoRR, (), (2016).

Johndrow, J. E., Mattingly, J. C., Mukherjee, S., & Dunson, D., Optimal approximating markov chains for bayesian inference, CoRR, (), (2015).

Johndrow, J. E., Orenstein, P., & Bhattacharya, A., Bayes shrinkage at gwas scale: convergence and approximation theory of a scalable mcmc algorithm for the horseshoe prior, CoRR, (), (2017).