Topic Modeling

tags: Machine Learning

http://www.cs.columbia.edu/~blei/topicmodeling.html LDA survey - Github

LDA

The Little Book on LDA https://www.youtube.com/watch?v=FkckgwMHP2s http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

Dirichlet Distribution

https://www2.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf

Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector α of positive reals.

$θ \sim D i r (α)$

$p (θ) = \frac{1}{β (α)} \prod_{i = 1}^{n} θ_{i}^{α_{i} - 1} I (θ \in S)$

Where $θ = (θ_{1}, θ_{2}, \dots, θ_{n}), α = (α_{1}, α_{2}, \dots, α_{n}), α_{i} > 0$ and

$S = {x \in R^{n} : x_{i} \geq 0, \sum_{i = 1}^{n} x_{i} = 1}$

and $\frac{1}{β (α)} = \frac{Γ (α_{0})}{Γ (α_{1}) Γ (α_{2}) \dots Γ (α_{n})}$

The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.

The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the prior distribution of the distribution’s parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior distribution of the parameter is also a Dirichlet.

Exploring a Corpus with the posterior distribution

Quantities needed for exploring a corpus are the posterior expectations of hidden variables. Each of these quantities are conditioned on the observed corpus.

Visualizing a topic is done by visualizing the posterior topics through their per-topic probabilities $\hat{β}$ .

Visualizing a document uses the posterior topic proportions ${\hat{θ}}_{d, k}$ and the posterior topic assignments ${\hat{z}}_{d, k}$ .

Finding similar documents can be done through the Hellinger distance:

$\begin{array}{r} D_{d, k} = \sum_{k = 1}^{K} {(\sqrt{{\hat{θ}}_{d, k}} - \sqrt{{\hat{θ}}_{f, k}})}^{2} \end{array}$

Posterior Inference

Mean Field Variational Inference

Approximate intractable posterior distribution with a simpler distribution containing free variational parameters. These parameters are fit to approximate the true posterior.

In contrast to the true posterior, the mean field variational distribution for LDA is one where the variables are independent of each other, with and each governed by a different variational parameter.

We fit the variational parameters to minimise the KL-divergence to the true posterior.

The general approach to mean-field variational methods - update each variational parameter with the parameter given by the expectation of the true posterior under the variational distribution - is applicable when the conditional distribution of each variable is the exponential family.

Markov Chains

http://setosa.io/ev/markov-chains/

Shortcomings

strong, potentially invalid statistical assumptions:
- topics have no correlation to one another (dirichlet assumes nearly independent)
  - solution: CTM: use a logistic normal distribution
- assumes order of documents don’t matter
  - solution: DTM: use logistic normal distribution to model topics evolving over time

TopicRNN

http://www.columbia.edu/~jwp2128/Papers/DiengWangetal2017.pdf

In TopicRNN, latent topic models are used to capture global semantic dependencies so that the RNN can focus its modeling capacity on the local dynamics of the sequences

Jethro's Braindump