Jethro's Braindump

Riken AIP Workshop 2019

Weakly Supervised Classification

Motivation

  • Machine learning from big data is already successful
  • In some cases, massive labelled data is not available
  • Classification from limited information

Supervised Classification

A large number of labeled samples yield better classification performance. Optimal convergence rate is \(O(n^{-\frac{1}{2}})\).

Unsupervised Classification

Since collecting labelled samples is costly, we can learn a classifier from unlabelled data. This is equivalent to clustering

Semi-supervised Classification

  • Use a large number of unlabelled samples and a small number of labelled samples.
  • Find a decision boundary along cluster structure induced by unlabelled samples.

Positive Unlabelled Classification

Given positive and unlabelled samples:

\begin{equation} {x_i^P}_{i=1}^{n_P} \sim P(x | y = + 1) \end{equation}

\begin{equation} {x_i^U}_{i=1}^{n_U} \sim P(x) \end{equation}

Risk of classifier can be decomposed into two terms:

  1. Risk for positive data
  2. Risk for negative data

Since we do not have negative data in the positive unlabelled data in the PU setting, the risk cannot be directly estimated.

U-density is a mixture of positive and negative densities:

\begin{equation} R(f) = \pi E_{p(x|y=+1)} \left[ l(f(x)) \right] + (1-\pi) E_{p(x|y=-1)}\left[ l(-f(x)) \right] \end{equation}

Through this we can find an unbiased risk estimator.

Estimating error bounds, we can show that PU learning can be better than PN provided a large number of PU data.

PNU Classification

  • Train PU, PN, and NU classification, and combine them.
  • Unlabelled data always helps without cluster assumptions
  • Use unlabelled data for loss evaluation (reducing the bias), not for regularisation.

Pconf Classification

Only positive data is available:

  1. data from rival companies cannot be obtained
  2. Only successful examples are available

If we have positive data with confidence, we can train a classifier.

Others: Similar-unlabelled etc.

Fast Computation of Uncertainty in Deep Learning

author
Emtiyaz Khan
links
https://emtiyaz.github.io/

Uncertainty quantifies the confidence in the prediction of a model, i.e., how much it does not know.

Uncertainty in Deep Learning

\begin{equation} p(D|\theta) = \prod_{i=1}^{N} p(y_i | f_\theta (x_i)) \end{equation}

Data given parameters, output given NN(input)

  1. Generate a prior distribution \(\theta \sim p(\theta)\)

Approximating Inference with Gradients

\begin{equation} p(\theta | D) \approx q(\theta) = N(\theta | \mu, \sigma^2) \end{equation}

Find the \(\mu\) and \(\sigma^2\) such that \(q\) is close to the posterior distribution.

\begin{equation} max L(\mu, \sigma^2) = E_q\left[ \log \frac{p(\theta)}{q(\theta)} \right] + \sum_{i=1}^N E_q \left[ \log p(D_i|\theta) \right] \end{equation}

Using natural-gradients leads to faster and simpler algorithm than gradients methods.

Data-efficient Probabilistic Machine Learning

Bryan Low

Gaussian Process (GP) Models for Big Data.

Gaussian Process

  • Is a rich class of Bayesian, non-parametric models
  • A GP is a collection of rvs any finite subset of which belongs to a univariate

Task Setting

  • Agent explores unknown environment modelled by GP
  • Every location has a reward

Lipschitz Continuous Reward Functions

\begin{equation} R(z_t, s_t) \overset{\Delta}{=} R_1(z_t) + R_2(z_t) + R_3(s_t) \end{equation}

  • R_1 Lipschitz continuous (current measurement)
  • R_2 Lipschitz continuous after convolution with Gaussian kernel (current measurement)
  • R_3 Location History, independent of current measurement