# Riken AIP Workshop 2019

## Weakly Supervised Classification

### Motivation

- Machine learning from big data is already successful
- In some cases, massive labelled data is not available
- Classification from limited information

### Supervised Classification

A large number of labeled samples yield better classification performance. Optimal convergence rate is \(O(n^{-\frac{1}{2}})\).

### Unsupervised Classification

Since collecting labelled samples is costly, we can learn a classifier from unlabelled data. This is equivalent to clustering

### Semi-supervised Classification

- Use a large number of unlabelled samples and a small number of labelled samples.
- Find a decision boundary along cluster structure induced by unlabelled samples.

### Positive Unlabelled Classification

Given positive and unlabelled samples:

\begin{equation} {x_i^P}_{i=1}^{n_P} \sim P(x | y = + 1) \end{equation}

\begin{equation} {x_i^U}_{i=1}^{n_U} \sim P(x) \end{equation}

Risk of classifier can be decomposed into two terms:

- Risk for positive data
- Risk for negative data

Since we do not have negative data in the positive unlabelled data in the PU setting, the risk cannot be directly estimated.

U-density is a mixture of positive and negative densities:

\begin{equation} R(f) = \pi E_{p(x|y=+1)} \left[ l(f(x)) \right] + (1-\pi) E_{p(x|y=-1)}\left[ l(-f(x)) \right] \end{equation}

Through this we can find an unbiased risk estimator.

Estimating error bounds, we can show that PU learning can be better than PN provided a large number of PU data.

### PNU Classification

- Train PU, PN, and NU classification, and combine them.
- Unlabelled data always helps without cluster assumptions
- Use unlabelled data for loss evaluation (reducing the bias), not for regularisation.

### Pconf Classification

Only positive data is available:

- data from rival companies cannot be obtained
- Only successful examples are available

If we have positive data with confidence, we can train a classifier.

Others: Similar-unlabelled etc.

## Fast Computation of Uncertainty in Deep Learning

- author
- Emtiyaz Khan
- links
- https://emtiyaz.github.io/

Uncertainty quantifies the confidence in the prediction of a model, i.e., how much it does not know.

### Uncertainty in Deep Learning

\begin{equation} p(D|\theta) = \prod_{i=1}^{N} p(y_i | f_\theta (x_i)) \end{equation}

Data given parameters, output given NN(input)

- Generate a prior distribution \(\theta \sim p(\theta)\)

### Approximating Inference with Gradients

\begin{equation} p(\theta | D) \approx q(\theta) = N(\theta | \mu, \sigma^2) \end{equation}

Find the \(\mu\) and \(\sigma^2\) such that \(q\) is close to the posterior distribution.

\begin{equation} max L(\mu, \sigma^2) = E_q\left[ \log \frac{p(\theta)}{q(\theta)} \right] + \sum_{i=1}^N E_q \left[ \log p(D_i|\theta) \right] \end{equation}

Using natural-gradients leads to faster and simpler algorithm than gradients methods.

## Data-efficient Probabilistic Machine Learning

Bryan Low

Gaussian Process (GP) Models for Big Data.

### Gaussian Process

- Is a rich class of Bayesian, non-parametric models
- A GP is a collection of rvs any finite subset of which belongs to a univariate

### Task Setting

- Agent explores unknown environment modelled by GP
- Every location has a reward

### Lipschitz Continuous Reward Functions

\begin{equation} R(z_t, s_t) \overset{\Delta}{=} R_1(z_t) + R_2(z_t) + R_3(s_t) \end{equation}

- R_1 Lipschitz continuous (current measurement)
- R_2 Lipschitz continuous after convolution with Gaussian kernel (current measurement)
- R_3 Location History, independent of current measurement