A class of machine learning algorithm that uses a cascade of multiple layeprs of non-linear processing units for feature extraction and transformation


Sensory perception is plagued with large amounts of nuisance variation. Nuisance variables are task-dependent and can be implicit. How do we deal with nuisance variation in the input?

The holy grail of machine learning is to learn a disentangled representation that factors out variation in the sensory input into meaningful intrinsic degrees of freedom.

sensitive to task-relevant (target) features
robust to task-irrelevant (nuisance) features
useful for many different tasks

Deep learning solves this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. Deep learning enables the computer to build complex concepts out of simpler concepts.

The key inspiration from neuroscience is to build up feature selectivity and tolerance over multiple layers in a hierarchy.

Model/task categorisation

supervised learning
input -> otuput, learning the pattern from labeled training data
unsupervised learning
learning to differentiate/cluster unlabeled data
reinforcement learning
learning from trial-and-error through rewards

Bayesian probability The quantification of the expression of uncertainty. We can make precise revisions of uncertainty in light of new evidence.

In the example of polynomial curve fitting, it seems reasonable to apply the frequentist notion of probability to the random values of the observed variables. However, we can use probability theory to describe the uncertainty in model parameters, and the choice of model itself.

We can evaluate the uncertainty in model parameters w after we have observed some data D. \(p(D|w)\) is called the likelihood function.

The posterior probability is proportional to likelihood x prior. Model parameters w are chosen to maximise the likelihood function \(p(D|w)\). Because the negative log-likelihood is a monotonically decreasing function, minimising it is equivalent to maximising likelihood.

Universal Approximation Theorem A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of the real space.

Measure Theory

Kullback-Leiber (KL) divergence

The KL divergence is 0 iff P and Q are the same distribution in discrete variables, and equal almost everywhere in continuous variables. Because the KL divergence is non-negative and measures the difference between the two distributions, it is soften conceptualised as some sort of distance, but it is not a true measure, because it is not symmetric.

Cross Entropy

Minimising the cross entropy with respect to Q is equivalent to minimising the KL divergence.

Object Detection

Detection Without Proposals: YOLO / SSD

Within each grid cell:

Output: \(7 * 7 *(5*B + C)\)

Learning Rates

Learning Rate Annealing

Decay learning rate after several iterations…

Also: SGDR with cyclic learning rate, restores high learning rate after several iterations to try to find local minima that is largely flat, and doesn’t change so much in any direction. (Generalises better)

Reinforcement Learning

How Companies use Deep Learning


Visenze’s primary product is their Visual Search (reverse image search).

Initially, they started out with a similar model to our approach: Train a CNN, read encodings before the FC layer, use it to perform NN search.

Now their pipeline is as follows: Query time -> object detection -> Extract Features -> Nearest Neighbours -> Ranked Results Offline training -> Detection Model -> Embedding models -> Nearest Neighbours Index time -> Objects -> Extract Features -> Search Index (Compression/Hash Model)

Object detection followed the trends in research papers:

  1. R-CNN
  2. Fast-CNN
  3. Faster-CNN
  5. DSOD (Current)

Model performance is based of standard IR metrics: They are using DCG score for evaluating their reverse image search. This requires a lot of manual annotation.

Models are trained offline using physical purchased GPUs.

Deployment: Kubernetes on AWS, with generally small CPU servers. Overall latency is less than 200ms.

Things they focused on as a company

  1. Tooling:
    1. Annotation System
      1. Complete annotation is crucial to detection training
    2. Training System
      1. Make it easy to change hyperparameters and retrain models
      2. Abstract away need for knowing deep learning
      3. Platform for tracking metrics
    3. Querylog pipeline
      1. Take user input as training data
    4. Evaluation System
      1. Both automatic evaluation via metrics and manual (AB testing) is done before release
      2. Visualisations via T-SNE to see if clusters remain the same/make sense, when new learnings are added: learning without forgetting
  2. Business:
    1. Attend to customer requirements: e.g. if a company wants to sell hats, model needs to be trained to detect hats, and these learnings need to be added to the existing model without affecting data earlier

Visual Embeddings Used

At Visenze, they use multiple embeddings in different feature spaces to measure similarity. The results are then combined before returned. The 4 main embeddings are:

  1. Exact matches (same item)
    1. Trained with siamese network with batched triplet loss (typically used in face recognition, but seems to work well with product classification)
  2. Same Category
    1. Domain specific labels have been most helpful in achieving state-of-the-art accuracy
      1. e.g for fashion, sleeve length, jeans length etc.
  3. Similar Categories

Lessons Learnt

  1. Taxonomy Coverage
  2. Training Data Coverage
    1. Obtaining training data from one source only can lead to severe bias (e.g. detecting watermarks and using it as feature)
  3. Overfitting
  4. Continuous Improvement (Learning Without Forgetting)
  5. Bad-case driven development
    1. Be quick to identify hard negatives, and add in similar negative samples into training data to improve accuracy
  6. Image quality, rotation
  7. Re-ranking based on customer

Key Open Questions about Deep Learning Systems

  1. How and why do they work? Can we derive their structure from first principles? Can we compress/explain the myriad empirical observations/best practices about deep nets?
  2. Can we shed new light on the hidden representations of objects? Can we generate new theories and testable predictions for both artificial/real neuroscience?
  3. Why do they fail? How to improve them? How to alleviate their intrinsic limitations?
  4. Can we guide the search for better architectures/algorithms/performance in applied DL?

Concrete Theoretical Questions

  1. What are the implicit modelling assumptions?
  2. What is the inference task and algorithm?
  3. What is the learning algorithm?
  4. Can we generate new testable predictions for artificial/real nets?
  5. What modeling assumptions are being violated in failures? How can we improve the models, tasks and algorithms?

ConvNets from First Principles

There are many architectures, but just a few key operations and objectives:

We focus on the properties conserved across all species of Convnets.

Finding a generative model

We reverse engineer a Convnet by building a generative model

  1. Define a generative model that captures nuisance variation
  2. Recast feed-forward propogation in a DCN as MAP inference of the full latent configuration -> generative classifier
  3. Apply a discriminative relaxation -> discriminative classifier
  4. Learn the parameters via Batch Hard EG Algorithm -> SGD-Backprop Training of a DCN
  5. Use new generative model to address limitations of DCNs: top-down inference, learning from unlabeled data, hyperparameter optimization

Variational Inference and Deep Learning

Optimizers work well with sums, not products. We want to be able to calculate gradients on a subset of examples and have reasonable confidence for convergence.

Latent variables give our model structure and can make training more tractable. With latent variables \(\bar{z}\), we have:

\begin{equation} p(x) = \sum_z p(x,z) = \sum_z p(x | z)p(z) \end{equation}

Putting log-likelihood and latent variables together, we get:

\begin{equation} L = \sum_i \log \left( \sum_z p(x_i, z)\right) \end{equation}

Usually, we want to sample one example and one value for the latent variable at a time, for tractable optimization. This only works for a sum.

We compare the above structure with this equation:

\begin{equation} L = \sum_i \left( \sum_z \log p(x_i, z)\right) \end{equation}

Intuitively the new equation says that every latent variable value needs to do a good job of explaining the data

Derivation of the Variational Bound

\begin{equation} p(x) = \sum_i \log (\sum_z p(x_i, z)) \end{equation}

\begin{equation} p(x) = \sum_i \log(\sum_z \frac{q(z|x_i) p(x_i, z)}{q(z|x_i)}) \end{equation}

By Jensen’s Inequality (concavity of the log function),

\begin{equation} p(x) \ge \sum_i \sum_z q(z|x_i) \log \frac{p(x_i, z)}{q(z|x_i)} \end{equation}

\begin{equation} p(x) \ge \sum_i q(z|x_i) \sum_z \log (p(x_i | z)) + \log \frac{p(z)}{q(z|x_i)} \end{equation}

\begin{equation} p(x) \ge \mathbb{E}_{z \sim q(z|x_i)} \log (p(x_i | z)) - KL(q(z|x_i) || p(z)) \end{equation}

The first term is the conditional likelihood of our observation if we used a sampled z value from the approximated posterior \(q(z | x)\). The second term is a divergence between the approximate posterior and the prior. This is often well defined.

Variational Autoencoder

We can show that a special type of autoencoder corresponds to this lower bound, where \(q(z | x)\) is the encoder and \(p(x|z)\) is the decoder. The first term is the usual reconstruction objective for autoencoders. The second term is a special KL term which pulls the bottleneck closer to a prior.

In the simplest case, we make both \(q(z|x)\) and \(p(z)\) independent Gaussians:

\begin{equation} KL(q(z|x_i) || p(z)) = \sum_j \mu_j^2 + \sigma_j^2 - \log(\sigma_j^2) \end{equation}

\(q(z|x)\) has parameters \(\mu\) and \(\sigma\) which are just outputs from our encoder. \(p(z)\) is a prior distribution with \(\mu = 0\) and \(\sigma = 1\).

The bottleneck can’t remember spatial information, and optimizing for \(p(x|z)\) puts heavy emphasis on exact recovery of spatial information. Solutions involve removing the bottleneck, or replacing the prior \(p(z)\) with one that has more structure.