How to Represent Words

How do we represent words as input to any of our models? There are an estimated 13 million tokens for the English language, some of them have similar meanings.

The simplest representation is one-hot vector: we can represent every word as a \(\mathbb{R}^{|V|x}\) vector with all 0s and one 1 at the index of that word in the sorted english language.

To reduce the size of this space, we can use SVD-based methods. For instance, accumulating co-occurrence count into a matrix X, and perform SVD on X to get a \(USV^T\) decomposition. We can then use the rows of U as the word embeddings for all words in the dictionary.

Window-based Co-occurence Matrix

Count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all words in the corpus.

Issues with SVD-based methods

  1. Dimensions of the matrix can change very often (new words added to corpus)
  2. Matrix is extremely sparse as most words don’t co-occur
  3. Matrix is very high-dimensional
  4. Quadratic cost to perform SVD

Some solutions include:

  1. Ignore function words (blacklist)
  2. Apply a ramp window (distance between words)
  3. Use Pearson correlation and set negative counts to 0 instead of raw count.

Iteration-based Methods (e.g. word2vec)

Design a model whose parameters are the word vectors, and train the model on a certain objective. Discard the model, and use the vectors learnt.

Efficient tree structure to compute probabiltiies for all the vocabulary

Language Models

The model will need to assign a probability to a sequence of tokens. We want a high probability for a sentence that makes sense, and a low probability for a sentence that doesn’t.

Unigrams completely ignore this, and a nonsensical sentence might actually be assigned a high probability. Bigrams are a naive way of evaluating a whole sentence.

Training Methods

In practice, negative sampling works better for frequently occurring words and lower-dimensional vectors, and hierachical softmax works better for infrequent words.

Global Vectors for Word Representation (GloVe)

Count-based methods of generating word embeddings rely on global statistical information, and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure.

word2vec presents a window-based method of generating word-embeddings by making predictions in context-based windows, demonstrating the capacity to capture complex linguistic patterns beyond word similarity.

GloVe consists of a weighted least-squares model that combines the benefits of the word2vec skip-gram model when it comes to word analogy tasks, but also trains on global word-word co-occurrence counts, and produces a word vector space with meaningful sub-structure.

The appropriate starting point for word-vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves. Since vector spaces are inherently linear structures, the most natural way to encode the information present in a ratio in the word vector space is with vector differences.

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space.

Co-occurrence Matrix

Let \(X\) denote the word-word co-occurrence matrix, where \(X_{ij}\) indicate the number of times word \(j\) occur in the context of word \(i\). Let \(X_i = \sum_k X_{ik}\) be the number of times any word \(k\) appears in the context of word \(i\). Let \(P_{ij} = P(w_j|w_i) = \frac{X_{ij}}{X_i}\) be the probability of j appearing in the context of word \(i\).

Topic Modeling

Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.

Topic Lens UTOPIAN

Dimension Reduction

Multidimensional Scaling

Word Vector Evaluation



Visualizing Word Embeddings

Neural Networks

Most data are not linearly separable, hence the need for non-linear classifiers. Neural networks are a family of classifiers with non-linear decision boundaries.

A neuron

A neuron is a generic computational unit that takes \(n\) inputs and produces a single output. The sigmoid unit takes a $n$-dimensional vector $x4 and produces a scalar activation (output) \(a\). This neuron is also associated with an $n$-dimensional weight vector, \(w\), and a bias scalar, \(b\). for example, the output of this neuron is then:

\begin{equation*} a = \frac{1}{1 + exp(-(w^Tx + b))} \end{equation*}

This is also frequently formulated as:

\begin{equation*} a = \frac{1}{1 + exp(-([w^T\text{ }b] \cdot [x\text{ }1]))} \end{equation*}

A single layer of neurons

This idea can be extended to multiple neurons by considering the case where the input \(x\) is fed as input to multiple such neurons.

\begin{align*} \sigma\left( z \right) = \begin{bmatrix} \frac{1}{1 + exp(z_1)} \\
\vdots \\
\frac{1}{1 + exp(z_m)} \end{bmatrix} \end{align*}

\begin{align*} b = \begin{bmatrix} b_1 \\
\vdots \\
b_m \end{bmatrix} \in \mathbb{R}^m \end{align*}

\begin{align*} W = \begin{bmatrix} - && w^{(1)T} && - \\
&& \dots && \\
- && w^{(m)T} && - \end{bmatrix} \in \mathbb{R}^{m\times n} \end{align*}

\begin{equation*} z = Wx + b \end{equation*}

The activations of the sigmoid function can be written as:

\begin{align*} \begin{bmatrix} a^{(1)} \\
\vdots \\
a^{(m)} \end{bmatrix} = \sigma\left( Wx + b \right) \end{align*}

Activations indicate the presence of some weighted combination of features. We can use these combinations to perform classification tasks.

Feed-forward computation

Non-linear decisions cannot be classified in by feeding inputs directly into a softmax function. Instead, we use another matrix \(U \in \mathbb{R}^{m\times 1}\) to generate an unnormalized score for a classification task from the activations.

\begin{equation*} s = U^Ta = U^Tf\left( Wx + b \right) \end{equation*}

where f is the activation function.

Maximum Margin Objective Function

Neural networks also need an optimisation objective. the maximum margin objective ensures that the score computed for “true” labeled data points is higher than the score computed for “false” labeled data points, i.e. \(\text{minimize} J = max(s_c - s, 0)\), where \(s_c\) is the corrupt false window. This optimization can be achieved through backpropagation and gradient descent.

Gradient Checks

One can numerically approximate these gradients, allowing us to precisely estimate the derivative with respect to any parameter, serving as a sanity check on the correctness of our analytic derivatives.

\begin{equation*} f’(\theta) = \frac{J(\theta^{i+}) - J(\theta^{i-})}{2\epsilon} \end{equation*}

Dependency Parsing

Dependency Grammar and Dependency Structure

Parse trees in NLP are used to analyse the syntatic structure of sentences.

Consistency Grammar uses phrase structure grammar to organize words into nested constituents.

Dependency structure of sentences shows which words depend on (modify or are arguments of) other words. These binary assymetric relations between words are called dependencies and are depicted as arrows going from the head to the dependent. Usually these dependencies form a tree structure.

Dependency Parsing

Dependency parsing is the task of analysing the syntactic dependency structure of a given input sentence S. The output of a dependency parser is a dependency tree where the words of the input sentence are connected by typed dependency relations.

Formally, the dependency parsing problem asks to create a mapping from the input sentence with words \(S = w_0w_1\dots w_n\) (where \(w_0\) is the ROOT) to its dependency graph \(G\).

The two subproblems are:

Given a training set \(D\) of sentences annotated with dependency graphs, induce a parsing model \(M\) that can be used to parse new sentences.
Given a parsing model \(M\), and a sentence \(S\), derive the optimal dependency graph \(D\) for \(S\) according to \(M\).

Transition-based Parsing

This relies on a state machine which defines the possible transitions to create the mapping from the input sentence to the dependency tree. The learning problem involves predicting the next transition in the state machine. The parsing problem constructs the optimal sequence of transitions for the input sentence, given the previously induced model.

Greedy Deterministic Transition-based Parsing

The transition system is a state machine, which consists of states and transitions between those states. The model induces a sequence of transitions from some initial state to one of several terminal states. For any sentence \(S = w_0w_1\dots w_n\) a state can be described with a triple \(c = \left(\sigma, \beta, A)\):

  1. a stack \(\sigma\) of words \(w_i\) from S,
  2. a buffer \(\beta\) of words \(w_i\) from S,
  3. a set of dependency arcs \(A\) of the form \(\left(w_i, r, w_j\right)\)

For each sentence, there is an initial state, and a terminal state.

There are three types of transitions:

Remove the first word in the buffer, and push it on top of the stack
add a dependency arc \((w_j, r, w_i)\) to the arc set A, where \(w_i\) is the word second to the top of the stack, and \(w_j\) is the word at the top of the stack. Remove \(w_i\) from the stack.
add a dependency arc \((w_i, r, w_j)\) to the arc set A, where \(w_i\) is the word second to the top of the stack, and \(w_j\) is the word at the top of the stack. Remove \(w_i\) from the stack.

Neural Dependency Parsing

Basic Visualization Techniques of Text Data

Word Cloud

In word cloud, it is difficult to determine optimal placing of words. In addition, word clouds do not show relation between words.

Word Tree


Time-series representation: view which keywords occur more frequently over time. It is a type of visualization known as a stacked linegraph.

TIARA Visualization

Phrase Nets

Phrasenets are useful for exploring how words are linked in a text and like word clouds and word trees can be informative for early data analysis.

For more…

Deep Visualization Techniques


Dimensionality Reduction (aka Manifold Learning)

Linear tSNE optimization for the Web