A multi-modal representation is a representation of data using information from multiple modalities. The ability to represent data in a meaningful way is crucial to multimodal problems.
- how to combine data from heterogeneous sources
- how to deal with differing levels of noise, and noise otpology
- how to deal with missing data
Categories of multi-modal representations
Baltrušaitis, Ahuja, and Morency, n.d. proposes two categories of multimodal representation:
- combining unimodal signals into the same representation space.
- process unimodal signals separately, but enforcing certain similarity constraints on them to bring them into a coordinated space.
Joint representations are mostly used in tasks where multimodal data is present during both training and inference.
A common approach is to use deep neural networks to create abstract representations of unimodal data, beore combining them in the last layer (for example, via concatenation). The hidden layers project the modalities into a joint space. There is a close relationship between multimodal representation learning and Multi-modal Fusion when using neural networks.
It is common to pretrain these representations using an Autoencoder on unsupervised data, because neural networks require a lot of labelled data. The neural network approach often yields superior performance, but is unable to handle missing data naturally.
Probabilistic Graphical Models
Probabilistic Graph Models construct representations through the use of latent random variables.
A popular approach is Deep Boltzmann Machines, which uses Restricted Boltzmann machines as building blocks. Each successive layer of a DBM is expected to represent the data at a higher level of abstraction. DBMs do not need supervised data for training. It is also possible to convert them into a deterministic neural network, but this loses the generative aspect of the model.
One advantage of using multimodal DBMs is that they are generative models, allowing them to:
- Naturally deal with missing data.
- Generate samples of one modality in the presence of at least 1 modality
Separate representations are learnt from each modality, but later coordinated through a constraint.
Similarity models minimize the distance between modalities in the coordinated space. Neural network approaches such as DeViSE use inner product and a ranking loss function.
Some models use a structured coordinated space, enforcing additional constraints between the modality representations. These structural constraints differ across applications, which include Cross-modal Hashing, cross-modal retrieval, and image captioning.
Other examples of structured coordinated spaces use enforce a partial order in the multimodal space. Another special case of this is Canonical Correlation Analysis.