Multi-modal fusion is the integration of information from multiple modalities with the goal of predicting an outcome measure (e.g. classification, or regression).
Multi-modal fusion is motivated by 3 main reasons:
- Access to multiple modalities help make predictions more robust (e.g. in AVSR)
- Multiple modalities may capture complementary information
- Multi-modal systems can still operate when some modalities are missing
Multi-modal fusion approaches can be classified as (Baltruv saitis, Ahuja, and Morency 2017):
- early fusion
- hybrid fusion
- late fusion
- graphical models
- neural networks
Early fusion approaches integrate features immediately after they are extracted (for example, via concatenation). Late fusion performs integration after each of the modalities make a decision, for example via averaging. Hybrid fusion combines outputs form early fusion, and individual unimodal predictors.
Model-agnostic approaches are easy to implement using uni-modal machine learning methods.
These approaches are tailored towards handling multi-modal data.
Multiple Learning Kernel (MKL) methods extend kernel Support Vector Machines to use different kernels for different modalities. Modality-specific kernels allow for better fusion of heterogeneous data. One advantage of MKL is that the loss function is convex, which leads to easy optimization and globally optimum solutions. However, MKL approaches rely on support vectors (training data) at test time, which means inference can be slow.
Generative graphical models such as factorial hidden markov networks, or multi-stream hidden markov metods have been used. Discriminative models such as Conditional Random Fields sacrifice the modeling of joint probability for predictive power.
Lastly, neural networks fuse temporal multimodal information through RNNs and LSTMs. These approaches have also been adapted for various image captioning tasks. Neural networks have the capacity to learn from large amounts of data, and are able to learn complex decision boundaries. However, they lack interpretability, and require large amounts of training data.