Paper:《Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类》翻译与解读
Paper:《Multimodal Machine Learning: A Survey and Taxonomy,多模態(tài)機(jī)器學(xué)習(xí):綜述與分類》翻譯與解讀
目錄
《Multimodal Machine Learning: A Survey and Taxonomy》翻譯與解讀
Abstract
1 INTRODUCTION
2 Applications: a historical perspective?應(yīng)用:歷史視角
3 Multimodal Representations多模態(tài)表示
3.1 Joint Representations?聯(lián)合表示
3.2 Coordinated Representations協(xié)調(diào)表示
3.3 Discussion討論
4 Translation翻譯
4.1 Example-based?基于實例
4.2 Generative approaches生成方法
4.3 Model evaluation and discussion模型評價與討論
5 Alignment對齊
5.1 Explicit alignment顯式對齊
5.2 Implicit alignment隱式對齊
5.3 Discussion討論
6 Fusion融合
6.1 Model-agnostic approaches與模型無關(guān)的方法
6.2 Model-based approaches基于模型的方法
6.3 Discussion討論
7 Co-learning共同學(xué)習(xí)
7.1 Parallel data并行數(shù)據(jù)
7.2 Non-parallel data非并行數(shù)據(jù)
7.3 Hybrid data混合數(shù)據(jù)
7.4 Discussion討論
8 Conclusion結(jié)論
《Multimodal Machine Learning: A Survey and Taxonomy》翻譯與解讀
作者:Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency
時間:2017年5月26日
地址:https://arxiv.org/abs/1705.09406
Abstract
| Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation,?translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research. | 我們對世界的體驗是多模態(tài)的(五大感官)——我們看到物體(視覺),聽到聲音(聽覺),感覺到質(zhì)地(觸覺),聞到氣味(嗅覺),品嘗味道(味覺),其實還包括第六感(心覺)。模態(tài)是指事物發(fā)生或經(jīng)歷的方式,當(dāng)一個研究問題包含多種模態(tài)時,它就被稱為多模態(tài)。為了讓人工智能在理解我們周圍的世界方面取得進(jìn)展,它需要能夠同時解讀這些多模態(tài)信號。多模態(tài)機(jī)器學(xué)習(xí)旨在建立能夠處理和關(guān)聯(lián)來自多種模式信息的模型。這是一個充滿活力的多學(xué)科領(lǐng)域,其重要性和潛力都在不斷增加。本文不關(guān)注具體的多模態(tài)應(yīng)用,而是對多模態(tài)機(jī)器學(xué)習(xí)本身的最新進(jìn)展進(jìn)行了調(diào)查,并將它們以一種常見的分類方式呈現(xiàn)出來。我們超越了典型的早期和晚期融合分類,并確定了多模態(tài)機(jī)器學(xué)習(xí)面臨的更廣泛的挑戰(zhàn),即:表示、翻譯、對齊、融合和共同學(xué)習(xí)。這種新的分類方法將使研究人員更好地了解該領(lǐng)域的現(xiàn)狀,并確定未來的研究方向。 |
| Index Terms—Multimodal, machine learning, introductory, survey. | 索引術(shù)語-多模態(tài),機(jī)器學(xué)習(xí),入門,調(diào)查。 |
1 INTRODUCTION
| THE world surrounding us involves multiple modalities— we see objects, hear sounds, feel texture, smell odors, and so on. In general terms, a modality refers to the way in which something happens or is experienced. Most people associate the word modality with the sensory modalities which represent our primary channels of communication and sensation, such as vision or touch. A research problem or dataset is therefore characterized as multimodal when it includes multiple such modalities. In this paper we focus primarily, but not exclusively, on three modalities: natural language which can be both written or spoken; visual signals which are often represented with images or videos; and vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret and reason about multimodal messages. Multi- modal machine learning aims to build models that can process and relate information from multiple modalities. From early research on audio-visual speech recognition to the recent explosion of interest in language and vision models, multi- modal machine learning is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. | 我們周圍的世界包含多種形態(tài)——我們看到物體,聽到聲音,感覺到質(zhì)地,聞到氣味,等等。一般來說,模態(tài)是指某事發(fā)生或被體驗的方式。大多數(shù)人將“情態(tài)”(后均譯為模態(tài))一詞與代表我們溝通和感覺的主要渠道(如視覺或觸覺)的感官形式聯(lián)系在一起。因此,當(dāng)一個研究問題或數(shù)據(jù)集包含多個這樣的模態(tài)時,它就被描述為多模態(tài)。在本文中,我們主要(但不完全)關(guān)注三種形式:可以書面或口頭的自然語言;通常用圖像或視頻表示的視覺信號;還有編碼聲音和似言語信息的聲音信號,如韻律和聲音表達(dá)。 為了讓人工智能在理解我們周圍的世界方面取得進(jìn)展,它需要能夠解釋和推理關(guān)于多模態(tài)信息。多模態(tài)機(jī)器學(xué)習(xí)旨在建立能夠處理和關(guān)聯(lián)來自多種模態(tài)信息的模型。從早期的視聽語音識別研究到最近對語言和視覺模型的興趣激增,多模態(tài)機(jī)器學(xué)習(xí)是一個充滿活力的多學(xué)科領(lǐng)域,其重要性日益增加,具有非凡的潛力。 |
| The research field of Multimodal Machine Learning brings some unique challenges for computational re- searchers given the heterogeneity of the data. Learning from multimodal sources offers the possibility of capturing cor- respondences between modalities and gaining an in-depth understanding of natural phenomena. In this paper we iden- tify and explore five core technical challenges (and related sub-challenges) surrounding multimodal machine learning. They are central to the multimodal setting and need to be tackled in order to progress the field. Our taxonomy goes beyond the typical early and late fusion split, and consists of the five following challenges: 1)、Representation A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representa- tions. For example, language is often symbolic while au- dio and visual modalities will be represented as signals. 2)、Translation??A second challenge addresses how to trans- late (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For exam- ple, there exist a number of correct ways to describe an image and and one perfect translation may not exist. 3)、Alignment?A third challenge is to identify the direct rela- tions between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible long- range dependencies and ambiguities. 4)、Fusion?A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities. 5)、Co-learning A fifth challenge is to transfer knowledge between modalities, their representation, and their pre- dictive models. This is exemplified by algorithms of co- training, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a?different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data). | 考慮到數(shù)據(jù)的異質(zhì)性,多模態(tài)機(jī)器學(xué)習(xí)的研究領(lǐng)域給計算研究人員帶來了一些獨(dú)特的挑戰(zhàn)。從多模態(tài)來源學(xué)習(xí)提供了捕獲模態(tài)之間的對應(yīng)關(guān)系的可能性,并獲得對自然現(xiàn)象的深入理解。在本文中,我們確定并探討了圍繞多模態(tài)機(jī)器學(xué)習(xí)的五個核心技術(shù)挑戰(zhàn)(以及相關(guān)的子挑戰(zhàn))。它們是多模態(tài)環(huán)境的核心,需要加以解決以推動該領(lǐng)域的發(fā)展。我們的分類超越了典型的早期和晚期融合分裂,包括以下五個挑戰(zhàn): 1)、表示:第一個基本挑戰(zhàn)是學(xué)習(xí)如何以一種利用多模態(tài)的互補(bǔ)性和冗余性的方式來表示和總結(jié)多模態(tài)數(shù)據(jù)。多模態(tài)數(shù)據(jù)的異質(zhì)性使得構(gòu)造這樣的表示具有挑戰(zhàn)性。例如,語言通常是符號化的,而視聽形式將被表示為信號。 2)、翻譯:第二個挑戰(zhàn)是如何將數(shù)據(jù)從一種模態(tài)轉(zhuǎn)換(映射)到另一種模態(tài)。不僅數(shù)據(jù)是異質(zhì)的,而且模態(tài)之間的關(guān)系往往是開放的或主觀的。例如,存在許多描述圖像的正確方法,并且可能不存在一種完美的翻譯。 3)、對齊:第三個挑戰(zhàn)是識別來自兩個或更多不同模態(tài)的(子)元素之間的直接關(guān)系。例如,我們可能想要將菜譜中的步驟與顯示菜肴制作過程的視頻對齊。為了應(yīng)對這一挑戰(zhàn),我們需要衡量不同模態(tài)之間的相似性,并處理可能的長期依賴和歧義。 4)、融合:第四個挑戰(zhàn)是將來自兩個或更多模態(tài)的信息連接起來進(jìn)行預(yù)測。例如,在視聽語音識別中,將嘴唇運(yùn)動的視覺描述與語音信號融合在一起來預(yù)測口語單詞。來自不同模態(tài)的信息可能具有不同的預(yù)測能力和噪聲拓?fù)?#xff0c;至少在一種模態(tài)中可能丟失數(shù)據(jù)。 5)、共同學(xué)習(xí):第五項挑戰(zhàn)是如何在模態(tài)、表示和預(yù)測模型之間傳遞知識。這可以通過協(xié)同訓(xùn)練、概念基礎(chǔ)和零樣本學(xué)習(xí)的算法來例證。共同學(xué)習(xí)探索了如何從一個模態(tài)學(xué)習(xí)知識可以幫助在不同模態(tài)上訓(xùn)練的計算模型。當(dāng)其中一種模態(tài)的資源有限(例如,注釋數(shù)據(jù))時,這個挑戰(zhàn)尤其重要。 |
| Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it. APPLICATIONS:REPRESENTATION、TRANSLATION、ALIGNMENT、FUSION、CO-LEARNING 1、Speech recognition and synthesis:Audio-visual speech recognition、(Visual) speech synthesis 2、Event detection:Action classification、Multimedia event detection 3、Emotion and affect:Recognition、Synthesis 4、Media description:Image description、Video description、Visual question-answering、Media summarization 5、Multimedia retrieval:Cross modal retrieval、Cross modal hashing | 表1:多模態(tài)機(jī)器學(xué)習(xí)支持的應(yīng)用程序的總結(jié)。對于每個應(yīng)用領(lǐng)域,我們確定了需要解決的核心技術(shù)挑戰(zhàn)。 應(yīng)用:表示、翻譯、對齊、融合、共同學(xué)習(xí) 1、語音識別與合成:視聽語音識別、(視覺)語音合成 2、事件檢測:動作分類、多媒體事件檢測 3、情感與影響:識別、綜合 4、媒體描述:圖像描述、視頻描述、視覺問答、媒體摘要 5、多媒體檢索:交叉模態(tài)檢索,交叉模態(tài)哈希 |
| For each of these five challenges, we defines taxonomic classes and sub-classes to help structure the recent work in this emerging research field of multimodal machine learning. We start with a discussion of main applications of multimodal machine learning (Section 2)?followed by a discussion on the recent developments on all of the five core technical challenges facing multimodal machine learning: representation (Section 3),?translation (Section 4),?alignment (Section 5),?fusion (Section 6),?and co-learning (Section?7). We conclude with a discussion in Section 8. | 對于這五個挑戰(zhàn)中的每一個,我們都定義了分類類別和子類別,以幫助構(gòu)建多模態(tài)機(jī)器學(xué)習(xí)這一新興研究領(lǐng)域的最新工作。我們開始討論的多通道的主要應(yīng)用機(jī)器學(xué)習(xí)(2節(jié)),后跟一個討論近期的事態(tài)發(fā)展在所有的五個核心技術(shù)多通道機(jī)器學(xué)習(xí)所面臨的挑戰(zhàn):表示(第三節(jié)),翻譯(4節(jié)),對齊(5節(jié)),融合(6節(jié)),co-learning(第7節(jié))。我們在第8節(jié)中進(jìn)行了討論。 |
2 Applications: a historical perspective?應(yīng)用:歷史視角
| Multimodal machine learning enables a wide range of applications: from audio-visual speech recognition to im-age captioning. In this section we present a brief history of multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest in language and vision applications. One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) [243]. It was moti-vated by the McGurk effect [138] — an interaction between hearing and vision during speech perception. When human subjects heard the syllable /ba-ba/ while watching the lips of a person saying /ga-ga/, they perceived a third sound: /da-da/. These results motivated many researchers from the speech community to extend their approaches with visual information. Given the prominence of hidden Markov mod-els (HMMs) in the speech community at the time [95], it is without surprise that many of the early models for AVSR were based on various HMM extensions [24], [25]. While research into AVSR is not as common these days, it has seen renewed interest from the deep learning community [151]. While the original vision of AVSR was to improve speech recognition performance (e.g., word error rate) in all contexts, the experimental results showed that the main advantage of visual information was when the speech signal was noisy (i.e., low signal-to-noise ratio) [75], [151], [243]. In?other words, the captured interactions between modalities were supplementary rather than complementary. The same information was captured in both, improving the robustness of the multimodal models but not improving the speech recognition performance in noiseless scenarios. | 多模態(tài)機(jī)器學(xué)習(xí)實現(xiàn)了廣泛的應(yīng)用:從視聽語音識別到圖像字幕。在本節(jié)中,我們將簡要介紹多模態(tài)應(yīng)用的歷史,從它在視聽語音識別方面的起步,到最近在語言和視覺應(yīng)用方面重新燃起的興趣。 多模態(tài)研究最早的例子之一是視聽語音識別(AVSR)[243]。它的動機(jī)是McGurk效應(yīng)[138]——在言語感知過程中聽覺和視覺之間的交互作用。當(dāng)受試者在觀察一個人說/ga-ga/時的嘴唇時聽到/ba-ba/音節(jié),他們會聽到第三個聲音:/da-da/。這些結(jié)果激發(fā)了語言學(xué)界的許多研究人員將他們的方法擴(kuò)展到視覺信息。考慮到隱馬爾可夫模型(HMM)在當(dāng)時的語音社區(qū)中的突出程度[95],許多早期的AVSR模型都是基于各種HMM擴(kuò)展[24],[25],這一點(diǎn)也不令人驚訝。雖然目前對AVSR的研究并不常見,但深度學(xué)習(xí)社區(qū)對它重新燃起了興趣[151]。 雖然AVSR的原始視覺是為了提高所有語境下的語音識別性能(例如,單詞錯誤率),但實驗結(jié)果表明,視覺信息的主要優(yōu)勢是在語音信號有噪聲(即低信噪比)時[75]、[151]、[243]。換句話說,模態(tài)之間的相互作用是互補(bǔ)的而不是互補(bǔ)的。兩種方法都捕獲了相同的信息,提高了多模態(tài)模型的魯棒性,但沒有提高在無噪聲場景下的語音識別性能。 |
| A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval [11], [188]. With the advance of personal comput-ers and the internet, the quantity of digitized multime-dia content has increased dramatically [2]. While earlier approaches for indexing and searching these multimedia videos were keyword-based [188], new research problems emerged when trying to search the visual and multimodal content directly. This led to new research topics in multi-media content analysis such as automatic shot-boundary detection [123] and video summarization [53]. These re-search projects were supported by the TrecVid initiative from the National Institute of Standards and Technologies which introduced many high-quality datasets, including the multimedia event detection (MED) tasks started in 2011 [1] | 第二個重要的多模態(tài)應(yīng)用類別來自多媒體內(nèi)容索引和檢索領(lǐng)域[11][188]。隨著個人電腦和互聯(lián)網(wǎng)的發(fā)展,數(shù)字化多媒體內(nèi)容的數(shù)量急劇增加。雖然早期對這些多媒體視頻進(jìn)行索引和搜索的方法是基于關(guān)鍵詞的[188],但當(dāng)試圖直接搜索視覺和多模態(tài)內(nèi)容時,出現(xiàn)了新的研究問題。這導(dǎo)致了多媒體內(nèi)容分析的新研究課題,如自動鏡頭邊界檢測[123]和視頻摘要[53]。這些研究項目由國家標(biāo)準(zhǔn)和技術(shù)研究所的TrecVid計劃支持,該計劃引入了許多高質(zhì)量的數(shù)據(jù)集,包括2011年[1]開始的多媒體事件檢測(MED)任務(wù) |
| A third category of applications was established in the early 2000s around the emerging field of multimodal interaction with the goal of understanding human multi-modal behaviors during social interactions. One of the first landmark datasets collected in this field is the AMI Meet-ing Corpus which contains more than 100 hours of video recordings of meetings, all fully transcribed and annotated [33]. Another important dataset is the SEMAINE corpus which allowed to study interpersonal dynamics between speakers and listeners [139]. This dataset formed the basis of the first audio-visual emotion challenge (AVEC) orga-nized in 2011 [179]. The fields of emotion recognition and affective computing bloomed in the early 2010s thanks to strong technical advances in automatic face detection, facial landmark detection, and facial expression recognition [46]. The AVEC challenge continued annually afterward with the later instantiation including healthcare applications such as automatic assessment of depression and anxiety [208]. A great summary of recent progress in multimodal affect recognition was published by D’Mello et al. [50]. Their meta-analysis revealed that a majority of recent work on multimodal affect recognition show improvement when using more than one modality, but this improvement is reduced when recognizing naturally-occurring emotions. | 第三類應(yīng)用是在21世紀(jì)初建立的,圍繞著新興的多模態(tài)互動領(lǐng)域,目的是理解社會互動中人類的多模態(tài)行為。在這個領(lǐng)域收集的第一個具有里程碑意義的數(shù)據(jù)集是AMI會議語料庫(AMI meetet Corpus),它包含了100多個小時的會議視頻記錄,全部都是完全轉(zhuǎn)錄和注釋的[33]。另一個重要的數(shù)據(jù)集是SEMAINE語料庫,它可以研究說話者和聽者之間的人際動力學(xué)[139]。該數(shù)據(jù)集構(gòu)成了2011年組織的第一次視聽情緒挑戰(zhàn)(AVEC)的基礎(chǔ)[179]。由于自動人臉檢測、面部地標(biāo)檢測和面部表情識別[46]技術(shù)的強(qiáng)大進(jìn)步,情緒識別和情感計算領(lǐng)域在2010年代早期蓬勃發(fā)展。此后,AVEC挑戰(zhàn)每年都在繼續(xù),后來的實例包括抑郁和焦慮的自動評估等醫(yī)療保健應(yīng)用[208]。D’mello et al.[50]對多模態(tài)情感識別的最新進(jìn)展進(jìn)行了很好的總結(jié)。他們的薈萃分析顯示,最近關(guān)于多模態(tài)情感識別的大部分工作在使用多個模態(tài)時表現(xiàn)出改善,但當(dāng)識別自然發(fā)生的情緒時,這種改善就會減少。 |
| Most recently, a new category of multimodal applica-tions emerged with an emphasis on language and vision: media description. One of the most representative applica-tions is image captioning where the task is to generate a text description of the input image [83]. This is motivated by the ability of such systems to help the visually impaired in their daily tasks [20]. The main challenges media description is evaluation: how to evaluate the quality of the predicted descriptions. The task of visual question-answering (VQA) was recently proposed to address some of the evaluation challenges [9], where the goal is to answer a specific ques-tion about the image. In order to bring some of the mentioned applications to the real world we need to address a number of tech-nical challenges facing multimodal machine learning. We summarize the relevant technical challenges for the above mentioned application areas in Table 1. One of the most im-portant challenges is multimodal representation, the focus of our next section. | 最近,一種新的多模態(tài)應(yīng)用出現(xiàn)了,它強(qiáng)調(diào)語言和視覺:媒體描述。最具代表性的應(yīng)用之一是圖像字幕,其任務(wù)是生成輸入圖像的文本描述[83]。這是由這些系統(tǒng)的能力來幫助視障人士在他們的日常任務(wù)[20]。媒體描述的主要挑戰(zhàn)是評估:如何評估預(yù)測描述的質(zhì)量。最近提出的視覺回答任務(wù)(VQA)是為了解決[9]的一些評估挑戰(zhàn),其目標(biāo)是回答關(guān)于圖像的特定問題。 為了將上述一些應(yīng)用應(yīng)用到現(xiàn)實世界中,我們需要解決多模態(tài)機(jī)器學(xué)習(xí)所面臨的一系列技術(shù)挑戰(zhàn)。我們在表1中總結(jié)了上述應(yīng)用領(lǐng)域的相關(guān)技術(shù)挑戰(zhàn)。最重要的挑戰(zhàn)之一是多模態(tài)表示,這是我們下一節(jié)的重點(diǎn)。 |
3 Multimodal Representations多模態(tài)表示
| Representing raw data in a format that a computational model can work with has always been a big challenge in machine learning. Following the work of Bengio et al. [18] we use the term feature and representation interchangeably, with each referring to a vector or tensor representation of an entity, be it an image, audio sample, individual word, or a sentence. A multimodal representation is a representation of data using information from multiple such entities. Repre-senting multiple modalities poses many difficulties: how to combine the data from heterogeneous sources; how to deal with different levels of noise; and how to deal with missing data. The ability to represent data in a meaningful way is crucial to multimodal problems, and forms the backbone of any model. Good representations are important for the performance of machine learning models, as evidenced behind the recent leaps in performance of speech recognition [79] and visual object classification [109] systems. Bengio et al. [18] identify a number of properties for good representations: smooth-ness, temporal and spatial coherence, sparsity, and natural clustering amongst others. Srivastava and Salakhutdinov [198] identify additional desirable properties for multi-modal representations: similarity in the representation space should reflect the similarity of the corresponding concepts, the representation should be easy to obtain even in the absence of some modalities, and finally, it should be possible to fill-in missing modalities given the observed ones. | 以一種計算模型可以使用的格式表示原始數(shù)據(jù)一直是機(jī)器學(xué)習(xí)的一大挑戰(zhàn)。在Bengio等人[18]的工作之后,我們交替使用術(shù)語“特征”和“表示”,每一個都指一個實體的向量或張量表示,無論是圖像、音頻樣本、單個單詞還是一個句子。多模態(tài)表示是使用來自多個此類實體的信息的數(shù)據(jù)表示。表示多種模態(tài)帶來了許多困難:如何組合來自不同來源的數(shù)據(jù);如何處理不同程度的噪音;以及如何處理丟失的數(shù)據(jù)。以有意義的方式表示數(shù)據(jù)的能力對多模態(tài)問題至關(guān)重要,并構(gòu)成任何模型的支柱。 良好的表示對機(jī)器學(xué)習(xí)模型的性能非常重要,這在語音識別[79]和視覺對象分類[109]系統(tǒng)最近的性能飛躍中得到了證明。Bengio等人的[18]為良好的表示識別了許多屬性:平滑性、時間和空間一致性、稀疏性和自然聚類。Srivastava和Salakhutdinov[198]確定了多模態(tài)表示的其他理想屬性:表示空間中的相似性應(yīng)反映出相應(yīng)概念的相似性,即使在沒有某些模態(tài)的情況下,表示也應(yīng)易于獲得,最后,對于觀察到的模態(tài),應(yīng)能夠填充缺失的模態(tài)。 |
| The development of unimodal representations has been extensively studied [5], [18], [122]. In the past decade there has been a shift from hand-designed for specific applications to data-driven. For example, one of the most famous image descriptors in the early 2000s, the scale invariant feature transform (SIFT) was hand designed [127], but currently most visual descriptions are learned from data using neural architectures such as convolutional neural networks (CNN)[109]. Similarly, in the audio domain, acoustic features?such as Mel-frequency cepstral coefficients (MFCC) have been superseded by data-driven deep neural networks in speech recognition [79] and recurrent neural networks for para-linguistic analysis [207]. In natural language process-ing, the textual features initially relied on counting word occurrences in documents, but have been replaced data-driven word embeddings that exploit the word context [141]. While there has been a huge amount of work on unimodal representation, up until recently most multimodal representations involved simple concatenation of unimodal ones [50], but this has been rapidly changing. | 單模態(tài)表征的發(fā)展已被廣泛研究[5],[18],[122]。在過去的十年里,已經(jīng)出現(xiàn)了從手工設(shè)計特定應(yīng)用程序到數(shù)據(jù)驅(qū)動的轉(zhuǎn)變。例如,本世紀(jì)初最著名的圖像描述符之一,尺度不變特征變換(SIFT)是手工設(shè)計的[127],但目前大多數(shù)視覺描述都是使用卷積神經(jīng)網(wǎng)絡(luò)(CNN)等神經(jīng)體系結(jié)構(gòu)從數(shù)據(jù)中學(xué)習(xí)的[109]。同樣,在音頻領(lǐng)域,如Mel-frequency倒譜系數(shù)(MFCC)等聲學(xué)特征已被語音識別中的數(shù)據(jù)驅(qū)動深度神經(jīng)網(wǎng)絡(luò)[79]和輔助語言分析中的循環(huán)神經(jīng)網(wǎng)絡(luò)[207]所取代。在自然語言處理中,文本特征最初依賴于計算文檔中的單詞出現(xiàn)次數(shù),但已經(jīng)取代了利用單詞上下文的數(shù)據(jù)驅(qū)動單詞嵌入[141]。盡管在單模態(tài)表示方面已經(jīng)做了大量的工作,但直到最近,大多數(shù)多模態(tài)表示都涉及單模態(tài)表示[50]的簡單串聯(lián),但這種情況正在迅速改變。 |
| To help understand the breadth of work, we propose two categories of multimodal representation: joint and coor-dinated. Joint representations combine the unimodal signals into the same representation space, while coordinated repre-sentations process unimodal signals separately, but enforce certain similarity constraints on them to bring them to what we term a coordinated space. An illustration of different multimodal representation types can be seen in Figure 1. Mathematically, the joint representation is expressed as: ?where the multimodal representation xm is computed using function f (e.g., a deep neural network, restricted Boltz-mann machine, or a recurrent neural network) that relies on unimodal representations x1, . . . xn. While coordinated representation is as follows: where each modality has a corresponding projection func-tion (f and g above) that maps it into a coordinated multi-modal space. While the projection into the multimodal space is independent for each modality, but the resulting space is coordinated between them (indicated as ~). Examples of such coordination include minimizing cosine distance [61], maximizing correlation [7], and enforcing a partial order [212] between the resulting spaces. | 為了幫助理解工作的廣度,我們提出了兩種類型的多模態(tài)表示:聯(lián)合的和協(xié)調(diào)的。聯(lián)合表示將單模態(tài)信號組合到相同的表示空間中,而協(xié)調(diào)表示則分別處理單模態(tài)信號,但對它們施加某種相似性約束,使它們進(jìn)入我們所說的協(xié)調(diào)空間。圖1展示了不同的多模態(tài)表示類型。 在數(shù)學(xué)上,聯(lián)合表示為: 其中,多模態(tài)表示xm是使用依賴于單模態(tài)表示x1、…的函數(shù)f(例如,深度神經(jīng)網(wǎng)絡(luò)、受限玻爾茲曼機(jī)或循環(huán)神經(jīng)網(wǎng)絡(luò))計算的。而協(xié)調(diào)表示如下: 其中,每個模態(tài)都有一個相應(yīng)的投影函數(shù)(f和g),將其映射到一個協(xié)調(diào)的多模態(tài)空間中。雖然投射到多模態(tài)空間的每個模態(tài)都是獨(dú)立的,但最終的空間在它們之間是協(xié)調(diào)的(表示為~)。這種協(xié)調(diào)的例子包括最小化余弦距離[61],最大化相關(guān)性[7],以及在結(jié)果空間之間強(qiáng)制執(zhí)行偏序[212]。 |
3.1 Joint Representations?聯(lián)合表示
| We start our discussion with joint representations that project unimodal representations together into a multimodal space (Equation 1). Joint representations are mostly (but not exclusively) used in tasks where multimodal data is present both during training and inference steps. The sim-plest example of a joint representation is a concatenation of individual modality features (also referred to as early fusion [50]). In this section we discuss more advanced methods for creating joint representations starting with neural net-works, followed by graphical models and recurrent neural networks (representative works can be seen in Table 2). Neural networks have become a very popular method for unimodal data representation [18]. They are used to repre-sent visual, acoustic, and textual data, and are increasingly used in the multimodal domain [151], [156], [217]. In this section we describe how neural networks can be used to construct a joint multimodal representation, how to train them, and what advantages they offer. | 我們從聯(lián)合表示開始討論,聯(lián)合表示將單模態(tài)表示一起投射到多模態(tài)空間中(方程1)。聯(lián)合表示通常(但不是唯一)用于在訓(xùn)練和推理步驟中都存在多模態(tài)數(shù)據(jù)的任務(wù)中。聯(lián)合表示的最簡單的例子是單個形態(tài)特征的串聯(lián)(也稱為早期融合[50])。在本節(jié)中,我們將討論創(chuàng)建聯(lián)合表示的更高級方法,首先是神經(jīng)網(wǎng)絡(luò),然后是圖形模型和循環(huán)神經(jīng)網(wǎng)絡(luò)(代表性作品見表2)。神經(jīng)網(wǎng)絡(luò)已經(jīng)成為單模態(tài)數(shù)據(jù)表示[18]的一種非常流行的方法。它們被用來表示視覺、聽覺和文本數(shù)據(jù),并在多模態(tài)領(lǐng)域中越來越多地使用[151]、[156]、[217]。在本節(jié)中,我們將描述如何使用神經(jīng)網(wǎng)絡(luò)來構(gòu)建聯(lián)合多模態(tài)表示,如何訓(xùn)練它們,以及它們提供了什么優(yōu)勢。 |
| In general, neural networks are made up of successive building blocks of inner products followed by non-linear activation functions. In order to use a neural network as?a way to represent data, it is first trained to perform a specific task (e.g., recognizing objects in images). Due to the multilayer nature of deep neural networks each successive layer is hypothesized to represent the data in a more abstract way [18], hence it is common to use the final or penultimate neural layers as a form of data representation. To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol-lowed by a hidden layer that projects the modalities into a joint space [9], [145], [156], [227]. The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction. Such models can be trained end-to-end — learning both to represent the data and to perform a particular task. This results in a close relationship between multimodal representation learning and multimodal fusion when using neural networks. | 一般來說,神經(jīng)網(wǎng)絡(luò)由內(nèi)積的連續(xù)構(gòu)建塊和非線性激活函數(shù)組成。為了使用神經(jīng)網(wǎng)絡(luò)作為一種表示數(shù)據(jù)的方法,首先要訓(xùn)練它執(zhí)行特定的任務(wù)(例如,識別圖像中的對象)。由于深度神經(jīng)網(wǎng)絡(luò)的多層性質(zhì),假設(shè)每一層都以更抽象的方式[18]表示數(shù)據(jù),因此通常使用最后或倒數(shù)第二層神經(jīng)網(wǎng)絡(luò)作為數(shù)據(jù)表示的一種形式。為了使用神經(jīng)網(wǎng)絡(luò)構(gòu)建多模態(tài)表示,每個模態(tài)從幾個單獨(dú)的神經(jīng)層開始,然后是一個隱藏層,該層將模態(tài)投射到關(guān)節(jié)空間[9],[145],[156],[227]。然后將聯(lián)合多模態(tài)表示通過多個隱藏層本身或直接用于預(yù)測。這樣的模型可以端到端進(jìn)行訓(xùn)練——既可以表示數(shù)據(jù),也可以執(zhí)行特定的任務(wù)。這導(dǎo)致了在使用神經(jīng)網(wǎng)絡(luò)時,多模態(tài)表示學(xué)習(xí)和多模態(tài)融合之間的密切關(guān)系。 |
| ?Figure 1: Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input. Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g. Euclidean distance) or structure constraint (e.g. partial order). | 圖1:關(guān)節(jié)和協(xié)調(diào)表示的結(jié)構(gòu)。使用所有的模態(tài)作為輸入,將聯(lián)合表示投影到同一空間。另一方面,協(xié)調(diào)表示存在于它們自己的空間中,但通過相似性(如歐幾里德距離)或結(jié)構(gòu)約束(如部分順序)進(jìn)行協(xié)調(diào)。 |
| As neural networks require a lot of labeled training data, it is common to pre-train such representations using an autoencoder on unsupervised data [80]. The model pro-posed by Ngiam et al. [151] extended the idea of using autoencoders to the multimodal domain. They used stacked denoising autoencoders to represent each modality individ-ually and then fused them into a multimodal representation using another autoencoder layer. Similarly, Silberer and Lapata [184] proposed to use a multimodal autoencoder for the task of semantic concept grounding (see Section 7.2). In addition to using a reconstruction loss to train the representation they introduce a term into the loss function that uses the representation to predict object labels. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed using an autoencoder is generic and not necessarily optimal for a specific task [217]. The major advantage of neural network based joint rep-resentations comes from their often superior performance and the ability to pre-train the representations in an unsu-pervised manner. The performance gain is, however, depen-dent on the amount of data available for training. One of the disadvantages comes from the model not being able to handle missing data naturally — although there are ways to alleviate this issue [151], [217]. Finally, deep networks are often difficult to train [69], but the field is making progress in better training techniques [196]. Probabilistic graphical models are another popular way to?construct representations through the use of latent random variables [18]. In this section we describe how probabilistic graphical models are used to represent unimodal and mul-timodal data. | 由于神經(jīng)網(wǎng)絡(luò)需要大量標(biāo)注的訓(xùn)練數(shù)據(jù),通常使用自動編碼器對非監(jiān)督數(shù)據(jù)進(jìn)行此類表示的預(yù)訓(xùn)練[80]。Ngiam等人[151]提出的模型將使用自動編碼器的思想擴(kuò)展到多模態(tài)域。他們使用堆疊降噪自動編碼器來單獨(dú)表示每個模態(tài),然后使用另一個自動編碼器層將它們?nèi)诤铣梢粋€多模態(tài)表示。類似地,Silberer和Lapata[184]提出使用多模態(tài)自動編碼器來完成語義概念扎根的任務(wù)(見章節(jié)7.2)。除了使用重構(gòu)損失來訓(xùn)練表示之外,他們還在損失函數(shù)中引入了一個術(shù)語,該術(shù)語使用表示來預(yù)測對象標(biāo)簽。由于使用自動編碼器構(gòu)造的表示是通用的,對于特定的任務(wù)不一定是最佳的,因此對當(dāng)前特定任務(wù)的結(jié)果表示進(jìn)行微調(diào)也是很常見的[217]。 基于神經(jīng)網(wǎng)絡(luò)的聯(lián)合表示的主要優(yōu)勢來自于它們通常卓越的性能,以及以無監(jiān)督的方式對表示進(jìn)行預(yù)訓(xùn)練的能力。然而,性能增益取決于可供訓(xùn)練的數(shù)據(jù)量。缺點(diǎn)之一是模型不能自然地處理缺失的數(shù)據(jù)——盡管有一些方法可以緩解這個問題[151],[217]。最后,深度網(wǎng)絡(luò)通常很難訓(xùn)練[69],但該領(lǐng)域在更好的訓(xùn)練技術(shù)方面正在取得進(jìn)展[196]。 概率圖形模型是另一種通過使用潛在隨機(jī)變量[18]來構(gòu)造表示的流行方法。在本節(jié)中,我們將描述如何使用概率圖形模型來表示單模態(tài)和多模態(tài)數(shù)據(jù)。 |
| The most popular approaches for graphical-model based representation are deep Boltzmann machines (DBM) [176], that stack restricted Boltzmann machines (RBM) [81] as building blocks. Similar to neural networks, each successive layer of a DBM is expected to represent the data at a higher level of abstraction. The appeal of DBMs comes from the fact that they do not need supervised data for training [176]. As they are graphical models the representation of data is probabilistic, however it is possible to convert them to a deterministic neural network — but this loses the generative aspect of the model [176]. Work by Srivastava and Salakhutdinov [197] introduced multimodal deep belief networks as a multimodal represen-tation. Kim et al. [104] used a deep belief network for each modality and then combined them into joint representation for audiovisual emotion recognition. Huang and Kingsbury [86] used a similar model for AVSR, and Wu et al. [225] for audio and skeleton joint based gesture recognition. Multimodal deep belief networks have been extended to multimodal DBMs by Srivastava and Salakhutdinov [198]. Multimodal DBMs are capable of learning joint represen-tations from multiple modalities by merging two or more undirected graphs using a binary layer of hidden units on top of them. They allow for the low level representations of each modality to influence each other after the joint training due to the undirected nature of the model. Ouyang et al. [156] explore the use of multimodal DBMs for the task of human pose estimation from multi-view data. They demonstrate that integrating the data at a later stage —after unimodal data underwent nonlinear transformations— was beneficial for the model. Similarly, Suk et al. [199] use multimodal DBM representation to perform Alzheimer’s disease classification from positron emission tomography and magnetic resonance imaging data. | 最流行的基于圖形模型的表示方法是深度玻爾茲曼機(jī)(DBM)[176],它將限制玻爾茲曼機(jī)(RBM)[81]堆疊為構(gòu)建塊。與神經(jīng)網(wǎng)絡(luò)類似,DBM的每一個后續(xù)層都被期望在更高的抽象級別上表示數(shù)據(jù)。DBMs的吸引力來自于這樣一個事實,即它們不需要監(jiān)督數(shù)據(jù)進(jìn)行訓(xùn)練[176]。由于它們是圖形模型,數(shù)據(jù)的表示是概率的,但是可以將它們轉(zhuǎn)換為確定性神經(jīng)網(wǎng)絡(luò)——但這失去了模型的生成方面[176]。 Srivastava和Salakhutdinov[197]的研究引入了多模態(tài)深度信念網(wǎng)絡(luò)作為多模態(tài)表征。Kim等人[104]對每個模態(tài)使用深度信念網(wǎng)絡(luò),然后將它們組合成聯(lián)合表征,用于視聽情感識別。Huang和Kingsbury[86]在AVSR中使用了類似的模型,Wu等[225]在基于音頻和骨骼關(guān)節(jié)的手勢識別中使用了類似的模型。 Srivastava和Salakhutdinov將多模態(tài)深度信念網(wǎng)絡(luò)擴(kuò)展到多模態(tài)DBMs[198]。多模態(tài)DBMs能夠通過在兩個或多個無向圖上使用隱藏單元的二進(jìn)制層來合并它們,從而從多個模態(tài)中學(xué)習(xí)聯(lián)合表示。由于模型的無方向性,它們允許每個模態(tài)的低層次表示在聯(lián)合訓(xùn)練后相互影響。 歐陽等人[156]探討了使用多模態(tài)DBMs完成從多視圖數(shù)據(jù)中估計人體姿態(tài)的任務(wù)。他們證明,在單模態(tài)數(shù)據(jù)經(jīng)過非線性轉(zhuǎn)換后的后期階段對數(shù)據(jù)進(jìn)行集成對模型是有益的。類似地,Suk等人[199]利用多模態(tài)DBM表示法,從正電子發(fā)射斷層掃描和磁共振成像數(shù)據(jù)中進(jìn)行阿爾茨海默病分類。 |
| One of the big advantages of using multimodal DBMs for learning multimodal representations is their generative nature, which allows for an easy way to deal with missing data — even if a whole modality is missing, the model has a natural way to cope. It can also be used to generate samples of one modality in the presence of the other one, or?both modalities from the representation. Similar to autoen-coders the representation can be trained in an unsupervised manner enabling the use of unlabeled data. The major disadvantage of DBMs is the difficulty of training them —high computational cost, and the need to use approximate variational training methods [198]. Sequential Representation. So far we have discussed mod-els that can represent fixed length data, however, we often need to represent varying length sequences such as sen-tences, videos, or audio streams. In this section we describe models that can be used to represent such sequences. | 使用多模態(tài)DBMs學(xué)習(xí)多模態(tài)表示的一大優(yōu)點(diǎn)是它們的生成特性,這允許使用一種簡單的方法來處理缺失的數(shù)據(jù)——即使整個模態(tài)都缺失了,模型也有一種自然的方法來處理。它還可以用于在存在另一種模態(tài)的情況下產(chǎn)生一種模態(tài)的樣本,或者從表示中產(chǎn)生兩種模態(tài)的樣本。與自動編碼器類似,表示可以以無監(jiān)督的方式進(jìn)行訓(xùn)練,以便使用未標(biāo)記的數(shù)據(jù)。DBMs的主要缺點(diǎn)是很難訓(xùn)練它們——計算成本高,而且需要使用近似變分訓(xùn)練方法[198]。 順序表示。到目前為止,我們已經(jīng)討論了可以表示固定長度數(shù)據(jù)的模型,但是,我們經(jīng)常需要表示不同長度的序列,例如句子、視頻或音頻流。在本節(jié)中,我們將描述可以用來表示這種序列的模型。 |
| ?Table 2: A summary of multimodal representation tech-niques. We identify three subtypes of joint representations (Section 3.1) and two subtypes of coordinated ones (Section 3.2). For modalities + indicates the modalities combined. | 表2:多模態(tài)表示技術(shù)的概述。我們確定了聯(lián)合表示的三種子類型(章節(jié)3.1)和協(xié)調(diào)表示的兩種子類型(章節(jié)3.2)。對于模態(tài),+表示組合的模態(tài)。 |
| Recurrent neural networks (RNNs), and their variants such as long-short term memory (LSTMs) networks [82], have recently gained popularity due to their success in sequence modeling across various tasks [12], [213]. So far RNNs have mostly been used to represent unimodal se-quences of words, audio, or images, with most success in the language domain. Similar to traditional neural networks, the hidden state of an RNN can be seen as a representation of the data, i.e., the hidden state of RNN at timestep t can be seen as the summarization of the sequence up to that timestep. This is especially apparent in RNN encoder-decoder frameworks where the task of an encoder is to represent a sequence in the hidden state of an RNN in such a way that a decoder could reconstruct it [12]. The use of RNN representations has not been limited to the unimodal domain. An early use of constructing a multimodal representation using RNNs comes from work by Cosi et al. [43] on AVSR. They have also been used for representing audio-visual data for affect recognition [37],[152] and to represent multi-view data such as different visual cues for human behavior analysis [166]. | 循環(huán)神經(jīng)網(wǎng)絡(luò)(rnn)及其變體,如長短期記憶(LSTMs)網(wǎng)絡(luò)[82],由于它們在不同任務(wù)的序列建模中取得了成功[12],[213],近年來越來越受歡迎。到目前為止,神經(jīng)網(wǎng)絡(luò)主要用于表示單模態(tài)的單詞序列、音頻序列或圖像序列,在語言領(lǐng)域取得了很大的成功。與傳統(tǒng)的神經(jīng)網(wǎng)絡(luò)類似,RNN的隱藏狀態(tài)可以看作是數(shù)據(jù)的一種表示,即RNN在時間步長t處的隱藏狀態(tài)可以看作是該時間步長的序列的匯總。這在RNN編碼器-解碼器框架中尤為明顯,在該框架中,編碼器的任務(wù)是表示RNN的隱藏狀態(tài)下的序列,以便解碼器可以將其重構(gòu)為[12]。 RNN表示的使用并不局限于單模態(tài)域。Cosi等人在AVSR上的工作最早使用RNNs構(gòu)造多模態(tài)表示。它們還被用于表示視聽數(shù)據(jù),用于情感識別[37][152],并表示多視圖數(shù)據(jù),如用于人類行為分析的不同視覺線索[166]。 |
3.2 Coordinated Representations協(xié)調(diào)表示
| An alternative to a joint multimodal representation is a coor-dinated representation. Instead of projecting the modalities together into a joint space, we learn separate representations for each modality but coordinate them through a constraint. We start our discussion with coordinated representations that enforce similarity between representations, moving on to coordinated representations that enforce more structure on the resulting space (representative works of different coordinated representations can be seen in Table 2). Similarity models minimize the distance between modal-ities in the coordinated space. For example such models encourage the representation of the word dog and an image of a dog to have a smaller distance between them than distance between the word dog and an image of a car [61]. One of the earliest examples of such a representation comes from the work by Weston et al. [221], [222] on the WSABIE (web scale annotation by image embedding) model, where a coordinated space was constructed for images and their annotations. WSABIE constructs a simple linear map from image and textual features such that corresponding anno-tation and image representation would have a higher inner product (smaller cosine distance) between them than non-corresponding ones. | 聯(lián)合多模態(tài)表示的另一種選擇是協(xié)調(diào)表示。我們學(xué)習(xí)每個模態(tài)的單獨(dú)表示,但通過一個約束來協(xié)調(diào)它們,而不是將這些模態(tài)一起投影到關(guān)節(jié)空間中。我們從協(xié)調(diào)表示開始討論,協(xié)調(diào)表示強(qiáng)制表示之間的相似性,然后繼續(xù)討論在結(jié)果空間上強(qiáng)制更多結(jié)構(gòu)的協(xié)調(diào)表示(不同協(xié)調(diào)表示的代表作品見表2)。 相似模型最小化協(xié)調(diào)空間中各模態(tài)之間的距離。例如,這樣的模型鼓勵單詞dog和一只狗的圖像之間的距離比單詞dog和一輛汽車的圖像之間的距離更小[61]。這種表達(dá)最早的例子之一來自Weston等人[221],[222]在WSABIE(圖像嵌入的web尺度注釋)模型上的工作,其中為圖像及其注釋構(gòu)建了一個協(xié)調(diào)的空間。WSABIE從圖像和文本特征構(gòu)造了一個簡單的線性映射,這樣對應(yīng)的標(biāo)注和圖像表示就會比不對應(yīng)的標(biāo)注和圖像之間有更高的內(nèi)積(更小的余弦距離)。 |
| More recently, neural networks have become a popular way to construct coordinated representations, due to their ability to learn representations. Their advantage lies in the fact that they can jointly learn coordinated representations in an end-to-end manner. An example of such coordinated representation is DeViSE — a deep visual-semantic embed-ding [61]. DeViSE uses a similar inner product and ranking loss function to WSABIE but uses more complex image and word embeddings. Kiros et al. [105] extended this to sentence and image coordinated representation by using an LSTM model and a pairwise ranking loss to coordinate the feature space. Socher et al. [191] tackle the same task, but extend the language model to a dependency tree RNN to incorporate compositional semantics. A similar model was also proposed by Pan et al. [159], but using videos instead of images. Xu et al. [231] also constructed a coordinated space between videos and sentences using a subject, verb, object compositional language model and a deep video model. This representation was then used for the task of cross-modal retrieval and video description. While the above models enforced similarity between representations, structured coordinated space models go beyond that and enforce additional constraints between the modality representations. The type of structure enforced is often based on the application, with different constraints for hashing, cross-modal retrieval, and image captioning. | 最近,由于神經(jīng)網(wǎng)絡(luò)具有學(xué)習(xí)表征的能力,它已經(jīng)成為構(gòu)建協(xié)調(diào)表征的一種流行方式。它們的優(yōu)勢在于能夠以端到端方式共同學(xué)習(xí)協(xié)調(diào)的表示。這種協(xié)調(diào)表示的一個例子是設(shè)計——一種深度的視覺語義嵌入[61]。設(shè)計使用與WSABIE類似的內(nèi)部產(chǎn)品和排名損失函數(shù),但使用更復(fù)雜的圖像和單詞嵌入。Kiros等人[105]通過使用LSTM模型和一對排序損失來協(xié)調(diào)特征空間,將其擴(kuò)展到句子和圖像的協(xié)調(diào)表示。Socher等人[191]處理了相同的任務(wù),但將語言模型擴(kuò)展到依賴樹RNN,以合并復(fù)合語義。Pan等人[159]也提出了類似的模型,但使用的是視頻而不是圖像。Xu等人[231]也使用主語、動詞、賓語構(gòu)成語言模型和深度視頻模型構(gòu)建了視頻和句子之間的協(xié)調(diào)空間。然后將該表示用于跨模態(tài)檢索和視頻描述任務(wù)。 雖然上述模型加強(qiáng)了表示之間的相似性,但結(jié)構(gòu)化協(xié)調(diào)空間模型超越了這一點(diǎn),并加強(qiáng)了模態(tài)表示之間的附加約束。強(qiáng)制的結(jié)構(gòu)類型通常基于應(yīng)用程序,對哈希、交叉模態(tài)檢索和圖像標(biāo)題有不同的約束。 |
| Structured coordinated spaces are commonly used in cross-modal hashing — compression of high dimensional data into compact binary codes with similar binary codes for similar objects [218]. The idea of cross-modal hashing is to create such codes for cross-modal retrieval [27], [93],[113]. Hashing enforces certain constraints on the result-ing multimodal space: 1) it has to be an N-dimensional Hamming space — a binary representation with controllable number of bits; 2) the same object from different modalities has to have a similar hash code; 3) the space has to be similarity-preserving. Learning how to represent the data as a hash function attempts to enforce all of these three requirements [27], [113]. For example, Jiang and Li [92] introduced a method to learn such common binary space between sentence descriptions and corresponding images using end-to-end trainable deep learning techniques. While Cao et al. [32] extended the approach with a more complex LSTM sentence representation and introduced an outlier insensitive bit-wise margin loss and a relevance feedback based semantic similarity constraint. Similarly, Wang et al.[219] constructed a coordinated space in which images (and?sentences) with similar meanings are closer to each other. | 結(jié)構(gòu)化協(xié)調(diào)空間通常用于高維數(shù)據(jù)的交叉模態(tài)哈希壓縮,將其壓縮為具有相似對象的相似二進(jìn)制碼的緊湊二進(jìn)制碼[218]。交叉模態(tài)哈希的思想是為交叉模態(tài)檢索[27],[93],[113]創(chuàng)建這樣的代碼。哈希對結(jié)果的多模態(tài)空間施加了一定的約束:1)它必須是一個n維的漢明空間——一個具有可控位數(shù)的二進(jìn)制表示;2)來自不同模態(tài)的相同對象必須有相似的哈希碼;3)空間必須保持相似性。學(xué)習(xí)如何將數(shù)據(jù)表示為一個哈希函數(shù),嘗試執(zhí)行所有這三個要求[27],[113]。例如,Jiang和Li[92]介紹了一種方法,利用端到端可訓(xùn)練的深度學(xué)習(xí)技術(shù)學(xué)習(xí)句子描述與相應(yīng)圖像之間的公共二值空間。而Cao等人的[32]擴(kuò)展了該方法,使用了更復(fù)雜的LSTM句子表示,并引入了離群值不敏感的位邊緣損失和基于關(guān)聯(lián)反饋的語義相似度約束。同樣,Wang等[219]構(gòu)建了一個具有相似意義的圖像(和句子)更加接近的協(xié)調(diào)空間。 |
| Another example of a structured coordinated represen-tation comes from order-embeddings of images and lan-guage [212], [249]. The model proposed by Vendrov et al.[212] enforces a dissimilarity metric that is asymmetric and implements the notion of partial order in the multimodal space. The idea is to capture a partial order of the language and image representations — enforcing a hierarchy on the space; for example image of “a woman walking her dog“ → text “woman walking her dog” → text “woman walking”. A similar model using denotation graphs was also proposed by Young et al. [238] where denotation graphs are used to induce a partial ordering. Lastly, Zhang et al. present how exploiting structured representations of text and images can create concept taxonomies in an unsupervised manner [249]. A special case of a structured coordinated space is one based on canonical correlation analysis (CCA) [84]. CCA computes a linear projection which maximizes the correla-tion between two random variables (in our case modalities) and enforces orthogonality of the new space. CCA models have been used extensively for cross-modal retrieval [76],[106], [169] and audiovisual signal analysis [177], [187]. Extensions to CCA attempt to construct a correlation max-imizing nonlinear projection [7], [116]. Kernel canonical correlation analysis (KCCA) [116] uses reproducing kernel Hilbert spaces for projection. However, as the approach is nonparametric it scales poorly with the size of the training set and has issues with very large real-world datasets. Deep canonical correlation analysis (DCCA) [7] was introduced as an alternative to KCCA and addresses the scalability issue, it was also shown to lead to better correlated representation space. Similar correspondence autoencoder [58] and deep correspondence RBMs [57] have also been proposed for cross-modal retrieval. CCA, KCCA, and DCCA are unsupervised techniques and only optimize the correlation over the representations, thus mostly capturing what is shared across the modal-ities. Deep canonically correlated autoencoders [220] also include an autoencoder based data reconstruction term. This encourages the representation to also capture modal-ity specific information. Semantic correlation maximization method [248] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space — this leads to a combination of CCA and cross-modal hashing techniques. | 另一個結(jié)構(gòu)化協(xié)調(diào)表示的例子來自圖像和語言的順序嵌入[212],[249]。venrov等人提出的模型[212]在多模態(tài)空間中實施了一個非對稱的不相似度規(guī),并實現(xiàn)了偏序的概念。其想法是捕捉語言和圖像表示的部分順序——在空間上強(qiáng)制執(zhí)行層次結(jié)構(gòu);例如,圖像“一個女人遛狗”→文本“女人遛狗”→文本“女人散步”。Young等人[238]也提出了一個類似的使用表示圖的模型,其中表示圖用于誘導(dǎo)部分排序。最后,Zhang等人提出了如何利用文本和圖像的結(jié)構(gòu)化表示以無監(jiān)督的方式創(chuàng)建概念分類[249]。 結(jié)構(gòu)化協(xié)調(diào)空間的一種特殊情況是基于典型相關(guān)分析(CCA)的情況[84]。CCA計算線性投影,最大化兩個隨機(jī)變量(在本例中為模態(tài))之間的相關(guān)性,并加強(qiáng)新空間的正交性。CCA模型被廣泛用于跨模態(tài)檢索[76]、[106]、[169]和視聽信號分析[177]、[187]。對CCA的擴(kuò)展嘗試構(gòu)造一個相關(guān)max- imalize非線性投影[7],[116]。核典型相關(guān)分析(Kernel canonical correlation analysis, KCCA)[116]使用再現(xiàn)核希爾伯特空間進(jìn)行投影。然而,由于該方法是非參數(shù)化的,它不能很好地適應(yīng)訓(xùn)練集的大小,并且在處理非常大的真實世界數(shù)據(jù)集時存在問題。深度典型相關(guān)分析(DCCA)[7]被引入作為KCCA的替代方案,并解決了可伸縮性問題,它還可以帶來更好的相關(guān)表示空間。類似的對應(yīng)自動編碼器[58]和深度對應(yīng)RBMs[57]也被提出用于跨模態(tài)檢索。 CCA、KCCA和DCCA是無監(jiān)督技術(shù),只優(yōu)化表示的相關(guān)性,因此主要捕獲跨模態(tài)共享的內(nèi)容。深層規(guī)范相關(guān)自動編碼器[220]還包括基于自動編碼器的數(shù)據(jù)重構(gòu)項。這鼓勵表示也捕獲特定于模態(tài)的信息。語義相關(guān)性最大化方法[248]也鼓勵語義相關(guān)性,同時保留相關(guān)性最大化和結(jié)果空間的正交性——這導(dǎo)致CCA和跨模態(tài)哈希技術(shù)的結(jié)合。 |
3.3 Discussion討論
| In this section we identified two major types of multimodal representations — joint and coordinated. Joint representa-tions project multimodal data into a common space and are best suited for situations when all of the modalities are present during inference. They have been extensively used for AVSR, affect, and multimodal gesture recognition. Coordinated representations, on the other hand, project each modality into a separate but coordinated space, making them suitable for applications where only one modality is present at test time, such as: multimodal retrieval and trans-lation (Section 4), grounding (Section 7.2), and zero shot learning (Section 7.2). Finally, while joint representations have been used in situations to construct representations of?more than two modalities, coordinated spaces have, so far, been mostly limited to two modalities. | 在本節(jié)中,我們確定了兩種主要類型的多模態(tài)表示——聯(lián)合表示和協(xié)調(diào)表示。聯(lián)合表示將多模態(tài)數(shù)據(jù)投射到公共空間中,最適合在推理過程中出現(xiàn)所有模態(tài)的情況。它們已被廣泛用于AVSR、情感和多模態(tài)手勢識別。另一方面,協(xié)調(diào)表示將每個模態(tài)投射到一個獨(dú)立但協(xié)調(diào)的空間中,使它們適合于在測試時只有一個模態(tài)的應(yīng)用,例如:多模態(tài)檢索和翻譯(章節(jié)4)、接地(章節(jié)7.2)和零鏡頭學(xué)習(xí)(章節(jié)7.2)。最后,雖然聯(lián)合表示已經(jīng)被用于構(gòu)造兩種以上形態(tài)的表示,但到目前為止,協(xié)調(diào)空間大多局限于兩種形態(tài)。 |
| ?Table 3: Taxonomy of multimodal translation research. For each class and sub-class, we include example tasks with references. Our taxonomy also includes the directionality of the translation: unidirectional (?) and bidirectional (?). | 表3:多模態(tài)翻譯研究的分類。對于每個類及其子類,我們都包含帶有引用的示例任務(wù)。我們的分類還包括翻譯的方向性:單向(瑪)和雙向(?)。 |
4 Translation翻譯
| A big part of multimodal machine learning is concerned with translating (mapping) from one modality to another. Given an entity in one modality the task is to generate the same entity in a different modality. For example given an image we might want to generate a sentence describing it or given a textual description generate an image matching it. Multimodal translation is a long studied problem, with early work in speech synthesis [88], visual speech generation [136] video description [107], and cross-modal retrieval [169]. More recently, multimodal translation has seen renewed interest due to combined efforts of the computer vision and natural language processing (NLP) communities [19] and recent availability of large multimodal datasets [38], [205]. A particularly popular problem is visual scene description, also known as image [214] and video captioning [213], which acts as a great test bed for a number of computer vision and NLP problems. To solve it, we not only need to fully understand the visual scene and to identify its salient parts, but also to produce grammatically correct and comprehensive yet concise sentences describing it. | 多模態(tài)機(jī)器學(xué)習(xí)的很大一部分是關(guān)于從一種模態(tài)到另一種模態(tài)的翻譯(映射)。給定一個以一種形態(tài)存在的實體,任務(wù)是在不同形態(tài)中生成相同的實體。例如,給定一幅圖像,我們可能想要生成一個描述它的句子,或者給定一個文本描述生成與之匹配的圖像。多模態(tài)翻譯是一個長期研究的問題,早期的工作包括語音合成[88]、視覺語音生成[136]、視頻描述[107]和跨模態(tài)檢索[169]。 最近,由于計算機(jī)視覺和自然語言處理(NLP)社區(qū)[19]和最近可用的大型多模態(tài)數(shù)據(jù)集[38]的共同努力,多模態(tài)翻譯又引起了人們的興趣[205]。一個特別流行的問題是視覺場景描述,也被稱為圖像[214]和視頻字幕[213],它是許多計算機(jī)視覺和NLP問題的一個很好的測試平臺。要解決這一問題,我們不僅需要充分理解視覺場景,識別視覺場景的突出部分,還需要生成語法正確、全面而簡潔的描述視覺場景的句子。 |
| While the approaches to multimodal translation are very broad and are often modality specific, they share a number of unifying factors. We categorize them into two types —example-based, and generative. Example-based models use a dictionary when translating between the modalities. Genera-tive models, on the other hand, construct a model that is able to produce a translation. This distinction is similar to the one between non-parametric and parametric machine learning approaches and is illustrated in Figure 2, with representative examples summarized in Table 3. Generative models are arguably more challenging to build as they require the ability to generate signals or sequences of symbols (e.g., sentences). This is difficult for any modality — visual, acoustic, or verbal, especially when temporally and structurally consistent sequences need to be generated. This led to many of the early multimodal transla-tion systems relying on example-based translation. However, this has been changing with the advent of deep learning models that are capable of generating images [171], [210], sounds [157], [209], and text [12]. | 盡管多模態(tài)翻譯的方法非常廣泛,而且往往是針對特定的模態(tài),但它們有許多共同的因素。我們將它們分為兩種類型——基于實例的和生成的。在模態(tài)之間轉(zhuǎn)換時,基于實例的模型使用字典。另一方面,生成模型構(gòu)建的是能夠生成翻譯的模型。這種區(qū)別類似于非參數(shù)機(jī)器學(xué)習(xí)方法和參數(shù)機(jī)器學(xué)習(xí)方法之間的區(qū)別,如圖2所示,表3總結(jié)了具有代表性的例子。 生成模型的構(gòu)建更具挑戰(zhàn)性,因為它們需要生成信號或符號序列(如句子)的能力。這對于任何形式(視覺的、聽覺的或口頭的)都是困難的,特別是當(dāng)需要生成時間和結(jié)構(gòu)一致的序列時。這導(dǎo)致了許多早期的多模態(tài)翻譯系統(tǒng)依賴于實例翻譯。然而, 隨著能夠生成圖像[171]、[210]、聲音[157]、[209]和文本[12]的深度學(xué)習(xí)模型的出現(xiàn),這種情況已經(jīng)有所改變。 |
| ?Figure 2: Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter first trains a translation model on the dictionary and then uses that model for translation. | 圖2:基于實例和生成式多模態(tài)翻譯的概述。前者從字典中檢索最佳的翻譯,而后者首先根據(jù)字典訓(xùn)練翻譯模型,然后使用該模型進(jìn)行翻譯。 |
4.1 Example-based?基于實例
| Example-based algorithms are restricted by their training data — dictionary (see Figure 2a). We identify two types of such algorithms: retrieval based, and combination based. Retrieval-based models directly use the retrieved translation without modifying it, while combination-based models rely on more complex rules to create translations based on a number of retrieved instances. Retrieval-based models are arguably the simplest form of multimodal translation. They rely on finding the closest sample in the dictionary and using that as the translated result. The retrieval can be done in unimodal space or inter-mediate semantic space. Given a source modality instance to be translated, uni-modal retrieval finds the closest instances in the dictionary in the space of the source — for example, visual feature space for images. Such approaches have been used for visual speech synthesis, by retrieving the closest matching visual example of the desired phoneme [26]. They have also been used in concatenative text-to-speech systems [88]. More recently, Ordonez et al. [155] used unimodal retrieval to generate image descriptions by using global image features to retrieve caption candidates [155]. Yagcioglu et al. [232] used a CNN-based image representation to retrieve visu-ally similar images using adaptive neighborhood selection. Devlin et al. [49] demonstrated that a simple k-nearest neighbor retrieval with consensus caption selection achieves competitive translation results when compared to more complex generative approaches. The advantage of such unimodal retrieval approaches is that they only require the representation of a single modality through which we are performing retrieval. However, they often require an extra processing step such as re-ranking of retrieved translations [135], [155], [232]. This indicates a major problem with this approach — similarity in unimodal space does not always imply a good translation. | 基于示例的算法受到訓(xùn)練數(shù)據(jù)字典的限制(見圖2a)。我們確定了這類算法的兩種類型:基于檢索的和基于組合的。基于檢索的模型直接使用檢索到的翻譯而不修改它,而基于組合的模型則依賴于更復(fù)雜的規(guī)則來創(chuàng)建基于大量檢索到的實例的翻譯。 基于檢索的模型可以說是最簡單的多模態(tài)翻譯形式。他們依賴于在字典中找到最接近的樣本,并將其作為翻譯結(jié)果。檢索可以在單峰空間或中間語義空間進(jìn)行。 給定要翻譯的源模態(tài)實例,單模態(tài)檢索在源空間(例如,圖像的視覺特征空間)中找到字典中最近的實例。這種方法已經(jīng)被用于視覺語音合成,通過檢索最接近匹配的期望音素[26]的視覺示例。它們也被用于串聯(lián)文本-語音系統(tǒng)[88]。最近,Ordonez等人[155]使用單模態(tài)檢索,通過使用全局圖像特征檢索候選標(biāo)題來生成圖像描述[155]。Yagcioglu等人[232]使用了一種基于cnn的圖像表示,使用自適應(yīng)鄰域選擇來檢索視覺上相似的圖像。Devlin et al.[49]證明,與更復(fù)雜的生成方法相比,具有一致標(biāo)題選擇的簡單k近鄰檢索可以獲得有競爭力的翻譯結(jié)果。這種單模態(tài)檢索方法的優(yōu)點(diǎn)是,它們只需要表示我們執(zhí)行檢索時所使用的單一模態(tài)。然而,它們通常需要額外的處理步驟,如對檢索到的翻譯進(jìn)行重新排序[135]、[155]、[232]。這表明了這種方法的一個主要問題——單峰空間中的相似性并不總是意味著好的翻譯。 |
| An alternative is to use an intermediate semantic space for similarity comparison during retrieval. An early ex-ample of a hand crafted semantic space is one used by?Farhadi et al. [56]. They map both sentences and images to a space of object, action, scene, retrieval of relevant caption to an image is then performed in that space. In contrast to hand-crafting a representation, Socher et al. [191] learn a coordinated representation of sentences and CNN visual features (see Section 3.2 for description of coordinated spaces). They use the model for both translating from text to images and from images to text. Similarly, Xu et al. [231] used a coordinated space of videos and their descriptions for cross-modal retrieval. Jiang and Li [93] and Cao et al. [32] use cross-modal hashing to perform multimodal translation from images to sentences and back, while Ho-dosh et al. [83] use a multimodal KCCA space for image-sentence retrieval. Instead of aligning images and sentences globally in a common space, Karpathy et al. [99] propose a multimodal similarity metric that internally aligns image fragments (visual objects) together with sentence fragments (dependency tree relations). | 另一種方法是在檢索過程中使用中間語義空間進(jìn)行相似度比較。Farhadi等人使用的手工語義空間是一個早期的例子。它們將句子和圖像映射到對象、動作、場景的空間中,然后在該空間中檢索圖像的相關(guān)標(biāo)題。與手工制作表征不同,Socher等人[191]學(xué)習(xí)句子和CNN視覺特征的協(xié)調(diào)表征(關(guān)于協(xié)調(diào)空間的描述,請參見章節(jié)3.2)。他們將該模型用于從文本到圖像和從圖像到文本的轉(zhuǎn)換。類似地,Xu等人[231]使用視頻及其描述的協(xié)調(diào)空間進(jìn)行跨模態(tài)檢索。Jiang和Li[93]、Cao等使用跨模態(tài)哈希進(jìn)行圖像到句子的多模態(tài)轉(zhuǎn)換,Ho-dosh等[83]使用多模態(tài)KCCA空間進(jìn)行圖像-句子檢索。Karpathy等人[99]提出了一種多模態(tài)相似性度量方法,該方法將圖像片段(視覺對象)與句子片段(依賴樹關(guān)系)內(nèi)部對齊,而不是將圖像和句子整體對齊到一個公共空間中。 |
| Retrieval approaches in semantic space tend to perform better than their unimodal counterparts as they are retriev-ing examples in a more meaningful space that reflects both modalities and that is often optimized for retrieval. Fur-thermore, they allow for bi-directional translation, which is not straightforward with unimodal methods. However, they require manual construction or learning of such a semantic space, which often relies on the existence of large training dictionaries (datasets of paired samples). Combination-based models take the retrieval based ap-proaches one step further. Instead of just retrieving exam-ples from the dictionary, they combine them in a meaningful way to construct a better translation. Combination based media description approaches are motivated by the fact that sentence descriptions of images share a common and simple structure that could be exploited. Most often the rules for combinations are hand crafted or based on heuristics. Kuznetsova et al. [114] first retrieve phrases that describe visually similar images and then combine them to generate novel descriptions of the query image by using Integer Linear Programming with a number of hand crafted rules. Gupta et al. [74] first find k images most similar to the source image, and then use the phrases extracted from their captions to generate a target sentence. Lebret et al. [119] use a CNN-based image representation to infer phrases that describe it. The predicted phrases are then combined using?a trigram constrained language model. A big problem facing example-based approaches for translation is that the model is the entire dictionary — mak-ing the model large and inference slow (although, optimiza-tions such as hashing alleviate this problem). Another issue facing example-based translation is that it is unrealistic to expect that a single comprehensive and accurate translation relevant to the source example will always exist in the dic-tionary — unless the task is simple or the dictionary is very large. This is partly addressed by combination models that are able to construct more complex structures. However, they are only able to perform translation in one direction, while semantic space retrieval-based models are able to perform it both ways. | 語義空間中的檢索方法往往比單模態(tài)檢索方法表現(xiàn)得更好,因為它們檢索的例子是在一個更有意義的空間中,反映了兩種模態(tài),并且通常對檢索進(jìn)行優(yōu)化。此外,它們允許雙向翻譯,這與單峰方法不同。然而,它們需要手工構(gòu)建或?qū)W習(xí)這樣的語義空間,而這往往依賴于大型訓(xùn)練字典(成對樣本的數(shù)據(jù)集)的存在。 基于組合的模型將基于檢索的方法又向前推進(jìn)了一步。它們不只是從字典中檢索示例,而是以一種有意義的方式將它們組合在一起,從而構(gòu)建出更好的翻譯。基于組合的媒體描述方法是基于這樣一個事實,即圖像的句子描述具有共同的、簡單的結(jié)構(gòu),可以被利用。最常見的組合規(guī)則是手工制作的或基于啟發(fā)式。 Kuznetsova等人[114]首先檢索描述視覺上相似圖像的短語,然后通過使用帶有大量手工規(guī)則的整數(shù)線性規(guī)劃將它們組合起來,生成查詢圖像的新穎描述。Gupta等人[74]首先找到與源圖像最相似的k張圖像,然后使用從這些圖像的標(biāo)題中提取的短語來生成目標(biāo)句。Lebret等人[119]使用基于cnn的圖像表示來推斷描述圖像的短語。然后,使用三元組合約束語言模型將預(yù)測的短語組合在一起。 基于實例的翻譯方法面臨的一個大問題是,模型是整個字典——這使得模型變大,推理速度變慢(盡管,哈希等優(yōu)化可以緩解這個問題)。基于示例的翻譯面臨的另一個問題是,期望與源示例相關(guān)的單個全面而準(zhǔn)確的翻譯總是存在于詞典中是不現(xiàn)實的——除非任務(wù)很簡單或詞典非常大。這可以通過能夠構(gòu)建更復(fù)雜結(jié)構(gòu)的組合模型部分地解決。然而,它們只能在一個方向上進(jìn)行翻譯,而基于語義空間檢索的模型可以以兩種方式進(jìn)行翻譯。 |
4.2 Generative approaches生成方法
| Generative approaches to multimodal translation construct models that can perform multimodal translation given a unimodal source instance. It is a challenging problem as it requires the ability to both understand the source modality and to generate the target sequence or signal. As discussed in the following section, this also makes such methods much more difficult to evaluate, due to large space of possible correct answers. In this survey we focus on the generation of three modal-ities: language, vision, and sound. Language generation has been explored for a long time [170], with a lot of recent attention for tasks such as image and video description [19]. Speech and sound generation has also seen a lot of work with a number of historical [88] and modern approaches [157], [209]. Photo-realistic image generation has been less explored, and is still in early stages [132], [171], however, there have been a number of attempts at generating abstract scenes [253], computer graphics [45], and talking heads [6]. | 多模態(tài)翻譯的生成方法可以在給定單模態(tài)源實例的情況下構(gòu)建能夠執(zhí)行多模態(tài)翻譯的模型。這是一個具有挑戰(zhàn)性的問題,因為它要求既能理解源模態(tài),又能生成目標(biāo)序列或信號。正如下一節(jié)所討論的,這也使得這些方法更難評估,因為可能的正確答案空間很大。 在這個調(diào)查中,我們關(guān)注三種模態(tài)的生成:語言、視覺和聲音。語言生成已經(jīng)被探索了很長一段時間[170],最近很多人關(guān)注的是圖像和視頻描述[19]等任務(wù)。語音和聲音生成也見證了許多歷史[88]和現(xiàn)代方法[157]、[209]的大量工作。真實感圖像生成的研究較少,仍處于早期階段[132],[171],然而,在生成抽象場景[253]、計算機(jī)圖形學(xué)[45]和會說話的頭[6]方面已經(jīng)有了一些嘗試。 |
| We identify three broad categories of generative mod-els: grammar-based, encoder-decoder, and continuous generation models. Grammar based models simplify the task by re-stricting the target domain by using a grammar, e.g., by gen-erating restricted sentences based on a subject, object, verbtemplate. Encoder-decoder models first encode the source modality to a latent representation which is then used by a decoder to generate the target modality. Continuous gen-eration models generate the target modality continuously based on a stream of source modality inputs and are most suited for translating between temporal sequences — such as text-to-speech. Grammar-based models rely on a pre-defined grammar for generating a particular modality. They start by detecting high level concepts from the source modality, such as objects in images and actions from videos. These detections are then incorporated together with a generation procedure based on a pre-defined grammar to result in a target modality. Kojima et al. [107] proposed a system to describe human behavior in a video using the detected position of the person’s head and hands and rule based natural language generation that incorporates a hierarchy of concepts and actions. Barbu et al. [14] proposed a video description model that generates sentences of the form: who did what to whom and where and how they did it. The system was based on handcrafted object and event classifiers and used?a restricted grammar suitable for the task. Guadarrama et al.[73] predict subject, verb, object triplets describing a video using semantic hierarchies that use more general words in case of uncertainty. Together with a language model their approach allows for translation of verbs and nouns not seen in the dictionary. | 我們確定了生成模型的三大類:基于語法的、編碼器-解碼器和連續(xù)生成模型。基于語法的模型通過使用語法限制目標(biāo)領(lǐng)域來簡化任務(wù),例如,通過基于主語、賓語、動詞模板生成限制句。編碼器-解碼器模型首先將源模態(tài)編碼為一個潛在的表示,然后由解碼器使用它來生成目標(biāo)模態(tài)。連續(xù)生成模型基于源模態(tài)輸入流連續(xù)地生成目標(biāo)模態(tài),最適合于時間序列之間的轉(zhuǎn)換——比如文本到語音。 基于語法的模型依賴于預(yù)定義的語法來生成特定的模態(tài)。他們首先從源模態(tài)檢測高級概念,如圖像中的對象和視頻中的動作。然后將這些檢測與基于預(yù)定義語法的生成過程合并在一起,以產(chǎn)生目標(biāo)模態(tài)。 Kojima等人[107]提出了一種系統(tǒng),利用檢測到的人的頭和手的位置,以及基于規(guī)則的自然語言生成(包含概念和行為的層次),來描述視頻中的人類行為。Barbu et al.[14]提出了一個視頻描述模型,該模型生成如下形式的句子:誰對誰做了什么,在哪里以及他們是如何做的。該系統(tǒng)基于手工制作的對象和事件分類器,并使用了適合該任務(wù)的限制性語法。guadarama等人[73]預(yù)測主語、動詞、賓語三連詞描述視頻,使用語義層次結(jié)構(gòu),在不確定的情況下使用更一般的詞匯。與語言模型一起,他們的方法允許翻譯字典中沒有的動詞和名詞。 |
| To describe images, Yao et al. [235] propose to use an and-or graph-based model together with domain-specific lexicalized grammar rules, targeted visual representation scheme, and a hierarchical knowledge ontology. Li et al.[121] first detect objects, visual attributes, and spatial re-lationships between objects. They then use an n-gram lan-guage model on the visually extracted phrases to generatesubject, preposition, object style sentences. Mitchell et al.[142] use a more sophisticated tree-based language model to generate syntactic trees instead of filling in templates, leading to more diverse descriptions. A majority of ap-proaches represent the whole image jointly as a bag of visual objects without capturing their spatial and semantic relationships. To address this, Elliott et al. [51] propose to explicitly model proximity relationships of objects for image description generation. Some grammar-based approaches rely on graphical models to generate the target modality. An example includes BabyTalk [112], which given an image generates object, preposition, object triplets, that are used together with a conditional random field to construct the sentences. Yang et al. [233] predict a set of noun, verb, scene, prepositioncandidates using visual features extracted from an image and combine them into a sentence using a statistical lan-guage model and hidden Markov model style inference. A similar approach has been proposed by Thomason et al. [204], where a factor graph model is used for video description of the form subject, verb, object, place. The factor model exploits language statistics to deal with noisy visual representations. Going the other way Zitnick et al.[253] propose to use conditional random fields to generate abstract visual scenes based on language triplets extracted from sentences. | 為了描述圖像,Yao等人[235]提出使用基于和或圖的模型,以及特定領(lǐng)域的詞匯化語法規(guī)則、有針對性的視覺表示方案和層次知識本體。Li等人[121]首先檢測對象、視覺屬性和對象之間的空間關(guān)系。然后,他們在視覺提取的短語上使用一個n-gram語言模型,生成主語、介詞、賓語式的句子。Mitchell等人[142]使用更復(fù)雜的基于樹的語言模型來生成語法樹,而不是填充模板,從而產(chǎn)生更多樣化的描述。大多數(shù)方法將整個圖像共同表示為一袋視覺對象,而沒有捕捉它們的空間和語義關(guān)系。為了解決這個問題,Elliott et al.[51]提出明確地建模物體的接近關(guān)系,以生成圖像描述。 一些基于語法的方法依賴于圖形模型來生成目標(biāo)模態(tài)。一個例子包括BabyTalk[112],它給出一個圖像生成object,介詞,object三連詞,這些連詞與條件隨機(jī)場一起用來構(gòu)造句子。Yang等人[233]利用從圖像中提取的視覺特征預(yù)測一組的名詞、動詞、場景、介詞候選人,并使用統(tǒng)計語言模型和隱馬爾可夫模型風(fēng)格推理將它們組合成一個句子。Thomason等人也提出了類似的方法[204],其中一個因子圖模型用于subject, verb, object, place形式的視頻描述。因子模型利用語言統(tǒng)計來處理嘈雜的視覺表示。Zitnick等人[253]則提出利用條件隨機(jī)場從句子中提取語言三聯(lián)體,生成抽象視覺場景。 |
| An advantage of grammar-based methods is that they are more likely to generate syntactically (in case of lan-guage) or logically correct target instances as they use predefined templates and restricted grammars. However, this limits them to producing formulaic rather than creative translations. Furthermore, grammar-based methods rely on complex pipelines for concept detection, with each concept requiring a separate model and a separate training dataset. Encoder-decoder models based on end-to-end trained neu-ral networks are currently some of the most popular tech-niques for multimodal translation. The main idea behind the model is to first encode a source modality into a vectorial representation and then to use a decoder module to generate the target modality, all this in a single pass pipeline. Al-though, first used for machine translation [97], such models have been successfully used for image captioning [134],[214], and video description [174], [213]. So far, encoder-decoder models have been mostly used to generate text, but they can also be used to generate images [132], [171], and continuos generation of speech and sound [157], [209]. The first step of the encoder-decoder model is to encode the source object, this is done in modality specific way.Popular models to encode acoustic signals include RNNs [35] and DBNs [79]. Most of the work on encoding words sentences uses distributional semantics [141] and variants of RNNs [12]. Images are most often encoded using convo-lutional neural networks (CNN) [109], [185]. While learned CNN representations are common for encoding images, this is not the case for videos where hand-crafted features are still commonly used [174], [204]. While it is possible to use unimodal representations to encode the source modality, it has been shown that using a coordinated space (see Section 3.2) leads to better results [105], [159], [231]. | 基于語法的方法的一個優(yōu)點(diǎn)是,當(dāng)它們使用預(yù)定義模板和受限制的語法時,它們更有可能生成語法上(對于語言)或邏輯上正確的目標(biāo)實例。然而,這限制了他們只能寫出公式化的翻譯,而不是創(chuàng)造性的翻譯。此外,基于語法的方法依賴于復(fù)雜的管道進(jìn)行概念檢測,每個概念都需要一個單獨(dú)的模型和一個單獨(dú)的訓(xùn)練數(shù)據(jù)集。基于端到端訓(xùn)練神經(jīng)網(wǎng)絡(luò)的編解碼模型是目前最流行的多模態(tài)翻譯技術(shù)之一。該模型背后的主要思想是,首先將源模態(tài)編碼為矢量表示,然后使用解碼器模塊生成目標(biāo)模態(tài),所有這些都在一個單通道中完成。雖然該模型最初用于機(jī)器翻譯[97],但已成功應(yīng)用于圖像字幕[134]、[214]和視頻描述[174]、[213]。到目前為止,編碼器-解碼器模型大多用于生成文本,但它們也可以用于生成圖像[132]、[171],以及語音和聲音的連續(xù)生成[157]、[209]。 編碼器-解碼器模型的第一步是對源對象進(jìn)行編碼,這是以特定于模態(tài)的方式完成的。常用的聲學(xué)信號編碼模型包括RNNs、[35]和DBNs[79]。大多數(shù)關(guān)于單詞和句子編碼的研究使用了分布語義[141]和RNNs的變體[12]。圖像通常使用卷積神經(jīng)網(wǎng)絡(luò)(CNN)進(jìn)行編碼[109],[185]。雖然學(xué)習(xí)過的CNN表示通常用于編碼圖像,但對于手工制作的特征仍然常用的視頻卻不是這樣[174],[204]。雖然可以使用單模態(tài)表示對源模態(tài)進(jìn)行編碼,但已經(jīng)證明使用協(xié)調(diào)空間(見3.2節(jié))可以得到更好的結(jié)果[105]、[159]、[231]。 |
| Decoding is most often performed by an RNN or an LSTM using the encoded representation as the initial hidden state [54], [132], [214], [215]. A number of extensions have been proposed to traditional LSTM models to aid in the task of translation. A guide vector could be used to tightly couple the solutions in the image input [91]. Venugopalan et al.[213] demonstrate that it is beneficial to pre-train a decoder LSTM for image captioning before fine-tuning it to video description. Rohrbach et al. [174] explore the use of various LSTM architectures (single layer, multilayer, factored) and a number of training and regularization techniques for the task of video description. A problem facing translation generation using an RNN is that the model has to generate a description from a single vectorial representation of the image, sentence, or video. This becomes especially difficult when generating long sequences as these models tend to forget the initial input. This has been partly addressed by neural attention models (see Section 5.2) that allow the network to focus on certain parts of an image [230], sentence [12], or video [236] during generation. Generative attention-based RNNs have also been used for the task of generating images from sentences [132], while the results are still far from photo-realistic they show a lot of promise. More recently, a large amount of progress has been made in generating images using generative adversarial networks [71], which have been used as an alternative to RNNs for image generation from text [171]. | 解碼通常由RNN或LSTM執(zhí)行,使用編碼表示作為初始隱藏狀態(tài)[54],[132],[214],[215]。人們對傳統(tǒng)的LSTM模型進(jìn)行了大量的擴(kuò)展,以幫助完成翻譯任務(wù)。一個引導(dǎo)向量可以用來緊耦合圖像輸入中的解[91]。Venugopalan等人[213]證明,在將解碼器LSTM微調(diào)為視頻描述之前,對圖像字幕進(jìn)行預(yù)訓(xùn)練是有益的。Rohrbach等人[174]探討了在視頻描述任務(wù)中使用各種LSTM架構(gòu)(單層、多層、因子)和多種訓(xùn)練和正則化技術(shù)。 使用RNN進(jìn)行翻譯生成面臨的一個問題是,模型必須從圖像、句子或視頻的單個矢量表示生成描述。這在生成長序列時變得特別困難,因為這些模型往往會忘記最初的輸入。神經(jīng)注意力模型已經(jīng)部分解決了這一問題(見5.2節(jié)),神經(jīng)注意力模型允許網(wǎng)絡(luò)在生成時聚焦于圖像[230]、句子[12]或視頻[236]的某些部分。 基于生成注意力的神經(jīng)網(wǎng)絡(luò)也被用于從句子中生成圖像的任務(wù)[132],盡管其結(jié)果還遠(yuǎn)遠(yuǎn)不夠逼真,但它們顯示出了很大的希望。最近,在使用生成對抗網(wǎng)絡(luò)生成圖像方面取得了大量進(jìn)展[71],生成對抗網(wǎng)絡(luò)已被用于替代rnn從文本生成圖像[171]。 |
| While neural network based encoder-decoder systems have been very successful they still face a number of issues. Devlin et al. [49] suggest that it is possible that the network is memorizing the training data rather than learning how to understand the visual scene and generate it. This is based on the observation that k-nearest neighbor models perform very similarly to those based on generation. Furthermore, such models often require large quantities of data for train-ing. Continuous generation models are intended for sequence translation and produce outputs at every timestep in an online manner. These models are useful when translating from a sequence to a sequence such as text to speech, speech to text, and video to text. A number of different techniques have been proposed for such modeling — graphical models, continuous encoder-decoder approaches, and various other regression or classification techniques. The extra difficulty that needs to be tackled by these models is the requirement of temporal consistency between modalities. A lot of early work on sequence to sequence transla-tion used graphical or latent variable models. Deena and Galata [47] proposed to use a shared Gaussian process latent?variable model for audio-based visual speech synthesis. The model creates a shared latent space between audio and vi-sual features that can be used to generate one space from the other, while enforcing temporal consistency of visual speech at different timesteps. Hidden Markov models (HMM) have also been used for visual speech generation [203] and text-to-speech [245] tasks. They have also been extended to use cluster adaptive training to allow for training on multiple speakers, languages, and emotions allowing for more con-trol when generating speech signal [244] or visual speech parameters [6]. | 雖然基于神經(jīng)網(wǎng)絡(luò)的編碼器-解碼器系統(tǒng)已經(jīng)非常成功,但它們?nèi)匀幻媾R一些問題。Devlin et al.[49]認(rèn)為,網(wǎng)絡(luò)可能是在記憶訓(xùn)練數(shù)據(jù),而不是學(xué)習(xí)如何理解視覺場景并生成它。這是基于k近鄰模型與基于生成的模型非常相似的觀察得出的。此外,這種模型通常需要大量的數(shù)據(jù)進(jìn)行訓(xùn)練。 連續(xù)生成模型用于序列轉(zhuǎn)換,并以在線方式在每個時間步中產(chǎn)生輸出。這些模型在將一個序列轉(zhuǎn)換為另一個序列時非常有用,比如文本到語音、語音到文本和視頻到文本。為這種建模提出了許多不同的技術(shù)——圖形模型、連續(xù)編碼器-解碼器方法,以及各種其他回歸或分類技術(shù)。這些模型需要解決的額外困難是對模態(tài)之間時間一致性的要求。 許多早期的序列到序列轉(zhuǎn)換的工作使用圖形或潛在變量模型。Deena和Galata[47]提出了一種共享高斯過程潛變量模型用于基于音頻的可視語音合成。該模型在音頻和視覺特征之間創(chuàng)建了一個共享的潛在空間,可用于從另一個空間生成一個空間,同時在不同的時間步長強(qiáng)制實現(xiàn)視覺語音的時間一致性。隱馬爾可夫模型(HMM)也被用于視覺語音生成[203]和文本-語音轉(zhuǎn)換[245]任務(wù)。它們還被擴(kuò)展到使用聚類自適應(yīng)訓(xùn)練,以允許對多種說話人、語言和情緒進(jìn)行訓(xùn)練,從而在產(chǎn)生語音信號[244]或視覺語音參數(shù)[6]時進(jìn)行更多的控制。 |
| Encoder-decoder models have recently become popular for sequence to sequence modeling. Owens et al. [157] used an LSTM to generate sounds resulting from drumsticks based on video. While their model is capable of generat-ing sounds by predicting a cochleogram from CNN visual features, they found that retrieving a closest audio sample based on the predicted cochleogram led to best results. Di-rectly modeling the raw audio signal for speech and music generation has been proposed by van den Oord et al. [209]. The authors propose using hierarchical fully convolutional neural networks, which show a large improvement over previous state-of-the-art for the task of speech synthesis. RNNs have also been used for speech to text translation (speech recognition) [72]. More recently encoder-decoder based continuous approach was shown to be good at pre-dicting letters from a speech signal represented as a filter bank spectra [35] — allowing for more accurate recognition of rare and out of vocabulary words. Collobert et al. [42] demonstrate how to use a raw audio signal directly for speech recognition, eliminating the need for audio features. A lot of earlier work used graphical models for mul-timodal translation between continuous signals. However, these methods are being replaced by neural network encoder-decoder based techniques. Especially as they have recently been shown to be able to represent and generate complex visual and acoustic signals. | 編碼器-解碼器模型是近年來序列對序列建模的流行方法。Owens等人[157]使用LSTM來產(chǎn)生基于視頻的鼓槌的聲音。雖然他們的模型能夠通過預(yù)測CNN視覺特征的耳蝸圖來產(chǎn)生聲音,但他們發(fā)現(xiàn),根據(jù)預(yù)測的耳蝸圖檢索最近的音頻樣本會帶來最好的結(jié)果。van den Oord等人提出直接對原始音頻信號建模以生成語音和音樂[209]。作者建議使用分層全卷積神經(jīng)網(wǎng)絡(luò),這表明在語音合成的任務(wù)中,比以前的最先進(jìn)技術(shù)有了很大的改進(jìn)。rnn也被用于語音到文本的翻譯(語音識別)[72]。最近,基于編碼器-解碼器的連續(xù)方法被證明能夠很好地從表示為濾波器組光譜[35]的語音信號中預(yù)測字母,從而能夠更準(zhǔn)確地識別罕見的和詞匯之外的單詞。Collobert等人的[42]演示了如何直接使用原始音頻信號進(jìn)行語音識別,消除了對音頻特征的需求。 許多早期的工作使用圖形模型來實現(xiàn)連續(xù)信號之間的多模態(tài)轉(zhuǎn)換。然而,這些方法正在被基于神經(jīng)網(wǎng)絡(luò)的編碼器-解碼器技術(shù)所取代。特別是它們最近被證明能夠表示和產(chǎn)生復(fù)雜的視覺和聽覺信號。 |
4.3 Model evaluation and discussion模型評價與討論
| A major challenge facing multimodal translation methods is that they are very difficult to evaluate. While some tasks such as speech recognition have a single correct translation, tasks such as speech synthesis and media description do not. Sometimes, as in language translation, multiple answers are correct and deciding which translation is better is often subjective. Fortunately, there are a number of approximate automatic metrics that aid in model evaluation. Often the ideal way to evaluate a subjective task is through human judgment. That is by having a group of people evaluating each translation. This can be done on a Likert scale where each translation is evaluated on a certain dimension: naturalness and mean opinion score for speech synthesis [209], [244], realism for visual speech synthesis [6],[203], and grammatical and semantic correctness, relevance, order, and detail for media description [38], [112], [142],[213]. Another option is to perform preference studies where two (or more) translations are presented to the participant for preference comparison [203], [244]. However, while user studies will result in evaluation closest to human judgments they are time consuming and costly. Furthermore, they require care when constructing and conducting them to avoid fluency, age, gender and culture biases. | 多模態(tài)翻譯方法面臨的一個主要挑戰(zhàn)是它們很難評估。語音識別等任務(wù)只有一個正確的翻譯,而語音合成和媒體描述等任務(wù)則沒有。有時,就像在語言翻譯中,多重答案是正確的,決定哪個翻譯更好往往是主觀的。幸運(yùn)的是,有許多有助于模型評估的近似自動指標(biāo)。 評估主觀任務(wù)的理想方法通常是通過人的判斷。那就是讓一群人評估每一個翻譯。這可以通過李克特量表來完成,其中每一篇翻譯都在一個特定的維度上進(jìn)行評估:語音合成的自然度和平均意見得分[209],[244],視覺語音合成的真實感[6],[203],以及媒體描述[38],[112],[142],[213]的語法和語義正確性、相關(guān)性、順序和細(xì)節(jié)。另一種選擇是進(jìn)行偏好研究,將兩種(或更多)翻譯呈現(xiàn)給參與者進(jìn)行偏好比較[203],[244]。然而,雖然用戶研究將導(dǎo)致最接近人類判斷的評估,但它們既耗時又昂貴。此外,在構(gòu)建和指導(dǎo)這些活動時,需要小心謹(jǐn)慎,以避免流利性、年齡、性別和文化偏見。 |
| While human studies are a gold standard for evaluation, a number of automatic alternatives have been proposed for the task of media description: BLEU [160], ROUGE [124], Meteor [48], and CIDEr [211]. These metrics are directly taken from (or are based on) work in machine translation and compute a score that measures the similarity between the generated and ground truth text. However, the use of them has faced a lot of criticism. Elliott and Keller [52] showed that sentence-level unigram BLEU is only weakly correlated with human judgments. Huang et al. [87] demon-strated that the correlation between human judgments and BLEU and Meteor is very low for visual story telling task. Furthermore, the ordering of approaches based on human judgments did not match that of the ordering using au-tomatic metrics on the MS COCO challenge [38] — with a large number of algorithms outperforming humans on all the metrics. Finally, the metrics only work well when a number of reference translations is high [211], which is often unavailable, especially for current video description datasets [205] | 雖然人類研究是評估的黃金標(biāo)準(zhǔn),但人們提出了許多媒體描述任務(wù)的自動替代方案:BLEU[160]、ROUGE[124]、Meteor[48]和CIDEr[211]。這些指標(biāo)是直接從(或基于)機(jī)器翻譯的工作,并計算出一個分?jǐn)?shù),以衡量生成的文本和地面真實文本之間的相似性。然而,它們的使用面臨著許多批評。Elliott和Keller[52]表明句子層面的ungram BLEU與人類判斷只有弱相關(guān)。Huang等[87]研究表明,在視覺講故事任務(wù)中,人類判斷與BLEU和Meteor之間的相關(guān)性非常低。此外,基于人類判斷的方法排序與在MS COCO挑戰(zhàn)[38]上使用自動度量的排序并不匹配——大量算法在所有度量上都優(yōu)于人類。最后,只有在大量參考翻譯量高的情況下,指標(biāo)才能很好地工作[211],而這通常是不可用的,特別是對于當(dāng)前的視頻描述數(shù)據(jù)集[205]。 |
| These criticisms have led to Hodosh et al. [83] proposing to use retrieval as a proxy for image captioning evaluation, which they argue better reflects human judgments. Instead of generating captions, a retrieval based system ranks the available captions based on their fit to the image, and is then evaluated by assessing if the correct captions are given a high rank. As a number of caption generation models are generative they can be used directly to assess the likelihood of a caption given an image and are being adapted by im-age captioning community [99], [105]. Such retrieval based evaluation metrics have also been adopted by the video captioning community [175]. Visual question-answering (VQA) [130] task was pro-posed partly due to the issues facing evaluation of image captioning. VQA is a task where given an image and a ques-tion about its content the system has to answer it. Evaluating such systems is easier due to the presence of a correct answer. However, it still faces issues such as ambiguity of certain questions and answers and question bias. We believe that addressing the evaluation issue will be crucial for further success of multimodal translation systems. This will allow not only for better comparison be-tween approaches, but also for better objectives to optimize. | 這些批評導(dǎo)致Hodosh等人[83]提出使用檢索作為圖像標(biāo)題評價的代理,他們認(rèn)為檢索可以更好地反映人類的判斷。基于檢索的系統(tǒng)不是生成標(biāo)題,而是根據(jù)它們與圖像的契合度對可用的標(biāo)題進(jìn)行排序,然后通過評估正確的標(biāo)題是否被給予較高的級別來進(jìn)行評估。由于許多字幕生成模型是可生成的,它們可以直接用于評估給定圖像的字幕的可能性,并被圖像字幕社區(qū)改編[99],[105]。這種基于檢索的評價指標(biāo)也被視頻字幕社區(qū)采用[175]。 視覺問答(Visual question-answer, VQA)[130]任務(wù)的提出,部分是由于圖像字幕評價面臨的問題。VQA是一個任務(wù),在這個任務(wù)中,給定一個圖像和一個關(guān)于其內(nèi)容的問題,系統(tǒng)必須回答它。由于存在正確的答案,評估這些系統(tǒng)更容易。然而,它仍然面臨一些問題,如某些問題和答案的模糊性和問題的偏見。 我們認(rèn)為,解決評價問題將是多模態(tài)翻譯系統(tǒng)進(jìn)一步成功的關(guān)鍵。這不僅可以更好地比較不同的方法,而且還可以優(yōu)化更好的目標(biāo)。 |
5 Alignment對齊
| We define multimodal alignment as finding relationships and correspondences between sub-components of instances from two or more modalities. For example, given an image and a caption we want to find the areas of the image cor-responding to the caption’s words or phrases [98]. Another example is, given a movie, aligning it to the script or the book chapters it was based on [252]. We categorize multimodal alignment into two types –implicit and explicit. In explicit alignment, we are explicitly interested in aligning sub-components between modalities,e.g., aligning recipe steps with the corresponding instructional video [131]. Implicit alignment is used as an interme-diate (often latent) step for another task, e.g., image retrieval?based on text description can include an alignment step between words and image regions [99]. An overview of such approaches can be seen in Table 4 and is presented in more detail in the following sections. | 我們將多模態(tài)對齊定義為尋找來自兩個或多個模態(tài)實例的子組件之間的關(guān)系和對應(yīng)關(guān)系。例如,給定一張圖片和一個標(biāo)題,我們想要找到圖片中與標(biāo)題中的單詞或短語相對應(yīng)的區(qū)域[98]。另一個例子是,給定一部電影,將其與劇本或書中的章節(jié)對齊[252]。 我們將多模態(tài)對齊分為兩種類型-隱式和顯式。在顯式對齊中,我們明確地對模態(tài)之間的子組件對齊感興趣。,將配方步驟與相應(yīng)的教學(xué)視頻對齊[131]。隱式對齊是另一個任務(wù)的中間(通常是潛伏的)步驟,例如,基于文本描述的圖像檢索可以包括單詞和圖像區(qū)域之間的對齊步驟[99]。這些方法的概述見表4,并在下面幾節(jié)中給出更詳細(xì)的介紹。 |
| ?Table 4: Summary of our taxonomy for the multimodal alignment challenge. For each sub-class of our taxonomy, we include reference citations and modalities aligned. | 表4:我們對多模態(tài)對齊挑戰(zhàn)的分類總結(jié)。對于我們分類法的每一個子類,我們包括參考引文和模態(tài)對齊。 |
5.1 Explicit alignment顯式對齊
| We categorize papers as performing explicit alignment if their main modeling objective is alignment between sub-components of instances from two or more modalities. A very important part of explicit alignment is the similarity metric. Most approaches rely on measuring similarity be-tween sub-components in different modalities as a basic building block. These similarities can be defined manually or learned from data. We identify two types of algorithms that tackle ex-plicit alignment — unsupervised and (weakly) supervised. The first type operates with no direct alignment labels (i.e., la-beled correspondences) between instances from the different modalities. The second type has access to such (sometimes weak) labels. Unsupervised multimodal alignment tackles modality alignment without requiring any direct alignment labels. Most of the approaches are inspired from early work on alignment for statistical machine translation [28] and genome sequences [3], [111]. To make the task easier the approaches assume certain constrains on alignment, such as temporal ordering of sequence or an existence of a similarity metric between the modalities. | 如果論文的主要建模目標(biāo)是對齊來自兩個或更多模態(tài)的實例的子組件,那么我們將其分類為執(zhí)行顯式對齊。顯式對齊的一個非常重要的部分是相似性度量。大多數(shù)方法都依賴于度量不同模態(tài)的子組件之間的相似性作為基本構(gòu)建塊。這些相似點(diǎn)可以手工定義,也可以從數(shù)據(jù)中學(xué)習(xí)。 我們確定了兩種處理顯式對齊的算法-無監(jiān)督和(弱)監(jiān)督。第一種類型在不同模態(tài)的實例之間沒有直接對齊標(biāo)簽(即標(biāo)簽對應(yīng))。第二種類型可以訪問這樣的標(biāo)簽(有時是弱標(biāo)簽)。 無監(jiān)督多模態(tài)對齊處理模態(tài)對齊,而不需要任何直接對齊標(biāo)簽。大多數(shù)方法的靈感來自于早期對統(tǒng)計機(jī)器翻譯[28]和基因組序列[3]的比對工作,[111]。為了使任務(wù)更容易,這些方法在對齊上假定了一定的約束,例如序列的時間順序或模態(tài)之間存在相似性度量。 |
| Dynamic time warping (DTW) [3], [111] is a dynamic programming approach that has been extensively used to align multi-view time series. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames). It requires the timesteps in the two sequences to be comparable and re-quires a similarity measure between them. DTW can be used directly for multimodal alignment by hand-crafting similar-ity metrics between modalities; for example Anguera et al.[8] use a manually defined similarity between graphemes and phonemes; and Tapaswi et al. [201] define a similarity between visual scenes and sentences based on appearance of same characters [201] to align TV shows and plot syn-opses. DTW-like dynamic programming approaches have also been used for multimodal alignment of text to speech [77] and video [202]. As the original DTW formulation requires a pre-defined similarity metric between modalities, it was extended using?canonical correlation analysis (CCA) to map the modali-ties to a coordinated space. This allows for both aligning (through DTW) and learning the mapping (through CCA) between different modality streams jointly and in an unsu-pervised manner [180], [250], [251]. While CCA based DTW models are able to find multimodal data alignment under a linear transformation, they are not able to model non-linear relationships. This has been addressed by the deep canonical time warping approach [206], which can be seen as a generalization of deep CCA and DTW. | 動態(tài)時間翹曲(DTW)[3],[111]是一種動態(tài)規(guī)劃方法,被廣泛用于對齊多視圖時間序列。DTW測量兩個序列之間的相似性,并通過時間翹曲(插入幀)找到它們之間的最優(yōu)匹配。它要求兩個序列中的時間步具有可比性,并要求它們之間的相似性度量。DTW可以通過手工制作模態(tài)之間的相似性度量直接用于多模態(tài)校準(zhǔn);例如,Anguera等人使用手工定義的字素和音素之間的相似度;和Tapaswi等[201]根據(jù)相同角色的出現(xiàn)定義視覺場景和句子之間的相似性[201],以對齊電視節(jié)目和情節(jié)同步。類似dtw的動態(tài)規(guī)劃方法也被用于文本到語音[77]和視頻[202]的多模態(tài)對齊。 由于原始的DTW公式需要預(yù)定義的模態(tài)之間的相似性度量,因此使用典型相關(guān)分析(CCA)對其進(jìn)行了擴(kuò)展,以將模態(tài)映射到協(xié)調(diào)空間。這既允許(通過DTW)對齊,也允許(通過CCA)以非監(jiān)督的方式共同學(xué)習(xí)不同模態(tài)流之間的映射[180]、[250]、[251]。雖然基于CCA的DTW模型能夠在線性變換下找到多模態(tài)數(shù)據(jù)對齊,但它們不能建模非線性關(guān)系。深度標(biāo)準(zhǔn)時間翹曲方法已經(jīng)解決了這一問題[206],該方法可以看作是深度CCA和DTW的推廣。 |
| Various graphical models have also been popular for multimodal sequence alignment in an unsupervised man-ner. Early work by Yu and Ballard [239] used a generative graphical model to align visual objects in images with spoken words. A similar approach was taken by Cour et al.[44] to align movie shots and scenes to the corresponding screenplay. Malmaud et al. [131] used a factored HMM to align recipes to cooking videos, while Noulas et al. [154] used a dynamic Bayesian network to align speakers to videos. Naim et al. [147] matched sentences with corre-sponding video frames using a hierarchical HMM model to align sentences with frames and a modified IBM [28] algorithm for word and object alignment [15]. This model was then extended to use latent conditional random fields for alignments [146] and to incorporate verb alignment to actions in addition to nouns and objects [195]. Both DTW and graphical model approaches for align-ment allow for restrictions on alignment, e.g. temporal consistency, no large jumps in time, and monotonicity. While DTW extensions allow for learning both the similarity met-ric and alignment jointly, graphical model based approaches require expert knowledge for construction [44], [239]. Supervised alignment methods rely on labeled aligned in-stances. They are used to train similarity measures that are used for aligning modalities. | 各種圖形模型也流行于無監(jiān)督方式的多模態(tài)序列比對。Yu和Ballard的早期工作[239]使用生成圖形模型,將圖像中的視覺對象與口語對齊。Cour et al.[44]采用了類似的方法,將電影鏡頭和場景與相應(yīng)的劇本對齊。Malmaud等人[131]使用一種經(jīng)過分解的HMM將食譜與烹飪視頻進(jìn)行對齊,而Noulas等人[154]使用動態(tài)貝葉斯網(wǎng)絡(luò)將說話者與視頻進(jìn)行對齊。Naim等人[147]使用分層HMM模型對句子和幀進(jìn)行對齊,并使用改進(jìn)的IBM[28]算法對單詞和對象進(jìn)行對齊[15],將句子與相應(yīng)的視頻幀進(jìn)行匹配。隨后,該模型被擴(kuò)展到使用潛在條件隨機(jī)場進(jìn)行對齊[146],并將動詞對齊合并到動作中,除了名詞和對象之外[195]。 DTW和圖形模型的對齊方法都允許對對齊的限制,例如時間一致性、時間上沒有大的跳躍和單調(diào)性。雖然DTW擴(kuò)展可以同時學(xué)習(xí)相似性度量和對齊,但基于圖形模型的方法需要專家知識來構(gòu)建[44],[239]。監(jiān)督對齊方法依賴于標(biāo)記對齊的實例。它們被用來訓(xùn)練用于對齊模態(tài)的相似性度量。 |
| A number of supervised sequence alignment techniques take inspiration from unsupervised ones. Bojanowski et al.[22], [23] proposed a method similar to canonical time warp-ing, but have also extended it to take advantage of exist-ing (weak) supervisory alignment data for model training. Plummer et al. [161] used CCA to find a coordinated space between image regions and phrases for alignment. Gebru et al. [65] trained a Gaussian mixture model and performed semi-supervised clustering together with an unsupervised latent-variable graphical model to align speakers in an audio channel with their locations in a video. Kong et al. [108] trained a Markov random field to align objects in 3D scenes to nouns and pronouns in text descriptions. Deep learning based approaches are becoming popular for explicit alignment (specifically for measuring similarity) due to very recent availability of aligned datasets in the lan-guage and vision communities [133], [161]. Zhu et al. [252] aligned books with their corresponding movies/scripts by training a CNN to measure similarities between scenes and text. Mao et al. [133] used an LSTM language model and a CNN visual one to evaluate the quality of a match between a referring expression and an object in an image. Yu et al.[242] extended this model to include relative appearance and context information that allows to better disambiguate between objects of the same type. Finally, Hu et al. [85] used an LSTM based scoring function to find similarities between?image regions and their descriptions. | 許多監(jiān)督序列比對技術(shù)的靈感來自于非監(jiān)督序列比對技術(shù)。Bojanowski et al.[22],[23]提出了一種類似于規(guī)范時間扭曲的方法,但也對其進(jìn)行了擴(kuò)展,以利用現(xiàn)有的(弱)監(jiān)督對準(zhǔn)數(shù)據(jù)進(jìn)行模型訓(xùn)練。Plummer等[161]利用CCA在圖像區(qū)域和短語之間找到一個協(xié)調(diào)的空間進(jìn)行對齊。Gebru等人[65]訓(xùn)練了一種高斯混合模型,并將半監(jiān)督聚類與一種無監(jiān)督的潛在變量圖形模型結(jié)合在一起,以使音頻通道中的揚(yáng)聲器與視頻中的位置保持一致。Kong等人[108]訓(xùn)練了馬爾可夫隨機(jī)場來將3D場景中的物體與文本描述中的名詞和代詞對齊。 基于深度學(xué)習(xí)的方法在顯式對齊(特別是度量相似性)方面正變得流行起來,這是由于最近在語言和視覺社區(qū)中對齊數(shù)據(jù)集的可用性[133],[161]。Zhu等人[252]通過訓(xùn)練CNN來衡量場景和文本之間的相似性,將書籍與相應(yīng)的電影/腳本對齊。Mao等人[133]使用LSTM語言模型和CNN視覺模型來評估參考表達(dá)和圖像中物體匹配的質(zhì)量。Yu等人[242]將該模型擴(kuò)展到包含相對外觀和上下文信息,從而可以更好地消除同一類型對象之間的歧義。最后,Hu等[85]使用基于LSTM的評分函數(shù)來尋找圖像區(qū)域與其描述之間的相似點(diǎn)。 |
5.2 Implicit alignment隱式對齊
| In contrast to explicit alignment, implicit alignment is used as an intermediate (often latent) step for another task. This allows for better performance in a number of tasks including speech recognition, machine translation, media description, and visual question-answering. Such models do not explic-itly align data and do not rely on supervised alignment examples, but learn how to latently align the data during model training. We identify two types of implicit alignment models: earlier work based on graphical models, and more modern neural network methods. Graphical models have seen some early work used to better align words between languages for machine translation [216] and alignment of speech phonemes with their tran-scriptions [186]. However, they require manual construction of a mapping between the modalities, for example a gener-ative phone model that maps phonemes to acoustic features [186]. Constructing such models requires training data or human expertise to define them manually. Neural networks Translation (Section 4) is an example of a modeling task that can often be improved if alignment is performed as a latent intermediate step. As we mentioned before, neural networks are popular ways to address this translation problem, using either an encoder-decoder model or through cross-modal retrieval. When translation is per-formed without implicit alignment, it ends up putting a lot of weight on the encoder module to be able to properly summarize the whole image, sentence or a video with a single vectorial representation. | 與顯式對齊相反,隱式對齊用作另一個任務(wù)的中間(通常是潛在的)步驟。這允許在許多任務(wù)中有更好的表現(xiàn),包括語音識別、機(jī)器翻譯、媒體描述和視覺問題回答。這些模型不顯式地對齊數(shù)據(jù),也不依賴于監(jiān)督對齊示例,而是學(xué)習(xí)如何在模型訓(xùn)練期間潛在地對齊數(shù)據(jù)。我們確定了兩種類型的隱式對齊模型:基于圖形模型的早期工作,以及更現(xiàn)代的神經(jīng)網(wǎng)絡(luò)方法。 圖形化模型的一些早期工作已被用于機(jī)器翻譯中更好地對齊語言之間的單詞[216],以及對齊語音音素與其轉(zhuǎn)錄文本[186]。然而,它們需要人工構(gòu)建模態(tài)之間的映射,例如將音素映射到聲學(xué)特征的生成電話模型[186]。構(gòu)建這樣的模型需要訓(xùn)練數(shù)據(jù)或人類專業(yè)知識來手動定義它們。 神經(jīng)網(wǎng)絡(luò)翻譯(第4節(jié))是建模任務(wù)的一個例子,如果將對齊作為一個潛在的中間步驟執(zhí)行,那么通常可以改進(jìn)建模任務(wù)。正如我們前面提到的,神經(jīng)網(wǎng)絡(luò)是解決這個翻譯問題的常用方法,可以使用編碼器-解碼器模型,也可以通過跨模態(tài)檢索。當(dāng)在沒有隱式對齊的情況下進(jìn)行平移時,編碼器模塊最終會因為能夠正確地用單個矢量表示總結(jié)整個圖像、句子或視頻而受到很大的影響。 |
| A very popular way to address this is through attention [12], which allows the decoder to focus on sub-components of the source instance. This is in contrast with encoding all source sub-components together, as is performed in a con-ventional encoder-decoder model. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated — areas of an image [230], words of a sentence [12], segments of an audio sequence [35], [39], frames and regions in a video [236], [241], and even parts of an instruction [140]. For example, in image captioning in-stead of encoding an entire image using a CNN, an attention mechanism will allow the decoder (typically an RNN) to focus on particular parts of the image when generating each successive word [230]. The attention module which learns what part of the image to focus on is typically a shallow neural network and is trained end-to-end together with a target task (e.g., translation). Attention models have also been successfully applied to question answering tasks, as they allow for aligning the words in a question with sub-components of an information source such as a piece of text [228], an image [62], or a video sequence [246]. This both allows for better performance in question answering and leads to better model interpretabil-ity [4]. In particular, different types of attention models have been proposed to address this problem, including hierar-chical [128], stacked [234], and episodic memory attention [228]. | 解決這個問題的一種非常流行的方法是通過關(guān)注[12],它允許解碼器關(guān)注源實例的子組件。這與在傳統(tǒng)的編碼器-解碼器模型中執(zhí)行的將所有源子組件編碼在一起形成對比。注意模塊將告訴解碼器看起來更有針對性的子組件的源代碼翻譯——圖像[230]、句子[12],段音頻序列[35],[39],一個視頻幀和地區(qū)[236],[241],甚至部分指令[140]。例如,在圖像標(biāo)題中,不是使用CNN對整個圖像進(jìn)行編碼,而是一種注意機(jī)制,允許解碼器(通常是RNN)在生成每個連續(xù)單詞時聚焦于圖像的特定部分[230]。注意力模塊學(xué)習(xí)圖像的哪一部分需要關(guān)注,它通常是一個淺層神經(jīng)網(wǎng)絡(luò),并與目標(biāo)任務(wù)(如翻譯)一起進(jìn)行端到端訓(xùn)練。 注意力模型也已成功應(yīng)用于問答任務(wù),因為它們允許將問題中的單詞與信息源的子組件(如一段文本[228]、一幅圖像[62]或一段視頻序列[246])對齊。這既允許更好的問題回答性能,也導(dǎo)致更好的模型可解釋性[4]。特別是,人們提出了不同類型的注意模型來解決這個問題,包括層次結(jié)構(gòu)注意[128]、堆疊注意[234]和情景記憶注意[228]。 |
| Another neural alternative for aligning images with cap-tions for cross-modal retrieval was proposed by Karpathy?et al. [98], [99]. Their proposed model aligns sentence frag-ments to image regions by using a dot product similarity measure between image region and word representations. While it does not use attention, it extracts a latent alignment between modalities through a similarity measure that is learned indirectly by training a retrieval model. | Karpathy等人提出了另一種神經(jīng)方法,可用于對帶有標(biāo)題的圖像進(jìn)行交叉模態(tài)檢索[98],[99]。他們提出的模型通過使用圖像區(qū)域和單詞表示之間的點(diǎn)積相似度度量來將句子片段與圖像區(qū)域?qū)R。雖然它不使用注意力,但它通過通過訓(xùn)練檢索模型間接學(xué)習(xí)的相似度度量來提取模態(tài)之間的潛在對齊。 |
5.3 Discussion討論
| Multimodal alignment faces a number of difficulties: 1) there are few datasets with explicitly annotated alignments; 2) it is difficult to design similarity metrics between modalities; 3) there may exist multiple possible alignments and not all?elements in one modality have correspondences in another. Earlier work on multimodal alignment focused on aligning multimodal sequences in an unsupervised manner using graphical models and dynamic programming techniques. It relied on hand-defined measures of similarity between the modalities or learnt them in an unsupervised manner. With recent availability of labeled training data supervised learn-ing of similarities between modalities has become possible. However, unsupervised techniques of learning to jointly align and translate or fuse data have also become popular. | 多模態(tài)對齊面臨許多困難: 1)具有明確注釋對齊的數(shù)據(jù)集很少; 2) 難以設(shè)計模態(tài)之間的相似性度量; 3) 可能存在多種可能的對齊方式,并且并非一種模態(tài)中的所有元素在另一種模態(tài)中都有對應(yīng)關(guān)系。 早期關(guān)于多模態(tài)對齊的工作側(cè)重于使用圖形模型和動態(tài)規(guī)劃技術(shù)以無監(jiān)督方式對齊多模態(tài)序列。 它依靠手動定義的模態(tài)之間的相似性度量或以無人監(jiān)督的方式學(xué)習(xí)它們。 隨著最近標(biāo)記訓(xùn)練數(shù)據(jù)的可用性,對模態(tài)之間相似性的監(jiān)督學(xué)習(xí)成為可能。 然而,學(xué)習(xí)聯(lián)合對齊和翻譯或融合數(shù)據(jù)的無監(jiān)督技術(shù)也變得流行起來。 |
6 Fusion融合
| Multimodal fusion is one of the original topics in mul-timodal machine learning, with previous surveys empha-sizing early, late and hybrid fusion approaches [50], [247]. In technical terms, multimodal fusion is the concept of integrating information from multiple modalities with the goal of predicting an outcome measure: a class (e.g., happy vs. sad) through classification, or a continuous value (e.g., positivity of sentiment) through regression. It is one of the most researched aspects of multimodal machine learning with work dating to 25 years ago [243]. The interest in multimodal fusion arises from three main benefits it can provide. First, having access to multiple modalities that observe the same phenomenon may allow for more robust predictions. This has been especially ex-plored and exploited by the AVSR community [163]. Second, having access to multiple modalities might allow us to capture complementary information — something that is not visible in individual modalities on their own. Third, a multimodal system can still operate when one of the modalities is missing, for example recognizing emotions from the visual signal when the person is not speaking [50]. | 多模態(tài)融合是多模態(tài)機(jī)器學(xué)習(xí)中最原始的主題之一,以往的研究強(qiáng)調(diào)早期、晚期和混合融合方法[50],[247]。用技術(shù)術(shù)語來說,多模態(tài)融合是將來自多種模態(tài)的信息整合在一起的概念,目的是預(yù)測一個結(jié)果度量:通過分類得到一個類別(例如,快樂vs.悲傷),或者通過回歸得到一個連續(xù)值(例如,情緒的積極性)。這是多模態(tài)機(jī)器學(xué)習(xí)研究最多的方面之一,可追溯到25年前的工作[243]。 人們對多模態(tài)融合的興趣源于它能提供的三個主要好處。首先,使用觀察同一現(xiàn)象的多種模態(tài)可能會使預(yù)測更加準(zhǔn)確。AVSR社區(qū)對此進(jìn)行了特別的探索和利用[163]。其次,接觸多種模態(tài)可能會讓我們獲得互補(bǔ)信息——在單獨(dú)的模態(tài)中是看不到的信息。第三,當(dāng)其中一種模態(tài)缺失時,多模態(tài)系統(tǒng)仍然可以運(yùn)行,例如,當(dāng)一個人不說話時,從視覺信號中識別情緒。 |
| Multimodal fusion has a very broad range of appli-cations, including audio-visual speech recognition (AVSR)[163], multimodal emotion recognition [192], medical image analysis [89], and multimedia event detection [117]. There are a number of reviews on the subject [11], [163], [188],[247]. Most of them concentrate on multimodal fusion for a particular task, such as multimedia analysis, information retrieval or emotion recognition. In contrast, we concentrate on the machine learning approaches themselves and the technical challenges associated with these approaches. While some prior work used the term multimodal fu-sion to include all multimodal algorithms, in this survey paper we classify approaches as fusion category when the multimodal integration is performed at the later prediction?stages, with the goal of predicting outcome measures. In recent work, the line between multimodal representation and fusion has been blurred for models such as deep neural networks where representation learning is interlaced with classification or regression objectives. As we will describe in this section, this line is clearer for other approaches such as graphical models and kernel-based methods. We classify multimodal fusion into two main categories: model-agnostic approaches (Section 6.1) that are not di-rectly dependent on a specific machine learning method; and model-based (Section 6.2) approaches that explicitly ad-dress fusion in their construction — such as kernel-based approaches, graphical models, and neural networks. An overview of such approaches can be seen in Table 5. | 多模態(tài)融合有非常廣泛的應(yīng)用,包括視聽語音識別[163]、多模態(tài)情感識別[192]、醫(yī)學(xué)圖像分析[89]、多媒體事件檢測[117]。關(guān)于這個主題[11],[163],[188],[247]有許多評論。它們大多集中于針對特定任務(wù)的多模態(tài)融合,如多媒體分析、信息檢索或情感識別。相比之下,我們專注于機(jī)器學(xué)習(xí)方法本身以及與這些方法相關(guān)的技術(shù)挑戰(zhàn)。 雖然之前的一些工作使用術(shù)語多模態(tài)融合來包括所有的多模態(tài)算法,但在本調(diào)查論文中,當(dāng)多模態(tài)集成在后期預(yù)測階段進(jìn)行時,我們將方法歸類為融合類別,目的是預(yù)測結(jié)果度量。在最近的工作中,多模態(tài)表示和融合之間的界限已經(jīng)模糊,例如在深度神經(jīng)網(wǎng)絡(luò)中,表示學(xué)習(xí)與分類或回歸目標(biāo)交織在一起。正如我們將在本節(jié)中描述的那樣,這一行對于其他方法(如圖形模型和基于內(nèi)核的方法)來說更清晰。 我們將多模態(tài)融合分為兩大類:不直接依賴于特定機(jī)器學(xué)習(xí)方法的模型無關(guān)方法(章節(jié)6.1);和基于模型(第6.2節(jié))的方法,這些方法在其構(gòu)造中明確地處理融合——例如基于內(nèi)核的方法、圖形模型和神經(jīng)網(wǎng)絡(luò)。這些方法的概述見表5。 |
| ?Table 5: A summary of our taxonomy of multimodal fusion approaches. OUT — output type (class — classification or reg — regression), TEMP — is temporal modeling possible. | 表5:我們對多模態(tài)融合方法的分類總結(jié)。OUT -輸出類型(類-分類或reg -回歸),TEMP -是時間建模的可能。 |
6.1 Model-agnostic approaches與模型無關(guān)的方法
| Historically, the vast majority of multimodal fusion has been done using model-agnostic approaches [50]. Such ap-proaches can be split into early (i.e., feature-based), late (i.e., decision-based) and hybrid fusion [11]. Early fusion inte-grates features immediately after they are extracted (often by simply concatenating their representations). Late fusion on the other hand performs integration after each of the modalities has made a decision (e.g., classification or regres-sion). Finally, hybrid fusion combines outputs from early fusion and individual unimodal predictors. An advantage of model agnostic approaches is that they can be implemented using almost any unimodal classifiers or regressors. Early fusion could be seen as an initial attempt by mul-timodal researchers to perform multimodal representation learning — as it can learn to exploit the correlation and interactions between low level features of each modality. Furthermore it only requires the training of a single model, making the training pipeline easier compared to late and hybrid fusion. | 歷史上,絕大多數(shù)的多模態(tài)融合都是使用模型無關(guān)的方法[50]完成的。這樣的方法可以分為早期(即基于特征的)、后期(即基于決策的)和混合融合[11]。早期融合會在特征被提取后立即進(jìn)行整合(通常是簡單地將它們的表示連接起來)。另一方面,晚期融合在每種模態(tài)做出決定(如分類或回歸)后進(jìn)行整合。最后,混合融合結(jié)合早期融合和單個單模態(tài)預(yù)測的結(jié)果。模型無關(guān)方法的一個優(yōu)點(diǎn)是,它們可以使用幾乎任何單模態(tài)分類器或回歸器來實現(xiàn)。 早期的融合可以被看作是多模態(tài)研究人員進(jìn)行多模態(tài)表征學(xué)習(xí)的初步嘗試,因為它可以學(xué)習(xí)利用每個模態(tài)的低水平特征之間的相關(guān)性和相互作用。而且,它只需要對單個模型進(jìn)行訓(xùn)練,相比后期的混合融合更容易實現(xiàn)。 |
| In contrast, late fusion uses unimodal decision values and fuses them using a fusion mechanism such as averaging [181], voting schemes [144], weighting based on channel noise [163] and signal variance [53], or a learned model [68], [168]. It allows for the use of different models for each modality as different predictors can model each individual modality better, allowing for more flexibility. Furthermore, it makes it easier to make predictions when one or more of?the modalities is missing and even allows for training when no parallel data is available. However, late fusion ignores the low level interaction between the modalities. Hybrid fusion attempts to exploit the advantages of both of the above described methods in a common framework. It has been used successfully for multimodal speaker identifi-cation [226] and multimedia event detection (MED) [117]. | 相反,后期融合使用單模態(tài)決策值,并使用一種融合機(jī)制來融合它們,如平均[181]、投票方案[144]、基于信道噪聲[163]和信號方差[53]的加權(quán)或?qū)W習(xí)模型[68]、[168]。它允許為每個模態(tài)使用不同的模型,因為不同的預(yù)測器可以更好地為每個模態(tài)建模,從而具有更大的靈活性。此外,當(dāng)一個或多個模態(tài)缺失時,它可以更容易地進(jìn)行預(yù)測,甚至可以在沒有并行數(shù)據(jù)可用時進(jìn)行訓(xùn)練。然而,晚期融合忽略了模態(tài)之間低水平的相互作用。 混合融合嘗試在一個公共框架中利用上述兩種方法的優(yōu)點(diǎn)。它已成功地用于多模態(tài)說話人識別[226]和多媒體事件檢測(MED)[117]。 |
6.2 Model-based approaches基于模型的方法
| While model-agnostic approaches are easy to implement using unimodal machine learning methods, they end up using techniques that are not designed to cope with mul-timodal data. In this section we describe three categories of approaches that are designed to perform multimodal fusion: kernel-based methods, graphical models, and neural networks. Multiple kernel learning (MKL) methods are an extension to kernel support vector machines (SVM) that allow for the use of different kernels for different modalities/views of the data [70]. As kernels can be seen as similarity functions be-tween data points, modality-specific kernels in MKL allows for better fusion of heterogeneous data. MKL approaches have been an especially popular method for fusing visual descriptors for object detection [31], [66] and only recently have been overtaken by deep learning methods for the task [109]. They have also seen use for multimodal affect recognition [36], [90], [182], mul-timodal sentiment analysis [162], and multimedia event detection (MED) [237]. Furthermore, McFee and Lanckriet [137] proposed to use MKL to perform musical artist simi-larity ranking from acoustic, semantic and social view data. Finally, Liu et al. [125] used MKL for multimodal fusion in Alzheimer’s disease classification. Their broad applicability demonstrates the strength of such approaches in various domains and across different modalities. | 雖然使用單模態(tài)機(jī)器學(xué)習(xí)方法很容易實現(xiàn)模型無關(guān)的方法,但它們最終使用的技術(shù)不是用來處理多模態(tài)數(shù)據(jù)的。在本節(jié)中,我們將描述用于執(zhí)行多模態(tài)融合的三類方法:基于核的方法、圖形模型和神經(jīng)網(wǎng)絡(luò)。 多核學(xué)習(xí)(MKL)方法是對核支持向量機(jī)(SVM)的一種擴(kuò)展,它允許對數(shù)據(jù)的不同模態(tài)/視圖使用不同的核[70]。由于內(nèi)核可以被視為數(shù)據(jù)點(diǎn)之間的相似函數(shù),因此MKL中的特定于模態(tài)的內(nèi)核可以更好地融合異構(gòu)數(shù)據(jù)。 MKL方法是融合視覺描述符用于目標(biāo)檢測[31]的一種特別流行的方法[66],直到最近才被用于任務(wù)的深度學(xué)習(xí)方法所取代[109]。它們也被用于多模態(tài)情感識別[36][90],[182],多模態(tài)情感分析[162],以及多媒體事件檢測(MED)[237]。此外,McFee和Lanckriet[137]提出使用MKL從聲學(xué)、語義和社會視圖數(shù)據(jù)中進(jìn)行音樂藝術(shù)家相似度排序。最后,Liu等[125]將MKL用于阿爾茨海默病的多模態(tài)融合分類。它們廣泛的適用性表明了這些方法在不同領(lǐng)域和不同模態(tài)中的優(yōu)勢。 |
| Besides flexibility in kernel selection, an advantage of MKL is the fact that the loss function is convex, allowing for model training using standard optimization packages and global optimum solutions [70]. Furthermore, MKL can be used to both perform regression and classification. One of the main disadvantages of MKL is the reliance on training data (support vectors) during test time, leading to slow inference and a large memory footprint. Graphical models are another family of popular methods for multimodal fusion. In this section we overview work done on multimodal fusion using shallow graphical models. A description of deep graphical models such as deep belief networks can be found in Section 3.1. Majority of graphical models can be classified into two main categories: generative — modeling joint probability; or discriminative — modeling conditional probability [200]. Some of the earliest approaches to use graphical models for multimodal fusion include generative models such as cou-pled [149] and factorial hidden Markov models [67] along-side dynamic Bayesian networks [64]. A more recently-proposed multi-stream HMM method proposes dynamic weighting of modalities for AVSR [75]. Arguably, generative models lost popularity to discrimi-native ones such as conditional random fields (CRF) [115] which sacrifice the modeling of joint probability for pre-dictive power. A CRF model was used to better segment?images by combining visual and textual information of image description [60]. CRF models have been extended to model latent states using hidden conditional random fields [165] and have been applied to multimodal meeting seg-mentation [173]. Other multimodal uses of latent variable discriminative graphical models include multi-view hidden CRF [194] and latent variable models [193]. More recently Jiang et al. [93] have shown the benefits of multimodal hidden conditional random fields for the task of multimedia classification. While most graphical models are aimed at classification, CRF models have been extended to a continu-ous version for regression [164] and applied in multimodal settings [13] for audio visual emotion recognition. | 除了在核選擇上的靈活性外,MKL的一個優(yōu)點(diǎn)是損失函數(shù)是凸的,允許使用標(biāo)準(zhǔn)優(yōu)化包和全局最優(yōu)解進(jìn)行模型訓(xùn)練[70]。此外,MKL可用于回歸和分類。MKL的主要缺點(diǎn)之一是在測試期間依賴于訓(xùn)練數(shù)據(jù)(支持向量),導(dǎo)致推理緩慢和占用大量內(nèi)存。 圖形模型是另一類流行的多模態(tài)融合方法。在本節(jié)中,我們將概述使用淺層圖形模型進(jìn)行多模態(tài)融合的工作。深度圖形模型(如深度信念網(wǎng)絡(luò))的描述可以在3.1節(jié)中找到。 大多數(shù)圖形模型可分為兩大類:生成-建模聯(lián)合概率模型;或判別建模條件概率[200]。最早將圖形模型用于多模態(tài)融合的一些方法包括生成模型,如耦合模型[149]和階乘隱馬爾可夫模型[67]以及動態(tài)貝葉斯網(wǎng)絡(luò)[64]。最近提出的一種多流HMM方法提出了AVSR模態(tài)的動態(tài)加權(quán)[75]。 可以證明的是,生成型模型被諸如條件隨機(jī)域(CRF)[115]這樣的判別型模型所取代,后者犧牲了對聯(lián)合概率的建模來提高預(yù)測能力。結(jié)合圖像描述[60]的視覺和文本信息,采用CRF模型對圖像進(jìn)行更好的分割。CRF模型已被擴(kuò)展到使用隱藏條件隨機(jī)場來模擬潛在狀態(tài)[165],并已被應(yīng)用于多模態(tài)相遇分割[173]。潛變量判別圖形模型的其他多模態(tài)應(yīng)用包括多視圖隱CRF[194]和潛變量模型[193]。最近,Jiang等人[93]展示了多模態(tài)隱藏條件隨機(jī)場對多媒體分類任務(wù)的好處。雖然大多數(shù)圖形模型的目的是分類,但CRF模型已擴(kuò)展到連續(xù)版本用于回歸[164],并應(yīng)用于多模態(tài)設(shè)置[13]用于視聽情感識別。 |
| The benefit of graphical models is their ability to easily exploit spatial and temporal structure of the data, making them especially popular for temporal modeling tasks, such as AVSR and multimodal affect recognition. They also allow to build in human expert knowledge into the models. and often lead to interpretable models. Neural Networks have been used extensively for the task of multimodal fusion [151]. The earliest examples of using neural networks for multi-modal fusion come from work on AVSR [163]. Nowadays they are being used to fuse information for visual and media question answering [63],[130], [229], gesture recognition [150], affect analysis [96],[153], and video description generation [94]. While the modalities used, architectures, and optimization techniques might differ, the general idea of fusing information in joint hidden layer of a neural network remains the same. Neural networks have also been used for fusing tempo-ral multimodal information through the use of RNNs and LSTMs. One of the earlier such applications used a bidi-rectional LSTM was used to perform audio-visual emotion classification [224]. More recently, W¨ollmer et al. [223] used LSTM models for continuous multimodal emotion recog-nition, demonstrating its advantage over graphical models and SVMs. Similarly, Nicolaou et al. [152] used LSTMs for continuous emotion prediction. Their proposed method used an LSTM to fuse the results from a modality specific (audio and facial expression) LSTMs. | 圖形模型的優(yōu)點(diǎn)是能夠輕松利用數(shù)據(jù)的空間和時間結(jié)構(gòu),這使得它們在時間建模任務(wù)(如AVSR和多模態(tài)影響識別)中特別受歡迎。它們還允許在模型中加入人類的專業(yè)知識。通常會產(chǎn)生可解釋的模型。 神經(jīng)網(wǎng)絡(luò)已被廣泛用于多模態(tài)融合的任務(wù)[151]。使用神經(jīng)網(wǎng)絡(luò)進(jìn)行多模態(tài)融合的最早例子來自于AVSR的研究[163]。如今,它們被用于融合信息,用于視覺和媒體問答[63]、[130]、[229]、手勢識別[150]、情感分析[96]、[153]和視頻描述生成[94]。雖然所使用的模態(tài)、架構(gòu)和優(yōu)化技術(shù)可能不同,但在神經(jīng)網(wǎng)絡(luò)的聯(lián)合隱層中融合信息的一般思想是相同的。 神經(jīng)網(wǎng)絡(luò)也通過使用rnn和lstm來融合時間多模態(tài)信息。較早使用雙向LSTM進(jìn)行視聽情緒分類的應(yīng)用之一[224]。最近,W¨ollmer等人[223]使用LSTM模型進(jìn)行連續(xù)多模態(tài)情緒識別,證明了其優(yōu)于圖形模型和支持向量機(jī)。同樣,Nicolaou等[152]使用lstm進(jìn)行連續(xù)情緒預(yù)測。他們提出的方法使用LSTM來融合來自特定模態(tài)(音頻和面部表情)LSTM的結(jié)果。 |
| Approaching modality fusion through recurrent neural networks has been used in various image captioning tasks, example models include: neural image captioning [214] where a CNN image representation is decoded using an LSTM language model, gLSTM [91] which incorporates the image data together with sentence decoding at every time step fusing the visual and sentence data in a joint repre-sentation. A more recent example is the multi-view LSTM (MV-LSTM) model proposed by Rajagopalan et al. [166]. MV-LSTM model allows for flexible fusion of modalities in the LSTM framework by explicitly modeling the modality-specific and cross-modality interactions over time. A big advantage of deep neural network approaches in data fusion is their capacity to learn from large amount of data. Secondly, recent neural architectures allow for end-to-end training of both the multimodal representation compo-nent and the fusion component. Finally, they show good performance when compared to non neural network based system and are able to learn complex decision boundaries that other approaches struggle with. The major disadvantage of neural network approaches?is their lack of interpretability. It is difficult to tell what the prediction relies on, and which modalities or features play an important role. Furthermore, neural networks require large training datasets to be successful. | 通過遞歸神經(jīng)網(wǎng)絡(luò)實現(xiàn)模態(tài)融合已被用于各種圖像字幕任務(wù),示例模型包括:神經(jīng)圖像字幕[214],其中CNN圖像表示使用LSTM語言模型進(jìn)行解碼,gLSTM[91]將圖像數(shù)據(jù)和每一步的句子解碼結(jié)合在一起,將視覺數(shù)據(jù)和句子數(shù)據(jù)融合在一個聯(lián)合表示中。最近的一個例子是Rajagopalan等人提出的多視圖LSTM (MV-LSTM)模型[166]。MV-LSTM模型通過顯式地建模隨時間變化的特定模態(tài)和跨模態(tài)交互,允許LSTM框架中模態(tài)的靈活融合。 深度神經(jīng)網(wǎng)絡(luò)方法在數(shù)據(jù)融合中的一大優(yōu)勢是能夠從大量數(shù)據(jù)中學(xué)習(xí)。其次,最近的神經(jīng)體系結(jié)構(gòu)允許端到端訓(xùn)練多模態(tài)表示組件和融合組件。最后,與基于非神經(jīng)網(wǎng)絡(luò)的系統(tǒng)相比,它們表現(xiàn)出了良好的性能,并且能夠?qū)W習(xí)其他方法難以處理的復(fù)雜決策邊界。 神經(jīng)網(wǎng)絡(luò)方法的主要缺點(diǎn)是缺乏可解釋性。很難判斷預(yù)測的依據(jù)是什么,以及哪種模態(tài)或特征發(fā)揮了重要作用。此外,神經(jīng)網(wǎng)絡(luò)需要大量的訓(xùn)練數(shù)據(jù)集才能成功。 |
6.3 Discussion討論
| Multimodal fusion has been a widely researched topic with a large number of approaches proposed to tackle it, includ-ing model agnostic methods, graphical models, multiple kernel learning, and various types of neural networks. Each approach has its own strengths and weaknesses, with some more suited for smaller datasets and others performing bet-ter in noisy environments. Most recently, neural networks have become a very popular way to tackle multimodal fu-sion, however graphical models and multiple kernel learn-ing are still being used, especially in tasks with limited training data or where model interpretability is important. | 多模態(tài)融合是一個被廣泛研究的課題,有大量的方法被提出來解決它,包括模型不確定方法、圖形模型、多核學(xué)習(xí)和各種類型的神經(jīng)網(wǎng)絡(luò)。每種方法都有自己的優(yōu)點(diǎn)和缺點(diǎn),一些方法更適合于較小的數(shù)據(jù)集,而另一些方法在嘈雜的環(huán)境中表現(xiàn)得更好。最近,神經(jīng)網(wǎng)絡(luò)已經(jīng)成為處理多模態(tài)融合的一種非常流行的方法,但圖形模型和多核學(xué)習(xí)仍在使用,特別是在訓(xùn)練數(shù)據(jù)有限的任務(wù)或模型可解釋性很重要的地方。 |
| Despite these advances multimodal fusion still faces the following challenges: 1) signals might not be temporally aligned (possibly dense continuous signal and a sparse event); 2) it is difficult to build models that exploit supple-mentary and not only complementary information; 3) each modality might exhibit different types and different levels of noise at different points in time. | 盡管有這些進(jìn)展,但多模態(tài)融合仍面臨以下挑戰(zhàn): 1)信號可能沒有時間對齊(可能是密集的連續(xù)信號和稀疏事件); 2)很難建立既能利用互補(bǔ)信息又能利用互補(bǔ)信息的模型; 3)各模態(tài)在不同時間點(diǎn)可能表現(xiàn)出不同類型和不同水平的噪聲。 |
7 Co-learning共同學(xué)習(xí)
| The final multimodal challenge in our taxonomy is co-learning — aiding the modeling of a (resource poor) modal-ity by exploiting knowledge from another (resource rich) modality. It is particularly relevant when one of the modali-ties has limited resources — lack of annotated data, noisy input, and unreliable labels. We call this challenge co-learning as most often the helper modality is used only during model training and is not used during test time. We identify three types of co-learning approaches based on their training resources: parallel, non-parallel, and hybrid. Parallel-data approaches require training datasets where the observations from one modality are directly linked to the ob-servations from other modalities. In other words, when the multimodal observations are from the same instances, such as in an audio-visual speech dataset where the video and speech samples are from the same speaker. In contrast, non-parallel data approaches do not require direct links between observations from different modalities. These approaches usually achieve co-learning by using overlap in terms of categories. For example, in zero shot learning when the con-ventional visual object recognition dataset is expanded with a second text-only dataset from Wikipedia to improve the generalization of visual object recognition. In the hybrid data setting the modalities are bridged through a shared modality or a dataset. An overview of methods in co-learning can be seen in Table 6 and summary of data parallelism in Figure 3. | 我們分類法中的最后一個多模態(tài)挑戰(zhàn)是共同學(xué)習(xí)——通過從另一個(資源豐富的)模態(tài)中獲取知識來幫助(資源貧乏的)模態(tài)建模。當(dāng)其中一種模態(tài)的資源有限時——缺乏注釋的數(shù)據(jù)、嘈雜的輸入和不可靠的標(biāo)簽——這一點(diǎn)尤其重要。我們稱這種挑戰(zhàn)為共同學(xué)習(xí),因為大多數(shù)情況下,助手模態(tài)只在模型訓(xùn)練中使用,而在測試期間不使用。我們根據(jù)他們的培訓(xùn)資源確定了三種類型的共同學(xué)習(xí)方法:并行、非并行和混合。平行數(shù)據(jù)方法需要訓(xùn)練數(shù)據(jù)集,其中一個模態(tài)的觀察結(jié)果與其他模態(tài)的觀察結(jié)果直接相連。換句話說,當(dāng)多模態(tài)觀察來自相同的實例時,例如在一個視聽語音數(shù)據(jù)集中,視頻和語音樣本來自同一個說話者。相反,非平行數(shù)據(jù)方法不需要不同模態(tài)的觀察結(jié)果之間的直接聯(lián)系。這些方法通常通過使用類別上的重疊來實現(xiàn)共同學(xué)習(xí)。例如,在零鏡頭學(xué)習(xí)時,將傳統(tǒng)的視覺對象識別數(shù)據(jù)集擴(kuò)展為維基百科的第二個純文本數(shù)據(jù)集,以提高視覺對象識別的泛化。在混合數(shù)據(jù)設(shè)置中,模態(tài)通過共享的模態(tài)或數(shù)據(jù)集進(jìn)行連接。在表6中可以看到共同學(xué)習(xí)方法的概述,在圖3中可以看到數(shù)據(jù)并行性的總結(jié)。 |
7.1 Parallel data并行數(shù)據(jù)
| In parallel data co-learning both modalities share a set of in-stances — audio recordings with the corresponding videos, images and their sentence descriptions. This allows for two types of algorithms to exploit that data to better model the modalities: co-training and representation learning. Co-training is the process of creating more labeled training samples when we have few labeled samples in a multimodal problem [21]. The basic algorithm builds weak classifiers in each modality to bootstrap each other with labels for the unlabeled data. It has been shown to discover more training samples for web-page classification based on the web-page itself and hyper-links leading in the seminal work of Blum and Mitchell [21]. By definition this task requires parallel data as it relies on the overlap of multimodal samples. | 在并行數(shù)據(jù)協(xié)同學(xué)習(xí)中,兩種模態(tài)都共享一組實例—音頻記錄與相應(yīng)的視頻、圖像及其句子描述。這就允許了兩種類型的算法來利用這些數(shù)據(jù)來更好地為模態(tài)建模:協(xié)同訓(xùn)練和表示學(xué)習(xí)。 協(xié)同訓(xùn)練是在多模態(tài)問題[21]中有少量標(biāo)記樣本的情況下,創(chuàng)建更多標(biāo)記訓(xùn)練樣本的過程。基本算法在每個模態(tài)中構(gòu)建弱分類器,對未標(biāo)記的數(shù)據(jù)進(jìn)行標(biāo)簽引導(dǎo)。Blum和Mitchell[21]的開創(chuàng)性工作表明,基于網(wǎng)頁本身和超鏈接,可以發(fā)現(xiàn)更多的訓(xùn)練樣本用于網(wǎng)頁分類。根據(jù)定義,該任務(wù)需要并行數(shù)據(jù),因為它依賴于多模態(tài)樣本的重疊。 |
| ?Figure 3: Types of data parallelism used in co-learning: parallel — modalities are from the same dataset and there is a direct correspondence between instances; non-parallel— modalities are from different datasets and do not have overlapping instances, but overlap in general categories or concepts; hybrid — the instances or concepts are bridged by a third modality or a dataset. | 圖3:在共同學(xué)習(xí)中使用的數(shù)據(jù)并行類型:并行模態(tài)來自相同的數(shù)據(jù)集,并且實例之間有直接對應(yīng)關(guān)系;非平行模態(tài)來自不同的數(shù)據(jù)集,沒有重疊的實例,但在一般類別或概念上有重疊;混合—實例或概念由第三種模態(tài)或數(shù)據(jù)集連接起來。 |
| Co-training has been used for statistical parsing [178] to build better visual detectors [120] and for audio-visual speech recognition [40]. It has also been extended to deal with disagreement between modalities, by filtering out unreliable samples [41]. While co-training is a powerful method for generating more labeled data, it can also lead to biased training samples resulting in overfitting. Transfer learning is another way to exploit co-learning with parallel data. Multimodal representation learning (Section 3.1) approaches such as multimodal deep Boltzmann ma-chines [198] and multimodal autoencoders [151] transfer information from representation of one modality to that of another. This not only leads to multimodal representations, but also to better unimodal ones, with only one modality being used during test time [151] . Moon et al. [143] show how to transfer information from a speech recognition neural network (based on audio) to a lip-reading one (based on images), leading to a better visual representation, and a model that can be used for lip-reading without need for audio information during test time. Similarly, Arora and Livescu [10] build better acoustic features using CCA on acoustic and articulatory (location of lips, tongue and jaw) data. They use articulatory data only during CCA construction and use only the resulting acoustic (unimodal) representation during test time. | 協(xié)同訓(xùn)練已被用于統(tǒng)計解析[178],以構(gòu)建更好的視覺檢測器[120],并用于視聽語音識別[40]。通過過濾掉不可靠的樣本[41],它還被擴(kuò)展到處理不同模態(tài)之間的分歧。雖然協(xié)同訓(xùn)練是一種生成更多標(biāo)記數(shù)據(jù)的強(qiáng)大方法,但它也會導(dǎo)致有偏差的訓(xùn)練樣本導(dǎo)致過擬合。遷移學(xué)習(xí)是利用并行數(shù)據(jù)進(jìn)行共同學(xué)習(xí)的另一種方法。多模態(tài)表示學(xué)習(xí)(3.1節(jié))方法,如多模態(tài)深度玻爾茲曼機(jī)[198]和多模態(tài)自編碼器[151],將信息從一個模態(tài)的表示傳遞到另一個模態(tài)的表示。這不僅導(dǎo)致了多模態(tài)表示,而且還導(dǎo)致了更好的單模態(tài)表示,在測試期間只使用了一個模態(tài)[151]。 Moon等人[143]展示了如何將信息從語音識別神經(jīng)網(wǎng)絡(luò)(基于音頻)傳輸?shù)酱阶x神經(jīng)網(wǎng)絡(luò)(基于圖像),從而獲得更好的視覺表示,以及在測試期間可以用于唇讀而不需要音頻信息的模型。類似地,Arora和Livescu[10]利用聲學(xué)和發(fā)音(嘴唇、舌頭和下巴的位置)數(shù)據(jù)的CCA構(gòu)建了更好的聲學(xué)特征。他們僅在CCA構(gòu)建期間使用發(fā)音數(shù)據(jù),并在測試期間僅使用產(chǎn)生的聲學(xué)(單模態(tài))表示。 |
7.2 Non-parallel data非并行數(shù)據(jù)
| Methods that rely on non-parallel data do not require the modalities to have shared instances, but only shared cat-egories or concepts. Non-parallel co-learning approaches can help when learning representations, allow for better semantic concept understanding and even perform unseen object recognition. ?Table 6: A summary of co-learning taxonomy, based on data parallelism. Parallel data — multiple modalities can see the same instance. Non-parallel data — unimodal instances are independent of each other. Hybrid data — the modalities are pivoted through a shared modality or dataset. | 依賴于非并行數(shù)據(jù)的方法不需要模態(tài)擁有共享的實例,而只需要共享的類別或概念。非平行的共同學(xué)習(xí)方法可以幫助學(xué)習(xí)表征,允許更好的語義概念理解,甚至執(zhí)行看不見的對象識別。 表6:基于數(shù)據(jù)并行性的協(xié)同學(xué)習(xí)分類概述。并行數(shù)據(jù)-多種模態(tài)可以看到相同的實例。非并行數(shù)據(jù)——單模態(tài)實例彼此獨(dú)立。混合數(shù)據(jù)—模態(tài)是通過共享的模態(tài)或數(shù)據(jù)集來轉(zhuǎn)換的。 |
| Transfer learning is also possible on non-parallel data and allows to learn better representations through transferring information from a representation built using a data rich or clean modality to a data scarce or noisy modality. This type of trasnfer learning is often achieved by using coordinated multimodal representations (see Section 3.2). For example, Frome et al. [61] used text to improve visual representations for image classification by coordinating CNN visual features with word2vec textual ones [141] trained on separate large datasets. Visual representations trained in such a way result in more meaningful errors — mistaking objects for ones of similar category [61]. Mahasseni and Todorovic [129] demonstrated how to regularize a color video based LSTM using an autoencoder LSTM trained on 3D skeleton data by enforcing similarities between their hidden states. Such an approach is able to improve the original LSTM and lead to state-of-the-art performance in action recognition. Conceptual grounding refers to learning semantic mean-ings or concepts not purely based on language but also on additional modalities such as vision, sound, or even smell [16]. While the majority of concept learning approaches are purely language-based, representations of meaning in humans are not merely a product of our linguistic exposure, but are also grounded through our sensorimotor experience and perceptual system [17], [126]. Human semantic knowl-edge relies heavily on perceptual information [126] and many concepts are grounded in the perceptual system and are not purely symbolic [17]. This implies that learning semantic meaning purely from textual information might not be optimal, and motivates the use of visual or acoustic cues to ground our linguistic representations. | 在非并行數(shù)據(jù)上也可以進(jìn)行遷移學(xué)習(xí),通過將信息從使用數(shù)據(jù)豐富或干凈的模態(tài)構(gòu)建的表示轉(zhuǎn)移到數(shù)據(jù)缺乏或有噪聲的模態(tài),可以學(xué)習(xí)更好的表示。這種類型的遷移學(xué)習(xí)通常是通過使用協(xié)調(diào)的多模態(tài)表示來實現(xiàn)的(見第3.2節(jié))。例如,Frome等人[61]通過將CNN視覺特征與在單獨(dú)的大數(shù)據(jù)集上訓(xùn)練的word2vec文本特征[141]相協(xié)調(diào),使用文本來改善圖像分類的視覺表示。以這種方式訓(xùn)練的視覺表征會導(dǎo)致更有意義的錯誤——將物體誤認(rèn)為類似類別的物體[61]。Mahasseni和Todorovic[129]演示了如何使用在3D骨架數(shù)據(jù)上訓(xùn)練的自動編碼器LSTM來正則化基于LSTM的彩色視頻,方法是增強(qiáng)隱藏狀態(tài)之間的相似性。這種方法可以改進(jìn)原有的LSTM,在動作識別方面達(dá)到最先進(jìn)的性能。概念基礎(chǔ)是指學(xué)習(xí)語義的意義或概念,不單純基于語言,也基于其他形式,如視覺、聲音,甚至嗅覺。雖然大多數(shù)概念學(xué)習(xí)方法是純粹基于語言的,但人類意義的表征并不僅僅是語言接觸的產(chǎn)物,而且還基于我們的感覺運(yùn)動經(jīng)驗和感知系統(tǒng)[17],[126]。人類的語義知識-邊緣嚴(yán)重依賴于感知信息[126],許多概念都建立在感知系統(tǒng)的基礎(chǔ)上,而不是純粹的符號[17]。這意味著,單純從文本信息中學(xué)習(xí)語義意義可能不是最理想的,這促使我們使用視覺或聽覺線索來建立語言表征的基礎(chǔ)。 |
| Starting from work by Feng and Lapata [59], grounding is usually performed by finding a common latent space between the representations [59], [183] (in case of parallel datasets) or by learning unimodal representations sepa-rately and then concatenating them to lead to a multimodal one [29], [101], [172], [181] (in case of non-parallel data). Once a multimodal representation is constructed it can be used on purely linguistic tasks. Shutova et al. [181] and Bruni et al. [29] used grounded representations for better classification of metaphors and literal language. Such representations have also been useful for measuring conceptual similarity and relatedness — identifying how semantically or conceptually related two words are [30], [101], [183] or actions [172]. Furthermore, concepts can be grounded not only using visual signals, but also acoustic ones, leading to better performance especially on words with auditory associations [103], or even olfactory signals [102] for words with smell associations. Finally, there is a lot of overlap between multimodal alignment and conceptual grounding, as aligning visual scenes to their descriptions leads to better textual or visual representations [108], [161], [172], [240]. Conceptual grounding has been found to be an effective way to improve performance on a number of tasks. It also shows that language and vision (or audio) are com-plementary sources of information and combining them in multimodal models often improves performance. However, one has to be careful as grounding does not always lead to better performance [102], [103], and only makes sense when grounding has relevance for the task — such as grounding using images for visually-related concepts. | 從工作開始由馮和Lapata[59],接地通常是由之間找到一個共同的潛在空間表征[59],[183](并行數(shù)據(jù)集的情況下)或通過學(xué)習(xí)sepa-rately單峰表示,然后導(dǎo)致一個多通道連接[29],[101],[172],[181](對于非并行數(shù)據(jù))。一旦構(gòu)建了多模態(tài)表示,它就可以用于純語言任務(wù)。Shutova等人[181]和Bruni等人[29]使用扎根表征來更好地分類隱喻和字面語言。這樣的表征在度量概念相似度和相關(guān)性方面也很有用——識別兩個詞在語義上或概念上的關(guān)聯(lián)程度分別為[30]、[101]、[183]或動作[172]。此外,概念不僅可以使用視覺信號,也可以使用聽覺信號,從而使詞匯表現(xiàn)得更好[103],甚至嗅覺信號[102]也能使詞匯表現(xiàn)得更好。最后,多模態(tài)對齊和概念基礎(chǔ)之間有很多重疊,因為將視覺場景與其描述相匹配會導(dǎo)致更好的文本或視覺表征[108]、[161]、[172]、[240]。 概念基礎(chǔ)已被發(fā)現(xiàn)是提高許多任務(wù)性能的有效方法。它還表明,語言和視覺(或音頻)是互補(bǔ)的信息來源,在多模態(tài)模型中結(jié)合它們通常可以提高性能。然而,必須小心,因為接地并不總是會帶來更好的性能[102],[103],而且只有當(dāng)接地與任務(wù)相關(guān)時才有意義——例如使用圖像作為視覺相關(guān)概念的接地。 |
| Zero shot learning (ZSL) refers to recognizing a concept without having explicitly seen any examples of it. For ex-ample classifying a cat in an image without ever having seen (labeled) images of cats. This is an important problem to address as in a number of tasks such as visual object clas-sification: it is prohibitively expensive to provide training examples for every imaginable object of interest. There are two main types of ZSL — unimodal and multimodal. The unimodal ZSL looks at component parts or attributes of the object, such as phonemes to recognize an unheard word or visual attributes such as color, size, and shape to predict an unseen visual class [55]. The multimodal ZSL recognizes the objects in the primary modality through the help of the secondary one — in which the object has been seen. The multimodal version of ZSL is a problem facing non-parallel data by definition as the overlap of seen classes is different between the modalities. Socher et al. [190] map image features to a conceptual word space and are able to classify between seen and unseen concepts. The unseen concepts can be then assigned to a word that is close to the visual representation — this is enabled by the semantic space being trained on a separate dataset that has seen more concepts. Instead of learning a mapping from visual to concept space Frome et al. [61] learn a coordinated multimodal representation between concepts and images that allows for ZSL. Palatucci et al. [158] per-form prediction of words people are thinking of based on functional magnetic resonance images, they show how it is possible to predict unseen words through the use of an intermediate semantic space. Lazaridou et al. [118] present a fast mapping method for ZSL by mapping extracted visual feature vectors to text-based vectors through a neural network. | 零機(jī)會學(xué)習(xí)(ZSL)指的是在沒有明確看到任何例子的情況下識別一個概念。例如,在從未見過(有標(biāo)記的)貓的圖像的情況下,將一只貓在圖像中分類。這是一個需要解決的重要問題,就像在許多任務(wù)中(如視覺對象分類)一樣:為每一個可以想象到的感興趣的對象提供訓(xùn)練示例是非常昂貴的。 ZSL主要有兩種類型——單峰和多峰。單模態(tài)ZSL查看對象的組件部分或?qū)傩?#xff0c;如音素,以識別未聽過的單詞或視覺屬性(如顏色、大小和形狀),以預(yù)測不可見的視覺類[55]。多模態(tài)ZSL通過輔助模態(tài)的幫助識別主要模態(tài)的物體——在輔助模態(tài)中,物體已經(jīng)被看到。根據(jù)定義,ZSL的多模態(tài)版本是面對非并行數(shù)據(jù)的一個問題,因為所看到的類的重疊在模態(tài)之間是不同的。 Socher等人[190]將圖像特征映射到概念詞空間,并能夠在可見和不可見概念之間進(jìn)行分類。然后,看不見的概念可以分配給一個接近于視覺表示的單詞——這是通過在一個單獨(dú)的數(shù)據(jù)集上訓(xùn)練語義空間實現(xiàn)的,該數(shù)據(jù)集已經(jīng)看到了更多的概念。Frome等人[61]沒有學(xué)習(xí)從視覺空間到概念空間的映射,而是學(xué)習(xí)概念和圖像之間的協(xié)調(diào)多模態(tài)表示,從而實現(xiàn)ZSL。Palatucci等人[158]基于功能性磁共振圖像對人們正在思考的單詞進(jìn)行形式預(yù)測,他們展示了如何通過使用中間語義空間來預(yù)測未見的單詞。Lazaridou等人[118]提出了一種ZSL的快速映射方法,通過神經(jīng)網(wǎng)絡(luò)將提取的視覺特征向量映射到基于文本的向量。 |
7.3 Hybrid data混合數(shù)據(jù)
| In the hybrid data setting two non-parallel modalities are bridged by a shared modality or a dataset (see Figure 3c). The most notable example is the Bridge Correlational Neural Network [167], which uses a pivot modality to learn coordinated multimodal representations in presence of non-parallel data. For example, in the case of multilingual image?captioning, the image modality would always be paired with at least one caption in any language. Such methods have also been used to bridge languages that might not have parallel corpora but have access to a shared pivot language, such as for machine translation [148], [167] and document transliteration [100]. | 在混合數(shù)據(jù)設(shè)置中,兩個非并行模態(tài)由一個共享的模態(tài)或數(shù)據(jù)集連接起來(見圖3c)。最著名的例子是Bridge相關(guān)神經(jīng)網(wǎng)絡(luò)[167],它使用一個樞軸模態(tài)來學(xué)習(xí)非并行數(shù)據(jù)的協(xié)調(diào)多模態(tài)表示。例如,在多語言圖像字幕的情況下,圖像模態(tài)總是與任何語言的至少一個字幕配對。這些方法也被用于連接那些可能沒有并行語料庫但可以訪問共享的主語言的語言,如機(jī)器翻譯[148]、[167]和文檔音譯[100]。 |
| Instead of using a separate modality for bridging, some methods rely on existence of large datasets from a similar or related task to lead to better performance in a task that only contains limited annotated data. Socher and Fei-Fei [189] use the existence of large text corpora in order to guide image segmentation. While Hendricks et al. [78] use separately trained visual model and a language model to lead to a better image and video description system, for which only limited data is available. | 一些方法依賴于來自類似或相關(guān)任務(wù)的大型數(shù)據(jù)集,而不是使用單獨(dú)的模態(tài)進(jìn)行橋接,從而在只包含有限注釋數(shù)據(jù)的任務(wù)中獲得更好的性能。Socher和feifei[189]利用存在的大型文本語料庫來指導(dǎo)圖像分割。而Hendricks等人[78]則分別使用訓(xùn)練過的視覺模型和語言模型來得到更好的圖像和視頻描述系統(tǒng),但數(shù)據(jù)有限。 |
7.4 Discussion討論
| Multimodal co-learning allows for one modality to influ-ence the training of another, exploiting the complementary information across modalities. It is important to note that co-learning is task independent and could be used to cre-ate better fusion, translation, and alignment models. This challenge is exemplified by algorithms such as co-training, multimodal representation learning, conceptual grounding, and zero shot learning (ZSL) and has found many applica-tions in visual classification, action recognition, audio-visual speech recognition, and semantic similarity estimation. | 多模態(tài)共同學(xué)習(xí)允許一種模態(tài)影響另一種模態(tài)的訓(xùn)練,利用不同模態(tài)之間的互補(bǔ)信息。值得注意的是,共同學(xué)習(xí)是獨(dú)立于任務(wù)的,可以用來創(chuàng)建更好的融合、翻譯和對齊模型。協(xié)同訓(xùn)練、多模態(tài)表示學(xué)習(xí)、概念基礎(chǔ)和零樣本學(xué)習(xí)(ZSL)等算法都體現(xiàn)了這一挑戰(zhàn),并在視覺分類、動作識別、視聽語音識別和語義相似度估計等方面得到了許多應(yīng)用。 |
8 Conclusion結(jié)論
| As part of this survey, we introduced a taxonomy of multi-modal machine learning: representation, translation, fusion, alignment, and co-learning. Some of them such as fusion have been studied for a long time, but more recent interest in representation and translation have led to a large number of new multimodal algorithms and exciting multimodal applications. We believe that our taxonomy will help to catalog future research papers and also better understand the remaining unresolved problems facing multimodal machine learning. | 作為調(diào)查的一部分,我們介紹了多模態(tài)機(jī)器學(xué)習(xí)的分類:表示、翻譯、對齊、融合和共同學(xué)習(xí)。 其中一些如融合已經(jīng)被研究了很長時間,但最近對表示、翻譯的興趣導(dǎo)致了大量新的多模態(tài)算法和令人興奮的多模態(tài)應(yīng)用。 我們相信我們的分類法將有助于對未來的研究論文進(jìn)行分類,并更好地理解多模態(tài)機(jī)器學(xué)習(xí)面臨的剩余未解決問題。 |
總結(jié)
以上是生活随笔為你收集整理的Paper:《Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类》翻译与解读的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: TF之DD:利用Inception模型+
- 下一篇: TF之DD:利用Inception模型+