當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Distilled Dual-Encoder Model for Vision-Language Understanding

發布時間：2023/12/15 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 Distilled Dual-Encoder Model for Vision-Language Understanding 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

視覺語言理解的提取雙編碼器模型

Zekun Wang ? ? , Wenhui Wang ? , Haichao Zhu ? , Ming Liu ? , Bing Qin ? , Furu Wei ? ? Harbin Institute of Technology, Harbin, China ? Microsoft Research, Beijing, China {zkwang,hczhu,mliu,qinb}@ir.hit.edu.cn {wenwan,fuwei}@microsoft.com

摘要

????我們提出了一個跨模式注意力提取框架，用于訓練視覺語言理解任務（如視覺推理和視覺問答）的雙編碼器模型。雙編碼器模型比融合編碼器模型具有更快的推理速度，并且能夠在推理過程中對圖像和文本進行預計算。然而，雙編碼器模型中使用的淺層交互模塊不足以處理復雜的視覺語言理解任務。為了了解圖像和文本的深度交互，我們引入了跨模式注意力提取，它使用融合編碼器模型的圖像到文本和文本到圖像注意力分布來指導我們的雙編碼器模型的訓練。此外，我們還表明，將跨模式注意力提取應用于預訓練和微調階段可以實現進一步的改進。實驗結果表明，提取的雙編碼器模型在視覺推理、視覺蘊涵和視覺問答任務方面取得了有競爭力的表現，同時推理速度比融合編碼器模型快得多。我們的代碼和模型將在https://github.com/kugwzk/蒸餾雙編碼器。

1引言

????視覺語言（VL）預訓練模型（Li等人，2019；Lu等人，2019；Tan和Bansal，2019；Su等人，2020；Chen等人，2020；Li等人，2020，2021b；Zhang等人，2021；Kim等人，2021；Rad-ford等人，2021；Li等人，2021a）學習大規模圖像-文本對的跨模態表示，可以直接微調以適應各種下游VL任務，例如視覺-語言理解/分類（視覺推理（Suhr等人，2019）、視覺問答（Goyal等人，2017）和圖像-文本檢索（Young等人，2014）。根據跨模態相互作用的建模方法，這些模型可分為兩類。
????第一類是融合編碼器模型（Lu等人，2019年；Chen等人，2020年；Li等人，2020年；Kim等人，2021；Li等人，2021a），使用有效但效率較低的Transformer（Vaswani等人，2017）編碼器捕捉圖像和文本與跨模式注意力的交互。這一類的大多數模型（李等人，2019年；陸等人，2019年；陳等人，2020年；李等人，2020年；張等人，2021）依賴現成的物體檢測器來提取圖像區域特征，這進一步阻礙了它們的效率。最近，ViLT（Kim等人，2021）丟棄了檢測器，并使用視覺transformer（Dosovitskiy等人，2021）直接編碼圖像塊，如文本tokens。它在提高效率的同時，在VL理解和重試任務方面取得了有競爭力的表現。然而，由于需要同時編碼圖像和文本，基于transformer的跨模式交互仍然是一個效率瓶頸，限制了其在具有大量圖像或文本候選的任務中的應用。
????第二類作品，包括CLIP（Radford等人，2021）和ALIGN（Jia等人，2021），采用雙編碼器架構分別對圖像和文本進行編碼。跨模式交互通過淺層融合模型建模，通常是多層感知器（MLP）網絡或點積，與融合編碼器模型中的Transformer編碼器相比，該模型非常輕。此外，分解編碼支持圖像和文本候選的離線計算和緩存，可以很好地擴展到大規模候選。這些變化降低了理解和檢索任務中的推理速度，使模型在實際場景中實用。雙編碼器模型在圖像文本檢索任務上取得了良好的表現。然而，在視覺語言理解任務（如NLVR2）上，它們遠遠落后于需要復雜跨模式推理的融合編碼器模型。
????在這項工作中，我們提出了一個跨模式注意力提取框架來訓練雙編碼器視覺語言模型。提取的雙編碼器模型在視覺-語言理解任務中實現了具有競爭力的表現，其推理速度比融合編碼器模型快得多。除了軟標簽提取（Hinton等人，2015），我們還引入了跨模式注意力分離，作為雙編碼器模型（學生）的細粒度監督，以更好地學習跨模式推理。具體來說，我們使用融合編碼器模型（教師）的圖像到文本和文本到圖像注意力分布進行提取。
????我們的蒸餾框架可以應用于預訓練和微調階段。在預訓練中，我們將提取目標應用于圖像-文本對比學習和圖像-文本匹配任務中。在微調階段，微調教師模型的任務特定知識轉移到學生模型。
????我們評估了我們的視覺語言理解任務和圖像文本檢索任務模型。
????實驗結果表明，我們提取的雙編碼器模型在視覺蘊涵（99.9%）、視覺推理（97.8%）和視覺問答（95.5%）方面具有競爭力，同時推理速度比融合編碼器-教師模型快3倍以上。
????此外，我們提出的跨模式注意力分散也提高了檢索任務的表現，甚至在圖像檢索方面優于教師模型。與其他潛在特征相比，跨模式注意力有助于雙編碼器模型學習更好的跨模式推理能力，在VL理解任務中獲得顯著收益。此外，兩段蒸餾得到的模型比單段蒸餾得到的模型具有更好的表現。

2相關工作

2.1視覺語言預訓練

????語言和視覺預訓練提高了下游自然語言處理任務（Radford等人，2018；Devlin等人，2019；Dong等人，2019；Liu等人，2019；Bao等人，2020；Lewis等人，2020；Raffel等人，2020；Con-neau和Lample，2019；Chi等人，2020；Conneau等人，2020；Chi等人，2021a，b；Ma等人，2021）和計算機的技術水平視覺任務（Dosovitskiy等人，2021；Touvron等人，2021；Bao等人，2021）。視覺-語言預訓練（Lu等人，2019年；Tan和Bansal，2019年；Su等人，2020年；Gan等人，2020年；Li等人，2020年，2021b；Wang等人，2021a，c）也被證明在學習跨模態表達方面占優勢。這些VL模型的架構分為兩行。
????第一行工作（李等人，2019年；陸等人，2019年；譚和班薩爾，2019年；陳等人，2020年；周等人，2020年；張等人，2021；金等人，2021；李等人，2021a）使用融合編碼器來學習跨模式交互。這些模型首先將圖像-文本對編碼為向量，然后使用多層Transformer（Vaswani等人，2017）網絡融合視覺和文本表示。大多數以前的模型通過對象檢測器提取視覺特征（例如，更快的R-CNN（Ren等人，2015）），該檢測器需要使用一組固定的對象類（如視覺基因組）對昂貴的注釋數據集進行預訓練（Krishna等人，2017）。此外，目標檢測器需要高分辨率的輸入圖像，并帶來更多的計算成本。
????最近，黃等人（2020）和徐等人（2021）直接將圖像像素作為輸入，并將其輸入卷積神經網絡，以獲得視覺網格特征，而不是以前的區域特征。李等人（2021a）使用視覺transformer（Dosovitskiy等人，2021）為多模式融合編碼器提取圖像特征。
????ViLT（Kim等人，2021）通過簡單的嵌入層直接編碼圖像塊。然后，多模態Transformer聯合編碼視覺和文本嵌入。它在VL任務上以更少的開銷實現了有競爭力的表現。融合編碼器模型顯示出強大的跨模態建模能力，并在需要復雜跨模態推理的VL理解任務（如NLVR2）上取得了優異的結果（Suhr等人，2019）。
????然而，融合編碼器模型仍然依賴于跨模式Transformer跨層同時編碼和融合視覺和文本表示，需要大量計算預算，導致推理速度低。
????另一條生產線（Radford等人，2021；Jia等人，2021；Sun等人，2021）采用雙編碼器架構分別編碼圖像和文本，并采用點積或MLP網絡模型圖像和文本之間的交互。與融合編碼器模型相比，雙編碼器模型具有計算效率優勢。多頭注意力機制僅適用于同一模態的tokens，并將融合編碼器模型的復雜性降低到，其中N和M分別是視覺和文本特征的長度。此外，由于獨立的編碼器，視覺或文本表示可以預先計算并緩存在實際應用中。雙編碼器模型在圖像文本檢索方面取得了良好的表現。然而，淺層交互模塊不足以處理復雜的VL理解任務，這些任務需要更深的跨模態交互，導致表現顯著下降。為了改進復雜VL理解任務的雙編碼器模型，我們引入了跨模式注意力提取框架，以幫助模型學習更深層次的交互。

2.2知識提煉

????知識提煉（KD）旨在將在強教師模型中學習到的知識轉移到學生模型，使學生表現出競爭性。Hinton等人（2015）采用教師模型的軟標簽分布來訓練學生模型。最近，可以通過模擬教師的中間表示來進一步改進學生模型，例如隱藏狀態（Romero等人，2015）和注意力分布（Zagoruyko和Komodakis，2017）。
????知識提取也被廣泛用于壓縮和改進跨學科的基于transformer的模型（孫等人，2019年；焦等人，2020年；王等人，2020a，b，2021b；Touvron等人，2021）。在這項工作中，我們利用融合編碼器教師模型的跨模式注意力知識來指導雙編碼器模型的訓練。我們的提取框架改進了復雜VL理解任務的雙編碼器模型。

3方法

????在本節中，我們描述了用于訓練雙編碼器模型的跨模式注意力提取框架。圖1概述了我們的方法。我們采用融合編碼器模型作為教師，引入跨模式注意力知識和軟標簽來訓練雙編碼器學生模型。蒸餾目標適用于預訓練和微調階段，并幫助雙編碼器模型學習不同模式的交互。

3.1模型概述

????我們的提取框架可以采用不同的融合編碼器模型作為指導。在這項工作中，我們采用ViLT（Kim等人，2021）作為教師模型進行實驗，因為它簡單有效。
????輸入表示給定一個圖像-文本對（v，t）作為輸入，我們對圖像v進行切片∈ R H×W×C到面片v p∈ R N×（P 2 C），其中N=HW/P 2是面片數，（H，W）是輸入圖像分辨率，（P，P）是每個面片的分辨率，C是通道數。
????輸入文本通過分詞（Wu等人，2016）標記為M個子詞tokens序列，如BERT（Devlin等人，2019）所示。然后，我們將特殊tokens[I\\u-CLS]和[T\\u-CLS]分別預處理到圖像塊序列和文本子詞tokens序列。
????我們線性投影圖像面片v p以獲得面片嵌入，最終視覺輸入em-beddings 通過：

計算，其中v∈ ×dis線性投影，V pos∈ R（N+1）×dis a可學習的1D位置em層理，V型∈ R是視覺類型嵌入。
????文本輸入嵌入是通過將單詞嵌入、文本位置嵌入和文本類型em-bedding相加得到的：

我們將作為教師和學生模型的視覺和文本輸入。
????老師：融合編碼器模型輸入表示H v 0和H t 0串聯為，然后將向量饋送到L-層跨模式Transformer編碼器以獲得上下文表示：

其中L∈ [1，L]。跨模式Transformer en編碼器通過多頭注意力機制融合不同模式的表示。具體來說，對于每個頭部a，a∈ [1，A h]在層l中，通過

計算注意力分布A vl A，其中查詢Q vl A和密鑰K vl A分別通過使用參數線性投影最后一個層的隱藏狀態獲得。d k是注意力頭部大小。最后一個層的[I\\u CLS]和[T\\u CLS]tokens的輸出向量被饋送到任務特定的層以獲得預測。
????學生：雙編碼器模型雙編碼器模型通過基于視覺和文本轉換器的編碼器分別編碼視覺嵌入（H v 0）和文本嵌入（H t 0）：

最后層的[I\\u CLS]和[t\\u CLS]tokens的輸出向量用作圖像和文本的最終表示。我們采用淺模f來融合這兩種表示。對于視覺語言理解任務，如VQA，模塊f是一個MLP網絡。對于圖像-文本檢索，我們使用點積函數獲得圖像-文本對的相似性分數。

圖1：我們的跨模式注意力提取框架概述。除了軟標簽外，我們還介紹了融合編碼器模型（教師）的跨模式注意力知識，包括圖像到文本和文本到圖像的注意力分布，以指導雙編碼器模型（學生）的訓練。

3.2蒸餾目標

????跨模式注意力提取為了改進雙編碼器模型以捕捉圖像和文本的深層交互，我們利用融合編碼器模型的跨模式注意力知識來指導雙編碼器模型的訓練。具體來說，我們使用圖像到文本和文本到圖像的注意力分布來訓練雙編碼器模型。
????融合編碼器-教師模型通過多頭注意力機制捕捉跨模式交互，如等式4所示。整個注意力分布

可以分為兩部分。我們使用N和M表示圖像和文本輸入的長度。第一部分是單峰注意力

，它對同一模態的tokens內的交互進行建模。第二部分是跨模態注意力，包括圖像-文本注意力分布和文本-圖像注意力分布跨模態注意力分布捕捉視覺和文本特征向量的交互作用。由于雙編碼器的單獨編碼僅模擬同一模態的tokens的交互，我們引入跨模態注意力提取，以鼓勵雙編碼器模型模擬融合編碼器模型的圖像和文本對齊。雙編碼器模型的跨模式（圖像到文本和文本到圖像）注意力分布計算如下：

，其中是視覺查詢和自注意力模塊的鍵。

0是文本輸入的查詢和鍵。我們以同樣的方式重新計算了教師的跨模式注意力分布，而不是直接將原始注意力分布拆分為VLT。跨模式注意力蒸餾損失通過以下公式計算：

2，其中D KL是庫爾巴克-萊布爾散度。受王等人（2020b）的啟發，我們只轉移了教師模型最后一個層的跨模態注意力知識。
????軟標簽提取除了模擬跨模式注意力分布外，我們還使用教師模型的預測作為軟標簽來提高學生的注意力。軟標簽損失計算為：

，其中z S，z T分別是學生和教師的預測對數。

表1：在預訓練和微調期間用于不同任務的訓練目標。

3.3兩段蒸餾框架

????我們使用提出的知識提取目標在兩階段框架下訓練雙編碼器學生模型，包括預訓練提取和微調提取。在這兩個階段中，融合編碼器模型幫助雙編碼器模型學習跨模態交互。
????如表1所示，我們根據任務的特點，對具有不同目標的模型進行了訓練。

3.3.1預訓練蒸餾

????在預訓練期間，雙編碼器學生模型在大規模圖像-文本對上接受訓練的，以學習通用跨模態表示，包括圖像-文本匹配、圖像-文本對比和掩碼式語言建模任務。預訓練融合編碼器模型ViLT（Kim等人，2021）用作教師模型。
????圖像文本匹配圖像文本匹配的目標是預測輸入圖像和文本是否匹配。繼ViLT（Kim等人，2021）之后，我們用0.5的概率替換匹配圖像以構建負對。
????我們利用ITM輸入對上的跨模式注意力提取損失和軟標簽損失來訓練雙編碼器模型。
????圖像-文本對比學習（ITC）我們引入了對比損失和批量負抽樣，以優化視覺和文本表征的共享空間。給定一批N個圖像-文本對，我們可以得到N個匹配對和N 2? N個負對。圖像-文本對比學習旨在從所有可能的配對中預測匹配配對。融合編碼器模型需要對每一對進行聯合編碼以獲得軟標簽，這導致了二次時間復雜度。因此，為了提高訓練效率，我們采用了帶有地面真實性標簽的跨模式注意力提取。具體來說，我們只考慮在N個匹配對上計算的跨模式注意力分布。
????掩碼式語言建模（MLM）掩碼式語言建模的目標是從所有其他未屏蔽tokens中恢復掩碼式tokens。
????我們使用15%的掩碼概率，如BERT（De-vlin等人，2019）。為了提高訓練速度，我們使用地面實況標簽對傳銷任務的模型進行訓練。

3.3.2微調蒸餾

????在微調過程中，我們使用微調后的ViLT作為教師模型，并對下游任務數據執行跨模式注意力提取。
????視覺語言理解對于視覺-語言理解任務，如視覺推理和VQA，我們使用跨模式注意力提取和軟標記丟失來微調學生模型。
????圖像文本檢索對于檢索任務，我們在教師模型和地面真值標簽的跨模式注意力分布的監督下對學生進行訓練，以進行有效的訓練。

4個實驗

4.1數據集

????根據之前的工作（陳等人，2020年；金等人，2021），我們在訓練前使用了四個數據集：COCO（林等人，2014年）、概念性字幕（Sharma等人，2018年）、SBU字幕（或-donez等人，2011年）和視覺基因組（Krishna等人，2017年）。
????我們在三個視覺語言理解/分類數據集和一個圖像文本檢索數據集上評估了我們的雙編碼器模型。表2顯示了四個數據集的統計信息。
????視覺推理NLVR2（Suhr等人，2019）數據集是一項視覺推理任務，旨在確定文本語句是否描述了一對圖像。根據之前的工作（李等人，2020；金等人，2021），我們構建了兩個圖像-文本對作為輸入，每個由一個圖像和文本語句組成。將兩對的最終表示輸入分類器層以獲得預測。
????視覺蘊涵SNLI-VE（Xie等人，2019）數據集旨在預測圖像和文本描述之間的關系。與之前的工作一樣，我們將SNLI-VE視為一個三向分類任務（Chen等人，2020；Li等人，2021a）。
????視覺問答任務要求模型基于圖像回答問題。
????我們在廣泛使用的VQAv2（Goyal等人，2017）數據集上進行了評估。繼Anderson等人。
????（2018年），我們將該問題表述為一項分類任務，共有3129個候選答案。
????圖像文本檢索該任務由兩個子任務組成：圖像檢索和文本檢索。我們在Flickr30K（Plummer等人，2015）數據集上進行評估，并遵循Karpathy和Fei-Fei（2015）中的分割。

表2：不同下游視覺語言數據集的統計。

4.2實施細節

????我們的雙編碼器模型的Transformer架構與ViLT相同（Kim等人，2021）。視覺和文本Transformers都由12個層塊和768個隱藏大小和12個注意力頭組成。前饋網絡的中間大小為3072。繼Kim等人（2021）之后，圖像的分辨率調整為384×640，面片大小為32×32。文本序列的最大長度設置為40。
????對于預訓練，我們以1024個批量對模型進行200k步的訓練。我們使用ViLT的預訓練的權重（Kim等人，2021）來初始化雙編碼器模型的視覺和文本編碼器。在微調過程中，我們對模型進行10個epochs的訓練，批量大小為256，用于VQA和SNLI-VE。對于NLVR2，我們訓練模型20個epochs，批量大小為128。對于Flickr30k，模型經過20個epochs的訓練的，批量大小為512。我們應用RandAugment（Cubuk等人，2020年），沒有顏色反轉和剪切。對于這兩個階段，我們使用Adam（Kingma和Ba，2015），β1=0。9 , β 2 = 0 . 999用于優化。
????學習速率設置為1e-4，預熱比為0。1和線性衰減。權重衰減設置為0。01 .

4.3結果

????視覺語言理解結果我們評估了視覺語言理解任務的模型，包括NLVR2、SNLI-VE和VQA。
????表3給出了三個任務的微調結果。與之前的雙編碼器模型（如CLIP）（Radford等人，2021）相比，我們的模型在三個視覺語言理解任務中實現了更好的表現，平均分數從57分提高到57分。83至73。85 . 此外，與融合編碼器模型相比，我們的雙編碼器模型也實現了具有競爭力的表現。模型保留了99。SNLI-VE的準確率為9%，97。NLVR2和95的準確率為8%。5%的VQA表現，比教師模型（ViLT）快3倍以上。我們的模型在NLVR2任務上甚至優于PixelBERT-R50（Huang等人，2020）。使用雙編碼器架構比融合編碼器模型需要更少的計算量，并實現更快的推理速度。此外，執行單獨編碼可以實現圖像或文本表示的預計算和緩存，這對于大量圖像和文本更有效。
????消融研究表3也顯示了我們方法的消融結果。在預訓練和微調階段執行蒸餾都對我們的雙編碼器模型做出了積極貢獻。與直接微調由ViLT初始化的雙編碼器模型相比，在微調期間使用跨模式注意力蒸餾帶來了顯著的改進。引入訓練前提取進一步改進了模型。
????圖像文本檢索結果除了視覺語言理解任務外，我們還評估了我們在圖像文本檢索任務中的方法。我們的雙編碼器學生模型通過跨模式注意力提取和對比損失進行訓練的。
????表4報告了在Flickr30K上微調的模型的結果。我們的雙編碼器模型以更快的推理速度實現了具有競爭力的性能。該模型在圖像重建方面甚至優于融合編碼器-教師模型（ViLT）。此外，實驗結果表明，跨模式注意力提取也改進了檢索任務的模型。
????推理速度我們評估了我們的雙編碼器模型和ViLT對視覺語言理解任務的推理能力。這兩個模型在具有相同超參數的單個P100 GPU上進行評估。多虧了雙編碼器架構，我們的模型可以緩存圖像表示以減少冗余計算。
????不同任務的平均推斷時間和緩存時間如表5所示。我們的雙編碼器模型在三個任務中實現了更快的推理速度。預計算圖像表示進一步提高了推理速度，這對于現實應用中的大量圖像和文本是有效的。

表3：視覺語言理解任務的結果。“Std”表示具有原始地面實況標簽的訓練。“KD”表示使用我們的蒸餾目標訓練的的模型。我們報告了NLVR2開發和公共測試集（test-P）、SNLI-VE驗證和測試分割的準確性。我們報告了vqa測試開發拆分的vqa分數。?表明了我們對CLIP的微調結果（Radford等人，2021）。每個任務的結果在3次運行中取平均值。在NLVR2數據集上評估了推理速度。我們在具有相同超參數的單個P100 GPU上評估了我們的模型和ViLT。其他模型的推理加速來自Kim等人（2021）。

表4:Flickr30K上的檢索結果。ViLT是融合編碼器教師模型。我們的模型通過跨模式注意力提取和對比目標進行了微調。“ ? “跨模態注意力”是沒有跨模態注意力提取目標的模型訓練的。在同一設置下，在單個P100 GPU上評估了我們的模型和ViLT的推理速度。

表5:ViLT和我們的模型在三種視覺語言理解任務上的平均推理時間。推理時間和緩存時間在單個P100 GPU上進行評估。

表6：使用不同提取知識的效果。“Attn”是注意力分布的縮寫。“整體Attn”是“單峰Attn”和“跨峰Attn”的組合。結果是每個任務平均運行3次。

表7：不同層映射策略對蒸餾方法的影響。結果在3次運行中取平均值。

4.4討論

????不同蒸餾知識的影響我們研究了蒸餾中使用的不同知識的影響。我們對微調過程中不同失真損失的視覺-語言理解任務進行了實驗。雙編碼器學生模型由ViLT直接初始化。表6說明了跨任務的結果。首先，我們發現使用軟標簽蒸餾比地面真實值標簽獲得更好的表現。然而，使用軟標簽訓練的的模型在NLVR2任務上的準確性仍然相對較低。我們進一步合并了融合編碼器模型的中間表示，以提高雙編碼器模型的表現。我們比較了使用隱藏狀態和不同的注意力分布。在三個任務中，使用注意力分布比隱藏狀態帶來更多改進。我們進一步探討了注意力分布的哪一部分更關鍵，包括跨模態注意力和單峰注意力。如表6所示，模擬教師的跨模態注意力分布比單峰部分獲得了更多的改善，這驗證了跨模態交互對于視覺語言理解任務更為重要。我們還發現，僅使用跨模態注意力分布比使用整個注意力分布（跨模態+單峰）表現更好。
????受王等人（2020b）的啟發，我們在教師和學生的最后一個層上執行了提出的知識提取方法。為了驗證僅在最后一個層上提取的有效性，我們將其與層策略進行了比較。結果如表7所示。
????最后一個層蒸餾策略在NLVR2和SNLI-VE任務上獲得了更好的表現。此外，僅使用最后一個層的注意力知識需要更少的計算量。因此，僅使用最后一個層是執行跨模式注意力提取的更實際的方法。

5結論

????在這項工作中，我們引入了一個跨模式注意力提取框架，以提高雙編碼器模型在視覺語言理解任務中的表現。我們利用融合編碼器模型的跨模式注意力知識，包括圖像到文本和文本到圖像的注意力分布，來指導雙編碼器模型的訓練。實驗結果表明，提取的雙編碼器模型在NLVR2、SNLI-VE和VQA上實現了具有競爭力的表現，同時具有比融合編碼器模型更快的推理速度。

參考文獻

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering . In 2018 IEEE Conference on Computer Vision and pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 6077–6086. computer Vision Foundation / IEEE Computer Society.

Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT pre-training of image transformers . CoRR , abs/2106.08254.

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Song- hao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020. UniLMv2: Pseudo-masked language models for unified language model pre-training . In proceedings of the 37th International Conference on machine Learning, ICML 2020, 13-18 July 2020, virtual Event , volume 119 of Proceedings of Machine Learning Research , pages 642–652. PMLR.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: universal image-text representation learning . In Computer Vision - ECCV

2020 - 16th European Conference, Glasgow, UK, august 23-28, 2020, Proceedings, Part XXX , volume 12375 of Lecture Notes in Computer Science , pages 104–120. Springer.

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian- Ling Mao, and Heyan Huang. 2020. Cross-lingual natural language generation via pre-training . In The Thirty-Fourth AAAI Conference on Artificial Intelli- gence, AAAI 2020, The Thirty-Second Innovative applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 , pages 7570– 7577. AAAI Press.

Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak- sham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021a. In- foXLM: An information-theoretic framework for cross-lingual language model pre-training . In proceedings of the 2021 Conference of the North american Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3576–3588, Online. Association for Computational Linguistics.

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, and Furu Wei. 2021b. XLM-E: cross-lingual language model pre-training via ELECTRA . CoRR , abs/2106.16138.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the association for Computational Linguistics , pages 8440– 8451, Online. Association for Computational Lin- guistics.

Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining . In Advances in Neural Information Processing Systems 32: annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 7057–7067.

Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space . In advances in Neural Information Processing Systems 33: Annual Conference on Neural Information processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual .

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Pa- pers) , pages 4171–4186. Association for Computa- tional Linguistics.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 13042–13054.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net. Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale ad- versarial training for vision-and-language representation learning . In Advances in Neural Information Processing Systems 33: Annual Conference on Neu- ral Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual .

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325–6334. IEEE Computer So- ciety.

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network . CoRR , abs/1503.02531.

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transform- ers . CoRR , abs/2004.00849.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision . In Proceedings of the 38th international Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 4904–4916. PMLR.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tinybert: Distilling BERT for natural language understanding . In Findings of the Association for Computational Linguistics: EMNLP 2020, online Event, 16-20 November 2020 , volume EMNLP

2020 of Findings of ACL , pages 4163–4174. association for Computational Linguistics.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descriptions . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015 , pages 3128–3137. IEEE Computer Society.

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolu- tion or region supervision . In Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning research , pages 5583–5594. PMLR.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization . In 3rd international Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings .

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. visual genome: Connecting language and vision using crowdsourced dense image annotations . Int. J. Com- put. Vis. , 123(1):32–73.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension . In Proceedings of the 58th annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 7871–7880. Association for Computational Linguistics.

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caim- ing Xiong, and Steven C. H. Hoi. 2021a. Align before fuse: Vision and language representation learning with momentum distillation . CoRR , abs/2107.07651.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language . CoRR , abs/1908.03557.

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021b. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , pages 2592–2607. Association for Computational Linguistics.

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi- aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX , volume 12375 of Lecture Notes in Computer science , pages 121–137. Springer.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , volume 8693 of Lecture Notes in Computer Science , pages 740–755. Springer.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach . CoRR , abs/1907.11692.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visi- olinguistic representations for vision-and-language tasks . In Advances in Neural Information processing Systems 32: Annual Conference on Neural information Processing Systems 2019, NeurIPS 2019, december 8-14, 2019, Vancouver, BC, Canada , pages 13–23.

Shuming Ma, Li Dong, Shaohan Huang, Dong- dong Zhang, Alexandre Muzio, Saksham Sing- hal, Hany Hassan Awadalla, Xia Song, and Furu Wei. 2021. Deltalm: Encoder-decoder pre-training for language generation and translation by aug- menting pretrained multilingual encoders . CoRR , abs/2106.13736.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs . In Advances in Neural information Processing Systems 24: 25th Annual conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 december 2011, Granada, Spain , pages 1143–1151. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image- to-sentence models . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santi- ago, Chile, December 7-13, 2015 , pages 2641–2649. IEEE Computer Society.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. learning transferable visual models from natural language supervision . In Proceedings of the 38th international Conference on Machine Learning, ICML

2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 8748–8763. PMLR.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training .

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res. , 21:140:1–140:67.

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks . In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information processing Systems 2015, December 7-12, 2015, Mon- treal, Quebec, Canada , pages 91–99.

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Ka- hou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. Fitnets: Hints for thin deep nets . In 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings .

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning . In Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics, ACL 2018, Melbourne, Aus- tralia, July 15-20, 2018, Volume 1: Long Papers , pages 2556–2565. Association for Computational Linguistics.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: pre- training of generic visual-linguistic representations . In 8th International Conference on Learning Repre- sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net.

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs . In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, volume 1: Long Papers , pages 6418–6428. Association for Computational Linguistics.

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. Lightning- dot: Pre-training visual-semantic embeddings for real-time image-text retrieval . In Proceedings of the 2021 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 , pages 982–997. association for Computational Linguistics.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model compression . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pages 4322–4331. Association for Computational Linguis- tics.

Hao Tan and Mohit Bansal. 2019. LXMERT: learning cross-modality encoder representations from trans- formers . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pages 5099–5110. Association for Computational Linguis- tics.

Hugo Touvron, Matthieu Cord, Matthijs Douze, francisco Massa, Alexandre Sablayrolles, and Hervé Jé- gou. 2021. Training data-efficient image transform- ers & distillation through attention . In proceedings of the 38th International Conference on machine Learning, ICML 2021, 18-24 July 2021, virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 10347–10357. PMLR.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need . In Advances in Neural Information processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA , pages 5998–6008. Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiu- jun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020a. Minivlm: A smaller and faster vision-language model . CoRR , abs/2012.06946. Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. 2021a. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts . CoRR , abs/2111.02358.

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021b. Minilmv2: Multi-head self- attention relation distillation for compressing pre- trained transformers . In Findings of the association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 , volume ACL/IJCNLP 2021 of Findings of ACL , pages 2140– 2151. Association for Computational Linguistics. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020b. Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers . In Advances in Neural Information Processing Systems 33: Annual conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual . Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu- lia Tsvetkov, and Yuan Cao. 2021c. Simvlm: Simple

visual language model pretraining with weak supervision . CoRR , abs/2108.10904.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.

Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation . CoRR , abs/1609.08144.

Ning Xie, Farley Lai, Derek Doran, and Asim Ka- dav. 2019. Visual entailment: A novel task for fine-grained image understanding . CoRR , abs/1901.06706.

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Song- fang Huang, Wenming Xiao, and Fei Huang. 2021. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning . In Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), virtual Event, August 1-6, 2021 , pages 503–513. association for Computational Linguistics.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions . Trans. Assoc. Com- put. Linguistics , 2:67–78.

Sergey Zagoruyko and Nikos Komodakis. 2017. paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer . In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track proceedings . OpenReview.net.

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian- feng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models . In IEEE conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579– 5588. Computer Vision Foundation / IEEE.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. unified vision-language pre-training for image caption- ing and VQA . In The Thirty-Fourth AAAI conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial In- telligence, EAAI 2020, New York, NY, USA, february 7-12, 2020 , pages 13041–13049. AAAI Press.

總結

以上是生活随笔為你收集整理的Distilled Dual-Encoder Model for Vision-Language Understanding的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。