當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

论文笔记【A Comprehensive Study of Deep Video Action Recognition】

發布時間：2024/3/7 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了论文笔记【A Comprehensive Study of Deep Video Action Recognition】小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

論文鏈接：A Comprehensive Study of Deep Video Action Recognition

A Comprehensive Study of Deep Video Action Recognition
Abstract
1. Introduction
2. Datasets and Challenges
- 2.1. Datasets
- 2.2. Challenges
3. An Odyssey of Using Deep Learning for Video Action Recognition
- 3.1. From hand-crafted features to CNNs
- 3.2. Two-stream networks
- - 3.2.1 Using deeper network architectures
  - 3.2.2 Two-stream fusion
  - 3.2.3 Recurrent neural networks
  - 3.2.4 Segment-based methods
  - 3.2.5 Multi-stream networks
- 3.3. The rise of 3D CNNs
- - 3.3.1 Mapping from 2D to 3D CNNs
  - 3.3.2 Unifying 2D and 3D CNNs
  - 3.3.3 Long-range temporal modeling
  - 3.3.4 Enhancing 3D efficiency
- 3.4. Efficient Video Modeling
- - 3.4.1 Flow-mimic approaches
  - 3.4.2 Temporal modeling without 3D convolution
- 3.5. Miscellaneous
- - 3.5.1 Trajectory-based methods
  - 3.5.2 Rank pooling
  - 3.5.3 Compressed video action recognition
  - 3.5.4 Frame/Clip sampling
  - 3.5.5 Visual tempo
4. Evaluation and Benchmarking
- 4.1. Evaluation scheme
- 4.2. Scene-focused datasets
- 4.3. Motion-focused datasets
- 4.4. Multi-label datasets
- 4.5. Speed comparison
5. Discussion and Future Work
- 5.1. Analysis and insights
- 5.2. Data augmentation
- 5.3. Video domain adaptation
- 5.4. Neural architecture search
- 5.5. Efficient model development
- 5.6. New datasets
- 5.7. Video adversarial attack
- 5.8. Zero-shot action recognition
- 5.9. Weakly-supervised video action recognition
- 5.10. Fine-grained video action recognition
- 5.11. Egocentric action recognition
- 5.12. Multi-modality
- 5.13. Self-supervised video representation learning
6. Conclusion
參考

A Comprehensive Study of Deep Video Action Recognition

Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong,
Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li
Amazon Web Services
{yzaws,xxnl,chunhliu,mozolf,yuanjx,chongrwu,zhiz,tighej,manmatha,mli}@amazon.com

Abstract

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

視頻動作識別是視頻理解的代表性課題之一。在過去的十年中，由于深度學習的出現，我們見證了視頻動作識別的巨大進步。但我們也遇到了新的挑戰，包括在視頻中建模遠程時間信息，高計算成本，以及由于數據集和評估協議的差異而導致的不可比較的結果。在這篇論文中，我們提供了一個全面的調查超過200篇現有的深度學習視頻動作識別的論文。我們首先介紹了影響模型設計的17個視頻動作識別數據集。然后，我們按時間順序介紹視頻動作識別模型:從早期適應深度學習的嘗試開始，然后是雙流網絡，然后是采用3D卷積核，最后是最近的計算效率模型。此外，我們在幾個有代表性的數據集和發布代碼上對流行的方法進行了基準測試，以提高可重復性。最后，我們討論了視頻動作識別的開放問題，并闡明了視頻動作識別的機會，以促進新的研究思路。

1. Introduction

One of the most important tasks in video understanding is to understand human actions. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Human action understanding involves recognizing, localizing, and predicting human behaviors. The task to recognize human actions in a video is called video action recognition. In Figure 1, we visualize several video frames with the associated action labels, which are typical human daily activities such as shaking hands and riding a bike.

理解人的動作是視頻理解中最重要的任務之一。它在現實世界中有許多應用，包括行為分析、視頻檢索、人機交互、游戲和娛樂。對人類行為的理解包括識別、定位和預測人類行為。在視頻中識別人類動作的任務稱為視頻動作識別。在圖1中，我們可視化了幾個帶有相關動作標簽的視頻幀，這些動作標簽是典型的人類日常活動，如握手和騎自行車。

Over the last decade, there has been growing research interest in video action recognition with the emergence of high-quality large-scale action recognition datasets. We summarize the statistics of popular action recognition datasets in Figure 2. We see that both the number of videos and classes increase rapidly, e.g, from 7K videos over 51 classes in HMDB51 [109] to 8M videos over 3, 862 classes in YouTube8M [1]. Also, the rate at which new datasets are released is increasing: 3 datasets were released from 2011 to 2015 compared to 13 released from 2016 to 2020.

在過去的十年中，隨著高質量的大規模動作識別數據集的出現，人們對視頻動作識別的研究興趣日益濃厚。我們在圖2中總結了流行的動作識別數據集的統計數據。我們看到視頻和課程的數量都在快速增長，例如，從HMDB51[109]中的51個課程上的7K視頻到YouTube8M[1]中的3862個課程上的8M視頻。此外，新數據集發布的速度也在增加:2011年至2015年發布了3個數據集，而2016年至2020年發布了13個數據集。

Thanks to both the availability of large-scale datasets and the rapid progress in deep learning, there is also a rapid growth in deep learning based models to recognize video actions. In Figure 3, we present a chronological overview of recent representative work. DeepVideo [99] is one of the earliest attempts to apply convolutional neural networks to videos. We observed three trends here. The first trend started by the seminal paper on Two-Stream Networks [187], adds a second path to learn the temporal information in a video by training a convolutional neural network on the optical flow stream. Its great success inspired a large number of follow-up papers, such as TDD [214], LRCN [37], Fusion [50], TSN [218], etc. The second trend was the use of 3D convolutional kernels to model video temporal information, such as I3D [14], R3D [74], S3D [239], Non-local [219], SlowFast [45], etc. Finally, the third trend focused on computational efficiency to scale to even larger datasets so that they could be adopted in real applications. Examples include Hidden TSN [278], TSM [128], X3D [44], TVN [161], etc.

得益于大規模數據集的可用性和深度學習的快速發展，基于深度學習的視頻動作識別模型也得到了快速發展。在圖3中，我們按時間順序概述了最近的代表性作品。DeepVideo[99]是最早將卷積神經網絡應用于視頻的嘗試之一。我們觀察到三個趨勢。第一個趨勢是由關于雙流網絡的開創性論文[187]開始的，增加了第二條路徑，通過在光流流上訓練卷積神經網絡來學習視頻中的時間信息。它的巨大成功啟發了大量后續的論文，如TDD [214]， LRCN [37]， Fusion [50]， TSN[218]等。第二個趨勢是使用3D卷積核對視頻時間信息建模，如I3D[14]、R3D[74]、S3D[239]、Non-local[219]、SlowFast[45]等。最后，第三個趨勢專注于計算效率，以擴展到更大的數據集，以便它們可以在實際應用中采用。示例包括隱藏TSN[278]、TSM[128]、X3D[44]、TVN[161]等。

Despite the large number of deep learning based models for video action recognition, there is no comprehensive survey dedicated to these models. Previous survey papers either put more efforts into hand-crafted features [77, 173] or focus on broader topics such as video captioning [236], video prediction [104], video action detection [261] and zero-shot video action recognition [96]. In this paper:

盡管有大量基于深度學習的視頻動作識別模型，但還沒有針對這些模型的全面研究。以前的調查論文要么把更多的精力放在手工制作的功能上[77,173]，要么關注更廣泛的主題，如視頻字幕[236]、視頻預測[104]、視頻動作檢測[261]和零鏡頭視頻動作識別[96]。在本文中:

We comprehensively review over 200 papers on deep learning for video action recognition. We walk the readers through the recent advancements chronologically and systematically, with popular papers explained in detail.

我們全面審查了200多篇關于深度學習在視頻動作識別中的應用的論文。我們將按時間順序和系統地向讀者介紹最近的進展，并對流行論文進行詳細解釋。

We benchmark widely adopted methods on the same set of datasets in terms of both accuracy and efficiency.We also release our implementations for full reproducibility .

在準確性和效率方面，我們在同一組數據集上對廣泛采用的方法進行了基準測試。我們還發布了我們的實現以實現完全的可再現性。

We elaborate on challenges, open problems, and opportunities in this field to facilitate future research.

我們詳細闡述了該領域的挑戰、開放問題和機遇，以促進未來的研究。

The rest of the survey is organized as following. We first describe popular datasets used for benchmarking and existing challenges in section 2. Then we present recent advancements using deep learning for video action recognition in section 3, which is the major contribution of this survey. In section 4, we evaluate widely adopted approaches on standard benchmark datasets, and provide discussions and future research opportunities in section 5.

調查的其余部分組織如下。我們首先在第2節中描述用于基準測試的流行數據集和現有的挑戰。然后，我們在第3部分介紹了使用深度學習進行視頻動作識別的最新進展，這是本次調查的主要貢獻。在第4節中，我們評估了標準基準數據集上廣泛采用的方法，并在第5節中提供了討論和未來的研究機會。

2. Datasets and Challenges

2.1. Datasets

Deep learning methods usually improve in accuracy when the volume of the training data grows. In the case of video action recognition, this means we need large-scale annotated datasets to learn effective models.

深度學習方法的準確性通常隨著訓練數據量的增長而提高。在視頻動作識別的情況下，這意味著我們需要大規模的帶注釋的數據集來學習有效的模型。

For the task of video action recognition, datasets are often built by the following process: (1) Define an action list, by combining labels from previous action recognition datasets and adding new categories depending on the use case. (2) Obtain videos from various sources, such as YouTube and movies, by matching the video title/subtitle to the action list. (3) Provide temporal annotations manually to indicate the start and end position of the action, and (4) finally clean up the dataset by de-duplication and filtering out noisy classes/samples. Below we review the most popular large-scale video action recognition datasets in Table 1 and Figure 2.

對于視頻動作識別任務，數據集通常通過以下過程構建:(1)定義一個動作列表，將之前動作識別數據集中的標簽組合起來，并根據用例添加新的類別。(2)通過將視頻標題/字幕與動作列表進行匹配，獲取各種來源的視頻，如YouTube和電影。(3)手動提供時間注釋，以指示動作的開始和結束位置。(4)最后通過去重復和過濾噪聲類/樣本來清理數據集。下面我們回顧表1和圖2中最流行的大規模視頻動作識別數據集。

HMDB51 [109] was introduced in 2011. It was collected mainly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6, 849 clips divided into 51 action categories, each containing a minimum of 101 clips. The dataset has three official splits. Most previous papers either report the top-1 classification accuracy on split 1 or the average accuracy over three splits.

HMDB51[109]于2011年問世。這些數據主要來自電影，還有一小部分來自公共數據庫，如普雷林格檔案、YouTube和谷歌視頻。該數據集包含6,849個剪輯，分為51個動作類別，每個動作類別至少包含101個剪輯。該數據集有三次官方拆分。以前的大多數論文要么報告分裂1的前1分類準確率，要么報告三次分裂的平均準確率。

UCF101 [190] was introduced in 2012 and is an extension of the previous UCF50 dataset. It contains 13,320 videos from YouTube spreading over 101 categories of human actions. The dataset has three official splits similar to HMDB51, and is also evaluated in the same manner.

UCF101[190]于2012年引入，是之前UCF50數據集的擴展。它包含了來自YouTube的13320個視頻，傳播了101種人類行為。該數據集有三個官方分割，類似于HMDB51，也以相同的方式進行評估。

Sports1M [99] was introduced in 2014 as the first large-scale video action dataset which consisted of more than 1 million YouTube videos annotated with 487 sports classes. The categories are fine-grained which leads to low interclass variations. It has an official 10-fold cross-validation split for evaluation.

Sports1M[99]于2014年推出，是第一個大規模視頻動作數據集，由100多萬YouTube視頻組成，注釋了487個體育課程。類別是細粒度的，這導致了類間的低變化。它有一個官方的10倍交叉驗證拆分用于評估。

ActivityNet [40] was originally introduced in 2015 and the ActivityNet family has several versions since its initial launch. The most recent ActivityNet 200 (V1.3) contains 200 human daily living actions. It has 10, 024 training, 4, 926 validation, and 5, 044 testing videos. On average there are 137 untrimmed videos per class and 1.41 activity instances per video.

ActivityNet[40]最初于2015年推出，ActivityNet家族自最初推出以來有幾個版本。最新的ActivityNet 200 (V1.3)包含200個人類日常生活行為。它有10,024個訓練視頻、4,926個驗證視頻和5,044個測試視頻。平均而言，每個類有137個未修剪的視頻，每個視頻有1.41個活動實例。

YouTube8M [1] was introduced in 2016 and is by far the largest-scale video dataset that contains 8 million YouTube videos (500K hours of video in total) and annotated with 3, 862 action classes. Each video is annotated with one or multiple labels by a YouTube video annotation system. This dataset is split into training, validation and test in the ratio 70:20:10. The validation set of this dataset is also extended with human-verified segment annotations to provide temporal localization information.

YouTube8M[1]于2016年推出，是迄今為止規模最大的視頻數據集，包含800萬YouTube視頻(總共50萬小時的視頻)，并注釋了3862個動作類。每個視頻都由YouTube視頻標注系統標注一個或多個標簽。該數據集按70:20:10的比例分為訓練、驗證和測試。該數據集的驗證集還擴展了人類驗證的段注釋，以提供時間定位信息。

Charades [186] was introduced in 2016 as a dataset for real-life concurrent action understanding. It contains 9, 848 videos with an average length of 30 seconds. This dataset includes 157 multi-label daily indoor activities, performed by 267 different people. It has an official train-validation split that has 7, 985 videos for training and the remaining 1, 863 for validation.

2016年，Charades[186]作為一個用于理解現實并發動作的數據集被引入。它包含9848個視頻，平均長度為30秒。該數據集包括157項多標簽的日常室內活動，由267名不同的人進行。它有一個官方的訓練-驗證分割，有7985個視頻用于訓練，剩下的1863個用于驗證。

Kinetics Family is now the most widely adopted benchmark. Kinetics400 [100] was introduced in 2017 and it consists of approximately 240k training and 20k validation videos trimmed to 10 seconds from 400 human action categories. The Kinetics family continues to expand, with Kinetics-600 [12] released in 2018 with 480K videos and Kinetics700[13] in 2019 with 650K videos.

Kinetics Family是目前最廣泛采用的基準。Kinetics400[100]于2017年推出，由大約240k的訓練視頻和20k的驗證視頻組成，從400個人體動作類別中刪減到10秒。Kinetics系列繼續擴大，2018年發布了Kinetics-600[12]，包含480K視頻，2019年發布了Kinetics700[13]，包含650K視頻。

20BN-Something-Something [69] V1 was introduced in 2017 and V2 was introduced in 2018. This family is another popular benchmark that consists of 174 action classes that describe humans performing basic actions with everyday objects. There are 108, 499 videos in V1 and 220, 847 videos in V2. Note that the Something-Something dataset requires strong temporal modeling because most activities cannot be inferred based on spatial features alone (e.g. opening something, covering something with something).

20BN-Something-Something [69] V1于2017年引入，V2于2018年引入。這個系列是另一個流行的基準測試，由174個動作類組成，描述人類對日常物體執行的基本動作。V1中有108,499個視頻，V2中有220,847個視頻。請注意，something - something數據集需要強大的時間建模，因為大多數活動不能僅根據空間特征推斷(例如，打開某物，用某物覆蓋某物)。

AVA [70] was introduced in 2017 as the first large-scale spatio-temporal action detection dataset. It contains 430 15-minute video clips with 80 atomic actions labels (only 60 labels were used for evaluation). The annotations were provided at each key-frame which lead to 214, 622 training, 57, 472 validation and 120, 322 testing samples. The AVA dataset was recently expanded to AVA-Kinetics with 352,091 training, 89,882 validation and 182,457 testing samples [117].

AVA[70]于2017年問世，是首個大規模時空動作檢測數據集。它包含430個15分鐘的視頻剪輯，帶有80個原子動作標簽(只有60個標簽用于評估)。在每個關鍵幀上提供注釋，得到214 622個訓練，57 472個驗證和120 322個測試樣本。AVA數據集最近擴展到AVA- kinetics，包含352091個訓練、89,882個驗證和182,457個測試樣本[117]。

Moments in Time [142] was introduced in 2018 and it is a large-scale dataset designed for event understanding. It contains one million 3 second video clips, annotated with a dictionary of 339 classes. Different from other datasets designed for human action understanding, Moments in Time dataset involves people, animals, objects and natural phenomena. The dataset was extended to Multi-Moments in Time (M-MiT) [143] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video.

Moments in Time[142]于2018年引入，是一個旨在理解事件的大規模數據集。它包含100萬個3秒的視頻片段，用339個類的字典注釋。不同于其他為理解人類行為而設計的數據集，瞬間數據集包括人、動物、物體和自然現象。2019年，該數據集擴展到Multi-Moments in Time (M-MiT)[143]，將視頻數量增加到102萬，修剪模糊類，并增加每個視頻的標簽數量。

HACS [267] was introduced in 2019 as a new large-scale dataset for recognition and localization of human actions collected from Web videos. It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations on 504K videos, and HACS Segments has 140K complete action segments (from action start to end) on 50K videos. The videos are annotated with the same 200 human action classes used in ActivityNet (V1.3) [40].

HACS[267]是一種新的大規模數據集，用于從網絡視頻中收集人類行為的識別和定位。它由兩種手工注釋組成。HACS剪輯包含1.55M 2秒剪輯注釋504K視頻，HACS片段有140K完整的動作片段(從動作開始到結束)50K視頻。這些視頻使用ActivityNet (V1.3)[40]中使用的200個人類動作類進行注釋。

HVU [34] dataset was released in 2020 for multi-label multi-task video understanding. This dataset has 572K videos and 3, 142 labels. The official split has 481K, 31K and 65K videos for train, validation, and test respectively. This dataset has six task categories: scene, object, action, event, attribute, and concept. On average, there are about 2, 112 samples for each label. The duration of the videos varies with a maximum length of 10 seconds.

HVU[34]數據集于2020年發布，用于多標簽多任務視頻理解。這個數據集有572K個視頻和3142個標簽。官方拆分有481K、31K和65K視頻分別用于訓練、驗證和測試。這個數據集有六個任務類別:場景、對象、動作、事件、屬性和概念。平均而言，每個標簽大約有2112個樣品。視頻時長各不相同，最長10秒。

AViD [165] was introduced in 2020 as a dataset for anonymized action recognition. It contains 410K videos for training and 40K videos for testing. Each video clip duration is between 3-15 seconds and in total it has 887 action classes. During data collection, the authors tried to collect data from various countries to deal with data bias. They also remove face identities to protect privacy of video makers. Therefore, AViD dataset might not be a proper choice for recognizing face-related actions.

AViD[165]于2020年推出，作為匿名動作識別的數據集。它包含410K的訓練視頻和40K的測試視頻。每個視頻剪輯的時長在3-15秒之間，總共有887個動作類。在數據收集過程中，作者試圖收集不同國家的數據，以處理數據偏差。為了保護視頻制作者的隱私，他們還刪除了面部身份。因此，AViD數據集可能不是識別人臉相關動作的合適選擇。

Figure 4. Visual examples from popular video action datasets.
Top: individual video frames from action classes in UCF101 and Kinetics400. A single frame from these scene-focused datasets often contains enough information to correctly guess the category. Middle: consecutive video frames from classes in Something-Something. The 2nd and 3rd frames are made transparent to indicate the importance of temporal reasoning that we cannot tell these two actions apart by looking at the 1st frame alone. Bottom: individual video frames from classes in Moment in Time. Same action could have different actors in different environments.

圖4。來自流行視頻動作數據集的可視化示例。
上圖:UCF101和Kinetics400中動作類的獨立視頻幀。來自這些以場景為中心的數據集的單個幀通常包含足夠的信息來正確猜測類別。中間:來自某某類的連續視頻幀。第二幀和第三幀是透明的，以表明時間推理的重要性，我們不能通過單獨看第一幀來區分這兩個行為。底部:來自“時刻”中的各個類的單獨視頻幀。相同的動作在不同的環境中可能有不同的參與者。

Before we dive into the chronological review of methods, we present several visual examples from the above datasets in Figure 4 to show their different characteristics. In the top two rows, we pick action classes from UCF101 [190] and Kinetics400 [100] datasets. Interestingly, we find that these actions can sometimes be determined by the context or scene alone. For example, the model can predict the action riding a bike as long as it recognizes a bike in the video frame. The model may also predict the action cricket bowling if it recognizes the cricket pitch. Hence for these classes, video action recognition may become an object/scene classification problem without the need of reasoning motion/temporal information. In the middle two rows, we pick action classes from Something-Something dataset [69]. This dataset focuses on human-object interaction, thus it is more fine-grained and requires strong temporal modeling. For example, if we only look at the first frame of dropping something and picking something up without looking at other video frames, it is impossible to tell these two actions apart. In the bottom row, we pick action classes from Moments in Time dataset [142]. This dataset is different from most video action recognition datasets, and is designed to have large inter-class and intra-class variation that represent dynamical events at different levels of abstraction. For example, the action climbing can have different actors (person or animal) in different environments (stairs or tree).

在按時間順序回顧這些方法之前，我們在圖4中展示了上述數據集中的幾個可視化示例，以展示它們的不同特征。在前兩行中，我們從UCF101[190]和Kinetics400[100]數據集中選擇動作類。有趣的是，我們發現這些行為有時只取決于情境或場景。例如，只要模型識別出視頻幀中的自行車，它就可以預測騎自行車的動作。該模型在識別板球場地的情況下，也可以預測板球運動。因此，對于這些類來說，視頻動作識別可能成為一個不需要推理運動/時間信息的目標/場景分類問題。在中間的兩行中，我們從Something-Something數據集[69]中選擇動作類。該數據集關注人-物交互，因此粒度更細，需要強大的時間建模。例如，如果我們只看扔東西和撿東西的第一幀而不看其他視頻幀，就不可能區分這兩個動作。在最下面一行，我們從Moments In Time數據集[142]中選擇動作類。該數據集不同于大多數視頻動作識別數據集，設計為具有較大的類間和類內變化，代表不同抽象級別的動態事件。例如，動作攀爬可以有不同的演員(人或動物)在不同的環境(樓梯或樹)。

2.2. Challenges

There are several major challenges in developing effective video action recognition algorithms.

In terms of dataset, first, defining the label space for training action recognition models is non-trivial. It’s because human actions are usually composite concepts and the hierarchy of these concepts are not well-defined. Second, annotating videos for action recognition are laborious (e.g., need to watch all the video frames) and ambiguous (e.g, hard to determine the exact start and end of an action). Third, some popular benchmark datasets (e.g., Kinetics family) only release the video links for users to download and not the actual video, which leads to a situation that methods are evaluated on different data. It is impossible to do fair comparisons between methods and gain insights.

在數據集方面，首先，為訓練動作識別模型定義標簽空間是非常重要的。這是因為人類行為通常是復合概念，這些概念的層次結構并沒有很好的定義。其次，為動作識別標注視頻是費力的(例如，需要觀看所有的視頻幀)和模糊的(例如，難以確定動作的確切開始和結束)。第三，一些流行的基準數據集(例如，Kinetics家族)只發布視頻鏈接供用戶下載，而不發布實際的視頻，這導致了一種情況，方法是在不同的數據上評估。不可能在方法之間進行公平的比較并獲得深刻的見解。

In terms of modeling, first, videos capturing human actions have both strong intra- and inter-class variations. People can perform the same action in different speeds under various viewpoints. Besides, some actions share similar movement patterns that are hard to distinguish. Second, recognizing human actions requires simultaneous understanding of both short-term action-specific motion information and long-range temporal information. We might need a sophisticated model to handle different perspectives rather than using a single convolutional neural network. Third, the computational cost is high for both training and inference, hindering both the development and deployment of action recognition models. In the next section, we will demonstrate how video action recognition methods developed over the last decade to address the aforementioned challenges.

在建模方面，首先，捕捉人類行為的視頻具有很強的類內和類間變異。人們可以在不同的視點下以不同的速度執行相同的動作。此外，有些動作有著相似的運動模式，很難區分。其次，識別人類行為需要同時理解特定于動作的短期運動信息和長期時間信息。我們可能需要一個復雜的模型來處理不同的視角，而不是使用單一的卷積神經網絡。第三，訓練和推理的計算成本都很高，阻礙了動作識別模型的開發和部署。在下一節中，我們將演示視頻動作識別方法是如何在過去十年中發展起來的，以解決上述挑戰。

3. An Odyssey of Using Deep Learning for Video Action Recognition

In this section, we review deep learning based methods for video action recognition from 2014 to present and introduce the related earlier work in context.

在本節中，我們將回顧從2014年開始的基于深度學習的視頻動作識別方法，以呈現和介紹相關的早期工作。

3.1. From hand-crafted features to CNNs

Despite there being some papers using Convolutional Neural Networks (CNNs) for video action recognition, [200, 5, 91], hand-crafted features [209, 210, 158, 112], particularly Improved Dense Trajectories (IDT) [210], dominated the video understanding literature before 2015, due to their high accuracy and good robustness. However, handcrafted features have heavy computational cost [244], and are hard to scale and deploy.

盡管有一些論文使用卷積神經網絡（CNN）進行視頻動作識別，[200，5，91]，手工制作的特征[209，210，158，112]，特別是改進的密集軌跡（IDT）[210]，由于其高精度和良好的魯棒性，在2015年之前主導了視頻理解文獻。但是，手工制作的功能具有很高的計算成本[244]，并且難以擴展和部署。

With the rise of deep learning [107], researchers started to adapt CNNs for video problems. The seminal work DeepVideo [99] proposed to use a single 2D CNN model on each video frame independently and investigated several temporal connectivity patterns to learn spatio-temporal features for video action recognition, such as late fusion, early fusion and slow fusion. Though this model made early progress with ideas that would prove to be useful later such as a multi-resolution network, its transfer learning performance on UCF101 [190] was 20% less than hand-crafted IDT features (65.4% vs 87.9%). Furthermore, DeepVideo [99] found that a network fed by individual video frames, performs equally well when the input is changed to a stack of frames. This observation might indicate that the learnt spatio-temporal features did not capture the motion well. It also encouraged people to think about why CNN models did not outperform traditional hand-crafted features in the video domain unlike in other computer vision tasks [107, 171].

隨著深度學習的興起[107]，研究人員開始使CNN適應視頻問題。開創性的工作DeepVideo [99]建議在每個視頻幀上獨立使用單個2D CNN模型，并研究了幾種時間連接模式來學習視頻動作識別的時空特征，例如后期融合，早期融合和慢融合。雖然該模型在多分辨率網絡等后來被證明是有用的想法方面取得了早期進展，但它在UCF101 [190]上的遷移學習性能比手工制作的IDT特征低20%（65.4%對87.9%）。此外，DeepVideo [99]發現，當輸入更改為幀堆棧時，由單個視頻幀饋送的網絡表現同樣出色。這一觀察結果可能表明，所學的時空特征沒有很好地捕捉到運動。它還鼓勵人們思考為什么CNN模型在視頻領域的表現不如其他計算機視覺任務的傳統手工制作功能[107，171]。

3.2. Two-stream networks

Since video understanding intuitively needs motion information, finding an appropriate way to describe the temporal relationship between frames is essential to improving the performance of CNN-based video action recognition.

由于視頻理解直觀地需要運動信息，找到一種合適的方法來描述幀間的時間關系對于提高基于cnn的視頻動作識別的性能至關重要。

Optical flow [79] is an effective motion representation to describe object/scene movement. To be precise, it is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. We show several visualizations of optical flow in Figure 5. As we can see, optical flow is able to describe the motion pattern of each action accurately. The advantage of using optical flow is it provides orthogonal information compared to the the RGB image. For example, the two images on the bottom of Figure 5 have cluttered backgrounds. Optical flow can effectively remove the nonmoving background and result in a simpler learning problem compared to using the original RGB images as input.

光流[79]是描述物體/場景運動的一種有效的運動表示方法。準確地說，它是由觀察者與場景之間的相對運動所引起的視覺場景中物體、表面和邊緣的視運動模式。我們在圖5中展示了光流的幾個可視化圖。我們可以看到，光流能夠準確地描述每個動作的運動模式。使用光流的優點是，與RGB圖像相比，它提供了正交的信息。例如，圖5底部的兩張圖片，背景就比較雜亂。光流可以有效地去除不移動的背景，與使用原始RGB圖像作為輸入相比，會產生更簡單的學習問題。

Figure 5. Visualizations of optical flow. We show four image-flow pairs, left is original RGB image and right is the estimated optical flow by FlowNet2 [85]. Color of optical flow indicates the directions of motion, and we follow the color coding scheme of FlowNet2 [85] as shown in top right.

圖5。光流的可視化。我們展示了四組圖像流對，左邊是原始RGB圖像，右邊是FlowNet2估計的光流[85]。光流的顏色表示運動方向，我們遵循FlowNet2[85]的顏色編碼方案如右上所示。

In addition, optical flow has been shown to work well on video problems. Traditional hand-crafted features such as IDT [210] also contain optical-flow-like features, such as Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH).

此外，光流已被證明在視頻問題上工作得很好。傳統手工制作的特征，如IDT[210]也包含類似光流的特征，如光流直方圖(Histogram of Optical Flow, HOF)和運動邊界直方圖(Motion Boundary Histogram, MBH)。

Hence, Simonyan et al. [187] proposed two-stream networks, which included a spatial stream and a temporal stream as shown in Figure 6. This method is related to the two-streams hypothesis [65], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognizes motion). The spatial stream takes raw video frame(s) as input to capture visual appearance information. The temporal stream takes a stack of optical flow images as input to capture motion information between video frames. To be specific, [187] linearly rescaled the horizontal and vertical components of the estimated flow (i.e., motion in the x-direction and y-direction) to a [0, 255] range and compressed using JPEG. The output corresponds to the two optical flow images shown in Figure 6. The compressed optical flow images will then be concatenated as the input to the temporal stream with a dimension of H×W×2L, where H, W and L indicates the height, width and the length of the video frames. In the end, the final prediction is obtained by averaging the prediction scores from both streams.

因此，Simonyan等[187]提出了兩流網絡，包括一個空間流和一個時間流，如圖6所示。這種方法與雙流假說(two-streams hypothesis)有關[65]，根據該假說，人類視覺皮層包含兩條通路:腹側流(執行物體識別)和背側流(識別運動)。空間流以原始視頻幀(s)為輸入，捕獲視覺外觀信息。時間流以一堆光流圖像作為輸入，捕獲視頻幀之間的運動信息。具體來說，[187]將估計流的水平和垂直分量(即x方向和y方向的運動)線性縮放到[0,255]范圍，并使用JPEG進行壓縮。輸出對應于圖6所示的兩張光流圖像。然后將壓縮后的光流圖像拼接成維度為H×W×2L的時間流的輸入，其中H、W和L分別表示視頻幀的高度、寬度和長度。最后，將兩個流的預測分數取平均值得到最終的預測結果。

By adding the extra temporal stream, for the first time, a CNN-based approach achieved performance similar to the previous best hand-crafted feature IDT on UCF101 (88.0% vs 87.9%) and on HMDB51 [109] (59.4% vs 61.1%). [187] makes two important observations. First, motion information is important for video action recognition. Second, it is still challenging for CNNs to learn temporal information directly from raw video frames. Pre-computing optical flow as the motion representation is an effective way for deep learning to reveal its power. Since [187] managed to close the gap between deep learning approaches and traditional hand-crafted features, many follow-up papers on twostream networks emerged and greatly advanced the development of video action recognition. Here, we divide them into several categories and review them individually.

通過首次添加額外的時間流，基于cnn的方法獲得了類似于之前在UCF101 (88.0% vs 87.9%)和HMDB51 [109] (59.4% vs 61.1%)上的最佳手工特性IDT的性能。[187]做了兩個重要的觀察。首先，運動信息對視頻動作識別很重要。其次，對于cnn來說，直接從原始視頻幀中學習時間信息仍然具有挑戰性。預計算光流作為運動表示是深度學習展示其威力的有效途徑。自從[187]成功地縮小了深度學習方法和傳統手工特征之間的差距以來，許多關于雙流網絡的后續論文出現了，極大地推動了視頻動作識別的發展。在這里，我們將它們分為幾個類別并分別進行回顧。

Figure 6. Workflow of five important papers: two-stream networks [187], temporal segment networks [218], I3D [14], Nonlocal [219] and SlowFast [45]. Best viewed in color.

圖6。五篇重要論文的工作流程:雙流網絡[187]，時間段網絡[218]，I3D [14]， Nonlocal[219]和SlowFast[45]。最好用彩色觀看。

3.2.1 Using deeper network architectures

Two-stream networks [187] used a relatively shallow network architecture [107]. Thus a natural extension to the two-stream networks involves using deeper networks. However, Wang et al. [215] finds that simply using deeper networks does not yield better results, possibly due to overfitting on the small-sized video datasets [190, 109]. Recall from section 2.1, UCF101 and HMDB51 datasets only have thousands of training videos. Hence, Wang et al. [217] introduce a series of good practices, including crossmodality initialization, synchronized batch normalization, corner cropping and multi-scale cropping data augmentation, large dropout ratio, etc. to prevent deeper networks from overfitting. With these good practices, [217] was able to train a two-stream network with the VGG16 model [188] that outperforms [187] by a large margin on UCF101. These good practices have been widely adopted and are still being used. Later, Temporal Segment Networks (TSN) [218] performed a thorough investigation of network architectures, such as VGG16, ResNet [76], Inception [198], and demonstrated that deeper networks usually achieve higher recognition accuracy for video action recognition. We will describe more details about TSN in section 3.2.4.

雙流網絡[187]采用了相對較淺的網絡架構[107]。因此，對雙流網絡的自然擴展涉及到使用更深層次的網絡。然而，Wang等人[215]發現，簡單地使用更深的網絡并不能產生更好的結果，這可能是由于對小型視頻數據集的過擬合[190,109]。回顧2.1節，UCF101和HMDB51數據集只有數千個訓練視頻。因此，Wang等人[217]引入了一系列好的實踐，包括跨模態初始化、同步批處理歸一化、拐角裁剪和多尺度裁剪數據增強、大dropout率等，以防止更深層次的網絡過擬合。有了這些良好的實踐，[217]能夠用VGG16模型[188]訓練一個雙流網絡，其性能大大優于UCF101[187]。這些良好實踐已被廣泛采用，并仍在使用。后來，時間段網絡(TSN)[218]對VGG16、ResNet[76]、Inception[198]等網絡架構進行了深入研究，并證明更深層次的網絡通常可以實現更高的視頻動作識別精度。我們將在3.2.4節中詳細描述TSN。

3.2.2 Two-stream fusion

Since there are two streams in a two-stream network, there will be a stage that needs to merge the results from both networks to obtain the final prediction. This stage is usually referred to as the spatial-temporal fusion step.

由于在一個兩流網絡中有兩個流，將會有一個階段需要合并來自兩個網絡的結果，以獲得最終的預測。這一階段通常被稱為時空融合步驟。

The easiest and most straightforward way is late fusion, which performs a weighted average of predictions from both streams. Despite late fusion being widely adopted [187, 217], many researchers claim that this may not be the optimal way to fuse the information between the spatial appearance stream and temporal motion stream. They believe that earlier interactions between the two networks could benefit both streams during model learning and this is termed as early fusion.

最簡單、最直接的方法是后期融合，它對來自兩種流的預測進行加權平均。盡管后期融合被廣泛采用[187,217]，但許多研究者認為這可能不是融合空間外觀流和時間運動流之間信息的最佳方式。他們認為，在模型學習過程中，兩個網絡之間的早期交互可以使兩種流都受益，這被稱為早期融合。

Fusion [50] is one of the first of several papers investigating the early fusion paradigm, including how to perform spatial fusion (e.g., using operators such as sum, max, bilinear, convolution and concatenation), where to fuse the network (e.g., the network layer where early interactions happen), and how to perform temporal fusion (e.g., using 2D or 3D convolutional fusion in later stages of the network). [50] shows that early fusion is beneficial for both streams to learn richer features and leads to improved performance over late fusion. Following this line of research, Feichtenhofer et al. [46] generalizes ResNet [76] to the spatiotemporal domain by introducing residual connections between the two streams. Based on [46], Feichtenhofer et al. [47] further propose a multiplicative gating function for residual networks to learn better spatio-temporal features. Concurrently, [225] adopts a spatio-temporal pyramid to perform hierarchical early fusion between the two streams.

融合[50]是研究早期融合范式的幾篇論文中的第一篇，包括如何執行空間融合(例如，使用運算符如和、最大、雙線性、卷積和拼接)，在哪里融合網絡(例如，發生早期交互的網絡層)，以及如何執行時間融合(例如，在網絡的后期使用2D或3D卷積融合)。[50]表明，早期的融合有利于兩個流學習更豐富的特性，并導致比后期融合更好的性能。遵循這一研究路線，Feichtenhofer等人[46]通過引入兩個流之間的殘余連接，將ResNet[76]推廣到時空域。Feichtenhofer等[47]在[46]的基礎上進一步提出了殘差網絡的乘法門控函數，以更好地學習時空特征。同時，[225]采用時空金字塔在兩個流之間進行分層早期融合。

3.2.3 Recurrent neural networks

Since a video is essentially a temporal sequence, researchers have explored Recurrent Neural Networks (RNNs) for temporal modeling inside a video, particularly the usage of Long Short-Term Memory (LSTM) [78].

由于視頻本質上是一個時間序列，研究人員探索了循環神經網絡(rnn)在視頻中進行時間建模，特別是使用長短期記憶(LSTM)[78]。

LRCN [37] and Beyond-Short-Snippets [253] are the first of several papers that use LSTM for video action recognition under the two-stream networks setting. They take the feature maps from CNNs as an input to a deep LSTM network, and aggregate frame-level CNN features into videolevel predictions. Note that they use LSTM on two streams separately, and the final results are still obtained by late fusion. However, there is no clear empirical improvement from LSTM models [253] over the two-stream baseline [187]. Following the CNN-LSTM framework, several variants are proposed, such as bi-directional LSTM [205], CNN-LSTM fusion [56] and hierarchical multi-granularity LSTM network [118]. [125] described VideoLSTM which includes a correlation-based spatial attention mechanism and a lightweight motion-based attention mechanism. VideoLSTM not only show improved results on action recognition, but also demonstrate how the learned attention can be used for action localization by relying on just the action class label. Lattice-LSTM [196] extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, so that it can accurately model long-term and complex motions. ShuttleNet [183] is a concurrent work that considers both feedforward and feedback connections in a RNN to learn long-term dependencies. FASTER [272] designed a FAST-GRU to aggregate clip-level features from an expensive backbone and a cheap backbone. This strategy reduces the processing cost of redundant clips and hence accelerates the inference speed.

LRCN[37]和Beyond-Short-Snippets[253]是在雙流網絡設置下使用LSTM進行視頻動作識別的幾篇論文中的第一篇。他們將CNN的特征圖作為深度LSTM網絡的輸入，并將幀級CNN特征聚合為視頻級預測。注意，他們分別在兩個流上使用LSTM，最終結果仍然是通過后期融合獲得的。然而，相比雙流基線[187]，LSTM模型[253]并沒有明顯的經驗改進。在CNN-LSTM框架的基礎上，提出了幾種變體，如雙向LSTM[205]、CNN-LSTM融合[56]和分層多粒度LSTM網絡[118]。[125]描述了包括基于相關的空間注意機制和基于輕量運動的注意機制的VideoLSTM。VideoLSTM不僅展示了動作識別的改進結果，而且還展示了如何僅依靠動作類標簽將習得的注意力用于動作定位。晶格-LSTM[196]通過學習單個空間位置的記憶單元的獨立隱藏狀態躍遷來擴展LSTM，從而能夠準確地模擬長期和復雜的運動。ShuttleNet[183]是一項并行工作，它考慮了RNN中的前饋和反饋連接，以學習長期依賴關系。FASTER[272]設計了一種FAST-GRU來聚合來自昂貴主干和廉價主干的剪輯級特征。該策略降低了冗余片段的處理成本，從而提高了推理速度。

However, the work mentioned above [37, 253, 125, 196, 183] use different two-stream networks/backbones. The differences between various methods using RNNs are thus unclear. Ma et al. [135] build a strong baseline for fair comparison and thoroughly study the effect of learning spatiotemporal features by using RNNs. They find that it requires proper care to achieve improved performance, e.g., LSTMs require pre-segmented data to fully exploit the temporal information. RNNs are also intensively studied in video action localization [189] and video question answering [274], but these are beyond the scope of this survey.

然而，上面提到的工作[37,253,125,196,183]使用不同的雙流網絡/骨干。因此，使用rnn的各種方法之間的區別還不清楚。Ma等人[135]為公平比較建立了一個強有力的基線，并深入研究了使用rnn學習時空特征的效果。他們發現需要適當的關注來實現改進的性能，例如，LSTMs需要預先分割的數據來充分利用時間信息。rnn在視頻動作定位[189]和視頻問答[274]中也被深入研究，但這些都超出了本次調查的范圍。

3.2.4 Segment-based methods

Thanks to optical flow, two-stream networks are able to reason about short-term motion information between frames. However, they still cannot capture long-range temporal information. Motivated by this weakness of two-stream networks , Wang et al. [218] proposed a Temporal Segment Network (TSN) to perform video-level action recognition. Though initially proposed to be used with 2D CNNs, it is simple and generic. Thus recent work using either 2D or 3D CNNs, are still built upon this framework.

由于光流，雙流網絡能夠推理幀之間的短期運動信息。然而，他們仍然不能捕獲遠程時間信息。基于雙流網絡的這一弱點，Wang等人[218]提出了一種時間段網絡(TSN)來執行視頻級別的動作識別。雖然最初建議與2D cnn一起使用，但它簡單而通用。因此，最近使用2D或3D cnn的工作仍然建立在這個框架上。

To be specific, as shown in Figure 6, TSN first divides a whole video into several segments, where the segments distribute uniformly along the temporal dimension. Then TSN randomly selects a single video frame within each segment and forwards them through the network. Here, the network shares weights for input frames from all the segments. In the end, a segmental consensus is performed to aggregate information from the sampled video frames. The segmental consensus could be operators like average pooling, max pooling, bilinear encoding, etc. In this sense, TSN is capable of modeling long-range temporal structure because the model sees the content from the entire video. In addition, this sparse sampling strategy lowers the training cost over long video sequences but preserves relevant information.

具體來說，如圖6所示，TSN首先將整個視頻分割成幾個片段，這些片段沿時間維度均勻分布。然后TSN在每個視頻片段中隨機選擇一個視頻幀，通過網絡轉發。在這里，網絡共享來自所有段的輸入幀的權重。最后，對采樣的視頻幀進行分段一致性聚合。分段共識可以是像平均池、最大池、雙線性編碼等操作符。在這個意義上，TSN能夠建模遠程時間結構，因為模型看到的是整個視頻的內容。此外，這種稀疏采樣策略降低了長視頻序列的訓練成本，但保留了相關信息。

Given TSN’s good performance and simplicity, most two-stream methods afterwards become segment-based two-stream networks. Since the segmental consensus is simply doing a max or average pooling operation, a feature encoding step might generate a global video feature and lead to improved performance as suggested in traditional approaches [179, 97, 157]. Deep Local Video Feature (DVOF) [114] proposed to treat the deep networks that trained on local inputs as feature extractors and train another encoding function to map the global features into global labels. Temporal Linear Encoding (TLE) network [36] ap- peared concurrently with DVOF, but the encoding layer was embedded in the network so that the whole pipeline could be trained end-to-end. VLAD3 and ActionVLAD [123, 63] also appeared concurrently. They extended the NetVLAD layer [4] to the video domain to perform video-level encoding, instead of using compact bilinear encoding as in [36]. To improve the temporal reasoning ability of TSN, Tempo- ral Relation Network (TRN) [269] was proposed to learn and reason about temporal dependencies between video frames at multiple time scales. The recent state-of-the-art efficient model TSM [128] is also segment-based. We will discuss it in more detail in section 3.4.2.

由于TSN的良好性能和簡單性，大多數雙流方法后來都變成了基于段的雙流網絡。由于分段共識只是進行一個最大或平均池操作，因此特征編碼步驟可能會生成一個全局視頻特征，并像傳統方法[179,97,157]中建議的那樣，導致性能的提高。深度局部視頻特征(Deep Local Video feature, DVOF)[114]提出將對局部輸入進行訓練的深度網絡作為特征提取器，并訓練另一個編碼函數將全局特征映射到全局標簽。時間線性編碼(TLE)網絡[36]與DVOF并行，但編碼層嵌入到網絡中，可實現端到端訓練。VLAD3和ActionVLAD[123,63]也同時出現。他們將NetVLAD層[4]擴展到視頻域以執行視頻級的編碼，而不是像[36]那樣使用緊湊的雙線性編碼。為了提高TSN的時間推理能力，節拍關系網絡(Tempo- ral Relation Network, TRN)[269]被提出來學習和推理視頻幀之間在多個時間尺度上的時間依賴性。最近最先進的高效模型TSM[128]也是基于分段的。我們將在3.4.2節中更詳細地討論它。

3.2.5 Multi-stream networks

Two-stream networks are successful because appearance and motion information are two of the most important properties of a video. However, there are other factors that can help video action recognition as well, such as pose, object, audio and depth, etc.

雙流網絡之所以成功，是因為外觀和運動信息是視頻最重要的兩個屬性。然而，還有其他因素可以幫助視頻動作識別，如姿勢，對象，音頻和深度等。

Pose information is closely related to human action. We can recognize most actions by just looking at a pose (skeleton) image without scene context. Although there is previous work on using pose for action recognition [150, 246], P-CNN [23] was one of the first deep learning methods that successfully used pose to improve video action recognition. P-CNN proposed to aggregates motion and appearance information along tracks of human body parts, in a similar spirit to trajectory pooling [214]. [282] extended this pipeline to a chained multi-stream framework, that computed and integrated appearance, motion and pose. They introduced a Markov chain model that added these cues successively and obtained promising results on both action recognition and action localization. PoTion [25] was a follow-up work to P-CNN, but introduced a more powerful feature representation that encoded the movement of human semantic keypoints. They first ran a decent human pose estimator and extracted heatmaps for the human joints in each frame. They then obtained the PoTion representation by temporally aggregating these probability maps. PoTion is lightweight and outperforms previous pose representations [23, 282]. In addition, it was shown to be complementary to standard appearance and motion streams, e.g. combining PoTion with I3D [14] achieved state-of-the-art result on UCF101 (98.2%).

姿態信息與人的行為密切相關。我們可以通過觀察沒有場景背景的姿勢(骨架)圖像來識別大多數動作。雖然此前也有使用姿勢進行動作識別的研究[150,246]，但P-CNN[23]是第一個成功使用姿勢改進視頻動作識別的深度學習方法之一。P-CNN提出沿著人體部位的軌跡聚合運動和外觀信息，其精神類似于軌跡池[214]。[282]將該管道擴展為一個鏈式多流框架，計算并集成外觀、運動和姿勢。他們引入了一個馬爾可夫鏈模型，將這些線索依次添加到模型中，在動作識別和動作定位方面都取得了很好的效果。PoTion[25]是P-CNN的后續工作，但是引入了更強大的特征表示，編碼了人類語義關鍵點的移動。他們首先運行了一個像樣的人體姿態估計器，提取了每一幀人體關節的熱圖。然后，他們通過時間聚合這些概率映射獲得了PoTion表示。PoTion是輕量級的，并且比之前的姿勢表現更好[23,282]。此外，它被證明是標準外觀和動作流的補充，例如，將PoTion與I3D[14]結合在UCF101上獲得了最先進的結果(98.2%)。

Object information is another important cue because most human actions involve human-object interaction. Wu [232] proposed to leverage both object features and scene features to help video action recognition. The object and scene features were extracted from state-of-the-art pretrained object and scene detectors. Wang et al. [252] took a step further to make the network end-to-end trainable. They introduced a two-stream semantic region based method, by replacing a standard spatial stream with a Faster RCNN network [171], to extract semantic information about the object, person and scene.

物體信息是另一個重要的線索，因為大多數人類行為都涉及人與物體的交互。Wu[232]提出利用物體特征和場景特征來幫助視頻動作識別。物體和場景特征提取的最先進的預先訓練的物體和場景探測器。Wang等人[252]進一步使網絡端到端可訓練。他們引入了一種基于兩流語義區域的方法，將標準空間流替換為Faster RCNN網絡[171]，以提取關于對象、人和場景的語義信息。

Audio signals usually come with video, and are complementary to the visual information. Wu et al. [233] introduced a multi-stream framework that integrates spatial, short-term motion, long-term temporal and audio in videos to digest complementary clues. Recently, Xiao et al. [237] introduced AudioSlowFast following [45], by adding another audio pathway to model vision and sound in an unified representation.

音頻信號通常伴隨著視頻，是視覺信息的補充。Wu等人[233]引入了一種多流框架，該框架集成了視頻中的空間、短期運動、長期時間和音頻，以消化互補的線索。最近，Xiao等人[237]在[45]之后引入了AudioSlowFast，通過添加另一個音頻路徑以統一表示的方式對視覺和聲音進行建模。

In RGB-D video action recognition field, using depth information is standard practice [59]. However, for visionbased video action recognition (e.g., only given monocular videos), we do not have access to ground truth depth information as in the RGB-D domain. An early attempt Depth2Action [280] uses off-the-shelf depth estimators to extract depth information from videos and use it for action recognition.

在RGB-D視頻動作識別領域，利用深度信息是標準做法[59]。然而，對于基于視覺的視頻動作識別(例如，只給定單目視頻)，我們不能像在RGB-D領域那樣獲得地面真相深度信息。早期的嘗試Depth2Action[280]使用現成的深度估計器從視頻中提取深度信息，并將其用于動作識別。

Essentially, multi-stream networks is a way of multimodality learning, using different cues as input signals to help video action recognition. We will discuss more on multi-modality learning in section 5.12.

從本質上講，多流網絡是一種多模態學習的方式，使用不同的線索作為輸入信號來幫助視頻動作識別。我們將在第5.12節詳細討論多模態學習。

3.3. The rise of 3D CNNs

Pre-computing optical flow is computationally intensive and storage demanding, which is not friendly for large-scale training or real-time deployment. A conceptually easy way to understand a video is as a 3D tensor with two spatial and one time dimension. Hence, this leads to the usage of 3D CNNs as a processing unit to model the temporal information in a video.

預計算光流計算量大，存儲容量大，不利于大規模訓練和實時部署。從概念上講，理解視頻的一種簡單方法是將其視為具有兩個空間維度和一個時間維度的三維張量。因此，這導致使用3D cnn作為處理單元來建模視頻中的時間信息。

The seminal work for using 3D CNNs for action recognition is [91]. While inspiring, the network was not deep enough to show its potential. Tran et al. [202] extended [91] to a deeper 3D network, termed C3D. C3D follows the modular design of [188], which could be thought of as a 3D version of VGG16 network. Its performance on standard benchmarks is not satisfactory, but shows strong generalization capability and can be used as a generic feature extractor for various video tasks [250].

使用3D cnn進行動作識別的開創性工作是[91]。雖然令人鼓舞，但這個網絡的深度還不足以顯示出它的潛力。Tran等人[202]將[91]擴展到更深層次的3D網絡，稱為C3D。C3D遵循了[188]的模塊化設計，可以認為是VGG16網絡的3D版本。它在標準基準測試上的性能并不令人滿意，但表現出很強的泛化能力，可以用作各種視頻任務的通用特征提取器[250]。

However, 3D networks are hard to optimize. In order to train a 3D convolutional filter well, people need a large- scale dataset with diverse video content and action categories. Fortunately, there exists a dataset, Sports1M [99] which is large enough to support the training of a deep 3D network. However, the training of C3D takes weeks to converge. Despite the popularity of C3D, most users just adopt it as a feature extractor for different use cases instead of modifying/fine-tuning the network. This is partially the reason why two-stream networks based on 2D CNNs dominated the video action recognition domain from year 2014 to 2017.

然而，3D網絡很難優化。為了很好地訓練三維卷積濾波器，人們需要一個具有多種視頻內容和動作類別的大規模數據集。幸運的是，有一個數據集Sports1M[99]，它足夠大，可以支持深度3D網絡的訓練。然而，C3D的訓練需要數周才能完成。盡管C3D很流行，但大多數用戶只是將其作為不同用例的特征提取器，而不是修改/微調網絡。這也是2014 - 2017年基于二維cnn的雙流網絡在視頻動作識別領域占據主導地位的部分原因。

The situation changed when Carreira et al. [14] proposed I3D in year 2017. As shown in Figure 6, I3D takes a video clip as input, and forwards it through stacked 3D convolutional layers. A video clip is a sequence of video frames, usually 16 or 32 frames are used.
The major contributions of I3D are:
1)it adapts mature image classification architectures to use for 3D CNN;
2)For model weights, it adopts a method developed for initializing optical flow networks in [217] to inflate the ImageNet pre-trained 2D model weights to their counterparts in the 3D model.
Hence, I3D bypasses the dilemma that 3D CNNs have to be trained from scratch. With pre-training on a new large-scale dataset Kinetics400 [100], I3D achieved a 95.6% on UCF101 and 74.8% on HMDB51. I3D ended the era where different methods reported numbers on small-sized datasets such as UCF101 and HMDB512. Publications following I3D needed to report their performance on Kinetics400, or other large-scale benchmark datasets, which pushed video action recognition to the next level. In the next few years, 3D CNNs advanced quickly and became top performers on almost every benchmark dataset. We will review the 3D CNNs based literature in several categories below.

當carira等人在2017年提出I3D時，情況發生了變化。如圖6所示，I3D將一個視頻剪輯作為輸入，并將其通過堆疊的3D卷積層轉發。一個視頻剪輯是一個視頻幀序列，通常使用16或32幀。
I3D的主要貢獻有:
1)將成熟的圖像分類體系結構應用于3D CNN;
2)對于模型權值，采用[217]中為初始化光流網絡而開發的方法，將ImageNet預訓練的2D模型權值膨脹為3D模型中的對應模型權值。
因此，I3D繞過了3D cnn必須從零開始訓練的困境。通過在新的大規模數據集Ki- netics400[100]上進行預訓練，I3D在UCF101上取得了95.6%的準確率，在HMDB51上取得了74.8%的準確率。I3D結束了不同方法在小型數據集(如UCF101和HMDB512)上報告數字的時代。I3D之后的出版物需要在Kinetics400或其他大型基準數據集上報告它們的性能，這將視頻交流識別推向了下一個水平。在接下來的幾年里，3D cnn發展迅速，幾乎在所有基準數據集上都是佼佼者。我們將從以下幾個類別回顧基于3D cnn的文獻。

We want to point out that 3D CNNs are not replacing two-stream networks, and they are not mutually exclusive. They just use different ways to model the temporal relationship in a video. Furthermore, the two-stream approach is a generic framework for video understanding, instead of a specific method. As long as there are two networks, one for spatial appearance modeling using RGB frames, the other for temporal motion modeling using optical flow, the method may be categorized into the family of two-stream networks. In [14], they also build a temporal stream with I3D architecture and achieved even higher performance, 98.0% on UCF101 and 80.9% on HMDB51. Hence, the final I3D model is a combination of 3D CNNs and two- stream networks. However, the contribution of I3D does not lie in the usage of optical flow.

我們想指出的是，3D cnn并沒有取代雙流網絡，它們也不是相互排斥的。他們只是用不同的方法來模擬視頻中的時間關系。此外，雙流方法是視頻理解的通用框架，而不是一種特定的方法。只要有兩個網絡，一個是使用RGB幀進行空間外觀建模，另一個是使用光流進行時間運動建模，這種方法就可以歸為雙流網絡。在[14]中，他們還構建了一個具有I3D架構的時間流，并實現了更高的性能，在UCF101上達到98.0%，在HMDB51上達到80.9%。因此，最終的I3D模型是三維cnn和雙流網絡的組合。然而，I3D的貢獻并不在于光流的使用。

3.3.1 Mapping from 2D to 3D CNNs

2D CNNs enjoy the benefit of pre-training brought by the large-scale of image datasets such as ImageNet [30] and Places205 [270], which cannot be matched even with the largest video datasets available today. On these datasets numerous efforts have been devoted to the search for 2D CNN architectures that are more accurate and generalize better. Below we describe the efforts to capitalize on these advances for 3D CNNs.

2D cnn享受著ImageNet[30]和Places205[270]等大規模圖像數據集帶來的預訓練的好處，即使是當今最大的視頻數據集也無法與之匹敵。在這些數據集上，許多人致力于尋找更準確、更一般化的2D CNN架構。下面我們將介紹如何利用3D cnn的這些進展。

ResNet3D [74] directly took 2D ResNet [76] and replaced all the 2D convolutional filters with 3D kernels. They believed that by using deep 3D CNNs together with large-scale datasets one can exploit the success of 2D CNNs on ImageNet. Motivated by ResNeXt [238], Chen et al. [20] presented a multi-fiber architecture that slices a complex neural network into an ensemble of lightweight networks (fibers) that facilitate information flow between fibers, reduces the computational cost at the same time. In- spired by SENet [81], STCNet [33] propose to integrate channel-wise information inside a 3D block to capture both spatial-channels and temporal-channels correlation information throughout the network.

ResNet3D[74]直接采用2D ResNet[76]，將所有的2D卷積濾波器替換為3D內核。他們相信，通過將深度3D cnn與大規模數據集結合使用，可以利用ImageNet上2D cnn的成功。受ResNeXt[238]的激勵，Chen等人[20]提出了一種多光纖架構，該架構將復雜的神經網絡切割成輕量級網絡(光纖)的集成，促進了光纖之間的信息流動，同時降低了計算成本。受SENet[81]啟發，STCNet[33]提出在3D塊中集成通道信息，以捕獲整個網絡中的空間通道和時間通道相關信息。

3.3.2 Unifying 2D and 3D CNNs

To reduce the complexity of 3D network training, P3D [169] and R2+1D [204] explore the idea of 3D factorization. To be specific, a 3D kernel (e.g., 3×3×3) can be factorized to two separate operations, a 2D spatial convolution (e.g., 1 × 3 × 3) and a 1D temporal convolution (e.g., 3 × 1 × 1). The differences between P3D and R2+1D are how they arrange the two factorized operations and how they formulate each residual block. Trajectory convolution [268] follows this idea but uses deformable convolution for the temporal component to better cope with motion.

為了降低三維網絡訓練的復雜性，P3D[169]和R2+1D[204]探索了三維因子分解的思想。具體來說，一個三維核(例如，3×3×3)可以分解為兩個獨立的運算，一個二維空間卷積(例如，1 ×3×3)和一個一維時間卷積(例如，3× 1 × 1)。P3D和R2+1D之間的區別在于它們如何分配這兩個分解運算的范圍，以及它們如何表述每個殘差塊。軌跡卷積[268]遵循這一思想，但為更好地處理運動，時間分量使用了可變形卷積。

Another way of simplifying 3D CNNs is to mix 2D and 3D convolutions in a single network. MiCTNet [271] integrates 2D and 3D CNNs to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion. ARTNet [213] introduces an appearance-and-relation network by using a new building block. The building block consists of a spa- tial branch using 2D CNNs and a relation branch using 3D CNNs. S3D [239] combines the merits from approaches mentioned above. It first replaces the 3D convolutions at the bottom of the network with 2D kernels, and find that this kind of top-heavy network has higher recognition accuracy. Then S3D factorizes the remaining 3D kernels as P3D and R2+1D do, to further reduce the model size and train- ing complexity. A concurrent work named ECO [283] also adopts such a top-heavy network to achieve online video understanding.

另一種簡化3D cnn的方法是在單個網絡中混合2D和3D卷積。MiCTNet[271]集成了2D和3D cnn，生成更深入、更有形態的特征圖，同時降低了每一輪時空融合的訓練復雜度。ARTNet[213]通過使用一種新的構建塊引入了一種外觀-關系網絡。該構建模塊由一個使用2D cnn的spa分支和一個使用3D cnn的關系分支組成。S3D[239]結合了上述方法的優點。該方法首先用二維核替換網絡底部的三維卷積，發現這種頭重腳輕的網絡具有較高的識別精度。然后S3D像P3D和R2+1D那樣對剩余的3D核進行因式分解，進一步減小模型的大小和訓練的復雜度。一個名為ECO[283]的并發工作也采用了這種頭重頭輕的網絡來實現在線視頻理解。

3.3.3 Long-range temporal modeling

In 3D CNNs, long-range temporal connection may be achieved by stacking multiple short temporal convolutions, e.g., 3 × 3 × 3 filters. However, useful temporal information may be lost in the later stages of a deep network, especially for frames far apart.

在3D cnn中，遠程時間連接可以通過堆疊多個短時間卷積來實現，例如3 × 3 × 3濾波器。然而，在深度網絡的后期階段，有用的時間信息可能會丟失，特別是對于距離很遠的幀。

In order to perform long-range temporal modeling, LTC [206] introduces and evaluates long-term temporal convolutions over a large number of video frames. However, limited by GPU memory, they have to sacrifice input resolution to use more frames. After that, T3D [32] adopted a densely connected structure [83] to keep the original temporal information as complete as possible to make the final prediction. Later, Wang et al. [219] introduced a new building block, termed non-local. Non-local is a generic operation similar to self-attention [207], which can be used for many computer vision tasks in a plug-and-play manner. As shown in Figure 6, they used a spacetime non-local module after later residual blocks to capture the long-range dependence in both space and temporal domain, and achieved improved performance over baselines without bells and whistles. Wu et al. [229] proposed a feature bank representation, which embeds information of the entire video into a memory cell, to make contextaware prediction. Recently, V4D [264] proposed video-level 4D CNNs, to model the evolution of long-range spatio-temporal representation with 4D convolutions.

為了進行長期時間建模，LTC[206]引入并計算大量視頻幀上的長期時間卷積。然而，受到GPU內存的限制，它們不得不犧牲輸入分辨率來使用更多的幀。之后，T3D[32]采用密連結構[83]，盡可能保持原始的時間信息完整，從而進行最終預測。后來，Wang等人[219]引入了一種新的構建塊，稱為非局部構建塊。非局部操作是一種類似于自我注意的通用操作[207]，可以以即插即用的方式用于許多計算機視覺任務。如圖6所示，他們在后期殘留塊之后使用一個時空非局部模塊來捕獲空間和時間域的長期依賴，并在沒有鈴聲和口哨的基線上取得了改進的性能。Wu等人[229]提出了一種特征庫表示，該表示將整個視頻的信息嵌入到一個記憶單元中，以進行上下文感知預測。最近，V4D[264]提出了視頻級4D cnn，用4D卷積來模擬遠程時空表示的演化。

3.3.4 Enhancing 3D efficiency

In order to further improve the efficiency of 3D CNNs (i.e., in terms of GFLOPs, model parameters and latency), many variants of 3D CNNs begin to emerge.

為了進一步提高3D cnn的效率(即在gflop、模型參數和延遲方面)，許多3D cnn的變體開始出現。

Motivated by the development in efficient 2D networks, researchers started to adopt channel-wise separable convolution and extend it for video classification [111, 203]. CSN [203] reveals that it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions, and is able to obtain state-of-the-art performance while being 2 to 3 times faster than the previous best approaches. These methods are also related to multi-fiber networks [20] as they are all inspired by group convolution.

受高效2D網絡發展的激勵，研究人員開始采用信道可分卷積，并將其擴展到視頻分類中[111,203]。CSN[203]揭示了通過分離信道相互作用和時空相互作用來分解三維卷積是一種很好的實踐，并且能夠獲得最先進的性能，同時比之前的最佳方法快2到3倍。這些方法也與多光纖網絡[20]有關，因為它們都是受到群卷積的啟發。

Recently, Feichtenhofer et al. [45] proposed SlowFast, an efficient network with a slow pathway and a fast pathway. The network design is partially inspired by the biological Parvo- and Magnocellular cells in the primate visual systems. As shown in Figure 6, the slow pathway operates at low frame rates to capture detailed semantic information, while the fast pathway operates at high temporal resolution to capture rapidly changing motion. In order to incorporate motion information such as in two-stream networks, SlowFast adopts a lateral connection to fuse the representation learned by each pathway. Since the fast pathway can be made very lightweight by reducing its channel capacity, the overall efficiency of SlowFast is largely improved. Although SlowFast has two pathways, it is different from the two-stream networks [187], because the two pathways are designed to model different temporal speeds, not spatial and temporal modeling. There are several concurrent papers using multiple pathways to balance the accuracy and efficiency [43].

最近，Feichtenhofer等人[45]提出了一種具有慢通道和快通道的高效網絡SlowFast。這種網絡設計的部分靈感來自于靈長類視覺系統中的生物小細胞和巨細胞。如圖6所示，慢路徑以低幀率運行以捕獲詳細的語義信息，而快路徑以高時間分辨率運行以捕獲快速變化的運動。為了合并運動信息，如在兩流網絡，SlowFast采用橫向連接融合的表示學習的每個途徑。由于快速通道可以通過減少其通道容量而變得非常輕量級，因此SlowFast的總體效率得到了很大的提高。盡管SlowFast有兩條路徑，但它與雙流網絡不同[187]，因為兩條路徑設計用于建模不同的時間速度，而不是空間和時間建模。有多個并行人員使用多個路徑來平衡準確性和效率。

Following this line, Feichtenhofer [44] introduced X3D that progressively expand a 2D image classification architecture along multiple network axes, such as temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. X3D pushes the 3D model modification/factorization to an extreme, and is a family of efficient video networks to meet different requirements of target complexity. With similar spirit, A3D [276] also leverages multiple network configurations. However, A3D trains these configurations jointly and during inference deploys only one model. This makes the model at the end more efficient. In the next section, we will continue to talk about efficient video modeling, but not based on 3D convolutions.

按照這一思路，Feichtenhofer[44]引入了X3D，它沿著多個網絡軸(如時間持續時間、幀速率、空間分辨率、寬度、瓶頸寬度和深度)逐步擴展二維圖像分類架構。X3D將三維模型修改/因子分解發揮到了極致，是一組高效的視頻網絡，能夠滿足不同的獲取復雜度要求。基于類似的精神，A3D[276]也控制了多種網絡配置。然而，A3D聯合訓練這些配置，在推理過程中只部署一個模型。這使得最終的模型更加高效。在下一節中，我們將繼續討論有效的視頻建模，但不是基于3D卷積。

3.4. Efficient Video Modeling

With the increase of dataset size and the need for deployment, efficiency becomes an important concern.

隨著數據集規模的增加和部署需求的增加，效率成為一個重要的問題。

If we use methods based on two-stream networks, we need to pre-compute optical flow and store them on local disk. Taking Kinetics400 dataset as an illustrative exam- ple, storing all the optical flow images requires 4.5TB disk space. Such a huge amount of data would make I/O become the tightest bottleneck during training, leading to a waste of GPU resources and longer experiment cycle. In addition, pre-computing optical flow is not cheap, which means all the two-stream networks methods are not real-time.

如果采用基于雙流網絡的方法，則需要預先計算光流并將其存儲在本地磁盤上。以Kinetics400數據集為例，存儲所有光流圖像需要4.5TB的磁盤空間。如此龐大的數據量會使I/O成為訓練中最緊的瓶頸，導致GPU資源的浪費和實驗周期的延長。此外，預計算光流的成本并不低，這意味著所有的雙流網絡方法都不是實時的。

If we use methods based on 3D CNNs, people still find that 3D CNNs are hard to train and challenging to deploy. In terms of training, a standard SlowFast network trained on Kinetics400 dataset using a high-end 8-GPU machine takes 10 days to complete. Such a long experimental cycle and huge computing cost makes video understanding research only accessible to big companies/labs with abundant computing resources. There are several recent attempts to speed up the training of deep video models [230], but these are still expensive compared to most image-based computer vision tasks. In terms of deployment, 3D convolution is not as well supported as 2D convolution for different platforms. Furthermore, 3D CNNs require more video frames as input which adds additional IO cost.

如果我們使用基于3D cnn的方法，人們仍然發現3D cnn訓練困難，部署具有挑戰性。在訓練方面，使用高端8-GPU機器在Kinetics400數據集上訓練一個標準的SlowFast網絡需要10天才能完成。如此漫長的實驗周期和巨大的計算成本，使得視頻理解研究只能在擁有豐富計算資源的大公司/實驗室進行。最近有一些加速深度視頻模型訓練的嘗試[230]，但與大多數基于圖像的計算機視覺任務相比，這些仍是昂貴的。在部署方面，不同平臺對3D卷積的支持不如2D卷積。此外，3D cnn需要更多的視頻幀作為輸入，這增加了額外的IO成本。

Hence, starting from year 2018, researchers started to investigate other alternatives to see how they could improve accuracy and efficiency at the same time for video action recognition. We will review recent efficient video modeling methods in several categories below.

因此，從2018年開始，研究人員開始研究其他替代方案，看看如何在提高視頻動作識別的準確性和效率的同時。我們將在下面的幾個類別中回顧最近的高效視頻建模方法。

3.4.1 Flow-mimic approaches

One of the major drawback of two-stream networks is its need for optical flow. Pre-computing optical flow is computationally expensive, storage demanding, and not end-to- end trainable for video action recognition. It is appealing if we can find a way to encode motion information without using optical flow, at least during inference time.

雙流網絡的主要缺點之一是它需要光流。預計算光流的計算成本高，存儲需求大，而且不能端到端訓練視頻動作識別。如果我們能找到一種不使用光流的方法來編碼運動信息，至少在推理期間是很有吸引力的。

[146] and [35] are early attempts for learning to estimate optical flow inside a network for video action recognition. Although these two approaches do not need optical flow during inference, they require optical flow during training in order to train the flow estimation network. Hidden two-stream networks [278] proposed MotionNet to replace the traditional optical flow computation. MotionNet is a lightweight network to learn motion information in an unsupervised manner, and when concatenated with the temporal stream, is end-to-end trainable. Thus, hidden two- stream CNNs [278] only take raw video frames as input and directly predict action classes without explicitly computing optical flow, regardless of whether its the training or inference stage. PAN [257] mimics the optical flow features by computing the difference between consecutive feature maps. Following this direction, [197, 42, 116, 164] continue to investigate end-to-end trainable CNNs to learn optical- flow-like features from data. They derive such features directly from the definition of optical flow [255]. MARS [26] and D3D [191] used knowledge distillation to combine two-stream networks into a single stream, e.g., by tuning the spatial stream to predict the outputs of the temporal stream. Recently, Kwon et al. [110] introduced MotionSqueeze module to estimate the motion features. The proposed module is end-to-end trainable and can be plugged into any network, similar to [278].

[146]和[35]是學習估計網絡中用于視頻動作識別的光流的早期嘗試。雖然這兩種方法在推理過程中不需要光流，但在訓練過程中需要光流來訓練流量估計網絡。隱式雙流網絡[278]提出了MotionNet來取代傳統的光流計算。MotionNet是一個輕量級的網絡，以無監督的方式學習運動信息，當與時間流連接時，是端到端可訓練的。因此，隱式雙流cnn[278]只將原始視頻幀作為輸入，直接預測動作類，而不顯式計算光流，無論在訓練階段還是推理階段。PAN[257]通過計算連續特征映射之間的差異來模擬光流特征。按照這個方向，[197,42,116,164]繼續研究端到端可訓練的cnn從數據中學習類似光流的特征。他們直接從光流的定義中推導出這些特征[255]。MARS[26]和D3D[191]使用知識蒸餾將兩個流網絡組合為單個流，例如，通過調整spa流來預測時間流的輸出。最近，Kwon等人[110]引入了MotionSqueeze模組來估計運動特征。所提出的模塊是端到端可訓練的，可以插入任何網絡，類似于[278]。

3.4.2 Temporal modeling without 3D convolution

A simple and natural choice to model temporal relationship between frames is using 3D convolution. However, there are many alternatives to achieve this goal. Here, we will review some recent work that performs temporal modeling without 3D convolution.

建模幀間時間關系的一個簡單而自然的選擇是使用3D卷積。然而，有許多替代方案可以實現這一目標。在這里，我們將回顧一些最近的工作，執行時間建模沒有三維卷積。

Lin et al. [128] introduce a new method, termed temporal shift module (TSM). TSM extends the shift operation [228] to video understanding. It shifts part of the channels along the temporal dimension, thus facilitating information exchanged among neighboring frames. In order to keep spatial feature learning capacity, they put temporal shift module inside the residual branch in a residual block. Thus all the information in the original activation is still accessible after temporal shift through identity mapping. The biggest advantage of TSM is that it can be inserted into a 2D CNN to achieve temporal modeling at zero computation and zero parameters. Similar to TSM, TIN [182] introduces a temporal interlacing module to model the temporal convolution.

Lin等人[128]介紹了一種新的方法，稱為時間移位模塊(TSM)。TSM將移位操作[228]擴展到視頻理解。它沿著時間維度移動部分通道，從而促進相鄰幀之間的信息交換。為了保持空間特征的學習能力，他們在殘差塊的殘差分支中加入了時間位移模塊。因此，在時間移位后，通過身份映射，原始激活中的所有信息仍然可以被訪問。TSM的最大優點是可以插入到2D CNN中，實現零計算和零參數的時間建模。與TSM類似，TIN[182]引入了一個時間交錯模塊來建模時間卷積。

There are several recent 2D CNNs approaches using at- tention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatio- temporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode mo- tion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one han- dles short-range motion modeling and the second one effi- ciently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive re- sults with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non- local [219], because they use channel attention while non- local uses spatial attention.

There are several recent 2D CNNs approaches using attention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatio- temporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode mo- tion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one han- dles short-range motion modeling and the second one effi- ciently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive re- sults with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non- local [219], because they use channel attention while non- local uses spatial attention.

最近有幾種二維cnn方法利用注意力進行長期時間建模[92,122,132,133]。STM[92]提出了基于信道的時空模塊來表示時空特征，基于信道的運動模塊來有效地編碼運動特征。TEA[122]類似于STM，但受到SENet[81]的啟發，TEA使用運動特征來重新校準時空特征，以增強運動模式。具體而言，TEA包含兩個部分:運動激勵和多個時間聚合，其中第一個部分負責近距離運動建模，第二個部分有效地擴大了時間接受域，用于遠程時間建模。它們是互補的，而且都很輕，因此TEA能夠取得與以前最好的方法相比具有競爭力的結果，同時將flop保持在低到2D cnn的數量。最近，TEINet[132]也采用了注意增強時間建模。注意，上述基于注意的方法與非局部方法不同[219]，因為它們使用通道注意，而非局部方法使用空間注意。

3.5. Miscellaneous

In this section, we are going to show several other directions that are popular for video action recognition in the last decade.

在本節中，我們將展示在過去十年中流行的視頻動作識別的其他幾個方向。

3.5.1 Trajectory-based methods

While CNN-based approaches have demonstrated their superiority and gradually replaced the traditional hand-crafted methods, the traditional local feature pipeline still has its merits which should not be ignored, such as the usage of trajectory.

雖然基于cnn的方法已經證明了其優越性，并逐漸取代了傳統的手工方法，但傳統的局部特征管道仍有不可忽視的優點，如軌跡的使用。

Inspired by the good performance of trajectory-based methods [210], Wang et al. [214] proposed to conduct trajectory-constrained pooling to aggregate deep convolutional features into effective descriptors, which they term as TDD. Here, a trajectory is defined as a path tracking down pixels in the temporal dimension. This new video representation shares the merits of both hand-crafted features and deep-learned features, and became one of the top performers on both UCF101 and HMDB51 datasets in the year 2015. Concurrently, Lan et al. [113] incorporated both Independent Subspace Analysis (ISA) and dense trajectories into the standard two-stream networks, and show the complementarity between data-independent and data-driven approaches. Instead of treating CNNs as a fixed feature extractor, Zhao et al. [268] proposed trajectory convolution to learn features along the temporal dimension with the help of trajectories.

受基于軌跡的方法的良好性能[210]的啟發，Wang等人[214]提出進行軌跡約束池化，將深度卷積特征聚合為有效的描述符，他們將其稱為TDD。在這里，軌跡被定義為在時間維度中跟蹤像素的路徑。這種新的視頻表示形式兼有手工制作功能和深度學習功能的優點，并在2015年成為UCF101和HMDB51數據集上的最佳表現之一。同時，Lan等人[113]將獨立子空間分析(ISA)和密集軌跡納入標準的雙流網絡，并展示了數據獨立方法和數據驅動方法之間的互補性。趙等人[268]沒有將cnn作為固定的特征提取器，而是提出了軌跡卷積，在軌跡的幫助下沿著時間維度學習特征。

3.5.2 Rank pooling

There is another way to model temporal information inside a video, termed rank pooling (a.k.a learning-to-rank). The seminal work in this line starts from VideoDarwin [53], that uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The ranking function should be able to order the frames of a video temporally, thus they use the parameters of this ranking function as a new video representation. VideoDarwin [53] is not a deep learning based method, but achieves comparable performance and efficiency.

還有另一種方法可以對視頻中的時間信息建模，稱為排名池(也稱為學習排名)。這方面的開創性工作始于VideoDarwin[53]，它使用一個排名機器來了解外觀隨時間的演變，并返回一個排名函數。該排序函數應該能夠對視頻的幀進行時間排序，因此他們使用該排序函數的參數作為一種新的視頻表示。VideoDarwin[53]不是一種基于深度學習的方法，但達到了相當的性能和效率。

To adapt rank pooling to deep learning, Fernando [54] introduces a differentiable rank pooling layer to achieve end-to-end feature learning. Following this direction, Bilen et al. [9] apply rank pooling on the raw image pixels of a video producing a single RGB image per video, termed dynamic images. Another concurrent work by Fernando [51] extends rank pooling to hierarchical rank pooling by stacking multiple levels of temporal encoding. Finally, [22] propose a generalization of the original ranking formulation [53] using subspace representations and show that it leads to significantly better representation of the dynamic evolution of actions, while being computationally cheap.

為了使秩池適應深度學習，Fernando[54]引入了可區分的秩池層來實現端到端特征學習。按照這個方向，Bilen等人[9]對視頻的原始圖像像素應用秩池，每個視頻產生一個RGB圖像，稱為動態圖像。Fernando[51]同時進行的另一項工作是通過堆疊多層時間編碼將秩池擴展為分層秩池。最后，[22]提出了使用子空間表示的原始排序公式[53]的一般化，并表明它可以顯著更好地表示動作的動態演化，同時計算成本低。

3.5.3 Compressed video action recognition

Most video action recognition approaches use raw videos (or decoded video frames) as input. However, there are several drawbacks of using raw videos, such as the huge amount of data and high temporal redundancy. Video compression methods usually store one frame by reusing contents from another frame (i.e., I-frame) and only store the difference (i.e., P-frames and B-frames) due to the fact that adjacent frames are similar. Here, the I-frame is the original RGB video frame, and P-frames and B-frames include the motion vector and residual, which are used to store the difference. Motivated by the developments in the video compression domain, researchers started to adopt compressed video representations as input to train effective video models.

大多數視頻動作識別方法使用原始視頻(或解碼視頻幀)作為輸入。然而，使用原始視頻有一些缺點，如數據量大，時間冗余高。視頻壓縮方法通常通過重用來自另一幀(即i幀)的內容來存儲一幀，并且只存儲由于相鄰幀相似而產生的差異(即p幀和b幀)。其中i幀為原始RGB視頻幀，p幀和b幀包括運動矢量和殘差，用于存儲差值。受視頻壓縮領域發展的激勵，研究人員開始采用壓縮視頻表示作為輸入來訓練有效的視頻模型。

Since the motion vector has coarse structure and may contain inaccurate movements, Zhang et al. [256] adopted knowledge distillation to help the motion-vector-based temporal stream mimic the optical-flow-based temporal stream.

由于運動矢量結構粗糙，可能包含不準確的運動，Zhang等[256]采用知識蒸餾的方法幫助基于運動矢量的時間流模擬基于光流的時間流。

Table 2. Results of widely adopted methods on three scene-focused datasets. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, S: Sports1M and K: Kinetics400. NL represents non local.

表2。廣泛采用的方法在三個場景聚焦數據集上的結果。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet, S: Sports1M和K: Kinetics400。NL表示非本地的。

However, their approach required extracting and processing each frame. They obtained comparable recognition accuracy with standard two-stream networks, but were 27 times faster. Wu et al. [231] used a heavyweight CNN for the I frame and lightweight CNN’s for the P frames. This required that the motion vectors and residuals for each P frame be referred back to the I frame by accumulation. DMC-Net [185] is a follow-up work to [231] using adversarial loss. It adopts a lightweight generator network to help the motion vector capturing fine motion details, instead of knowledge distillation as in [256]. A recent paper SCSampler [106], also adopts compressed video representation for sampling salient clips and we will discuss it in the next section 3.5.4. As yet none of the compressed approaches can deal with B-frames due to the added complexity.

然而，他們的方法需要提取和處理每一幀。它們獲得了與標準雙流網絡相當的識別精度，但速度快了27倍。Wu等人[231]對I幀使用重量級CNN，對P幀使用輕量級CNN。這要求每個P幀的運動向量和殘差通過累加返回到I幀。DMC-Net[185]是[231]使用對抗損失的后續工作。它采用了一個輕量級的生成器網絡來幫助運動向量捕捉精細的運動細節，而不是像[256]那樣進行知識蒸餾。最近的一篇論文SCSampler[106]也采用了壓縮視頻表示來采樣顯著片段，我們將在下一節3.5.4中討論它。由于b幀的復雜性，目前還沒有一種壓縮方法能夠處理b幀。

3.5.4 Frame/Clip sampling

Most of the aforementioned deep learning methods treat every video frame/clip equally for the final prediction. However, discriminative actions only happen in a few moments, and most of the other video content is irrelevant or weakly related to the labeled action category. There are several drawbacks of this paradigm. First, training with a large proportion of irrelevant video frames may hurt performance. Second, such uniform sampling is not efficient during inference.

前面提到的大多數深度學習方法對每個視頻幀/剪輯的最終預測都是平等的。然而，歧視行為只發生在幾分鐘內，其他視頻內容大多與所標記的行為類別無關或關系不大。這種范式有幾個缺點。首先，使用大量無關視頻幀進行訓練可能會損害成績。其次，這種均勻抽樣在推理過程中效率不高。

Partially inspired by how human understand a video using just a few glimpses over the entire video [251], many methods were proposed to sample the most informative video frames/clips for both improving the performance and making the model more efficient during inference.

在一定程度上，受人類如何僅使用整個視頻中的一小部分來理解視頻[251]的啟發，提出了許多方法來抽樣信息最豐富的視頻幀/剪輯，以提高性能并使模型在推理過程中更有效。

KVM [277] is one of the first attempts to propose an end-to-end framework to simultaneously identify key volumes and do action classification. Later, [98] introduce AdaScan that predicts the importance score of each video frame in an online fashion, which they term as adaptive temporal pooling. Both of these methods achieve improved performance, but they still adopt the standard evaluation scheme which does not show efficiency during inference. Recent approaches focus more on the efficiency [41, 234, 8, 106]. AdaFrame [234] follows [251, 98] but uses a reinforcement learning based approach to search more informative video clips. Concurrently, [8] uses a teacher-student framework, i.e., a see-it-all teacher can be used to train a compute ef-ficient see-very-little student. They demonstrate that the efficient student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with negligible performance drop. Recently, SCSampler [106] trains a lightweight network to sample the most salient video clips based on compressed video representations, and achieve state-of-the-art performance on both Kinetics400 and Sports1M dataset. They also empirically show that such saliency-based sampling is not only efficient, but also enjoys higher accuracy than using all the video frames.

KVM[277]是首次提出端到端框架的嘗試之一，該框架可以同時識別密鑰卷并進行動作分類。后來，[98]引入了AdaScan，以在線方式預測每個視頻幀的重要性得分，他們稱之為自適應時間池。這兩種方法都提高了性能，但仍然采用標準的評價方案，在推理過程中沒有表現出效率。最近的方法更多地關注效率[41,234,8,106]。AdaFrame[234]遵循了[251,98]，但使用基于強化學習的方法來搜索更有信息的視頻剪輯。同時，[8]使用了一個師生框架，即，一個無所不知的老師可以用來訓練一個計算效率高、只知甚少的學生。他們證明，高效的學生網絡可以減少30%的推理時間和大約90%的flop數量，而性能下降可以忽略不計。最近，SCSampler[106]訓練一個輕量級網絡，基于壓縮視頻表示對最顯著的視頻剪輯進行采樣，并在Kinetics400和Sports1M數據集上實現了最先進的性能。他們還從經驗上表明，這種基于顯著性的采樣不僅有效，而且比使用所有視頻幀具有更高的準確性。

3.5.5 Visual tempo

Visual tempo is a concept to describe how fast an action goes. Many action classes have different visual tempos. There are several papers exploring different temporal rates (tempos) for improved temporal modeling [273, 147, 82, 281, 45, 248]. Initial attempts usually capture the video tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid [273, 147, 281]. Recently, SlowFast [45], as we discussed in section 3.3.4, utilizes the characteristics of visual tempo to design a two-pathway network for better accuracy and efficiency tradeoff. CIDC [121] proposed directional temporal modeling along with a local backbone for video temporal modeling. TPN [248] extends the tempo modeling to the feature-level and shows consistent improvement over previous approaches.
視覺節奏是一個描述動作進行速度的概念。許多動作類都有不同的視覺節奏。有幾篇論文探討了改進時態建模的不同時間速率(速度)[273,147,82,281,45,248]。最初的嘗試通常是通過以多個速率采樣原始視頻并構建輸入級幀金字塔來捕獲視頻節奏[273,147,281]。最近，SlowFast[45]，正如我們在3.3.4節中討論的，利用視覺節奏的特性設計了一個雙路徑網絡，以獲得更好的準確性和效率權衡。CIDC[121]提出了定向時間建模以及視頻時間建模的局部主干。TPN[248]將節奏建模擴展到特征級，并顯示出與之前方法的一致改進。

We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.

我們想指出的是，視覺節奏也被廣泛應用于自監督視頻表示學習[6,247,16]，因為它可以自然地提供監督信號來訓練深度網絡。我們將在第5.13節討論關于自監督視頻表示學習的更多細節。

4. Evaluation and Benchmarking

In this section, we compare popular approaches on benchmark datasets. To be specific, we first introduce standard evaluation schemes in section 4.1. Then we divide common benchmarks into three categories, scene-focused (UCF101, HMDB51 and Kinetics400 in section 4.2), motion-focused (Sth-Sth V1 and V2 in section 4.3) and multi-label (Charades in section 4.4). In the end, we present a fair comparison among popular methods in terms of both recognition accuracy and efficiency in section 4.5.We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.

在本節中，我們將比較基準數據集上的流行方法。具體而言，我們首先在4.1節介紹了標準評價方案。然后我們將通用基準分為三類，以場景為中心(第4.2節中的UCF101, HMDB51和Kinetics400)，以動作為中心(第4.3節中的Sth-Sth V1和V2)和多標簽(第4.4節中的Charades)。最后，我們在4.5節中對常用方法在識別精度和效率方面進行了比較。我們想指出的是，視覺節奏也被廣泛應用于自監督視頻表示學習[6,247,16]，因為它可以自然地提供監督信號來訓練深度網絡。我們將在第5.13節討論關于自監督視頻表示學習的更多細節。

4.1. Evaluation scheme

During model training, we usually randomly pick a video frame/clip to form mini-batch samples. However, for evaluation, we need a standardized pipeline in order to perform fair comparisons.

在模型訓練過程中，我們通常隨機選取一個視頻幀/剪輯，形成小批量樣本。然而，對于評估，我們需要一個標準化的管道來執行公平的比較。

For 2D CNNs, a widely adopted evaluation scheme is to evenly sample 25 frames from each video following [187, 217]. For each frame, we perform ten-crop data augmentation by cropping the 4 corners and 1 center, flipping them horizontally and averaging the prediction scores (before softmax operation) over all crops of the samples, i.e., this means we use 250 frames per video for inference.

對于2D cnn，廣泛采用的評估方案是從每個視頻中均勻采樣25幀[187,217]。對于每一幀，我們通過裁剪4個角和1個中心，水平翻轉它們，并在樣本的所有作物上平均預測得分(在softmax操作之前)來執行10作物數據增強，也就是說，這意味著我們使用每個視頻250幀進行推斷。

For 3D CNNs, a widely adopted evaluation scheme termed 30-view strategy is to evenly sample 10 clips from each video following [219]. For each video clip, we perform three-crop data augmentation. To be specific, we scale the shorter spatial side to 256 pixels and take three crops of 256 × 256 to cover the spatial dimensions and average the prediction scores.

對于3D cnn，一種被廣泛采用的評估方案稱為30-view策略，即從以下每個視頻中平均采樣10個片段[219]。對于每個視頻剪輯，我們執行三次裁剪數據增強。具體來說，我們將較短的空間邊縮放到256像素，取三種256 × 256的作物覆蓋空間維度，并對預測得分進行平均。

However, the evaluation schemes are not fixed. They are evolving and adapting to new network architectures and different datasets. For example, TSM [128] only uses two clips per video for small-sized datasets [190, 109], and perform three-crop data augmentation for each clip despite its being a 2D CNN. We will mention any deviations from the standard evaluation pipeline.

然而，評價方案并不是固定的。它們正在進化并適應新的網絡架構和不同的數據集。例如，TSM[128]對于小型數據集[190,109]，每個視頻只使用兩個剪輯，并對每個剪輯執行三次裁剪數據增強，盡管它是一個2D CNN。我們將提到與標準評估管道的任何偏差。

In terms of evaluation metric, we report accuracy for single-label action recognition, and mAP (mean average precision) for multi-label action recognition.

在評價指標方面，我們報告了單標簽動作識別的準確性和多標簽動作識別的平均平均精度(mAP)。

4.2. Scene-focused datasets

Here, we compare recent state-of-the-art approaches on scene-focused datasets: UCF101, HMDB51 and Kinet- ics400. The reason we call them scene-focused is because most action videos in these datasets are short, and can be recognized by static scene appearance alone as shown in Figure 4.

Following the chronology, we first present results for early attempts of using deep learning and the two-stream networks at the top of Table 2. We make several observations. First, without motion/temporal modeling, the performance of DeepVideo [99] is inferior to all other approaches. Second, it is helpful to transfer knowledge from traditional methods (non-CNN-based) to deep learning. For example, TDD [214] uses trajectory pooling to extract motion-aware CNN features. TLE [36] embeds global feature encoding, which is an important step in traditional video action recognition pipeline, into a deep network.

按照時間順序，我們首先展示了使用深度學習和表2頂部的雙流網絡的早期嘗試的結果。我們做了一些觀察。首先，沒有運動/時間建模，DeepVideo[99]的性能低于所有其他方法。第二，有助于將傳統方法(非基于cnn的)的知識轉移到深度學習。例如，TDD[214]使用軌跡池來提取運動感知的CNN特征。TLE[36]將全局特征編碼(傳統視頻動作識別管道中的重要一步)嵌入到深度網絡中。

Table 3. Results of widely adopted methods on Something-Something V1 and V2 datasets. We only report numbers without using optical flow. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet and K: Kinetics400. View means number of temporal clip multiples spatial crop, e.g., 30 means 10 temporal clips with 3 spatial crops each clip.

表3。在Something-Something V1和V2數據集上廣泛采用的方法的結果。我們只報告數字，不使用光流。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet和K: Kinetics400。視圖表示時間剪輯的數量乘以空間作物，例如，30表示10個時間剪輯，每個剪輯有3個空間作物。

We then compare 3D CNNs based approaches in the middle of Table 2. Despite training on a large corpus of videos, C3D [202] performs inferior to concurrent work [187, 214, 217], possibly due to difficulties in optimization of 3D kernels. Motivated by this, several papers - I3D [14], P3D [169], R2+1D [204] and S3D [239] factorize 3D convolution filters to 2D spatial kernels and 1D temporal kernels to ease the training. In addition, I3D introduces an inflation strategy to avoid training from scratch by bootstrap-ping the 3D model weights from well-trained 2D networks. By using these techniques, they achieve comparable performance to the best two-stream network methods [36] without the need for optical flow. Furthermore, recent 3D models obtain even higher accuracy, by using more training samples [203], additional pathways [45], or architecture search [44].

然后，我們在表2中間比較了基于3D cnn的方法。盡管在大量的視頻語料庫上進行訓練，C3D[202]的性能不如并發工作[187,214,217]，這可能是由于3D內核的優化困難。基于此，I3D[14]、P3D[169]、R2+1D[204]和S3D[239]幾篇論文將3D卷積濾波器分解為2D空間核和1D時間核，以簡化訓練。此外，I3D引入了膨脹策略，通過從訓練良好的2D網絡引導3D模型權重來避免從頭開始訓練。通過使用這些技術，他們可以在不需要光流的情況下實現與最好的雙流網絡方法[36]相當的性能。此外，最近的3D模型通過使用更多的訓練樣本[203]、額外的路徑[45]或架構搜索[44]獲得了更高的精度。

Finally, we show recent efficient models in the bottom of Table 2. We can see that these methods are able to achieve higher recognition accuracy than two-stream networks (top), and comparable performance to 3D CNNs (middle). Since they are 2D CNNs and do not use optical flow, these methods are efficient in terms of both training and inference. Most of them are real-time approaches, and some can do online video action recognition [128]. We believe 2D CNN plus temporal modeling is a promising direction due to the need of efficiency. Here, temporal modeling could be attention based, flow based or 3D kernel based.

最后，我們在表2的底部展示了最近的高效模型。我們可以看到，這些方法能夠實現比雙流網絡更高的識別精度(上)，性能與3D cnn相當(中)。由于它們是二維cnn，不使用光流，這些方法在訓練和推理方面都是有效的。其中大部分是實時方法，有些可以進行在線視頻動作識別[128]。由于對效率的需求，我們認為二維CNN加時間建模是一個很有前途的方向。在這里，時間建模可以是基于注意力的，基于流的或基于3D內核的。

4.3. Motion-focused datasets

In this section, we compare the recent state-of-the-art approaches on the 20BN-Something-Something (Sth-Sth) dataset. We report top1 accuracy on both V1 and V2. Sth-Sth datasets focus on humans performing basic actions with daily objects. Different from scene-focused datasets, background scene in Sth-Sth datasets contributes little to the final action class prediction. In addition, there are classes such as “Pushing something from left to right” and “Pushing something from right to left”, and which require strong motion reasoning.

在本節中，我們比較了20BN-Something-Something(某物-某物)數據集上最近最先進的方法。我們報告V1和V2的準確性都是最高的。Sth-Sth數據集關注的是人類對日常對象執行的基本操作。與以場景為中心的數據集不同，Sth-Sth數據集中的背景場景對最終動作類預測的貢獻很小。此外，還有“從左向右推東西”和“從右向左推東西”等類，這些類需要強運動推理。

By comparing the previous work in Table 3, we observe that using longer input (e.g., 16 frames) is generally better. Moreover, methods that focus on temporal modeling [128, 122, 92] work better than stacked 3D kernels [14]. For example, TSM [128], TEA [122] and MSNet [110] insert an explicit temporal reasoning module into 2D ResNet backbones and achieves state-of-the-art results. This implies that the Sth-Sth dataset needs strong temporal motion reasoning as well as spatial semantics information.

通過比較表3中之前的工作，我們觀察到使用更長的輸入(例如，16幀)通常更好。此外，專注于時間建模的方法[128,122,92]比堆疊的3D內核[14]更好。例如，TSM[128]、TEA[122]和MSNet[110]在2D ResNet骨干中插入顯式時間推理模塊，并獲得了最先進的結果。這意味著Sth-Sth數據集需要強大的時間運動推理和空間語義信息。

4.4. Multi-label datasets

In this section, we first compare the recent state-of-the-art approaches on Charades dataset [186] and then we list some recent work that use assemble model or additional object information on Charades.

在本節中，我們首先比較了Charades數據集[186]上最近最先進的方法，然后列出了一些在Charades上使用集合模型或附加對象信息的最新工作。

Comparing the previous work in Table 4, we make the following observations. First, 3D models [229, 45] generally perform better than 2D models [186, 231] and 2D models with optical flow input. This indicates that the spatio-temporal reasoning is critical for long-term complex concurrent action understanding. Second, longer input helps with the recognition [229] probably because some actions require long-term feature to recognize. Third, models with strong backbones that are pre-trained on larger datasets gen-erally have better performance [45]. This is because Charades is a medium-scaled dataset which doesn’t contain enough diversity to train a deep model.

對比表4中之前的工作，我們得出以下觀察結果。首先，3D模型[229,45]的性能通常優于2D模型[186,231]和具有光流輸入的2D模型。這表明，時空推理對于長期復雜的并發動作理解是至關重要的。其次，較長的輸入有助于識別[229]，這可能是因為一些動作需要長時間的特征才能識別。第三，在更大的數據集上預先訓練的具有強大主干的模型通常具有更好的性能[45]。這是因為Charades是一個中等規模的數據集，不包含足夠的多樣性來訓練一個深度模型。

Table 4. Charades evaluation using mAP, calculated using the officially provided script. NL: non-local network. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, K400: Kinetics400 and K600: Kinetics600.

表4。字謎評估使用mAP，計算使用官方提供的腳本。NL:非本地網絡。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet, K400: Kinetics400和K600: Kinetics600。

Recently, researchers explored the alternative direction for complex concurrent action recognition by assembling models [177] or providing additional human-object interaction information [90]. These papers significantly outperformed previous literature that only finetune a single model on Charades. It demonstrates that exploring spatio-temporal human-object interactions and finding a way to avoid overfitting are the keys for concurrent action understanding.

最近，研究人員通過組合模型[177]或提供額外的人-物交互信息[90]探索了復雜并發動作識別的替代方向。這些論文明顯優于以往的文獻，只微調一個模型的猜字謎。研究表明，探索人-物時空交互作用，避免過擬合是并行行為理解的關鍵。

4.5. Speed comparison

To deploy a model in real-life applications, we usually need to know whether it meets the speed requirement before we can proceed. In this section, we evaluate the approaches mentioned above to perform a thorough comparison in terms of (1) number of parameters, (2) FLOPS, (3) latency and (4) frame per second.

要在實際應用程序中部署模型，我們通常需要知道它是否滿足速度要求，然后才能繼續。在本節中，我們評估上述方法，根據(1)參數數量、(2)FLOPS、(3)延遲和(4)每秒幀數來執行徹底的比較。

We present the results in Table 5. Here, we use the models in the GluonCV video action recognition model zoo3 since all these models are trained using the same data, same data augmentation strategy and under same 30-view evaluation scheme, thus fair comparison. All the timings are done on a single Tesla V100 GPU with 105 repeated runs, while the first 5 runs are ignored for warming up. We always use a batch size of 1. In terms of model input, we use the suggested settings in the original paper.

我們在表5中展示了結果。在這里，我們使用GluonCV視頻動作識別模型zoo3中的模型，因為所有這些模型都使用相同的數據、相同的數據增強策略和相同的30視圖評估方案進行訓練，因此可以進行公平的比較。所有的計時都是在一個特斯拉V100 GPU上進行105次重復運行，而前5次運行被忽略以進行預熱。我們總是使用1的批量大小。在模型輸入方面，我們使用了原文中建議的設置。

As we can see in Table 5, if we compare latency, 2D models are much faster than all other 3D variants. This is probably why most real-world video applications still adopt frame-wise methods. Secondly, as mentioned in [170, 259], FLOPS is not strongly correlated with the actual inference time (i.e., latency). Third, if comparing performance, most 3D models give similar accuracy around 75%, but pretraining with a larger dataset can significantly boost the performance4. This indicates the importance of training data and partially suggests that self-supervised pre-training might be a promising way to further improve existing methods.

從表5中我們可以看到，如果我們比較延遲，2D模型比所有其他3D模型都要快得多。這可能就是為什么大多數現實中的視頻應用程序仍然采用基于幀的方法。其次，如[170,259]所述，FLOPS與實際推理時間(即延遲)沒有很強的相關性。第三，如果比較性能，大多數3D模型的準確率都在75%左右，但使用更大的數據集進行預訓練可以顯著提高性能4。這表明了訓練數據的重要性，并部分表明自我監督的訓練前可能是進一步改進現有方法的一種有前途的方式。

5. Discussion and Future Work

We have surveyed more than 200 deep learning based methods for video action recognition since year 2014. Despite the performance on benchmark datasets plateauing, there are many active and promising directions in this task worth exploring.

自2014年以來，我們已經調查了超過200種基于深度學習的視頻動作識別方法。盡管在基準數據集上的性能趨于穩定，但在這一任務中仍有許多積極和有前途的方向值得探索。

5.1. Analysis and insights

More and more methods haven been developed to improve video action recognition, at the same time, there are some papers summarizing these methods and providing analysis and insights. Huang et al. [82] perform an explicit analysis of the effect of temporal information for video understanding. They try to answer the question “how important is the motion in the video for recognizing the action”. Feichtenhofer et al. [48, 49] provide an amazing visualization of what two-stream models have learned in order to understand how these deep representations work and what they are capturing. Li et al. [124] introduce a concept, representation bias of a dataset, and find that current datasets are biased towards static representations. Experiments on such biased datasets may lead to erroneous conclusions, which is indeed a big problem that limits the development of video action recognition. Recently, Piergiovanni et al. introduced the AViD [165] dataset to cope with data bias by collecting data from diverse groups of people. These papers provide great insights to help fellow researchers to understand the challenges, open problems and where the next breakthrough might reside.

越來越多的方法被用來改進視頻動作識別，同時也有一些論文對這些方法進行了總結和分析。Huang等人[82]對時間信息對視頻理解的影響進行了顯式分析。他們試圖回答“視頻中的動作對于識別動作有多重要”這個問題。Feichtenhofer等人[48,49]提供了一個令人驚訝的雙流模型的可視化，以了解這些深度表示是如何工作的以及它們捕獲了什么。Li等人[124]引入了一個概念，數據集的表示偏差，并發現當前的數據集偏向于靜態表示。在這種有偏差的數據集上進行實驗可能會得出錯誤的結論，這確實是限制視頻動作識別發展的一大問題。最近，Piergiovanni等人引入了AViD[165]數據集，通過收集來自不同人群的數據來應對數據偏差。這些論文提供了很好的見解，幫助同行的研究人員理解挑戰，開放的問題和下一個突破可能在哪里。

5.2. Data augmentation

Numerous data augmentation methods have been proposed in image recognition domain, such as mixup [258], cutout [31], CutMix [254], AutoAugment [27], FastAutoAug [126], etc. However, video action recognition still adopts basic data augmentation techniques introduced before year 2015 [217, 188], including random resizing, random cropping and random horizontal flipping. Recently, SimCLR [17] and other papers have shown that color jittering and random rotation greatly help representation learning. Hence, an investigation of using different data augmentation techniques for video action recognition is particularly useful. This may change the data pre-processing pipeline for all existing methods.

在圖像識別領域已經提出了許多數據增強方法，如mixup[258]、cutout[31]、CutMix[254]、AutoAugment[27]、FastAutoAug[126]等。然而，視頻動作識別仍然采用2015年之前引入的基礎數據增強技術[217,188]，包括隨機調整大小、隨機裁剪和隨機水平翻轉。最近，SimCLR[17]等論文表明顏色抖動和隨機旋轉對表示學習有很大幫助。因此，研究使用不同的數據增強技術進行視頻動作識別是非常有用的。這可能會改變所有現有方法的數據預處理管道。

5.3. Video domain adaptation

Domain adaptation (DA) has been studied extensively in recent years to address the domain shift problem. Despite the accuracy on standard datasets getting higher and higher, the generalization capability of current video models across datasets or domains is less explored.

域適應(DA)是近年來解決域轉移問題的一種廣泛研究方法。盡管標準數據集上的精度越來越高，但當前視頻模型跨數據集或域的泛化能力研究較少。

There is early work on video domain adaptation [193, 241, 89, 159]. However, these literature focus on small-scale video DA with only a few overlapping categories, which may not reflect the actual domain discrepancy and may lead to biased conclusions. Chen et al. [15] introduce two larger-scale datasets to investigate video DA and find that aligning temporal dynamics is particularly useful. Pan et al. [152] adopts co-attention to solve the temporal misalignment problem. Very recently, Munro et al. [145] explore a multi-modal self-supervision method for fine-grained video action recognition and show the effectiveness of multi-modality learning in video DA. Shuffle and Attend [95] argues that aligning features of all sampled clips results in a sub-optimal solution due to the fact that all clips do not include relevant semantics. Therefore, they propose to use an attention mechanism to focus more on informative clips and discard the non-informative ones. In conclusion, video DA is a promising direction, especially for researchers with less computing resources.

在視頻領域適應方面有早期的工作[193,241,89,159]。然而，這些文獻關注的是小規模的視頻DA，類別重疊較少，可能無法反映實際的領域差異，導致結論存在偏差。Chen等人[15]引入了兩個更大規模的數據集來研究視頻DA，并發現校準時間動態特別有用。Pan等[152]采用共同注意的方法解決時間不對中問題。最近，Munro等人[145]探索了一種用于細粒度視頻動作識別的多模態自監督方法，并展示了視頻DA中多模態學習的有效性。Shuffle和Attend[95]認為，對齊所有采樣剪輯的特征會導致次優解決方案，因為所有剪輯都不包含相關語義。因此，他們建議使用一種注意力機制，更多地關注有信息的片段，丟棄無信息的片段。總之，視頻DA是一個很有前途的方向，特別是對于計算資源較少的研究人員。

5.4. Neural architecture search

Neural architecture search (NAS) has attracted great interest in recent years and is a promising research direction. However, given its greedy need for computing resources, only a few papers have been published in this area [156, 163, 161, 178]. The TVN family [161], which jointly optimize parameters and runtime, can achieve competitive accuracy with human-designed contemporary models, and run much faster (within 37 to 100 ms on a CPU and 10 ms on a GPU per 1 second video clip). AssembleNet [178] and AssembleNet++ [177] provide a generic approach to learn the connectivity among feature representations across input modalities, and show surprisingly good performance on Charades and other benchmarks. AttentionNAS [222] proposed a solution for spatio-temporal attention cell search. The found cell can be plugged into any network to improve the spatio-temporal features. All previous papers do show their potential for video understanding.

神經結構搜索(NAS)近年來引起了人們的廣泛關注，是一個很有前途的研究方向。然而，由于其對計算資源的巨大需求，該領域的論文發表較少[156,163,161,178]。TVN系列[161]聯合優化了參數和運行時，可以達到與人類設計的同時代模型具有競爭力的精度，并且運行速度更快(每1秒視頻剪輯在CPU上為37 - 100毫秒，在GPU上為10毫秒)。AssembleNet[178]和assemblenet++[177]提供了一種通用的方法來學習跨輸入模式的特征表示之間的連通性，并在Charades和其他基準測試中顯示出驚人的良好性能。AttentionNAS[222]提出了一種時空注意細胞搜索的解決方案。發現的細胞可以插入到任何網絡，以改善時空特征。所有之前的論文都顯示了視頻理解的潛力。

Recently, some efficient ways of searching architectures have been proposed in the image recognition domain, such as DARTS [130], Proxyless NAS [11], ENAS [160], one-shot NAS [7], etc. It would be interesting to combine efficient 2D CNNs and efficient searching algorithms to perform video NAS for a reasonable cost.

近年來，在圖像識別領域提出了一些高效的搜索架構方法，如DARTS[130]、Proxyless NAS[11]、ENAS[160]、one-shot NAS[7]等。將高效的2D cnn和高效的搜索算法結合起來以合理的成本執行視頻NAS將是一件有趣的事情。

5.5. Efficient model development

Despite their accuracy, it is difficult to deploy deep learning based methods for video understanding problems in terms of real-world applications. There are several major challenges: (1) most methods are developed in offline settings, which means the input is a short video clip, not a video stream in an online setting; (2) most methods do not meet the real-time requirement; (3) incompatibility of 3D convolutions or other non-standard operators on non-GPU devices (e.g., edge devices).

盡管其精確度很高，但在現實應用中，很難將基于深度學習的方法應用于視頻理解問題。有幾個主要的挑戰:(1)大多數方法是在離線設置下開發的，這意味著輸入是一個短視頻剪輯，而不是在線設置下的視頻流;(2)大多數方法不能滿足實時性要求;(3)在非gpu設備(如邊緣設備)上不兼容3D卷積或其他非標準運算符。

Hence, the development of efficient network architecture based on 2D convolutions is a promising direction. The approaches proposed in the image classification domain can be easily adapted to video action recognition, e.g. model compression, model quantization, model pruning, distributed training [68, 127], mobile networks [80, 265], mixed precision training, etc. However, more effort is needed for the online setting since the input to most action recognition applications is a video stream, such as surveillance monitoring. We may need a new and more comprehensive dataset for benchmarking online video action recognition methods. Lastly, using compressed videos might be desirable because most videos are already compressed, and we have free access to motion information.

因此，開發基于二維卷積的高效網絡體系結構是一個很有前途的方向。在圖像分類領域提出的方法可以很容易地適用于視頻動作識別，例如模型壓縮、模型量化、模型修剪、分布式訓練[68,127]、移動網絡[80,265]、混合精確訓練等。然而，由于大多數動作識別應用程序的輸入是視頻流，例如監視監控，因此在線設置需要付出更多努力。我們可能需要一個新的、更全面的數據集來對標在線視頻動作識別方法。最后，使用壓縮視頻可能是可取的，因為大多數視頻已經被壓縮了，我們可以免費訪問運動信息。

5.6. New datasets

Data is more or at least as important as model development for machine learning. For video action recognition, most datasets are biased towards spatial representations [124], i.e., most actions can be recognized by a single frame inside the video without considering the temporal movement. Hence, a new dataset in terms of long-term temporal modeling is required to advance video understanding. Furthermore, most current datasets are collected from YouTube. Due to copyright/privacy issues, the dataset organizer often only releases the YouTube id or video link for users to download and not the actual video. The first problem is that downloading the large-scale datasets might be slow for some regions. In particular, YouTube recently started to block massive downloading from a single IP. Thus, many researchers may not even get the dataset to start working in this field. The second problem is, due to region limitation and privacy issues, some videos are not accessible anymore. For example, the original Kinetcis400 dataset has over 300K videos, but at this moment, we can only crawl about 280K videos. On average, we lose 5% of the videos every year. It is impossible to do fair comparisons between methods when they are trained and evaluated on different data.

對于機器學習來說，數據與模型開發同等重要。對于視頻動作識別，大多數數據集都偏向于空間表示[124]，即大多數動作可以被視頻內的單個幀識別，而不考慮時間運動。因此，在長期時間建模方面，需要一個新的數據集來推進視頻理解。此外，大多數當前的數據集都來自YouTube。由于版權/隱私問題，數據集組織者通常只發布YouTube id或視頻鏈接供用戶下載，而不是實際的視頻。第一個問題是，對某些地區來說，下載大規模數據集可能很慢。特別是，YouTube最近開始阻止從單個IP進行大規模下載。因此，許多研究人員甚至可能還沒有得到數據集就開始在這個領域工作。第二個問題是，由于地域限制和隱私問題，一些視頻無法訪問。例如，原始的Kinetcis400數據集有超過300K的視頻，但目前，我們只能抓取大約280K的視頻。平均來說，我們每年會損失5%的視頻。當使用不同的數據進行訓練和評估時，不可能在方法之間進行公平的比較。

5.7. Video adversarial attack

Adversarial examples have been well studied on image models. [199] first shows that an adversarial sample, computed by inserting a small amount of noise on the original image, may lead to a wrong prediction. However, limited work has been done on attacking video models.

對抗的例子已經在圖像模型上得到了很好的研究。[199]首先表明，通過在原始圖像上插入少量噪聲計算的對抗樣本可能會導致錯誤的預測。然而，針對攻擊視頻模型的研究還很有限。

This task usually considers two settings, a white-box attack [86, 119, 66, 21] where the adversary can always get the full access to the model including exact gradients of a given input, or a black-box one [93, 245, 226], in which the structure and parameters of the model are blocked so that the attacker can only access the (input, output) pair through queries. Recent work ME-Sampler [260] leverages the motion information directly in generating adversarial videos, and is shown to successfully attack a number of video classification models using many fewer queries. In summary, this direction is useful since many companies provide APIs for services such as video classification, anomaly detection, shot detection, face detection, etc. In addition, this topic is also related to detecting DeepFake videos. Hence, investigating both attacking and defending methods is crucial to keeping these video services safe.

該任務通常考慮兩種設置，一種是白盒攻擊[86,11,66,21]，其中攻擊者總是可以獲得對模型的完全訪問權，包括給定輸入的精確梯度，另一種是黑盒攻擊[93,245,226]，在這種攻擊中，模型的結構和參數被阻止，因此攻擊者只能通過查詢訪問(輸入，輸出)對。最近的工作ME-Sampler[260]在生成對抗視頻時直接利用了運動信息，并被證明可以使用更少的查詢成功地攻擊許多視頻分類模型。總之，這個方向是有用的，因為許多公司為視頻分類、異常檢測、鏡頭檢測、人臉檢測等服務提供api。此外，本課題還與檢測DeepFake視頻有關。因此，研究攻擊和防御方法對于保證這些視頻服務的安全至關重要。

5.8. Zero-shot action recognition

Zero-shot learning (ZSL) has been trending in the image understanding domain, and has recently been adapted to video action recognition. Its goal is to transfer the learned knowledge to classify previously unseen categories. Due to (1) the expensive data sourcing and annotation and (2) the set of possible human actions is huge, zero-shot action recognition is a very useful task for real-world applications.

零鏡頭學習(ZSL)已經成為圖像理解領域的一個趨勢，最近被應用到視頻動作識別中。它的目標是將學到的知識轉化為以前從未見過的分類。由于(1)昂貴的數據來源和注釋，(2)可能的人類動作的集合是巨大的，零動作識別是一個非常有用的現實應用程序的任務。

There are many early attempts [242, 88, 243, 137, 168, 57] in this direction. Most of them follow a standard framework, which is to first extract visual features from videos using a pretrained network, and then train a joint model that maps the visual embedding to a semantic embedding space. In this manner, the model can be applied to new classes by finding the test class whose embedding is the nearest-neighbor of the model’s output. A recent work URL [279] proposes to learn a universal representation that generalizes across datasets. Following URL [279], [10] present the first end-to-end ZSL action recognition model. They also establish a new ZSL training and evaluation protocol, and provide an in-depth analysis to further advance this field. Inspired by the success of pre-training and then zero-shot in NLP domain, we believe ZSL action recognition is a promising research topic.

在這個方向上有許多早期的嘗試[242,88,243,137,168,57]。它們大多遵循一個標準框架，即首先使用預訓練的網絡從視頻中提取視覺特征，然后訓練一個聯合模型，將視覺嵌入映射到語義嵌入空間。通過這種方式，通過找到其嵌入是模型輸出的最近鄰居的測試類，可以將模型應用于新的類。最近的一項工作URL[279]提出學習一種跨數據集的通用表示。繼URL[279]之后，[10]提出了第一個端到端ZSL動作識別模型。他們還建立了新的ZSL培訓和評估協議，并提供了深入的分析，以進一步推進該領域。受自然語言處理領域先訓練后零射成功的啟發，我們認為ZSL動作識別是一個很有前途的研究課題。

5.9. Weakly-supervised video action recognition

Building a high-quality video action recognition dataset [190, 100] usually requires multiple laborious steps: 1) first sourcing a large amount of raw videos, typically from the internet; 2) removing videos irrelevant to the categories in the dataset; 3) manually trimming the video segments that have actions of interest; 4) refining the categorical labels. Weakly-supervised action recognition explores how to reduce the cost for curating training data.

構建一個高質量的視頻動作識別數據集[190,100]通常需要多個艱難的步驟:1)首先從互聯網上獲取大量原始視頻;2)刪除與數據集中類別無關的視頻;3)手動裁剪有興趣動作的視頻片段;4)細化分類標簽。弱監督動作識別探討了如何降低訓練數據策劃的成本。

The first direction of research [19, 60, 58] aims to reduce the cost of sourcing videos and accurate categorical labeling. They design training methods that use training data such as action-related images or partially annotated videos, gathered from publicly available sources such as Internet. Thus this paradigm is also referred to as webly-supervised learning [19, 58]. Recent work on omni-supervised learning [60, 64, 38] also follows this paradigm but features bootstrapping on unlabelled videos by distilling the models’ own inference results.

第一個研究方向[19,60,58]旨在降低視頻來源的成本和準確的分類標簽。他們設計訓練方法，使用訓練數據，如動作相關的圖像或部分注釋的視頻，收集自公開來源，如互聯網。因此，這種范式也被稱為網絡監督學習[19,58]。最近關于全監督學習的研究[60,64,38]也遵循這一范式，但通過提取模型自身的推理結果，對無標簽視頻進行自引導。

The second direction aims at removing trimming, the most time consuming part in annotation. UntrimmedNet [216] proposed a method to learn action recognition model on untrimmed videos with only categorical labels [149, 172]. This task is also related to weakly-supervised temporal action localization which aims to automatically generate the temporal span of the actions. Several papers propose to simultaneously [155] or iteratively [184] learn models for these two tasks.

第二個方向是去除注釋中最耗時的部分修剪。UntrimmedNet[216]提出了一種方法，在只有分類標簽的未裁剪視頻上學習動作識別模型[149,172]。此任務還與弱監督的時間動作定位相關，其目的是自動生成動作的時間跨度。幾篇論文建議同時[155]或迭代[184]學習這兩個任務的模型。

5.10. Fine-grained video action recognition

Popular action recognition datasets, such as UCF101 [190] or Kinetics400 [100], mostly comprise actions happening in various scenes. However, models learned on these datasets could overfit to contextual information irrelevant to the actions [224, 227, 24]. Several datasets have been proposed to study the problem of fine-grained action recognition, which could examine the models’ capacities in modeling action specific information. These datasets comprise fine-grained actions in human activities such as cooking [28, 108, 174], working [103] and sports [181, 124]. For example, FineGym [181] is a recent large dataset annotated with different moves and sub-actions in gymnastic videos.

流行的動作識別數據集，如UCF101[190]或Kinetics400[100]，大多包含發生在不同場景中的動作。然而，在這些數據集上學習的模型可能會過度擬合與動作無關的上下文信息[224,227,24]。研究細粒度動作識別問題的數據集可以檢驗模型在動作特定信息建模方面的能力。這些數據集包括烹飪[28,108,174]、工作[103]和體育[181,124]等人類活動中的細粒度動作。例如，FineGym[181]是最近的一個大型數據集，在體操視頻中注釋了不同的動作和子動作。

5.11. Egocentric action recognition

Recently, large-scale egocentric action recognition [29, 28] has attracted increasing interest with the emerging of wearable cameras devices. Egocentric action recognition requires a fine understanding of hand motion and the interacting objects in the complex environment. A few papers leverage object detection features to offer fine object context to improve egocentric video recognition [136, 223, 229, 180]. Others incorporate spatio-temporal attention [192] or gaze annotations [131] to localize the interacting object to facilitate action recognition. Similar to third-person action recognition, multi-modal inputs (e.g., optical flow and audio) have been demonstrated to be effective in egocentric action recognition [101].

近年來，隨著可穿戴相機設備的出現，大規模以自我為中心的動作識別[29,28]引起了越來越多的關注。以自我為中心的動作識別需要對手部運動和復雜環境中相互作用的物體有很好的理解。一些論文利用對象檢測特征提供精細的對象上下文來改進以自我為中心的視頻識別[136,223,229,180]。其他方法則結合時空注意力[192]或凝視注釋[131]來定位交互對象，以促進動作識別。與第三人動作識別類似，多模態輸入(如光流和音頻)已被證明在自我中心動作識別中是有效的[101]。

5.12. Multi-modality

Multi-modal video understanding has attracted increasing attention in recent years [55, 3, 129, 167, 154, 2, 105]. There are two main categories for multi-modal video understanding. The first group of approaches use multi-modalities such as scene, object, motion, and audio to enrich the video representations. In the second group, the goal is to design a model which utilizes modality information as a supervision signal for pre-training models [195, 138, 249, 62, 2].

近年來，多模態視頻理解越來越受到關注[55,3,129,167,154,2,105]。多模態視頻理解主要有兩類。第一組方法使用場景、物體、運動和音頻等多模式來豐富視頻表現。第二組的目標是設計一個模型，該模型利用情態信息作為預訓練模型的監督信號[195,138,249,62,2]。

Multi-modality for comprehensive video understanding Learning a robust and comprehensive representation of video is extremely challenging due to the complexity of semantics in videos. Video data often includes variations in different forms including appearance, motion, audio, text or scene [55, 129, 166]. Therefore, utilizing these multi-modal representations is a critical step in understanding video content more efficiently. The multi-modal representations of video can be approximated by gathering representations of various modalities such as scene, object, audio, motion, appearance and text. Ngiam et al. [148] was an early attempt to suggest using multiple modalities to obtain better features. They utilized videos of lips and their corresponding speech for multi-modal representation learning. Miech et al. [139] proposed a mixture-of embedding-experts model to combine multiple modalities including motion, appearance, audio, and face features and learn the shared embedding space between these modalities and text. Roig et al. [175] combines multiple modalities such as action, scene, object and acoustic event features in a pyramidal structure for action recognition. They show that adding each modality improves the final action recognition accuracy. Both CE [129] and MMT [55], follow a similar research line to [139] where the goal is to combine multiple-modalities to obtain a comprehensive representation of video for joint video-text representation learning. Piergiovanni et al. [166] utilized textual data together with video data to learn a joint embedding space. Using this learned joint embedding space, the method is capable of doing zero-shot action recognition. This line of research is promising due to the availability of strong semantic extraction models and also success of transformers on both vision and language tasks.

由于視頻中語義的復雜性，學習一個健壯的、全面的視頻表示是非常具有挑戰性的。視頻數據通常包括不同形式的變化，包括外觀、運動、音頻、文本或場景[55,129,166]。因此，利用這些多模態表示是更有效地理解視頻內容的關鍵步驟。視頻的多模態表示可以通過收集諸如場景、對象、音頻、運動、外觀和文本等各種形式的表示來近似。Ngiam等人[148]是一個早期的嘗試，建議使用多種模式來獲得更好的特征。他們利用嘴唇和相應的語音視頻進行多模態表征學習。Miech等人[139]提出了一種混合嵌入專家模型，將包括運動、外觀、音頻和面部特征在內的多種模態結合起來，并學習這些模態與文本之間的共享嵌入空間。Roig等人[175]將動作、場景、物體和聲學事件特征等多種模式結合在一個金字塔結構中進行動作識別。結果表明，添加各種模態可以提高最終動作的識別精度。CE[129]和MMT[55]都遵循與[139]相似的研究路線，其目標是結合多種模式來獲得視頻的綜合表示，用于聯合視頻-文本表示學習。Piergiovanni等人[166]利用文本數據和視頻數據一起學習聯合嵌入空間。利用學習得到的關節嵌入空間，該方法能夠實現零鏡頭動作識別。由于強語義提取模型的可用性以及在視覺和語言任務上的轉換器的成功，這一研究方向很有前途。

Table 6. Comparison of self-supervised video representation learning methods. Top section shows the multi-modal video representation learning approaches and bottom section shows the video-only representation learning methods. From left to right, we show the self-supervised training setting, e.g. dataset, modalities, resolution, and architecture. Two last right columns show the action recognition results on two datasets UCF101 and HMDB51 to measure the quality of self-supervised pre-trained model. HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400 and K600: Kinetics600.

表6所示。自監督視頻表示學習方法的比較。上半部分為多模態視頻表示學習方法，下半部分為純視頻表示學習方法。從左到右，我們展示了自我監督的訓練設置，例如數據集、模式、分辨率和架構。右最后兩列顯示了在UCF101和HMDB51兩個數據集上的動作識別結果，以衡量自監督預訓練模型的質量。HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400和K600: Kinetics600。

Multi-modality for self-supervised video representation learning Most videos contain multiple modalities such as audio or text/caption. These modalities are great source of supervision for learning video representations [3, 144, 154, 2, 162]. Korbar et al. [105] incorporated the natural synchronization between audio and video as a supervision signal in their contrastive learning objective for selfsupervised representation learning. In multi-modal self-supervised representation learning, the dataset plays an im-portant role. VideoBERT [195] collected 310K cooking videos from YouTube. However, this dataset is not publicly available. Similar to BERT, VideoBERT used a “masked language model” training objective and also quantized the visual representations into “visual words”. Miech et al. [140] introduced HowTo100M dataset in 2019. This dataset includes 136M clips from 1.22M videos with their corresponding text. They collected the dataset from YouTube with the aim of obtaining instructional videos (how to perform an activity). In total, it covers 23.6K instructional tasks. MIL-NCE [138] used this dataset for self-supervised cross-modal representation learning. They tackled the problem of visually misaligned narrations, by considering multiple positive pairs in the contrastive learning objective. Act-BERT [275], utilized HowTo100M dataset for pre-training of the model in a self-supervised way. They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr. The top section of Table 6 compares multi-modal self-supervised representation learning methods. We will discuss more work on video-only representation learning in the next section.

用于自我監督視頻表示學習的多模態大多數視頻包含多種模態，如音頻或文本/標題。這些模式是學習視頻表示的重要監督來源[3,144,154,2,162]。Korbar等人[105]將音頻和視頻之間的自然同步作為監督信號納入了自我監督表示學習的對比學習目標中。在多模態自監督表示學習中，數據集起著重要的作用。VideoBERT[195]從YouTube上收集了310K個烹飪視頻。然而，該數據集尚未公開。與BERT類似，VideoBERT使用了“遮蔽語言模型”的訓練目標，也將視覺表示量化為“視覺詞”。Miech等人[140]于2019年介紹了HowTo100M數據集。該數據集包括1.22M視頻中的136M剪輯和相應的文本。他們從YouTube上收集數據集，目的是獲得教學視頻(如何執行一項活動)。它總共涵蓋23.6K教學任務。MIL-NCE[138]使用該數據集進行自監督跨模態表示學習。他們通過在對比學習目標中考慮多個正對來解決視覺失調的敘述問題。Act-BERT[275]，利用HowTo100M數據集以自監督的方式對模型進行預訓練。它們結合了視覺、動作、文本和對象特征進行跨模態表示學習。最近AVLnet[176]和MMV[2]考慮了三種形式的自我監督表現學習視覺，音頻和語言。由于對比學習在許多視覺和語言任務上的成功，以及YouTube、Instagram或Flickr等平臺上大量未標記的多模態視頻數據的訪問，這一研究方向也越來越受到關注。表6的上半部分比較了多模態自監督表示學習方法。我們將在下一節中討論更多關于純視頻表示學習的工作。

5.13. Self-supervised video representation learning

Self-supervised learning has attracted more attention recently as it is able to leverage a large amount of unlabeled data by designing a pretext task to obtain free supervisory signals from data itself. It first emerged in image representation learning. On images, the first stream of papers aimed at designing pretext tasks for completing missing information, such as image coloring [262] and image reordering [153, 61, 263]. The second stream of papers uses instance discrimination [235] as the pretext task and contrastive losses [235, 151] for supervision. They learn visual representation by modeling visual similarity of object instances without class labels [235, 75, 201, 18, 17].

自監督學習是一種利用大量未標記數據，設計借口任務從數據本身獲取自由監督信號的學習方法，近年來備受關注。它首先出現在圖像表示學習中。關于圖像，第一組論文旨在設計借口任務來完成缺失的信息，如圖像著色[262]和圖像重新排序[153,61,263]。第二批論文以實例區別[235]為托詞，以對比損失[235,151]為監督。他們通過對沒有類標簽的對象實例的可視化相似度建模來學習可視化表示[235,75,201,18,17]。

Self-supervised learning is also viable for videos. Compared with images, videos has another axis, temporal dimension, which we can use to craft pretext tasks. Information completion tasks for this purpose include predicting the correct order of shuffled frames [141, 52] and video clips [240]. Jing et al. [94] focus on the spatial dimension only by predicting the rotation angles of rotated video clips. Combining temporal and spatial information, several tasks have been introduced to solve a space-time cubic puzzle, anticipate future frames [208], forecast long-term motions [134] and predict motion and appearance statistics [211]. RSPNet [16] and visual tempo [247] exploit the relative speed between video clips as a supervision signal.

自我監督學習在視頻中也是可行的。與圖像相比，視頻還有一個軸，即時間維度，我們可以利用它來制作借口任務。為此目的的信息完成任務包括預測打亂的幀[141,52]和視頻剪輯[240]的正確順序。Jing等[94]只通過預測旋轉視頻片段的旋轉角度來關注空間維度。結合時間和空間信息，幾個任務被引入來解決時空立方難題，預測未來幀[208]，預測長期運動[134]，預測運動和外觀統計[211]。RSPNet[16]和visual tempo[247]利用視頻片段之間的相對速度作為監控信號。

The added temporal axis can also provide flexibility in designing instance discrimination pretexts [67, 167]. Inspired by the decoupling of 3D convolution to spatial and temporal separable convolutions [239], Zhang et al. [266] proposed to decouple the video representation learning into two sub-tasks: spatial contrast and temporal contrast. Recently, Han et al. [72] proposed memory augmented dense predictive coding for self-supervised video representation learning. They split each video into several blocks and the embedding of future block is predicted by the combination of condensed representations in memory.

增加的時間軸還可以為設計實例區別借口提供靈活性[67,167]。受3D卷積解耦為時空可分離卷積的啟發[239]，Zhang等人[266]提出將視頻表示學習解耦為兩個子任務:空間對比和時間對比。最近，Han等人[72]提出了用于自監督視頻表示學習的記憶增強密集預測編碼。他們將每個視頻分割成幾個塊，并通過在內存中組合壓縮表示來預測未來塊的嵌入。

The temporal continuity in videos inspires researchers to design other pretext tasks around correspondence. Wang et al. [221] proposed to learn representation by performing cycle-consistency tracking. Specifically, they track the same object backward and then forward in the consecutive video frames, and use the inconsistency between the start and end points as the loss function. TCC [39] is a concurrent paper. Instead of tracking local objects, [39] used cycle-consistency to perform frame-wise temporal alignment as a supervision signal. [120] was a follow-up work of [221], and utilized both object-level and pixel-level correspondence across video frames. Recently, long-range temporal correspondence is modelled as a random walk graph to help learning video representation in [87].

視頻中的時間連續性激勵研究人員圍繞通信設計其他借口任務。Wang等人[221]提出通過執行周期一致性跟蹤來學習表示。具體來說，它們在連續的視頻幀中，先向后跟蹤同一個物體，然后向前跟蹤，并使用起始點和結束點之間的不一致作為損失函數。TCC[39]是一篇并發論文。[39]不是跟蹤本地對象，而是使用周期一致性執行幀相關的時間對齊作為監督信號。[120]是[221]的后續工作，并利用了跨視頻幀的對象級和像素級通信。最近，遠程時間對應被建模為隨機游走圖，以幫助學習視頻表示[87]。

We compare video self-supervised representation learning methods at the bottom section of Table 6. A clear trend can be observed that recent papers have achieved much better linear evaluation accuracy and fine-tuning accuracy comparable to supervised pre-training. This shows that self-supervised learning could be a promising direction towards learning better video representations.

我們在表6的底部部分比較了視頻自監督表示學習方法。可以觀察到一個明顯的趨勢，最近的論文取得了更好的線性評價精度和微調精度可比監督前訓練。這表明，自我監督學習可能是學習更好的視頻表示的一個有前途的方向。

6. Conclusion

In this survey, we present a comprehensive review of 200+ deep learning based recent approaches to video action recognition. Although this is not an exhaustive list, we hope the survey serves as an easy-to-follow tutorial for those seeking to enter the field, and an inspiring discussion for those seeking to find new research directions.

在本次調查中，我們全面回顧了200多個基于深度學習的視頻動作識別最新方法。雖然這不是一個詳盡的列表，我們希望調查可以成為那些尋求進入該領域的人的一個易于遵循的指南，并為那些尋求新的研究方向的人提供一個鼓舞人心的討論。

參考

行為識別 - A Comprehensive Study of Deep Video Action Recognition
視頻理解論文綜述
Video Action Recognition淺析（一）
A Comprehensive Study of Deep Video Action Recognition
A Comprehensive Study of Deep Video Action Recognition 論文筆記

總結

以上是生活随笔為你收集整理的论文笔记【A Comprehensive Study of Deep Video Action Recognition】的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【STM32】PWM输出原理
下一篇： java软件工程师自我评价_Java开发