久久精品国产精品国产精品污,男人扒开添女人下部免费视频,一级国产69式性姿势免费视频,夜鲁夜鲁很鲁在线视频 视频,欧美丰满少妇一区二区三区,国产偷国产偷亚洲高清人乐享,中文 在线 日韩 亚洲 欧美,熟妇人妻无乱码中文字幕真矢织江,一区二区三区人妻制服国产

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

论文笔记【A Comprehensive Study of Deep Video Action Recognition】

發布時間:2024/3/7 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 论文笔记【A Comprehensive Study of Deep Video Action Recognition】 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

論文鏈接:A Comprehensive Study of Deep Video Action Recognition

目錄

  • A Comprehensive Study of Deep Video Action Recognition
  • Abstract
  • 1. Introduction
  • 2. Datasets and Challenges
    • 2.1. Datasets
    • 2.2. Challenges
  • 3. An Odyssey of Using Deep Learning for Video Action Recognition
    • 3.1. From hand-crafted features to CNNs
    • 3.2. Two-stream networks
      • 3.2.1 Using deeper network architectures
      • 3.2.2 Two-stream fusion
      • 3.2.3 Recurrent neural networks
      • 3.2.4 Segment-based methods
      • 3.2.5 Multi-stream networks
    • 3.3. The rise of 3D CNNs
      • 3.3.1 Mapping from 2D to 3D CNNs
      • 3.3.2 Unifying 2D and 3D CNNs
      • 3.3.3 Long-range temporal modeling
      • 3.3.4 Enhancing 3D efficiency
    • 3.4. Efficient Video Modeling
      • 3.4.1 Flow-mimic approaches
      • 3.4.2 Temporal modeling without 3D convolution
    • 3.5. Miscellaneous
      • 3.5.1 Trajectory-based methods
      • 3.5.2 Rank pooling
      • 3.5.3 Compressed video action recognition
      • 3.5.4 Frame/Clip sampling
      • 3.5.5 Visual tempo
  • 4. Evaluation and Benchmarking
    • 4.1. Evaluation scheme
    • 4.2. Scene-focused datasets
    • 4.3. Motion-focused datasets
    • 4.4. Multi-label datasets
    • 4.5. Speed comparison
  • 5. Discussion and Future Work
    • 5.1. Analysis and insights
    • 5.2. Data augmentation
    • 5.3. Video domain adaptation
    • 5.4. Neural architecture search
    • 5.5. Efficient model development
    • 5.6. New datasets
    • 5.7. Video adversarial attack
    • 5.8. Zero-shot action recognition
    • 5.9. Weakly-supervised video action recognition
    • 5.10. Fine-grained video action recognition
    • 5.11. Egocentric action recognition
    • 5.12. Multi-modality
    • 5.13. Self-supervised video representation learning
  • 6. Conclusion
  • 參考

A Comprehensive Study of Deep Video Action Recognition

Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong,
Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li
Amazon Web Services
{yzaws,xxnl,chunhliu,mozolf,yuanjx,chongrwu,zhiz,tighej,manmatha,mli}@amazon.com

Abstract

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

視頻動作識別是視頻理解的代表性課題之一。在過去的十年中,由于深度學習的出現,我們見證了視頻動作識別的巨大進步。但我們也遇到了新的挑戰,包括在視頻中建模遠程時間信息,高計算成本,以及由于數據集和評估協議的差異而導致的不可比較的結果。在這篇論文中,我們提供了一個全面的調查超過200篇現有的深度學習視頻動作識別的論文。我們首先介紹了影響模型設計的17個視頻動作識別數據集。然后,我們按時間順序介紹視頻動作識別模型:從早期適應深度學習的嘗試開始,然后是雙流網絡,然后是采用3D卷積核,最后是最近的計算效率模型。此外,我們在幾個有代表性的數據集和發布代碼上對流行的方法進行了基準測試,以提高可重復性。最后,我們討論了視頻動作識別的開放問題,并闡明了視頻動作識別的機會,以促進新的研究思路。

1. Introduction

One of the most important tasks in video understanding is to understand human actions. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Human action understanding involves recognizing, localizing, and predicting human behaviors. The task to recognize human actions in a video is called video action recognition. In Figure 1, we visualize several video frames with the associated action labels, which are typical human daily activities such as shaking hands and riding a bike.

理解人的動作是視頻理解中最重要的任務之一。它在現實世界中有許多應用,包括行為分析、視頻檢索、人機交互、游戲和娛樂。對人類行為的理解包括識別、定位和預測人類行為。在視頻中識別人類動作的任務稱為視頻動作識別。在圖1中,我們可視化了幾個帶有相關動作標簽的視頻幀,這些動作標簽是典型的人類日常活動,如握手和騎自行車。

Over the last decade, there has been growing research interest in video action recognition with the emergence of high-quality large-scale action recognition datasets. We summarize the statistics of popular action recognition datasets in Figure 2. We see that both the number of videos and classes increase rapidly, e.g, from 7K videos over 51 classes in HMDB51 [109] to 8M videos over 3, 862 classes in YouTube8M [1]. Also, the rate at which new datasets are released is increasing: 3 datasets were released from 2011 to 2015 compared to 13 released from 2016 to 2020.

在過去的十年中,隨著高質量的大規模動作識別數據集的出現,人們對視頻動作識別的研究興趣日益濃厚。我們在圖2中總結了流行的動作識別數據集的統計數據。我們看到視頻和課程的數量都在快速增長,例如,從HMDB51[109]中的51個課程上的7K視頻到YouTube8M[1]中的3862個課程上的8M視頻。此外,新數據集發布的速度也在增加:2011年至2015年發布了3個數據集,而2016年至2020年發布了13個數據集。


Thanks to both the availability of large-scale datasets and the rapid progress in deep learning, there is also a rapid growth in deep learning based models to recognize video actions. In Figure 3, we present a chronological overview of recent representative work. DeepVideo [99] is one of the earliest attempts to apply convolutional neural networks to videos. We observed three trends here. The first trend started by the seminal paper on Two-Stream Networks [187], adds a second path to learn the temporal information in a video by training a convolutional neural network on the optical flow stream. Its great success inspired a large number of follow-up papers, such as TDD [214], LRCN [37], Fusion [50], TSN [218], etc. The second trend was the use of 3D convolutional kernels to model video temporal information, such as I3D [14], R3D [74], S3D [239], Non-local [219], SlowFast [45], etc. Finally, the third trend focused on computational efficiency to scale to even larger datasets so that they could be adopted in real applications. Examples include Hidden TSN [278], TSM [128], X3D [44], TVN [161], etc.

得益于大規模數據集的可用性和深度學習的快速發展,基于深度學習的視頻動作識別模型也得到了快速發展。在圖3中,我們按時間順序概述了最近的代表性作品。DeepVideo[99]是最早將卷積神經網絡應用于視頻的嘗試之一。我們觀察到三個趨勢。第一個趨勢是由關于雙流網絡的開創性論文[187]開始的,增加了第二條路徑,通過在光流流上訓練卷積神經網絡來學習視頻中的時間信息。它的巨大成功啟發了大量后續的論文,如TDD [214], LRCN [37], Fusion [50], TSN[218]等。第二個趨勢是使用3D卷積核對視頻時間信息建模,如I3D[14]、R3D[74]、S3D[239]、Non-local[219]、SlowFast[45]等。最后,第三個趨勢專注于計算效率,以擴展到更大的數據集,以便它們可以在實際應用中采用。示例包括隱藏TSN[278]、TSM[128]、X3D[44]、TVN[161]等。

Despite the large number of deep learning based models for video action recognition, there is no comprehensive survey dedicated to these models. Previous survey papers either put more efforts into hand-crafted features [77, 173] or focus on broader topics such as video captioning [236], video prediction [104], video action detection [261] and zero-shot video action recognition [96]. In this paper:

盡管有大量基于深度學習的視頻動作識別模型,但還沒有針對這些模型的全面研究。以前的調查論文要么把更多的精力放在手工制作的功能上[77,173],要么關注更廣泛的主題,如視頻字幕[236]、視頻預測[104]、視頻動作檢測[261]和零鏡頭視頻動作識別[96]。在本文中:

We comprehensively review over 200 papers on deep learning for video action recognition. We walk the readers through the recent advancements chronologically and systematically, with popular papers explained in detail.

我們全面審查了200多篇關于深度學習在視頻動作識別中的應用的論文。我們將按時間順序和系統地向讀者介紹最近的進展,并對流行論文進行詳細解釋。

We benchmark widely adopted methods on the same set of datasets in terms of both accuracy and efficiency.We also release our implementations for full reproducibility .

在準確性和效率方面,我們在同一組數據集上對廣泛采用的方法進行了基準測試。我們還發布了我們的實現以實現完全的可再現性。

We elaborate on challenges, open problems, and opportunities in this field to facilitate future research.

我們詳細闡述了該領域的挑戰、開放問題和機遇,以促進未來的研究。

The rest of the survey is organized as following. We first describe popular datasets used for benchmarking and existing challenges in section 2. Then we present recent advancements using deep learning for video action recognition in section 3, which is the major contribution of this survey. In section 4, we evaluate widely adopted approaches on standard benchmark datasets, and provide discussions and future research opportunities in section 5.

調查的其余部分組織如下。我們首先在第2節中描述用于基準測試的流行數據集和現有的挑戰。然后,我們在第3部分介紹了使用深度學習進行視頻動作識別的最新進展,這是本次調查的主要貢獻。在第4節中,我們評估了標準基準數據集上廣泛采用的方法,并在第5節中提供了討論和未來的研究機會。

2. Datasets and Challenges

2.1. Datasets

Deep learning methods usually improve in accuracy when the volume of the training data grows. In the case of video action recognition, this means we need large-scale annotated datasets to learn effective models.

深度學習方法的準確性通常隨著訓練數據量的增長而提高。在視頻動作識別的情況下,這意味著我們需要大規模的帶注釋的數據集來學習有效的模型。

For the task of video action recognition, datasets are often built by the following process: (1) Define an action list, by combining labels from previous action recognition datasets and adding new categories depending on the use case. (2) Obtain videos from various sources, such as YouTube and movies, by matching the video title/subtitle to the action list. (3) Provide temporal annotations manually to indicate the start and end position of the action, and (4) finally clean up the dataset by de-duplication and filtering out noisy classes/samples. Below we review the most popular large-scale video action recognition datasets in Table 1 and Figure 2.

對于視頻動作識別任務,數據集通常通過以下過程構建:(1)定義一個動作列表,將之前動作識別數據集中的標簽組合起來,并根據用例添加新的類別。(2)通過將視頻標題/字幕與動作列表進行匹配,獲取各種來源的視頻,如YouTube和電影。(3)手動提供時間注釋,以指示動作的開始和結束位置。(4)最后通過去重復和過濾噪聲類/樣本來清理數據集。下面我們回顧表1和圖2中最流行的大規模視頻動作識別數據集。

HMDB51 [109] was introduced in 2011. It was collected mainly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6, 849 clips divided into 51 action categories, each containing a minimum of 101 clips. The dataset has three official splits. Most previous papers either report the top-1 classification accuracy on split 1 or the average accuracy over three splits.

HMDB51[109]于2011年問世。這些數據主要來自電影,還有一小部分來自公共數據庫,如普雷林格檔案、YouTube和谷歌視頻。該數據集包含6,849個剪輯,分為51個動作類別,每個動作類別至少包含101個剪輯。該數據集有三次官方拆分。以前的大多數論文要么報告分裂1的前1分類準確率,要么報告三次分裂的平均準確率。

UCF101 [190] was introduced in 2012 and is an extension of the previous UCF50 dataset. It contains 13,320 videos from YouTube spreading over 101 categories of human actions. The dataset has three official splits similar to HMDB51, and is also evaluated in the same manner.

UCF101[190]于2012年引入,是之前UCF50數據集的擴展。它包含了來自YouTube的13320個視頻,傳播了101種人類行為。該數據集有三個官方分割,類似于HMDB51,也以相同的方式進行評估。

Sports1M [99] was introduced in 2014 as the first large-scale video action dataset which consisted of more than 1 million YouTube videos annotated with 487 sports classes. The categories are fine-grained which leads to low interclass variations. It has an official 10-fold cross-validation split for evaluation.

Sports1M[99]于2014年推出,是第一個大規模視頻動作數據集,由100多萬YouTube視頻組成,注釋了487個體育課程。類別是細粒度的,這導致了類間的低變化。它有一個官方的10倍交叉驗證拆分用于評估。

ActivityNet [40] was originally introduced in 2015 and the ActivityNet family has several versions since its initial launch. The most recent ActivityNet 200 (V1.3) contains 200 human daily living actions. It has 10, 024 training, 4, 926 validation, and 5, 044 testing videos. On average there are 137 untrimmed videos per class and 1.41 activity instances per video.

ActivityNet[40]最初于2015年推出,ActivityNet家族自最初推出以來有幾個版本。最新的ActivityNet 200 (V1.3)包含200個人類日常生活行為。它有10,024個訓練視頻、4,926個驗證視頻和5,044個測試視頻。平均而言,每個類有137個未修剪的視頻,每個視頻有1.41個活動實例。

YouTube8M [1] was introduced in 2016 and is by far the largest-scale video dataset that contains 8 million YouTube videos (500K hours of video in total) and annotated with 3, 862 action classes. Each video is annotated with one or multiple labels by a YouTube video annotation system. This dataset is split into training, validation and test in the ratio 70:20:10. The validation set of this dataset is also extended with human-verified segment annotations to provide temporal localization information.

YouTube8M[1]于2016年推出,是迄今為止規模最大的視頻數據集,包含800萬YouTube視頻(總共50萬小時的視頻),并注釋了3862個動作類。每個視頻都由YouTube視頻標注系統標注一個或多個標簽。該數據集按70:20:10的比例分為訓練、驗證和測試。該數據集的驗證集還擴展了人類驗證的段注釋,以提供時間定位信息。

Charades [186] was introduced in 2016 as a dataset for real-life concurrent action understanding. It contains 9, 848 videos with an average length of 30 seconds. This dataset includes 157 multi-label daily indoor activities, performed by 267 different people. It has an official train-validation split that has 7, 985 videos for training and the remaining 1, 863 for validation.

2016年,Charades[186]作為一個用于理解現實并發動作的數據集被引入。它包含9848個視頻,平均長度為30秒。該數據集包括157項多標簽的日常室內活動,由267名不同的人進行。它有一個官方的訓練-驗證分割,有7985個視頻用于訓練,剩下的1863個用于驗證。

Kinetics Family is now the most widely adopted benchmark. Kinetics400 [100] was introduced in 2017 and it consists of approximately 240k training and 20k validation videos trimmed to 10 seconds from 400 human action categories. The Kinetics family continues to expand, with Kinetics-600 [12] released in 2018 with 480K videos and Kinetics700[13] in 2019 with 650K videos.

Kinetics Family是目前最廣泛采用的基準。Kinetics400[100]于2017年推出,由大約240k的訓練視頻和20k的驗證視頻組成,從400個人體動作類別中刪減到10秒。Kinetics系列繼續擴大,2018年發布了Kinetics-600[12],包含480K視頻,2019年發布了Kinetics700[13],包含650K視頻。

20BN-Something-Something [69] V1 was introduced in 2017 and V2 was introduced in 2018. This family is another popular benchmark that consists of 174 action classes that describe humans performing basic actions with everyday objects. There are 108, 499 videos in V1 and 220, 847 videos in V2. Note that the Something-Something dataset requires strong temporal modeling because most activities cannot be inferred based on spatial features alone (e.g. opening something, covering something with something).

20BN-Something-Something [69] V1于2017年引入,V2于2018年引入。這個系列是另一個流行的基準測試,由174個動作類組成,描述人類對日常物體執行的基本動作。V1中有108,499個視頻,V2中有220,847個視頻。請注意,something - something數據集需要強大的時間建模,因為大多數活動不能僅根據空間特征推斷(例如,打開某物,用某物覆蓋某物)。

AVA [70] was introduced in 2017 as the first large-scale spatio-temporal action detection dataset. It contains 430 15-minute video clips with 80 atomic actions labels (only 60 labels were used for evaluation). The annotations were provided at each key-frame which lead to 214, 622 training, 57, 472 validation and 120, 322 testing samples. The AVA dataset was recently expanded to AVA-Kinetics with 352,091 training, 89,882 validation and 182,457 testing samples [117].

AVA[70]于2017年問世,是首個大規模時空動作檢測數據集。它包含430個15分鐘的視頻剪輯,帶有80個原子動作標簽(只有60個標簽用于評估)。在每個關鍵幀上提供注釋,得到214 622個訓練,57 472個驗證和120 322個測試樣本。AVA數據集最近擴展到AVA- kinetics,包含352091個訓練、89,882個驗證和182,457個測試樣本[117]。

Moments in Time [142] was introduced in 2018 and it is a large-scale dataset designed for event understanding. It contains one million 3 second video clips, annotated with a dictionary of 339 classes. Different from other datasets designed for human action understanding, Moments in Time dataset involves people, animals, objects and natural phenomena. The dataset was extended to Multi-Moments in Time (M-MiT) [143] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video.

Moments in Time[142]于2018年引入,是一個旨在理解事件的大規模數據集。它包含100萬個3秒的視頻片段,用339個類的字典注釋。不同于其他為理解人類行為而設計的數據集,瞬間數據集包括人、動物、物體和自然現象。2019年,該數據集擴展到Multi-Moments in Time (M-MiT)[143],將視頻數量增加到102萬,修剪模糊類,并增加每個視頻的標簽數量。

HACS [267] was introduced in 2019 as a new large-scale dataset for recognition and localization of human actions collected from Web videos. It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations on 504K videos, and HACS Segments has 140K complete action segments (from action start to end) on 50K videos. The videos are annotated with the same 200 human action classes used in ActivityNet (V1.3) [40].

HACS[267]是一種新的大規模數據集,用于從網絡視頻中收集人類行為的識別和定位。它由兩種手工注釋組成。HACS剪輯包含1.55M 2秒剪輯注釋504K視頻,HACS片段有140K完整的動作片段(從動作開始到結束)50K視頻。這些視頻使用ActivityNet (V1.3)[40]中使用的200個人類動作類進行注釋。

HVU [34] dataset was released in 2020 for multi-label multi-task video understanding. This dataset has 572K videos and 3, 142 labels. The official split has 481K, 31K and 65K videos for train, validation, and test respectively. This dataset has six task categories: scene, object, action, event, attribute, and concept. On average, there are about 2, 112 samples for each label. The duration of the videos varies with a maximum length of 10 seconds.

HVU[34]數據集于2020年發布,用于多標簽多任務視頻理解。這個數據集有572K個視頻和3142個標簽。官方拆分有481K、31K和65K視頻分別用于訓練、驗證和測試。這個數據集有六個任務類別:場景、對象、動作、事件、屬性和概念。平均而言,每個標簽大約有2112個樣品。視頻時長各不相同,最長10秒。

AViD [165] was introduced in 2020 as a dataset for anonymized action recognition. It contains 410K videos for training and 40K videos for testing. Each video clip duration is between 3-15 seconds and in total it has 887 action classes. During data collection, the authors tried to collect data from various countries to deal with data bias. They also remove face identities to protect privacy of video makers. Therefore, AViD dataset might not be a proper choice for recognizing face-related actions.

AViD[165]于2020年推出,作為匿名動作識別的數據集。它包含410K的訓練視頻和40K的測試視頻。每個視頻剪輯的時長在3-15秒之間,總共有887個動作類。在數據收集過程中,作者試圖收集不同國家的數據,以處理數據偏差。為了保護視頻制作者的隱私,他們還刪除了面部身份。因此,AViD數據集可能不是識別人臉相關動作的合適選擇。

Figure 4. Visual examples from popular video action datasets.
Top: individual video frames from action classes in UCF101 and Kinetics400. A single frame from these scene-focused datasets often contains enough information to correctly guess the category. Middle: consecutive video frames from classes in Something-Something. The 2nd and 3rd frames are made transparent to indicate the importance of temporal reasoning that we cannot tell these two actions apart by looking at the 1st frame alone. Bottom: individual video frames from classes in Moment in Time. Same action could have different actors in different environments.

圖4。來自流行視頻動作數據集的可視化示例。
上圖:UCF101和Kinetics400中動作類的獨立視頻幀。來自這些以場景為中心的數據集的單個幀通常包含足夠的信息來正確猜測類別。中間:來自某某類的連續視頻幀。第二幀和第三幀是透明的,以表明時間推理的重要性,我們不能通過單獨看第一幀來區分這兩個行為。底部:來自“時刻”中的各個類的單獨視頻幀。相同的動作在不同的環境中可能有不同的參與者。

Before we dive into the chronological review of methods, we present several visual examples from the above datasets in Figure 4 to show their different characteristics. In the top two rows, we pick action classes from UCF101 [190] and Kinetics400 [100] datasets. Interestingly, we find that these actions can sometimes be determined by the context or scene alone. For example, the model can predict the action riding a bike as long as it recognizes a bike in the video frame. The model may also predict the action cricket bowling if it recognizes the cricket pitch. Hence for these classes, video action recognition may become an object/scene classification problem without the need of reasoning motion/temporal information. In the middle two rows, we pick action classes from Something-Something dataset [69]. This dataset focuses on human-object interaction, thus it is more fine-grained and requires strong temporal modeling. For example, if we only look at the first frame of dropping something and picking something up without looking at other video frames, it is impossible to tell these two actions apart. In the bottom row, we pick action classes from Moments in Time dataset [142]. This dataset is different from most video action recognition datasets, and is designed to have large inter-class and intra-class variation that represent dynamical events at different levels of abstraction. For example, the action climbing can have different actors (person or animal) in different environments (stairs or tree).

在按時間順序回顧這些方法之前,我們在圖4中展示了上述數據集中的幾個可視化示例,以展示它們的不同特征。在前兩行中,我們從UCF101[190]和Kinetics400[100]數據集中選擇動作類。有趣的是,我們發現這些行為有時只取決于情境或場景。例如,只要模型識別出視頻幀中的自行車,它就可以預測騎自行車的動作。該模型在識別板球場地的情況下,也可以預測板球運動。因此,對于這些類來說,視頻動作識別可能成為一個不需要推理運動/時間信息的目標/場景分類問題。在中間的兩行中,我們從Something-Something數據集[69]中選擇動作類。該數據集關注人-物交互,因此粒度更細,需要強大的時間建模。例如,如果我們只看扔東西和撿東西的第一幀而不看其他視頻幀,就不可能區分這兩個動作。在最下面一行,我們從Moments In Time數據集[142]中選擇動作類。該數據集不同于大多數視頻動作識別數據集,設計為具有較大的類間和類內變化,代表不同抽象級別的動態事件。例如,動作攀爬可以有不同的演員(人或動物)在不同的環境(樓梯或樹)。

2.2. Challenges

There are several major challenges in developing effective video action recognition algorithms.

In terms of dataset, first, defining the label space for training action recognition models is non-trivial. It’s because human actions are usually composite concepts and the hierarchy of these concepts are not well-defined. Second, annotating videos for action recognition are laborious (e.g., need to watch all the video frames) and ambiguous (e.g, hard to determine the exact start and end of an action). Third, some popular benchmark datasets (e.g., Kinetics family) only release the video links for users to download and not the actual video, which leads to a situation that methods are evaluated on different data. It is impossible to do fair comparisons between methods and gain insights.

在數據集方面,首先,為訓練動作識別模型定義標簽空間是非常重要的。這是因為人類行為通常是復合概念,這些概念的層次結構并沒有很好的定義。其次,為動作識別標注視頻是費力的(例如,需要觀看所有的視頻幀)和模糊的(例如,難以確定動作的確切開始和結束)。第三,一些流行的基準數據集(例如,Kinetics家族)只發布視頻鏈接供用戶下載,而不發布實際的視頻,這導致了一種情況,方法是在不同的數據上評估。不可能在方法之間進行公平的比較并獲得深刻的見解。

In terms of modeling, first, videos capturing human actions have both strong intra- and inter-class variations. People can perform the same action in different speeds under various viewpoints. Besides, some actions share similar movement patterns that are hard to distinguish. Second, recognizing human actions requires simultaneous understanding of both short-term action-specific motion information and long-range temporal information. We might need a sophisticated model to handle different perspectives rather than using a single convolutional neural network. Third, the computational cost is high for both training and inference, hindering both the development and deployment of action recognition models. In the next section, we will demonstrate how video action recognition methods developed over the last decade to address the aforementioned challenges.

在建模方面,首先,捕捉人類行為的視頻具有很強的類內和類間變異。人們可以在不同的視點下以不同的速度執行相同的動作。此外,有些動作有著相似的運動模式,很難區分。其次,識別人類行為需要同時理解特定于動作的短期運動信息和長期時間信息。我們可能需要一個復雜的模型來處理不同的視角,而不是使用單一的卷積神經網絡。第三,訓練和推理的計算成本都很高,阻礙了動作識別模型的開發和部署。在下一節中,我們將演示視頻動作識別方法是如何在過去十年中發展起來的,以解決上述挑戰。

3. An Odyssey of Using Deep Learning for Video Action Recognition

In this section, we review deep learning based methods for video action recognition from 2014 to present and introduce the related earlier work in context.

在本節中,我們將回顧從2014年開始的基于深度學習的視頻動作識別方法,以呈現和介紹相關的早期工作。

3.1. From hand-crafted features to CNNs

Despite there being some papers using Convolutional Neural Networks (CNNs) for video action recognition, [200, 5, 91], hand-crafted features [209, 210, 158, 112], particularly Improved Dense Trajectories (IDT) [210], dominated the video understanding literature before 2015, due to their high accuracy and good robustness. However, handcrafted features have heavy computational cost [244], and are hard to scale and deploy.

盡管有一些論文使用卷積神經網絡(CNN)進行視頻動作識別,[200,5,91],手工制作的特征[209,210,158,112],特別是改進的密集軌跡(IDT)[210],由于其高精度和良好的魯棒性,在2015年之前主導了視頻理解文獻。但是,手工制作的功能具有很高的計算成本[244],并且難以擴展和部署。

With the rise of deep learning [107], researchers started to adapt CNNs for video problems. The seminal work DeepVideo [99] proposed to use a single 2D CNN model on each video frame independently and investigated several temporal connectivity patterns to learn spatio-temporal features for video action recognition, such as late fusion, early fusion and slow fusion. Though this model made early progress with ideas that would prove to be useful later such as a multi-resolution network, its transfer learning performance on UCF101 [190] was 20% less than hand-crafted IDT features (65.4% vs 87.9%). Furthermore, DeepVideo [99] found that a network fed by individual video frames, performs equally well when the input is changed to a stack of frames. This observation might indicate that the learnt spatio-temporal features did not capture the motion well. It also encouraged people to think about why CNN models did not outperform traditional hand-crafted features in the video domain unlike in other computer vision tasks [107, 171].

隨著深度學習的興起[107],研究人員開始使CNN適應視頻問題。開創性的工作DeepVideo [99]建議在每個視頻幀上獨立使用單個2D CNN模型,并研究了幾種時間連接模式來學習視頻動作識別的時空特征,例如后期融合,早期融合和慢融合。雖然該模型在多分辨率網絡等后來被證明是有用的想法方面取得了早期進展,但它在UCF101 [190]上的遷移學習性能比手工制作的IDT特征低20%(65.4%對87.9%)。此外,DeepVideo [99]發現,當輸入更改為幀堆棧時,由單個視頻幀饋送的網絡表現同樣出色。這一觀察結果可能表明,所學的時空特征沒有很好地捕捉到運動。它還鼓勵人們思考為什么CNN模型在視頻領域的表現不如其他計算機視覺任務的傳統手工制作功能[107,171]。

3.2. Two-stream networks

Since video understanding intuitively needs motion information, finding an appropriate way to describe the temporal relationship between frames is essential to improving the performance of CNN-based video action recognition.

由于視頻理解直觀地需要運動信息,找到一種合適的方法來描述幀間的時間關系對于提高基于cnn的視頻動作識別的性能至關重要。

Optical flow [79] is an effective motion representation to describe object/scene movement. To be precise, it is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. We show several visualizations of optical flow in Figure 5. As we can see, optical flow is able to describe the motion pattern of each action accurately. The advantage of using optical flow is it provides orthogonal information compared to the the RGB image. For example, the two images on the bottom of Figure 5 have cluttered backgrounds. Optical flow can effectively remove the nonmoving background and result in a simpler learning problem compared to using the original RGB images as input.

光流[79]是描述物體/場景運動的一種有效的運動表示方法。準確地說,它是由觀察者與場景之間的相對運動所引起的視覺場景中物體、表面和邊緣的視運動模式。我們在圖5中展示了光流的幾個可視化圖。我們可以看到,光流能夠準確地描述每個動作的運動模式。使用光流的優點是,與RGB圖像相比,它提供了正交的信息。例如,圖5底部的兩張圖片,背景就比較雜亂。光流可以有效地去除不移動的背景,與使用原始RGB圖像作為輸入相比,會產生更簡單的學習問題。

Figure 5. Visualizations of optical flow. We show four image-flow pairs, left is original RGB image and right is the estimated optical flow by FlowNet2 [85]. Color of optical flow indicates the directions of motion, and we follow the color coding scheme of FlowNet2 [85] as shown in top right.

圖5。光流的可視化。我們展示了四組圖像流對,左邊是原始RGB圖像,右邊是FlowNet2估計的光流[85]。光流的顏色表示運動方向,我們遵循FlowNet2[85]的顏色編碼方案如右上所示。

In addition, optical flow has been shown to work well on video problems. Traditional hand-crafted features such as IDT [210] also contain optical-flow-like features, such as Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH).

此外,光流已被證明在視頻問題上工作得很好。傳統手工制作的特征,如IDT[210]也包含類似光流的特征,如光流直方圖(Histogram of Optical Flow, HOF)和運動邊界直方圖(Motion Boundary Histogram, MBH)。

Hence, Simonyan et al. [187] proposed two-stream networks, which included a spatial stream and a temporal stream as shown in Figure 6. This method is related to the two-streams hypothesis [65], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognizes motion). The spatial stream takes raw video frame(s) as input to capture visual appearance information. The temporal stream takes a stack of optical flow images as input to capture motion information between video frames. To be specific, [187] linearly rescaled the horizontal and vertical components of the estimated flow (i.e., motion in the x-direction and y-direction) to a [0, 255] range and compressed using JPEG. The output corresponds to the two optical flow images shown in Figure 6. The compressed optical flow images will then be concatenated as the input to the temporal stream with a dimension of H×W×2L, where H, W and L indicates the height, width and the length of the video frames. In the end, the final prediction is obtained by averaging the prediction scores from both streams.

因此,Simonyan等[187]提出了兩流網絡,包括一個空間流和一個時間流,如圖6所示。這種方法與雙流假說(two-streams hypothesis)有關[65],根據該假說,人類視覺皮層包含兩條通路:腹側流(執行物體識別)和背側流(識別運動)。空間流以原始視頻幀(s)為輸入,捕獲視覺外觀信息。時間流以一堆光流圖像作為輸入,捕獲視頻幀之間的運動信息。具體來說,[187]將估計流的水平和垂直分量(即x方向和y方向的運動)線性縮放到[0,255]范圍,并使用JPEG進行壓縮。輸出對應于圖6所示的兩張光流圖像。然后將壓縮后的光流圖像拼接成維度為H×W×2L的時間流的輸入,其中H、W和L分別表示視頻幀的高度、寬度和長度。最后,將兩個流的預測分數取平均值得到最終的預測結果。

By adding the extra temporal stream, for the first time, a CNN-based approach achieved performance similar to the previous best hand-crafted feature IDT on UCF101 (88.0% vs 87.9%) and on HMDB51 [109] (59.4% vs 61.1%). [187] makes two important observations. First, motion information is important for video action recognition. Second, it is still challenging for CNNs to learn temporal information directly from raw video frames. Pre-computing optical flow as the motion representation is an effective way for deep learning to reveal its power. Since [187] managed to close the gap between deep learning approaches and traditional hand-crafted features, many follow-up papers on twostream networks emerged and greatly advanced the development of video action recognition. Here, we divide them into several categories and review them individually.

通過首次添加額外的時間流,基于cnn的方法獲得了類似于之前在UCF101 (88.0% vs 87.9%)和HMDB51 [109] (59.4% vs 61.1%)上的最佳手工特性IDT的性能。[187]做了兩個重要的觀察。首先,運動信息對視頻動作識別很重要。其次,對于cnn來說,直接從原始視頻幀中學習時間信息仍然具有挑戰性。預計算光流作為運動表示是深度學習展示其威力的有效途徑。自從[187]成功地縮小了深度學習方法和傳統手工特征之間的差距以來,許多關于雙流網絡的后續論文出現了,極大地推動了視頻動作識別的發展。在這里,我們將它們分為幾個類別并分別進行回顧。

Figure 6. Workflow of five important papers: two-stream networks [187], temporal segment networks [218], I3D [14], Nonlocal [219] and SlowFast [45]. Best viewed in color.

圖6。五篇重要論文的工作流程:雙流網絡[187],時間段網絡[218],I3D [14], Nonlocal[219]和SlowFast[45]。最好用彩色觀看。

3.2.1 Using deeper network architectures

Two-stream networks [187] used a relatively shallow network architecture [107]. Thus a natural extension to the two-stream networks involves using deeper networks. However, Wang et al. [215] finds that simply using deeper networks does not yield better results, possibly due to overfitting on the small-sized video datasets [190, 109]. Recall from section 2.1, UCF101 and HMDB51 datasets only have thousands of training videos. Hence, Wang et al. [217] introduce a series of good practices, including crossmodality initialization, synchronized batch normalization, corner cropping and multi-scale cropping data augmentation, large dropout ratio, etc. to prevent deeper networks from overfitting. With these good practices, [217] was able to train a two-stream network with the VGG16 model [188] that outperforms [187] by a large margin on UCF101. These good practices have been widely adopted and are still being used. Later, Temporal Segment Networks (TSN) [218] performed a thorough investigation of network architectures, such as VGG16, ResNet [76], Inception [198], and demonstrated that deeper networks usually achieve higher recognition accuracy for video action recognition. We will describe more details about TSN in section 3.2.4.

雙流網絡[187]采用了相對較淺的網絡架構[107]。因此,對雙流網絡的自然擴展涉及到使用更深層次的網絡。然而,Wang等人[215]發現,簡單地使用更深的網絡并不能產生更好的結果,這可能是由于對小型視頻數據集的過擬合[190,109]。回顧2.1節,UCF101和HMDB51數據集只有數千個訓練視頻。因此,Wang等人[217]引入了一系列好的實踐,包括跨模態初始化、同步批處理歸一化、拐角裁剪和多尺度裁剪數據增強、大dropout率等,以防止更深層次的網絡過擬合。有了這些良好的實踐,[217]能夠用VGG16模型[188]訓練一個雙流網絡,其性能大大優于UCF101[187]。這些良好實踐已被廣泛采用,并仍在使用。后來,時間段網絡(TSN)[218]對VGG16、ResNet[76]、Inception[198]等網絡架構進行了深入研究,并證明更深層次的網絡通常可以實現更高的視頻動作識別精度。我們將在3.2.4節中詳細描述TSN。

3.2.2 Two-stream fusion

Since there are two streams in a two-stream network, there will be a stage that needs to merge the results from both networks to obtain the final prediction. This stage is usually referred to as the spatial-temporal fusion step.

由于在一個兩流網絡中有兩個流,將會有一個階段需要合并來自兩個網絡的結果,以獲得最終的預測。這一階段通常被稱為時空融合步驟。

The easiest and most straightforward way is late fusion, which performs a weighted average of predictions from both streams. Despite late fusion being widely adopted [187, 217], many researchers claim that this may not be the optimal way to fuse the information between the spatial appearance stream and temporal motion stream. They believe that earlier interactions between the two networks could benefit both streams during model learning and this is termed as early fusion.

最簡單、最直接的方法是后期融合,它對來自兩種流的預測進行加權平均。盡管后期融合被廣泛采用[187,217],但許多研究者認為這可能不是融合空間外觀流和時間運動流之間信息的最佳方式。他們認為,在模型學習過程中,兩個網絡之間的早期交互可以使兩種流都受益,這被稱為早期融合。

Fusion [50] is one of the first of several papers investigating the early fusion paradigm, including how to perform spatial fusion (e.g., using operators such as sum, max, bilinear, convolution and concatenation), where to fuse the network (e.g., the network layer where early interactions happen), and how to perform temporal fusion (e.g., using 2D or 3D convolutional fusion in later stages of the network). [50] shows that early fusion is beneficial for both streams to learn richer features and leads to improved performance over late fusion. Following this line of research, Feichtenhofer et al. [46] generalizes ResNet [76] to the spatiotemporal domain by introducing residual connections between the two streams. Based on [46], Feichtenhofer et al. [47] further propose a multiplicative gating function for residual networks to learn better spatio-temporal features. Concurrently, [225] adopts a spatio-temporal pyramid to perform hierarchical early fusion between the two streams.

融合[50]是研究早期融合范式的幾篇論文中的第一篇,包括如何執行空間融合(例如,使用運算符如和、最大、雙線性、卷積和拼接),在哪里融合網絡(例如,發生早期交互的網絡層),以及如何執行時間融合(例如,在網絡的后期使用2D或3D卷積融合)。[50]表明,早期的融合有利于兩個流學習更豐富的特性,并導致比后期融合更好的性能。遵循這一研究路線,Feichtenhofer等人[46]通過引入兩個流之間的殘余連接,將ResNet[76]推廣到時空域。Feichtenhofer等[47]在[46]的基礎上進一步提出了殘差網絡的乘法門控函數,以更好地學習時空特征。同時,[225]采用時空金字塔在兩個流之間進行分層早期融合。

3.2.3 Recurrent neural networks

Since a video is essentially a temporal sequence, researchers have explored Recurrent Neural Networks (RNNs) for temporal modeling inside a video, particularly the usage of Long Short-Term Memory (LSTM) [78].

由于視頻本質上是一個時間序列,研究人員探索了循環神經網絡(rnn)在視頻中進行時間建模,特別是使用長短期記憶(LSTM)[78]。

LRCN [37] and Beyond-Short-Snippets [253] are the first of several papers that use LSTM for video action recognition under the two-stream networks setting. They take the feature maps from CNNs as an input to a deep LSTM network, and aggregate frame-level CNN features into videolevel predictions. Note that they use LSTM on two streams separately, and the final results are still obtained by late fusion. However, there is no clear empirical improvement from LSTM models [253] over the two-stream baseline [187]. Following the CNN-LSTM framework, several variants are proposed, such as bi-directional LSTM [205], CNN-LSTM fusion [56] and hierarchical multi-granularity LSTM network [118]. [125] described VideoLSTM which includes a correlation-based spatial attention mechanism and a lightweight motion-based attention mechanism. VideoLSTM not only show improved results on action recognition, but also demonstrate how the learned attention can be used for action localization by relying on just the action class label. Lattice-LSTM [196] extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, so that it can accurately model long-term and complex motions. ShuttleNet [183] is a concurrent work that considers both feedforward and feedback connections in a RNN to learn long-term dependencies. FASTER [272] designed a FAST-GRU to aggregate clip-level features from an expensive backbone and a cheap backbone. This strategy reduces the processing cost of redundant clips and hence accelerates the inference speed.

LRCN[37]和Beyond-Short-Snippets[253]是在雙流網絡設置下使用LSTM進行視頻動作識別的幾篇論文中的第一篇。他們將CNN的特征圖作為深度LSTM網絡的輸入,并將幀級CNN特征聚合為視頻級預測。注意,他們分別在兩個流上使用LSTM,最終結果仍然是通過后期融合獲得的。然而,相比雙流基線[187],LSTM模型[253]并沒有明顯的經驗改進。在CNN-LSTM框架的基礎上,提出了幾種變體,如雙向LSTM[205]、CNN-LSTM融合[56]和分層多粒度LSTM網絡[118]。[125]描述了包括基于相關的空間注意機制和基于輕量運動的注意機制的VideoLSTM。VideoLSTM不僅展示了動作識別的改進結果,而且還展示了如何僅依靠動作類標簽將習得的注意力用于動作定位。晶格-LSTM[196]通過學習單個空間位置的記憶單元的獨立隱藏狀態躍遷來擴展LSTM,從而能夠準確地模擬長期和復雜的運動。ShuttleNet[183]是一項并行工作,它考慮了RNN中的前饋和反饋連接,以學習長期依賴關系。FASTER[272]設計了一種FAST-GRU來聚合來自昂貴主干和廉價主干的剪輯級特征。該策略降低了冗余片段的處理成本,從而提高了推理速度。

However, the work mentioned above [37, 253, 125, 196, 183] use different two-stream networks/backbones. The differences between various methods using RNNs are thus unclear. Ma et al. [135] build a strong baseline for fair comparison and thoroughly study the effect of learning spatiotemporal features by using RNNs. They find that it requires proper care to achieve improved performance, e.g., LSTMs require pre-segmented data to fully exploit the temporal information. RNNs are also intensively studied in video action localization [189] and video question answering [274], but these are beyond the scope of this survey.

然而,上面提到的工作[37,253,125,196,183]使用不同的雙流網絡/骨干。因此,使用rnn的各種方法之間的區別還不清楚。Ma等人[135]為公平比較建立了一個強有力的基線,并深入研究了使用rnn學習時空特征的效果。他們發現需要適當的關注來實現改進的性能,例如,LSTMs需要預先分割的數據來充分利用時間信息。rnn在視頻動作定位[189]和視頻問答[274]中也被深入研究,但這些都超出了本次調查的范圍。

3.2.4 Segment-based methods

Thanks to optical flow, two-stream networks are able to reason about short-term motion information between frames. However, they still cannot capture long-range temporal information. Motivated by this weakness of two-stream networks , Wang et al. [218] proposed a Temporal Segment Network (TSN) to perform video-level action recognition. Though initially proposed to be used with 2D CNNs, it is simple and generic. Thus recent work using either 2D or 3D CNNs, are still built upon this framework.

由于光流,雙流網絡能夠推理幀之間的短期運動信息。然而,他們仍然不能捕獲遠程時間信息。基于雙流網絡的這一弱點,Wang等人[218]提出了一種時間段網絡(TSN)來執行視頻級別的動作識別。雖然最初建議與2D cnn一起使用,但它簡單而通用。因此,最近使用2D或3D cnn的工作仍然建立在這個框架上。

To be specific, as shown in Figure 6, TSN first divides a whole video into several segments, where the segments distribute uniformly along the temporal dimension. Then TSN randomly selects a single video frame within each segment and forwards them through the network. Here, the network shares weights for input frames from all the segments. In the end, a segmental consensus is performed to aggregate information from the sampled video frames. The segmental consensus could be operators like average pooling, max pooling, bilinear encoding, etc. In this sense, TSN is capable of modeling long-range temporal structure because the model sees the content from the entire video. In addition, this sparse sampling strategy lowers the training cost over long video sequences but preserves relevant information.

具體來說,如圖6所示,TSN首先將整個視頻分割成幾個片段,這些片段沿時間維度均勻分布。然后TSN在每個視頻片段中隨機選擇一個視頻幀,通過網絡轉發。在這里,網絡共享來自所有段的輸入幀的權重。最后,對采樣的視頻幀進行分段一致性聚合。分段共識可以是像平均池、最大池、雙線性編碼等操作符。在這個意義上,TSN能夠建模遠程時間結構,因為模型看到的是整個視頻的內容。此外,這種稀疏采樣策略降低了長視頻序列的訓練成本,但保留了相關信息。

Given TSN’s good performance and simplicity, most two-stream methods afterwards become segment-based two-stream networks. Since the segmental consensus is simply doing a max or average pooling operation, a feature encoding step might generate a global video feature and lead to improved performance as suggested in traditional approaches [179, 97, 157]. Deep Local Video Feature (DVOF) [114] proposed to treat the deep networks that trained on local inputs as feature extractors and train another encoding function to map the global features into global labels. Temporal Linear Encoding (TLE) network [36] ap- peared concurrently with DVOF, but the encoding layer was embedded in the network so that the whole pipeline could be trained end-to-end. VLAD3 and ActionVLAD [123, 63] also appeared concurrently. They extended the NetVLAD layer [4] to the video domain to perform video-level encoding, instead of using compact bilinear encoding as in [36]. To improve the temporal reasoning ability of TSN, Tempo- ral Relation Network (TRN) [269] was proposed to learn and reason about temporal dependencies between video frames at multiple time scales. The recent state-of-the-art efficient model TSM [128] is also segment-based. We will discuss it in more detail in section 3.4.2.

由于TSN的良好性能和簡單性,大多數雙流方法后來都變成了基于段的雙流網絡。由于分段共識只是進行一個最大或平均池操作,因此特征編碼步驟可能會生成一個全局視頻特征,并像傳統方法[179,97,157]中建議的那樣,導致性能的提高。深度局部視頻特征(Deep Local Video feature, DVOF)[114]提出將對局部輸入進行訓練的深度網絡作為特征提取器,并訓練另一個編碼函數將全局特征映射到全局標簽。時間線性編碼(TLE)網絡[36]與DVOF并行,但編碼層嵌入到網絡中,可實現端到端訓練。VLAD3和ActionVLAD[123,63]也同時出現。他們將NetVLAD層[4]擴展到視頻域以執行視頻級的編碼,而不是像[36]那樣使用緊湊的雙線性編碼。為了提高TSN的時間推理能力,節拍關系網絡(Tempo- ral Relation Network, TRN)[269]被提出來學習和推理視頻幀之間在多個時間尺度上的時間依賴性。最近最先進的高效模型TSM[128]也是基于分段的。我們將在3.4.2節中更詳細地討論它。

3.2.5 Multi-stream networks

Two-stream networks are successful because appearance and motion information are two of the most important properties of a video. However, there are other factors that can help video action recognition as well, such as pose, object, audio and depth, etc.

雙流網絡之所以成功,是因為外觀和運動信息是視頻最重要的兩個屬性。然而,還有其他因素可以幫助視頻動作識別,如姿勢,對象,音頻和深度等。

Pose information is closely related to human action. We can recognize most actions by just looking at a pose (skeleton) image without scene context. Although there is previous work on using pose for action recognition [150, 246], P-CNN [23] was one of the first deep learning methods that successfully used pose to improve video action recognition. P-CNN proposed to aggregates motion and appearance information along tracks of human body parts, in a similar spirit to trajectory pooling [214]. [282] extended this pipeline to a chained multi-stream framework, that computed and integrated appearance, motion and pose. They introduced a Markov chain model that added these cues successively and obtained promising results on both action recognition and action localization. PoTion [25] was a follow-up work to P-CNN, but introduced a more powerful feature representation that encoded the movement of human semantic keypoints. They first ran a decent human pose estimator and extracted heatmaps for the human joints in each frame. They then obtained the PoTion representation by temporally aggregating these probability maps. PoTion is lightweight and outperforms previous pose representations [23, 282]. In addition, it was shown to be complementary to standard appearance and motion streams, e.g. combining PoTion with I3D [14] achieved state-of-the-art result on UCF101 (98.2%).

姿態信息與人的行為密切相關。我們可以通過觀察沒有場景背景的姿勢(骨架)圖像來識別大多數動作。雖然此前也有使用姿勢進行動作識別的研究[150,246],但P-CNN[23]是第一個成功使用姿勢改進視頻動作識別的深度學習方法之一。P-CNN提出沿著人體部位的軌跡聚合運動和外觀信息,其精神類似于軌跡池[214]。[282]將該管道擴展為一個鏈式多流框架,計算并集成外觀、運動和姿勢。他們引入了一個馬爾可夫鏈模型,將這些線索依次添加到模型中,在動作識別和動作定位方面都取得了很好的效果。PoTion[25]是P-CNN的后續工作,但是引入了更強大的特征表示,編碼了人類語義關鍵點的移動。他們首先運行了一個像樣的人體姿態估計器,提取了每一幀人體關節的熱圖。然后,他們通過時間聚合這些概率映射獲得了PoTion表示。PoTion是輕量級的,并且比之前的姿勢表現更好[23,282]。此外,它被證明是標準外觀和動作流的補充,例如,將PoTion與I3D[14]結合在UCF101上獲得了最先進的結果(98.2%)。

Object information is another important cue because most human actions involve human-object interaction. Wu [232] proposed to leverage both object features and scene features to help video action recognition. The object and scene features were extracted from state-of-the-art pretrained object and scene detectors. Wang et al. [252] took a step further to make the network end-to-end trainable. They introduced a two-stream semantic region based method, by replacing a standard spatial stream with a Faster RCNN network [171], to extract semantic information about the object, person and scene.

物體信息是另一個重要的線索,因為大多數人類行為都涉及人與物體的交互。Wu[232]提出利用物體特征和場景特征來幫助視頻動作識別。物體和場景特征提取的最先進的預先訓練的物體和場景探測器。Wang等人[252]進一步使網絡端到端可訓練。他們引入了一種基于兩流語義區域的方法,將標準空間流替換為Faster RCNN網絡[171],以提取關于對象、人和場景的語義信息。

Audio signals usually come with video, and are complementary to the visual information. Wu et al. [233] introduced a multi-stream framework that integrates spatial, short-term motion, long-term temporal and audio in videos to digest complementary clues. Recently, Xiao et al. [237] introduced AudioSlowFast following [45], by adding another audio pathway to model vision and sound in an unified representation.

音頻信號通常伴隨著視頻,是視覺信息的補充。Wu等人[233]引入了一種多流框架,該框架集成了視頻中的空間、短期運動、長期時間和音頻,以消化互補的線索。最近,Xiao等人[237]在[45]之后引入了AudioSlowFast,通過添加另一個音頻路徑以統一表示的方式對視覺和聲音進行建模。

In RGB-D video action recognition field, using depth information is standard practice [59]. However, for visionbased video action recognition (e.g., only given monocular videos), we do not have access to ground truth depth information as in the RGB-D domain. An early attempt Depth2Action [280] uses off-the-shelf depth estimators to extract depth information from videos and use it for action recognition.

在RGB-D視頻動作識別領域,利用深度信息是標準做法[59]。然而,對于基于視覺的視頻動作識別(例如,只給定單目視頻),我們不能像在RGB-D領域那樣獲得地面真相深度信息。早期的嘗試Depth2Action[280]使用現成的深度估計器從視頻中提取深度信息,并將其用于動作識別。

Essentially, multi-stream networks is a way of multimodality learning, using different cues as input signals to help video action recognition. We will discuss more on multi-modality learning in section 5.12.

從本質上講,多流網絡是一種多模態學習的方式,使用不同的線索作為輸入信號來幫助視頻動作識別。我們將在第5.12節詳細討論多模態學習。

3.3. The rise of 3D CNNs

Pre-computing optical flow is computationally intensive and storage demanding, which is not friendly for large-scale training or real-time deployment. A conceptually easy way to understand a video is as a 3D tensor with two spatial and one time dimension. Hence, this leads to the usage of 3D CNNs as a processing unit to model the temporal information in a video.

預計算光流計算量大,存儲容量大,不利于大規模訓練和實時部署。從概念上講,理解視頻的一種簡單方法是將其視為具有兩個空間維度和一個時間維度的三維張量。因此,這導致使用3D cnn作為處理單元來建模視頻中的時間信息。

The seminal work for using 3D CNNs for action recognition is [91]. While inspiring, the network was not deep enough to show its potential. Tran et al. [202] extended [91] to a deeper 3D network, termed C3D. C3D follows the modular design of [188], which could be thought of as a 3D version of VGG16 network. Its performance on standard benchmarks is not satisfactory, but shows strong generalization capability and can be used as a generic feature extractor for various video tasks [250].

使用3D cnn進行動作識別的開創性工作是[91]。雖然令人鼓舞,但這個網絡的深度還不足以顯示出它的潛力。Tran等人[202]將[91]擴展到更深層次的3D網絡,稱為C3D。C3D遵循了[188]的模塊化設計,可以認為是VGG16網絡的3D版本。它在標準基準測試上的性能并不令人滿意,但表現出很強的泛化能力,可以用作各種視頻任務的通用特征提取器[250]。

However, 3D networks are hard to optimize. In order to train a 3D convolutional filter well, people need a large- scale dataset with diverse video content and action categories. Fortunately, there exists a dataset, Sports1M [99] which is large enough to support the training of a deep 3D network. However, the training of C3D takes weeks to converge. Despite the popularity of C3D, most users just adopt it as a feature extractor for different use cases instead of modifying/fine-tuning the network. This is partially the reason why two-stream networks based on 2D CNNs dominated the video action recognition domain from year 2014 to 2017.

然而,3D網絡很難優化。為了很好地訓練三維卷積濾波器,人們需要一個具有多種視頻內容和動作類別的大規模數據集。幸運的是,有一個數據集Sports1M[99],它足夠大,可以支持深度3D網絡的訓練。然而,C3D的訓練需要數周才能完成。盡管C3D很流行,但大多數用戶只是將其作為不同用例的特征提取器,而不是修改/微調網絡。這也是2014 - 2017年基于二維cnn的雙流網絡在視頻動作識別領域占據主導地位的部分原因。

The situation changed when Carreira et al. [14] proposed I3D in year 2017. As shown in Figure 6, I3D takes a video clip as input, and forwards it through stacked 3D convolutional layers. A video clip is a sequence of video frames, usually 16 or 32 frames are used.
The major contributions of I3D are:
1)it adapts mature image classification architectures to use for 3D CNN;
2)For model weights, it adopts a method developed for initializing optical flow networks in [217] to inflate the ImageNet pre-trained 2D model weights to their counterparts in the 3D model.
Hence, I3D bypasses the dilemma that 3D CNNs have to be trained from scratch. With pre-training on a new large-scale dataset Kinetics400 [100], I3D achieved a 95.6% on UCF101 and 74.8% on HMDB51. I3D ended the era where different methods reported numbers on small-sized datasets such as UCF101 and HMDB512. Publications following I3D needed to report their performance on Kinetics400, or other large-scale benchmark datasets, which pushed video action recognition to the next level. In the next few years, 3D CNNs advanced quickly and became top performers on almost every benchmark dataset. We will review the 3D CNNs based literature in several categories below.

當carira等人在2017年提出I3D時,情況發生了變化。如圖6所示,I3D將一個視頻剪輯作為輸入,并將其通過堆疊的3D卷積層轉發。一個視頻剪輯是一個視頻幀序列,通常使用16或32幀。
I3D的主要貢獻有:
1)將成熟的圖像分類體系結構應用于3D CNN;
2)對于模型權值,采用[217]中為初始化光流網絡而開發的方法,將ImageNet預訓練的2D模型權值膨脹為3D模型中的對應模型權值。
因此,I3D繞過了3D cnn必須從零開始訓練的困境。通過在新的大規模數據集Ki- netics400[100]上進行預訓練,I3D在UCF101上取得了95.6%的準確率,在HMDB51上取得了74.8%的準確率。I3D結束了不同方法在小型數據集(如UCF101和HMDB512)上報告數字的時代。I3D之后的出版物需要在Kinetics400或其他大型基準數據集上報告它們的性能,這將視頻交流識別推向了下一個水平。在接下來的幾年里,3D cnn發展迅速,幾乎在所有基準數據集上都是佼佼者。我們將從以下幾個類別回顧基于3D cnn的文獻。

We want to point out that 3D CNNs are not replacing two-stream networks, and they are not mutually exclusive. They just use different ways to model the temporal relationship in a video. Furthermore, the two-stream approach is a generic framework for video understanding, instead of a specific method. As long as there are two networks, one for spatial appearance modeling using RGB frames, the other for temporal motion modeling using optical flow, the method may be categorized into the family of two-stream networks. In [14], they also build a temporal stream with I3D architecture and achieved even higher performance, 98.0% on UCF101 and 80.9% on HMDB51. Hence, the final I3D model is a combination of 3D CNNs and two- stream networks. However, the contribution of I3D does not lie in the usage of optical flow.

我們想指出的是,3D cnn并沒有取代雙流網絡,它們也不是相互排斥的。他們只是用不同的方法來模擬視頻中的時間關系。此外,雙流方法是視頻理解的通用框架,而不是一種特定的方法。只要有兩個網絡,一個是使用RGB幀進行空間外觀建模,另一個是使用光流進行時間運動建模,這種方法就可以歸為雙流網絡。在[14]中,他們還構建了一個具有I3D架構的時間流,并實現了更高的性能,在UCF101上達到98.0%,在HMDB51上達到80.9%。因此,最終的I3D模型是三維cnn和雙流網絡的組合。然而,I3D的貢獻并不在于光流的使用。

3.3.1 Mapping from 2D to 3D CNNs

2D CNNs enjoy the benefit of pre-training brought by the large-scale of image datasets such as ImageNet [30] and Places205 [270], which cannot be matched even with the largest video datasets available today. On these datasets numerous efforts have been devoted to the search for 2D CNN architectures that are more accurate and generalize better. Below we describe the efforts to capitalize on these advances for 3D CNNs.

2D cnn享受著ImageNet[30]和Places205[270]等大規模圖像數據集帶來的預訓練的好處,即使是當今最大的視頻數據集也無法與之匹敵。在這些數據集上,許多人致力于尋找更準確、更一般化的2D CNN架構。下面我們將介紹如何利用3D cnn的這些進展。

ResNet3D [74] directly took 2D ResNet [76] and replaced all the 2D convolutional filters with 3D kernels. They believed that by using deep 3D CNNs together with large-scale datasets one can exploit the success of 2D CNNs on ImageNet. Motivated by ResNeXt [238], Chen et al. [20] presented a multi-fiber architecture that slices a complex neural network into an ensemble of lightweight networks (fibers) that facilitate information flow between fibers, reduces the computational cost at the same time. In- spired by SENet [81], STCNet [33] propose to integrate channel-wise information inside a 3D block to capture both spatial-channels and temporal-channels correlation information throughout the network.

ResNet3D[74]直接采用2D ResNet[76],將所有的2D卷積濾波器替換為3D內核。他們相信,通過將深度3D cnn與大規模數據集結合使用,可以利用ImageNet上2D cnn的成功。受ResNeXt[238]的激勵,Chen等人[20]提出了一種多光纖架構,該架構將復雜的神經網絡切割成輕量級網絡(光纖)的集成,促進了光纖之間的信息流動,同時降低了計算成本。受SENet[81]啟發,STCNet[33]提出在3D塊中集成通道信息,以捕獲整個網絡中的空間通道和時間通道相關信息。

3.3.2 Unifying 2D and 3D CNNs

To reduce the complexity of 3D network training, P3D [169] and R2+1D [204] explore the idea of 3D factorization. To be specific, a 3D kernel (e.g., 3×3×3) can be factorized to two separate operations, a 2D spatial convolution (e.g., 1 × 3 × 3) and a 1D temporal convolution (e.g., 3 × 1 × 1). The differences between P3D and R2+1D are how they arrange the two factorized operations and how they formulate each residual block. Trajectory convolution [268] follows this idea but uses deformable convolution for the temporal component to better cope with motion.

為了降低三維網絡訓練的復雜性,P3D[169]和R2+1D[204]探索了三維因子分解的思想。具體來說,一個三維核(例如,3×3×3)可以分解為兩個獨立的運算,一個二維空間卷積(例如,1 ×3×3)和一個一維時間卷積(例如,3× 1 × 1)。P3D和R2+1D之間的區別在于它們如何分配這兩個分解運算的范圍,以及它們如何表述每個殘差塊。軌跡卷積[268]遵循這一思想,但為更好地處理運動,時間分量使用了可變形卷積。

Another way of simplifying 3D CNNs is to mix 2D and 3D convolutions in a single network. MiCTNet [271] integrates 2D and 3D CNNs to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion. ARTNet [213] introduces an appearance-and-relation network by using a new building block. The building block consists of a spa- tial branch using 2D CNNs and a relation branch using 3D CNNs. S3D [239] combines the merits from approaches mentioned above. It first replaces the 3D convolutions at the bottom of the network with 2D kernels, and find that this kind of top-heavy network has higher recognition accuracy. Then S3D factorizes the remaining 3D kernels as P3D and R2+1D do, to further reduce the model size and train- ing complexity. A concurrent work named ECO [283] also adopts such a top-heavy network to achieve online video understanding.

另一種簡化3D cnn的方法是在單個網絡中混合2D和3D卷積。MiCTNet[271]集成了2D和3D cnn,生成更深入、更有形態的特征圖,同時降低了每一輪時空融合的訓練復雜度。ARTNet[213]通過使用一種新的構建塊引入了一種外觀-關系網絡。該構建模塊由一個使用2D cnn的spa分支和一個使用3D cnn的關系分支組成。S3D[239]結合了上述方法的優點。該方法首先用二維核替換網絡底部的三維卷積,發現這種頭重腳輕的網絡具有較高的識別精度。然后S3D像P3D和R2+1D那樣對剩余的3D核進行因式分解,進一步減小模型的大小和訓練的復雜度。一個名為ECO[283]的并發工作也采用了這種頭重頭輕的網絡來實現在線視頻理解。

3.3.3 Long-range temporal modeling

In 3D CNNs, long-range temporal connection may be achieved by stacking multiple short temporal convolutions, e.g., 3 × 3 × 3 filters. However, useful temporal information may be lost in the later stages of a deep network, especially for frames far apart.

在3D cnn中,遠程時間連接可以通過堆疊多個短時間卷積來實現,例如3 × 3 × 3濾波器。然而,在深度網絡的后期階段,有用的時間信息可能會丟失,特別是對于距離很遠的幀。

In order to perform long-range temporal modeling, LTC [206] introduces and evaluates long-term temporal convolutions over a large number of video frames. However, limited by GPU memory, they have to sacrifice input resolution to use more frames. After that, T3D [32] adopted a densely connected structure [83] to keep the original temporal information as complete as possible to make the final prediction. Later, Wang et al. [219] introduced a new building block, termed non-local. Non-local is a generic operation similar to self-attention [207], which can be used for many computer vision tasks in a plug-and-play manner. As shown in Figure 6, they used a spacetime non-local module after later residual blocks to capture the long-range dependence in both space and temporal domain, and achieved improved performance over baselines without bells and whistles. Wu et al. [229] proposed a feature bank representation, which embeds information of the entire video into a memory cell, to make contextaware prediction. Recently, V4D [264] proposed video-level 4D CNNs, to model the evolution of long-range spatio-temporal representation with 4D convolutions.

為了進行長期時間建模,LTC[206]引入并計算大量視頻幀上的長期時間卷積。然而,受到GPU內存的限制,它們不得不犧牲輸入分辨率來使用更多的幀。之后,T3D[32]采用密連結構[83],盡可能保持原始的時間信息完整,從而進行最終預測。后來,Wang等人[219]引入了一種新的構建塊,稱為非局部構建塊。非局部操作是一種類似于自我注意的通用操作[207],可以以即插即用的方式用于許多計算機視覺任務。如圖6所示,他們在后期殘留塊之后使用一個時空非局部模塊來捕獲空間和時間域的長期依賴,并在沒有鈴聲和口哨的基線上取得了改進的性能。Wu等人[229]提出了一種特征庫表示,該表示將整個視頻的信息嵌入到一個記憶單元中,以進行上下文感知預測。最近,V4D[264]提出了視頻級4D cnn,用4D卷積來模擬遠程時空表示的演化。

3.3.4 Enhancing 3D efficiency

In order to further improve the efficiency of 3D CNNs (i.e., in terms of GFLOPs, model parameters and latency), many variants of 3D CNNs begin to emerge.

為了進一步提高3D cnn的效率(即在gflop、模型參數和延遲方面),許多3D cnn的變體開始出現。

Motivated by the development in efficient 2D networks, researchers started to adopt channel-wise separable convolution and extend it for video classification [111, 203]. CSN [203] reveals that it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions, and is able to obtain state-of-the-art performance while being 2 to 3 times faster than the previous best approaches. These methods are also related to multi-fiber networks [20] as they are all inspired by group convolution.

受高效2D網絡發展的激勵,研究人員開始采用信道可分卷積,并將其擴展到視頻分類中[111,203]。CSN[203]揭示了通過分離信道相互作用和時空相互作用來分解三維卷積是一種很好的實踐,并且能夠獲得最先進的性能,同時比之前的最佳方法快2到3倍。這些方法也與多光纖網絡[20]有關,因為它們都是受到群卷積的啟發。

Recently, Feichtenhofer et al. [45] proposed SlowFast, an efficient network with a slow pathway and a fast pathway. The network design is partially inspired by the biological Parvo- and Magnocellular cells in the primate visual systems. As shown in Figure 6, the slow pathway operates at low frame rates to capture detailed semantic information, while the fast pathway operates at high temporal resolution to capture rapidly changing motion. In order to incorporate motion information such as in two-stream networks, SlowFast adopts a lateral connection to fuse the representation learned by each pathway. Since the fast pathway can be made very lightweight by reducing its channel capacity, the overall efficiency of SlowFast is largely improved. Although SlowFast has two pathways, it is different from the two-stream networks [187], because the two pathways are designed to model different temporal speeds, not spatial and temporal modeling. There are several concurrent papers using multiple pathways to balance the accuracy and efficiency [43].

最近,Feichtenhofer等人[45]提出了一種具有慢通道和快通道的高效網絡SlowFast。這種網絡設計的部分靈感來自于靈長類視覺系統中的生物小細胞和巨細胞。如圖6所示,慢路徑以低幀率運行以捕獲詳細的語義信息,而快路徑以高時間分辨率運行以捕獲快速變化的運動。為了合并運動信息,如在兩流網絡,SlowFast采用橫向連接融合的表示學習的每個途徑。由于快速通道可以通過減少其通道容量而變得非常輕量級,因此SlowFast的總體效率得到了很大的提高。盡管SlowFast有兩條路徑,但它與雙流網絡不同[187],因為兩條路徑設計用于建模不同的時間速度,而不是空間和時間建模。有多個并行人員使用多個路徑來平衡準確性和效率。

Following this line, Feichtenhofer [44] introduced X3D that progressively expand a 2D image classification architecture along multiple network axes, such as temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. X3D pushes the 3D model modification/factorization to an extreme, and is a family of efficient video networks to meet different requirements of target complexity. With similar spirit, A3D [276] also leverages multiple network configurations. However, A3D trains these configurations jointly and during inference deploys only one model. This makes the model at the end more efficient. In the next section, we will continue to talk about efficient video modeling, but not based on 3D convolutions.

按照這一思路,Feichtenhofer[44]引入了X3D,它沿著多個網絡軸(如時間持續時間、幀速率、空間分辨率、寬度、瓶頸寬度和深度)逐步擴展二維圖像分類架構。X3D將三維模型修改/因子分解發揮到了極致,是一組高效的視頻網絡,能夠滿足不同的獲取復雜度要求。基于類似的精神,A3D[276]也控制了多種網絡配置。然而,A3D聯合訓練這些配置,在推理過程中只部署一個模型。這使得最終的模型更加高效。在下一節中,我們將繼續討論有效的視頻建模,但不是基于3D卷積。

3.4. Efficient Video Modeling

With the increase of dataset size and the need for deployment, efficiency becomes an important concern.

隨著數據集規模的增加和部署需求的增加,效率成為一個重要的問題。

If we use methods based on two-stream networks, we need to pre-compute optical flow and store them on local disk. Taking Kinetics400 dataset as an illustrative exam- ple, storing all the optical flow images requires 4.5TB disk space. Such a huge amount of data would make I/O become the tightest bottleneck during training, leading to a waste of GPU resources and longer experiment cycle. In addition, pre-computing optical flow is not cheap, which means all the two-stream networks methods are not real-time.

如果采用基于雙流網絡的方法,則需要預先計算光流并將其存儲在本地磁盤上。以Kinetics400數據集為例,存儲所有光流圖像需要4.5TB的磁盤空間。如此龐大的數據量會使I/O成為訓練中最緊的瓶頸,導致GPU資源的浪費和實驗周期的延長。此外,預計算光流的成本并不低,這意味著所有的雙流網絡方法都不是實時的。

If we use methods based on 3D CNNs, people still find that 3D CNNs are hard to train and challenging to deploy. In terms of training, a standard SlowFast network trained on Kinetics400 dataset using a high-end 8-GPU machine takes 10 days to complete. Such a long experimental cycle and huge computing cost makes video understanding research only accessible to big companies/labs with abundant computing resources. There are several recent attempts to speed up the training of deep video models [230], but these are still expensive compared to most image-based computer vision tasks. In terms of deployment, 3D convolution is not as well supported as 2D convolution for different platforms. Furthermore, 3D CNNs require more video frames as input which adds additional IO cost.

如果我們使用基于3D cnn的方法,人們仍然發現3D cnn訓練困難,部署具有挑戰性。在訓練方面,使用高端8-GPU機器在Kinetics400數據集上訓練一個標準的SlowFast網絡需要10天才能完成。如此漫長的實驗周期和巨大的計算成本,使得視頻理解研究只能在擁有豐富計算資源的大公司/實驗室進行。最近有一些加速深度視頻模型訓練的嘗試[230],但與大多數基于圖像的計算機視覺任務相比,這些仍是昂貴的。在部署方面,不同平臺對3D卷積的支持不如2D卷積。此外,3D cnn需要更多的視頻幀作為輸入,這增加了額外的IO成本。

Hence, starting from year 2018, researchers started to investigate other alternatives to see how they could improve accuracy and efficiency at the same time for video action recognition. We will review recent efficient video modeling methods in several categories below.

因此,從2018年開始,研究人員開始研究其他替代方案,看看如何在提高視頻動作識別的準確性和效率的同時。我們將在下面的幾個類別中回顧最近的高效視頻建模方法。

3.4.1 Flow-mimic approaches

One of the major drawback of two-stream networks is its need for optical flow. Pre-computing optical flow is computationally expensive, storage demanding, and not end-to- end trainable for video action recognition. It is appealing if we can find a way to encode motion information without using optical flow, at least during inference time.

雙流網絡的主要缺點之一是它需要光流。預計算光流的計算成本高,存儲需求大,而且不能端到端訓練視頻動作識別。如果我們能找到一種不使用光流的方法來編碼運動信息,至少在推理期間是很有吸引力的。

[146] and [35] are early attempts for learning to estimate optical flow inside a network for video action recognition. Although these two approaches do not need optical flow during inference, they require optical flow during training in order to train the flow estimation network. Hidden two-stream networks [278] proposed MotionNet to replace the traditional optical flow computation. MotionNet is a lightweight network to learn motion information in an unsupervised manner, and when concatenated with the temporal stream, is end-to-end trainable. Thus, hidden two- stream CNNs [278] only take raw video frames as input and directly predict action classes without explicitly computing optical flow, regardless of whether its the training or inference stage. PAN [257] mimics the optical flow features by computing the difference between consecutive feature maps. Following this direction, [197, 42, 116, 164] continue to investigate end-to-end trainable CNNs to learn optical- flow-like features from data. They derive such features directly from the definition of optical flow [255]. MARS [26] and D3D [191] used knowledge distillation to combine two-stream networks into a single stream, e.g., by tuning the spatial stream to predict the outputs of the temporal stream. Recently, Kwon et al. [110] introduced MotionSqueeze module to estimate the motion features. The proposed module is end-to-end trainable and can be plugged into any network, similar to [278].

[146]和[35]是學習估計網絡中用于視頻動作識別的光流的早期嘗試。雖然這兩種方法在推理過程中不需要光流,但在訓練過程中需要光流來訓練流量估計網絡。隱式雙流網絡[278]提出了MotionNet來取代傳統的光流計算。MotionNet是一個輕量級的網絡,以無監督的方式學習運動信息,當與時間流連接時,是端到端可訓練的。因此,隱式雙流cnn[278]只將原始視頻幀作為輸入,直接預測動作類,而不顯式計算光流,無論在訓練階段還是推理階段。PAN[257]通過計算連續特征映射之間的差異來模擬光流特征。按照這個方向,[197,42,116,164]繼續研究端到端可訓練的cnn從數據中學習類似光流的特征。他們直接從光流的定義中推導出這些特征[255]。MARS[26]和D3D[191]使用知識蒸餾將兩個流網絡組合為單個流,例如,通過調整spa流來預測時間流的輸出。最近,Kwon等人[110]引入了MotionSqueeze模組來估計運動特征。所提出的模塊是端到端可訓練的,可以插入任何網絡,類似于[278]。

3.4.2 Temporal modeling without 3D convolution

A simple and natural choice to model temporal relationship between frames is using 3D convolution. However, there are many alternatives to achieve this goal. Here, we will review some recent work that performs temporal modeling without 3D convolution.

建模幀間時間關系的一個簡單而自然的選擇是使用3D卷積。然而,有許多替代方案可以實現這一目標。在這里,我們將回顧一些最近的工作,執行時間建模沒有三維卷積。

Lin et al. [128] introduce a new method, termed temporal shift module (TSM). TSM extends the shift operation [228] to video understanding. It shifts part of the channels along the temporal dimension, thus facilitating information exchanged among neighboring frames. In order to keep spatial feature learning capacity, they put temporal shift module inside the residual branch in a residual block. Thus all the information in the original activation is still accessible after temporal shift through identity mapping. The biggest advantage of TSM is that it can be inserted into a 2D CNN to achieve temporal modeling at zero computation and zero parameters. Similar to TSM, TIN [182] introduces a temporal interlacing module to model the temporal convolution.

Lin等人[128]介紹了一種新的方法,稱為時間移位模塊(TSM)。TSM將移位操作[228]擴展到視頻理解。它沿著時間維度移動部分通道,從而促進相鄰幀之間的信息交換。為了保持空間特征的學習能力,他們在殘差塊的殘差分支中加入了時間位移模塊。因此,在時間移位后,通過身份映射,原始激活中的所有信息仍然可以被訪問。TSM的最大優點是可以插入到2D CNN中,實現零計算和零參數的時間建模。與TSM類似,TIN[182]引入了一個時間交錯模塊來建模時間卷積。

There are several recent 2D CNNs approaches using at- tention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatio- temporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode mo- tion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one han- dles short-range motion modeling and the second one effi- ciently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive re- sults with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non- local [219], because they use channel attention while non- local uses spatial attention.

There are several recent 2D CNNs approaches using attention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatio- temporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode mo- tion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one han- dles short-range motion modeling and the second one effi- ciently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive re- sults with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non- local [219], because they use channel attention while non- local uses spatial attention.

最近有幾種二維cnn方法利用注意力進行長期時間建模[92,122,132,133]。STM[92]提出了基于信道的時空模塊來表示時空特征,基于信道的運動模塊來有效地編碼運動特征。TEA[122]類似于STM,但受到SENet[81]的啟發,TEA使用運動特征來重新校準時空特征,以增強運動模式。具體而言,TEA包含兩個部分:運動激勵和多個時間聚合,其中第一個部分負責近距離運動建模,第二個部分有效地擴大了時間接受域,用于遠程時間建模。它們是互補的,而且都很輕,因此TEA能夠取得與以前最好的方法相比具有競爭力的結果,同時將flop保持在低到2D cnn的數量。最近,TEINet[132]也采用了注意增強時間建模。注意,上述基于注意的方法與非局部方法不同[219],因為它們使用通道注意,而非局部方法使用空間注意。

3.5. Miscellaneous

In this section, we are going to show several other directions that are popular for video action recognition in the last decade.

在本節中,我們將展示在過去十年中流行的視頻動作識別的其他幾個方向。

3.5.1 Trajectory-based methods

While CNN-based approaches have demonstrated their superiority and gradually replaced the traditional hand-crafted methods, the traditional local feature pipeline still has its merits which should not be ignored, such as the usage of trajectory.

雖然基于cnn的方法已經證明了其優越性,并逐漸取代了傳統的手工方法,但傳統的局部特征管道仍有不可忽視的優點,如軌跡的使用。

Inspired by the good performance of trajectory-based methods [210], Wang et al. [214] proposed to conduct trajectory-constrained pooling to aggregate deep convolutional features into effective descriptors, which they term as TDD. Here, a trajectory is defined as a path tracking down pixels in the temporal dimension. This new video representation shares the merits of both hand-crafted features and deep-learned features, and became one of the top performers on both UCF101 and HMDB51 datasets in the year 2015. Concurrently, Lan et al. [113] incorporated both Independent Subspace Analysis (ISA) and dense trajectories into the standard two-stream networks, and show the complementarity between data-independent and data-driven approaches. Instead of treating CNNs as a fixed feature extractor, Zhao et al. [268] proposed trajectory convolution to learn features along the temporal dimension with the help of trajectories.

受基于軌跡的方法的良好性能[210]的啟發,Wang等人[214]提出進行軌跡約束池化,將深度卷積特征聚合為有效的描述符,他們將其稱為TDD。在這里,軌跡被定義為在時間維度中跟蹤像素的路徑。這種新的視頻表示形式兼有手工制作功能和深度學習功能的優點,并在2015年成為UCF101和HMDB51數據集上的最佳表現之一。同時,Lan等人[113]將獨立子空間分析(ISA)和密集軌跡納入標準的雙流網絡,并展示了數據獨立方法和數據驅動方法之間的互補性。趙等人[268]沒有將cnn作為固定的特征提取器,而是提出了軌跡卷積,在軌跡的幫助下沿著時間維度學習特征。

3.5.2 Rank pooling

There is another way to model temporal information inside a video, termed rank pooling (a.k.a learning-to-rank). The seminal work in this line starts from VideoDarwin [53], that uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The ranking function should be able to order the frames of a video temporally, thus they use the parameters of this ranking function as a new video representation. VideoDarwin [53] is not a deep learning based method, but achieves comparable performance and efficiency.

還有另一種方法可以對視頻中的時間信息建模,稱為排名池(也稱為學習排名)。這方面的開創性工作始于VideoDarwin[53],它使用一個排名機器來了解外觀隨時間的演變,并返回一個排名函數。該排序函數應該能夠對視頻的幀進行時間排序,因此他們使用該排序函數的參數作為一種新的視頻表示。VideoDarwin[53]不是一種基于深度學習的方法,但達到了相當的性能和效率。

To adapt rank pooling to deep learning, Fernando [54] introduces a differentiable rank pooling layer to achieve end-to-end feature learning. Following this direction, Bilen et al. [9] apply rank pooling on the raw image pixels of a video producing a single RGB image per video, termed dynamic images. Another concurrent work by Fernando [51] extends rank pooling to hierarchical rank pooling by stacking multiple levels of temporal encoding. Finally, [22] propose a generalization of the original ranking formulation [53] using subspace representations and show that it leads to significantly better representation of the dynamic evolution of actions, while being computationally cheap.

為了使秩池適應深度學習,Fernando[54]引入了可區分的秩池層來實現端到端特征學習。按照這個方向,Bilen等人[9]對視頻的原始圖像像素應用秩池,每個視頻產生一個RGB圖像,稱為動態圖像。Fernando[51]同時進行的另一項工作是通過堆疊多層時間編碼將秩池擴展為分層秩池。最后,[22]提出了使用子空間表示的原始排序公式[53]的一般化,并表明它可以顯著更好地表示動作的動態演化,同時計算成本低。

3.5.3 Compressed video action recognition

Most video action recognition approaches use raw videos (or decoded video frames) as input. However, there are several drawbacks of using raw videos, such as the huge amount of data and high temporal redundancy. Video compression methods usually store one frame by reusing contents from another frame (i.e., I-frame) and only store the difference (i.e., P-frames and B-frames) due to the fact that adjacent frames are similar. Here, the I-frame is the original RGB video frame, and P-frames and B-frames include the motion vector and residual, which are used to store the difference. Motivated by the developments in the video compression domain, researchers started to adopt compressed video representations as input to train effective video models.

大多數視頻動作識別方法使用原始視頻(或解碼視頻幀)作為輸入。然而,使用原始視頻有一些缺點,如數據量大,時間冗余高。視頻壓縮方法通常通過重用來自另一幀(即i幀)的內容來存儲一幀,并且只存儲由于相鄰幀相似而產生的差異(即p幀和b幀)。其中i幀為原始RGB視頻幀,p幀和b幀包括運動矢量和殘差,用于存儲差值。受視頻壓縮領域發展的激勵,研究人員開始采用壓縮視頻表示作為輸入來訓練有效的視頻模型。

Since the motion vector has coarse structure and may contain inaccurate movements, Zhang et al. [256] adopted knowledge distillation to help the motion-vector-based temporal stream mimic the optical-flow-based temporal stream.

由于運動矢量結構粗糙,可能包含不準確的運動,Zhang等[256]采用知識蒸餾的方法幫助基于運動矢量的時間流模擬基于光流的時間流。

Table 2. Results of widely adopted methods on three scene-focused datasets. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, S: Sports1M and K: Kinetics400. NL represents non local.

表2。廣泛采用的方法在三個場景聚焦數據集上的結果。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet, S: Sports1M和K: Kinetics400。NL表示非本地的。

However, their approach required extracting and processing each frame. They obtained comparable recognition accuracy with standard two-stream networks, but were 27 times faster. Wu et al. [231] used a heavyweight CNN for the I frame and lightweight CNN’s for the P frames. This required that the motion vectors and residuals for each P frame be referred back to the I frame by accumulation. DMC-Net [185] is a follow-up work to [231] using adversarial loss. It adopts a lightweight generator network to help the motion vector capturing fine motion details, instead of knowledge distillation as in [256]. A recent paper SCSampler [106], also adopts compressed video representation for sampling salient clips and we will discuss it in the next section 3.5.4. As yet none of the compressed approaches can deal with B-frames due to the added complexity.

然而,他們的方法需要提取和處理每一幀。它們獲得了與標準雙流網絡相當的識別精度,但速度快了27倍。Wu等人[231]對I幀使用重量級CNN,對P幀使用輕量級CNN。這要求每個P幀的運動向量和殘差通過累加返回到I幀。DMC-Net[185]是[231]使用對抗損失的后續工作。它采用了一個輕量級的生成器網絡來幫助運動向量捕捉精細的運動細節,而不是像[256]那樣進行知識蒸餾。最近的一篇論文SCSampler[106]也采用了壓縮視頻表示來采樣顯著片段,我們將在下一節3.5.4中討論它。由于b幀的復雜性,目前還沒有一種壓縮方法能夠處理b幀。

3.5.4 Frame/Clip sampling

Most of the aforementioned deep learning methods treat every video frame/clip equally for the final prediction. However, discriminative actions only happen in a few moments, and most of the other video content is irrelevant or weakly related to the labeled action category. There are several drawbacks of this paradigm. First, training with a large proportion of irrelevant video frames may hurt performance. Second, such uniform sampling is not efficient during inference.

前面提到的大多數深度學習方法對每個視頻幀/剪輯的最終預測都是平等的。然而,歧視行為只發生在幾分鐘內,其他視頻內容大多與所標記的行為類別無關或關系不大。這種范式有幾個缺點。首先,使用大量無關視頻幀進行訓練可能會損害成績。其次,這種均勻抽樣在推理過程中效率不高。

Partially inspired by how human understand a video using just a few glimpses over the entire video [251], many methods were proposed to sample the most informative video frames/clips for both improving the performance and making the model more efficient during inference.

在一定程度上,受人類如何僅使用整個視頻中的一小部分來理解視頻[251]的啟發,提出了許多方法來抽樣信息最豐富的視頻幀/剪輯,以提高性能并使模型在推理過程中更有效。

KVM [277] is one of the first attempts to propose an end-to-end framework to simultaneously identify key volumes and do action classification. Later, [98] introduce AdaScan that predicts the importance score of each video frame in an online fashion, which they term as adaptive temporal pooling. Both of these methods achieve improved performance, but they still adopt the standard evaluation scheme which does not show efficiency during inference. Recent approaches focus more on the efficiency [41, 234, 8, 106]. AdaFrame [234] follows [251, 98] but uses a reinforcement learning based approach to search more informative video clips. Concurrently, [8] uses a teacher-student framework, i.e., a see-it-all teacher can be used to train a compute ef-ficient see-very-little student. They demonstrate that the efficient student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with negligible performance drop. Recently, SCSampler [106] trains a lightweight network to sample the most salient video clips based on compressed video representations, and achieve state-of-the-art performance on both Kinetics400 and Sports1M dataset. They also empirically show that such saliency-based sampling is not only efficient, but also enjoys higher accuracy than using all the video frames.

KVM[277]是首次提出端到端框架的嘗試之一,該框架可以同時識別密鑰卷并進行動作分類。后來,[98]引入了AdaScan,以在線方式預測每個視頻幀的重要性得分,他們稱之為自適應時間池。這兩種方法都提高了性能,但仍然采用標準的評價方案,在推理過程中沒有表現出效率。最近的方法更多地關注效率[41,234,8,106]。AdaFrame[234]遵循了[251,98],但使用基于強化學習的方法來搜索更有信息的視頻剪輯。同時,[8]使用了一個師生框架,即,一個無所不知的老師可以用來訓練一個計算效率高、只知甚少的學生。他們證明,高效的學生網絡可以減少30%的推理時間和大約90%的flop數量,而性能下降可以忽略不計。最近,SCSampler[106]訓練一個輕量級網絡,基于壓縮視頻表示對最顯著的視頻剪輯進行采樣,并在Kinetics400和Sports1M數據集上實現了最先進的性能。他們還從經驗上表明,這種基于顯著性的采樣不僅有效,而且比使用所有視頻幀具有更高的準確性。

3.5.5 Visual tempo

Visual tempo is a concept to describe how fast an action goes. Many action classes have different visual tempos. There are several papers exploring different temporal rates (tempos) for improved temporal modeling [273, 147, 82, 281, 45, 248]. Initial attempts usually capture the video tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid [273, 147, 281]. Recently, SlowFast [45], as we discussed in section 3.3.4, utilizes the characteristics of visual tempo to design a two-pathway network for better accuracy and efficiency tradeoff. CIDC [121] proposed directional temporal modeling along with a local backbone for video temporal modeling. TPN [248] extends the tempo modeling to the feature-level and shows consistent improvement over previous approaches.
視覺節奏是一個描述動作進行速度的概念。許多動作類都有不同的視覺節奏。有幾篇論文探討了改進時態建模的不同時間速率(速度)[273,147,82,281,45,248]。最初的嘗試通常是通過以多個速率采樣原始視頻并構建輸入級幀金字塔來捕獲視頻節奏[273,147,281]。最近,SlowFast[45],正如我們在3.3.4節中討論的,利用視覺節奏的特性設計了一個雙路徑網絡,以獲得更好的準確性和效率權衡。CIDC[121]提出了定向時間建模以及視頻時間建模的局部主干。TPN[248]將節奏建模擴展到特征級,并顯示出與之前方法的一致改進。

We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.

我們想指出的是,視覺節奏也被廣泛應用于自監督視頻表示學習[6,247,16],因為它可以自然地提供監督信號來訓練深度網絡。我們將在第5.13節討論關于自監督視頻表示學習的更多細節。

4. Evaluation and Benchmarking

In this section, we compare popular approaches on benchmark datasets. To be specific, we first introduce standard evaluation schemes in section 4.1. Then we divide common benchmarks into three categories, scene-focused (UCF101, HMDB51 and Kinetics400 in section 4.2), motion-focused (Sth-Sth V1 and V2 in section 4.3) and multi-label (Charades in section 4.4). In the end, we present a fair comparison among popular methods in terms of both recognition accuracy and efficiency in section 4.5.We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.

在本節中,我們將比較基準數據集上的流行方法。具體而言,我們首先在4.1節介紹了標準評價方案。然后我們將通用基準分為三類,以場景為中心(第4.2節中的UCF101, HMDB51和Kinetics400),以動作為中心(第4.3節中的Sth-Sth V1和V2)和多標簽(第4.4節中的Charades)。最后,我們在4.5節中對常用方法在識別精度和效率方面進行了比較。我們想指出的是,視覺節奏也被廣泛應用于自監督視頻表示學習[6,247,16],因為它可以自然地提供監督信號來訓練深度網絡。我們將在第5.13節討論關于自監督視頻表示學習的更多細節。

4.1. Evaluation scheme

During model training, we usually randomly pick a video frame/clip to form mini-batch samples. However, for evaluation, we need a standardized pipeline in order to perform fair comparisons.

在模型訓練過程中,我們通常隨機選取一個視頻幀/剪輯,形成小批量樣本。然而,對于評估,我們需要一個標準化的管道來執行公平的比較。

For 2D CNNs, a widely adopted evaluation scheme is to evenly sample 25 frames from each video following [187, 217]. For each frame, we perform ten-crop data augmentation by cropping the 4 corners and 1 center, flipping them horizontally and averaging the prediction scores (before softmax operation) over all crops of the samples, i.e., this means we use 250 frames per video for inference.

對于2D cnn,廣泛采用的評估方案是從每個視頻中均勻采樣25幀[187,217]。對于每一幀,我們通過裁剪4個角和1個中心,水平翻轉它們,并在樣本的所有作物上平均預測得分(在softmax操作之前)來執行10作物數據增強,也就是說,這意味著我們使用每個視頻250幀進行推斷。

For 3D CNNs, a widely adopted evaluation scheme termed 30-view strategy is to evenly sample 10 clips from each video following [219]. For each video clip, we perform three-crop data augmentation. To be specific, we scale the shorter spatial side to 256 pixels and take three crops of 256 × 256 to cover the spatial dimensions and average the prediction scores.

對于3D cnn,一種被廣泛采用的評估方案稱為30-view策略,即從以下每個視頻中平均采樣10個片段[219]。對于每個視頻剪輯,我們執行三次裁剪數據增強。具體來說,我們將較短的空間邊縮放到256像素,取三種256 × 256的作物覆蓋空間維度,并對預測得分進行平均。

However, the evaluation schemes are not fixed. They are evolving and adapting to new network architectures and different datasets. For example, TSM [128] only uses two clips per video for small-sized datasets [190, 109], and perform three-crop data augmentation for each clip despite its being a 2D CNN. We will mention any deviations from the standard evaluation pipeline.

然而,評價方案并不是固定的。它們正在進化并適應新的網絡架構和不同的數據集。例如,TSM[128]對于小型數據集[190,109],每個視頻只使用兩個剪輯,并對每個剪輯執行三次裁剪數據增強,盡管它是一個2D CNN。我們將提到與標準評估管道的任何偏差。

In terms of evaluation metric, we report accuracy for single-label action recognition, and mAP (mean average precision) for multi-label action recognition.

在評價指標方面,我們報告了單標簽動作識別的準確性和多標簽動作識別的平均平均精度(mAP)。

4.2. Scene-focused datasets

Here, we compare recent state-of-the-art approaches on scene-focused datasets: UCF101, HMDB51 and Kinet- ics400. The reason we call them scene-focused is because most action videos in these datasets are short, and can be recognized by static scene appearance alone as shown in Figure 4.

Following the chronology, we first present results for early attempts of using deep learning and the two-stream networks at the top of Table 2. We make several observations. First, without motion/temporal modeling, the performance of DeepVideo [99] is inferior to all other approaches. Second, it is helpful to transfer knowledge from traditional methods (non-CNN-based) to deep learning. For example, TDD [214] uses trajectory pooling to extract motion-aware CNN features. TLE [36] embeds global feature encoding, which is an important step in traditional video action recognition pipeline, into a deep network.

按照時間順序,我們首先展示了使用深度學習和表2頂部的雙流網絡的早期嘗試的結果。我們做了一些觀察。首先,沒有運動/時間建模,DeepVideo[99]的性能低于所有其他方法。第二,有助于將傳統方法(非基于cnn的)的知識轉移到深度學習。例如,TDD[214]使用軌跡池來提取運動感知的CNN特征。TLE[36]將全局特征編碼(傳統視頻動作識別管道中的重要一步)嵌入到深度網絡中。

Table 3. Results of widely adopted methods on Something-Something V1 and V2 datasets. We only report numbers without using optical flow. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet and K: Kinetics400. View means number of temporal clip multiples spatial crop, e.g., 30 means 10 temporal clips with 3 spatial crops each clip.

表3。在Something-Something V1和V2數據集上廣泛采用的方法的結果。我們只報告數字,不使用光流。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet和K: Kinetics400。視圖表示時間剪輯的數量乘以空間作物,例如,30表示10個時間剪輯,每個剪輯有3個空間作物。

We then compare 3D CNNs based approaches in the middle of Table 2. Despite training on a large corpus of videos, C3D [202] performs inferior to concurrent work [187, 214, 217], possibly due to difficulties in optimization of 3D kernels. Motivated by this, several papers - I3D [14], P3D [169], R2+1D [204] and S3D [239] factorize 3D convolution filters to 2D spatial kernels and 1D temporal kernels to ease the training. In addition, I3D introduces an inflation strategy to avoid training from scratch by bootstrap-ping the 3D model weights from well-trained 2D networks. By using these techniques, they achieve comparable performance to the best two-stream network methods [36] without the need for optical flow. Furthermore, recent 3D models obtain even higher accuracy, by using more training samples [203], additional pathways [45], or architecture search [44].

然后,我們在表2中間比較了基于3D cnn的方法。盡管在大量的視頻語料庫上進行訓練,C3D[202]的性能不如并發工作[187,214,217],這可能是由于3D內核的優化困難。基于此,I3D[14]、P3D[169]、R2+1D[204]和S3D[239]幾篇論文將3D卷積濾波器分解為2D空間核和1D時間核,以簡化訓練。此外,I3D引入了膨脹策略,通過從訓練良好的2D網絡引導3D模型權重來避免從頭開始訓練。通過使用這些技術,他們可以在不需要光流的情況下實現與最好的雙流網絡方法[36]相當的性能。此外,最近的3D模型通過使用更多的訓練樣本[203]、額外的路徑[45]或架構搜索[44]獲得了更高的精度。

Finally, we show recent efficient models in the bottom of Table 2. We can see that these methods are able to achieve higher recognition accuracy than two-stream networks (top), and comparable performance to 3D CNNs (middle). Since they are 2D CNNs and do not use optical flow, these methods are efficient in terms of both training and inference. Most of them are real-time approaches, and some can do online video action recognition [128]. We believe 2D CNN plus temporal modeling is a promising direction due to the need of efficiency. Here, temporal modeling could be attention based, flow based or 3D kernel based.

最后,我們在表2的底部展示了最近的高效模型。我們可以看到,這些方法能夠實現比雙流網絡更高的識別精度(上),性能與3D cnn相當(中)。由于它們是二維cnn,不使用光流,這些方法在訓練和推理方面都是有效的。其中大部分是實時方法,有些可以進行在線視頻動作識別[128]。由于對效率的需求,我們認為二維CNN加時間建模是一個很有前途的方向。在這里,時間建模可以是基于注意力的,基于流的或基于3D內核的。

4.3. Motion-focused datasets

In this section, we compare the recent state-of-the-art approaches on the 20BN-Something-Something (Sth-Sth) dataset. We report top1 accuracy on both V1 and V2. Sth-Sth datasets focus on humans performing basic actions with daily objects. Different from scene-focused datasets, background scene in Sth-Sth datasets contributes little to the final action class prediction. In addition, there are classes such as “Pushing something from left to right” and “Pushing something from right to left”, and which require strong motion reasoning.

在本節中,我們比較了20BN-Something-Something(某物-某物)數據集上最近最先進的方法。我們報告V1和V2的準確性都是最高的。Sth-Sth數據集關注的是人類對日常對象執行的基本操作。與以場景為中心的數據集不同,Sth-Sth數據集中的背景場景對最終動作類預測的貢獻很小。此外,還有“從左向右推東西”和“從右向左推東西”等類,這些類需要強運動推理。

By comparing the previous work in Table 3, we observe that using longer input (e.g., 16 frames) is generally better. Moreover, methods that focus on temporal modeling [128, 122, 92] work better than stacked 3D kernels [14]. For example, TSM [128], TEA [122] and MSNet [110] insert an explicit temporal reasoning module into 2D ResNet backbones and achieves state-of-the-art results. This implies that the Sth-Sth dataset needs strong temporal motion reasoning as well as spatial semantics information.

通過比較表3中之前的工作,我們觀察到使用更長的輸入(例如,16幀)通常更好。此外,專注于時間建模的方法[128,122,92]比堆疊的3D內核[14]更好。例如,TSM[128]、TEA[122]和MSNet[110]在2D ResNet骨干中插入顯式時間推理模塊,并獲得了最先進的結果。這意味著Sth-Sth數據集需要強大的時間運動推理和空間語義信息。

4.4. Multi-label datasets

In this section, we first compare the recent state-of-the-art approaches on Charades dataset [186] and then we list some recent work that use assemble model or additional object information on Charades.

在本節中,我們首先比較了Charades數據集[186]上最近最先進的方法,然后列出了一些在Charades上使用集合模型或附加對象信息的最新工作。

Comparing the previous work in Table 4, we make the following observations. First, 3D models [229, 45] generally perform better than 2D models [186, 231] and 2D models with optical flow input. This indicates that the spatio-temporal reasoning is critical for long-term complex concurrent action understanding. Second, longer input helps with the recognition [229] probably because some actions require long-term feature to recognize. Third, models with strong backbones that are pre-trained on larger datasets gen-erally have better performance [45]. This is because Charades is a medium-scaled dataset which doesn’t contain enough diversity to train a deep model.

對比表4中之前的工作,我們得出以下觀察結果。首先,3D模型[229,45]的性能通常優于2D模型[186,231]和具有光流輸入的2D模型。這表明,時空推理對于長期復雜的并發動作理解是至關重要的。其次,較長的輸入有助于識別[229],這可能是因為一些動作需要長時間的特征才能識別。第三,在更大的數據集上預先訓練的具有強大主干的模型通常具有更好的性能[45]。這是因為Charades是一個中等規模的數據集,不包含足夠的多樣性來訓練一個深度模型。

Table 4. Charades evaluation using mAP, calculated using the officially provided script. NL: non-local network. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, K400: Kinetics400 and K600: Kinetics600.

表4。字謎評估使用mAP,計算使用官方提供的腳本。NL:非本地網絡。預訓練指示模型在哪個數據集上進行預訓練。I: ImageNet, K400: Kinetics400和K600: Kinetics600。

Recently, researchers explored the alternative direction for complex concurrent action recognition by assembling models [177] or providing additional human-object interaction information [90]. These papers significantly outperformed previous literature that only finetune a single model on Charades. It demonstrates that exploring spatio-temporal human-object interactions and finding a way to avoid overfitting are the keys for concurrent action understanding.

最近,研究人員通過組合模型[177]或提供額外的人-物交互信息[90]探索了復雜并發動作識別的替代方向。這些論文明顯優于以往的文獻,只微調一個模型的猜字謎。研究表明,探索人-物時空交互作用,避免過擬合是并行行為理解的關鍵。

4.5. Speed comparison

To deploy a model in real-life applications, we usually need to know whether it meets the speed requirement before we can proceed. In this section, we evaluate the approaches mentioned above to perform a thorough comparison in terms of (1) number of parameters, (2) FLOPS, (3) latency and (4) frame per second.

要在實際應用程序中部署模型,我們通常需要知道它是否滿足速度要求,然后才能繼續。在本節中,我們評估上述方法,根據(1)參數數量、(2)FLOPS、(3)延遲和(4)每秒幀數來執行徹底的比較。

We present the results in Table 5. Here, we use the models in the GluonCV video action recognition model zoo3 since all these models are trained using the same data, same data augmentation strategy and under same 30-view evaluation scheme, thus fair comparison. All the timings are done on a single Tesla V100 GPU with 105 repeated runs, while the first 5 runs are ignored for warming up. We always use a batch size of 1. In terms of model input, we use the suggested settings in the original paper.

我們在表5中展示了結果。在這里,我們使用GluonCV視頻動作識別模型zoo3中的模型,因為所有這些模型都使用相同的數據、相同的數據增強策略和相同的30視圖評估方案進行訓練,因此可以進行公平的比較。所有的計時都是在一個特斯拉V100 GPU上進行105次重復運行,而前5次運行被忽略以進行預熱。我們總是使用1的批量大小。在模型輸入方面,我們使用了原文中建議的設置。

As we can see in Table 5, if we compare latency, 2D models are much faster than all other 3D variants. This is probably why most real-world video applications still adopt frame-wise methods. Secondly, as mentioned in [170, 259], FLOPS is not strongly correlated with the actual inference time (i.e., latency). Third, if comparing performance, most 3D models give similar accuracy around 75%, but pretraining with a larger dataset can significantly boost the performance4. This indicates the importance of training data and partially suggests that self-supervised pre-training might be a promising way to further improve existing methods.

從表5中我們可以看到,如果我們比較延遲,2D模型比所有其他3D模型都要快得多。這可能就是為什么大多數現實中的視頻應用程序仍然采用基于幀的方法。其次,如[170,259]所述,FLOPS與實際推理時間(即延遲)沒有很強的相關性。第三,如果比較性能,大多數3D模型的準確率都在75%左右,但使用更大的數據集進行預訓練可以顯著提高性能4。這表明了訓練數據的重要性,并部分表明自我監督的訓練前可能是進一步改進現有方法的一種有前途的方式。

5. Discussion and Future Work

We have surveyed more than 200 deep learning based methods for video action recognition since year 2014. Despite the performance on benchmark datasets plateauing, there are many active and promising directions in this task worth exploring.

自2014年以來,我們已經調查了超過200種基于深度學習的視頻動作識別方法。盡管在基準數據集上的性能趨于穩定,但在這一任務中仍有許多積極和有前途的方向值得探索。

5.1. Analysis and insights

More and more methods haven been developed to improve video action recognition, at the same time, there are some papers summarizing these methods and providing analysis and insights. Huang et al. [82] perform an explicit analysis of the effect of temporal information for video understanding. They try to answer the question “how important is the motion in the video for recognizing the action”. Feichtenhofer et al. [48, 49] provide an amazing visualization of what two-stream models have learned in order to understand how these deep representations work and what they are capturing. Li et al. [124] introduce a concept, representation bias of a dataset, and find that current datasets are biased towards static representations. Experiments on such biased datasets may lead to erroneous conclusions, which is indeed a big problem that limits the development of video action recognition. Recently, Piergiovanni et al. introduced the AViD [165] dataset to cope with data bias by collecting data from diverse groups of people. These papers provide great insights to help fellow researchers to understand the challenges, open problems and where the next breakthrough might reside.

越來越多的方法被用來改進視頻動作識別,同時也有一些論文對這些方法進行了總結和分析。Huang等人[82]對時間信息對視頻理解的影響進行了顯式分析。他們試圖回答“視頻中的動作對于識別動作有多重要”這個問題。Feichtenhofer等人[48,49]提供了一個令人驚訝的雙流模型的可視化,以了解這些深度表示是如何工作的以及它們捕獲了什么。Li等人[124]引入了一個概念,數據集的表示偏差,并發現當前的數據集偏向于靜態表示。在這種有偏差的數據集上進行實驗可能會得出錯誤的結論,這確實是限制視頻動作識別發展的一大問題。最近,Piergiovanni等人引入了AViD[165]數據集,通過收集來自不同人群的數據來應對數據偏差。這些論文提供了很好的見解,幫助同行的研究人員理解挑戰,開放的問題和下一個突破可能在哪里。

5.2. Data augmentation

Numerous data augmentation methods have been proposed in image recognition domain, such as mixup [258], cutout [31], CutMix [254], AutoAugment [27], FastAutoAug [126], etc. However, video action recognition still adopts basic data augmentation techniques introduced before year 2015 [217, 188], including random resizing, random cropping and random horizontal flipping. Recently, SimCLR [17] and other papers have shown that color jittering and random rotation greatly help representation learning. Hence, an investigation of using different data augmentation techniques for video action recognition is particularly useful. This may change the data pre-processing pipeline for all existing methods.

在圖像識別領域已經提出了許多數據增強方法,如mixup[258]、cutout[31]、CutMix[254]、AutoAugment[27]、FastAutoAug[126]等。然而,視頻動作識別仍然采用2015年之前引入的基礎數據增強技術[217,188],包括隨機調整大小、隨機裁剪和隨機水平翻轉。最近,SimCLR[17]等論文表明顏色抖動和隨機旋轉對表示學習有很大幫助。因此,研究使用不同的數據增強技術進行視頻動作識別是非常有用的。這可能會改變所有現有方法的數據預處理管道。

5.3. Video domain adaptation

Domain adaptation (DA) has been studied extensively in recent years to address the domain shift problem. Despite the accuracy on standard datasets getting higher and higher, the generalization capability of current video models across datasets or domains is less explored.

域適應(DA)是近年來解決域轉移問題的一種廣泛研究方法。盡管標準數據集上的精度越來越高,但當前視頻模型跨數據集或域的泛化能力研究較少。

There is early work on video domain adaptation [193, 241, 89, 159]. However, these literature focus on small-scale video DA with only a few overlapping categories, which may not reflect the actual domain discrepancy and may lead to biased conclusions. Chen et al. [15] introduce two larger-scale datasets to investigate video DA and find that aligning temporal dynamics is particularly useful. Pan et al. [152] adopts co-attention to solve the temporal misalignment problem. Very recently, Munro et al. [145] explore a multi-modal self-supervision method for fine-grained video action recognition and show the effectiveness of multi-modality learning in video DA. Shuffle and Attend [95] argues that aligning features of all sampled clips results in a sub-optimal solution due to the fact that all clips do not include relevant semantics. Therefore, they propose to use an attention mechanism to focus more on informative clips and discard the non-informative ones. In conclusion, video DA is a promising direction, especially for researchers with less computing resources.

在視頻領域適應方面有早期的工作[193,241,89,159]。然而,這些文獻關注的是小規模的視頻DA,類別重疊較少,可能無法反映實際的領域差異,導致結論存在偏差。Chen等人[15]引入了兩個更大規模的數據集來研究視頻DA,并發現校準時間動態特別有用。Pan等[152]采用共同注意的方法解決時間不對中問題。最近,Munro等人[145]探索了一種用于細粒度視頻動作識別的多模態自監督方法,并展示了視頻DA中多模態學習的有效性。Shuffle和Attend[95]認為,對齊所有采樣剪輯的特征會導致次優解決方案,因為所有剪輯都不包含相關語義。因此,他們建議使用一種注意力機制,更多地關注有信息的片段,丟棄無信息的片段。總之,視頻DA是一個很有前途的方向,特別是對于計算資源較少的研究人員。

5.4. Neural architecture search

Neural architecture search (NAS) has attracted great interest in recent years and is a promising research direction. However, given its greedy need for computing resources, only a few papers have been published in this area [156, 163, 161, 178]. The TVN family [161], which jointly optimize parameters and runtime, can achieve competitive accuracy with human-designed contemporary models, and run much faster (within 37 to 100 ms on a CPU and 10 ms on a GPU per 1 second video clip). AssembleNet [178] and AssembleNet++ [177] provide a generic approach to learn the connectivity among feature representations across input modalities, and show surprisingly good performance on Charades and other benchmarks. AttentionNAS [222] proposed a solution for spatio-temporal attention cell search. The found cell can be plugged into any network to improve the spatio-temporal features. All previous papers do show their potential for video understanding.

神經結構搜索(NAS)近年來引起了人們的廣泛關注,是一個很有前途的研究方向。然而,由于其對計算資源的巨大需求,該領域的論文發表較少[156,163,161,178]。TVN系列[161]聯合優化了參數和運行時,可以達到與人類設計的同時代模型具有競爭力的精度,并且運行速度更快(每1秒視頻剪輯在CPU上為37 - 100毫秒,在GPU上為10毫秒)。AssembleNet[178]和assemblenet++[177]提供了一種通用的方法來學習跨輸入模式的特征表示之間的連通性,并在Charades和其他基準測試中顯示出驚人的良好性能。AttentionNAS[222]提出了一種時空注意細胞搜索的解決方案。發現的細胞可以插入到任何網絡,以改善時空特征。所有之前的論文都顯示了視頻理解的潛力。

Recently, some efficient ways of searching architectures have been proposed in the image recognition domain, such as DARTS [130], Proxyless NAS [11], ENAS [160], one-shot NAS [7], etc. It would be interesting to combine efficient 2D CNNs and efficient searching algorithms to perform video NAS for a reasonable cost.

近年來,在圖像識別領域提出了一些高效的搜索架構方法,如DARTS[130]、Proxyless NAS[11]、ENAS[160]、one-shot NAS[7]等。將高效的2D cnn和高效的搜索算法結合起來以合理的成本執行視頻NAS將是一件有趣的事情。

5.5. Efficient model development

Despite their accuracy, it is difficult to deploy deep learning based methods for video understanding problems in terms of real-world applications. There are several major challenges: (1) most methods are developed in offline settings, which means the input is a short video clip, not a video stream in an online setting; (2) most methods do not meet the real-time requirement; (3) incompatibility of 3D convolutions or other non-standard operators on non-GPU devices (e.g., edge devices).

盡管其精確度很高,但在現實應用中,很難將基于深度學習的方法應用于視頻理解問題。有幾個主要的挑戰:(1)大多數方法是在離線設置下開發的,這意味著輸入是一個短視頻剪輯,而不是在線設置下的視頻流;(2)大多數方法不能滿足實時性要求;(3)在非gpu設備(如邊緣設備)上不兼容3D卷積或其他非標準運算符。

Hence, the development of efficient network architecture based on 2D convolutions is a promising direction. The approaches proposed in the image classification domain can be easily adapted to video action recognition, e.g. model compression, model quantization, model pruning, distributed training [68, 127], mobile networks [80, 265], mixed precision training, etc. However, more effort is needed for the online setting since the input to most action recognition applications is a video stream, such as surveillance monitoring. We may need a new and more comprehensive dataset for benchmarking online video action recognition methods. Lastly, using compressed videos might be desirable because most videos are already compressed, and we have free access to motion information.

因此,開發基于二維卷積的高效網絡體系結構是一個很有前途的方向。在圖像分類領域提出的方法可以很容易地適用于視頻動作識別,例如模型壓縮、模型量化、模型修剪、分布式訓練[68,127]、移動網絡[80,265]、混合精確訓練等。然而,由于大多數動作識別應用程序的輸入是視頻流,例如監視監控,因此在線設置需要付出更多努力。我們可能需要一個新的、更全面的數據集來對標在線視頻動作識別方法。最后,使用壓縮視頻可能是可取的,因為大多數視頻已經被壓縮了,我們可以免費訪問運動信息。

5.6. New datasets

Data is more or at least as important as model development for machine learning. For video action recognition, most datasets are biased towards spatial representations [124], i.e., most actions can be recognized by a single frame inside the video without considering the temporal movement. Hence, a new dataset in terms of long-term temporal modeling is required to advance video understanding. Furthermore, most current datasets are collected from YouTube. Due to copyright/privacy issues, the dataset organizer often only releases the YouTube id or video link for users to download and not the actual video. The first problem is that downloading the large-scale datasets might be slow for some regions. In particular, YouTube recently started to block massive downloading from a single IP. Thus, many researchers may not even get the dataset to start working in this field. The second problem is, due to region limitation and privacy issues, some videos are not accessible anymore. For example, the original Kinetcis400 dataset has over 300K videos, but at this moment, we can only crawl about 280K videos. On average, we lose 5% of the videos every year. It is impossible to do fair comparisons between methods when they are trained and evaluated on different data.

對于機器學習來說,數據與模型開發同等重要。對于視頻動作識別,大多數數據集都偏向于空間表示[124],即大多數動作可以被視頻內的單個幀識別,而不考慮時間運動。因此,在長期時間建模方面,需要一個新的數據集來推進視頻理解。此外,大多數當前的數據集都來自YouTube。由于版權/隱私問題,數據集組織者通常只發布YouTube id或視頻鏈接供用戶下載,而不是實際的視頻。第一個問題是,對某些地區來說,下載大規模數據集可能很慢。特別是,YouTube最近開始阻止從單個IP進行大規模下載。因此,許多研究人員甚至可能還沒有得到數據集就開始在這個領域工作。第二個問題是,由于地域限制和隱私問題,一些視頻無法訪問。例如,原始的Kinetcis400數據集有超過300K的視頻,但目前,我們只能抓取大約280K的視頻。平均來說,我們每年會損失5%的視頻。當使用不同的數據進行訓練和評估時,不可能在方法之間進行公平的比較。

5.7. Video adversarial attack

Adversarial examples have been well studied on image models. [199] first shows that an adversarial sample, computed by inserting a small amount of noise on the original image, may lead to a wrong prediction. However, limited work has been done on attacking video models.

對抗的例子已經在圖像模型上得到了很好的研究。[199]首先表明,通過在原始圖像上插入少量噪聲計算的對抗樣本可能會導致錯誤的預測。然而,針對攻擊視頻模型的研究還很有限。

This task usually considers two settings, a white-box attack [86, 119, 66, 21] where the adversary can always get the full access to the model including exact gradients of a given input, or a black-box one [93, 245, 226], in which the structure and parameters of the model are blocked so that the attacker can only access the (input, output) pair through queries. Recent work ME-Sampler [260] leverages the motion information directly in generating adversarial videos, and is shown to successfully attack a number of video classification models using many fewer queries. In summary, this direction is useful since many companies provide APIs for services such as video classification, anomaly detection, shot detection, face detection, etc. In addition, this topic is also related to detecting DeepFake videos. Hence, investigating both attacking and defending methods is crucial to keeping these video services safe.

該任務通常考慮兩種設置,一種是白盒攻擊[86,11,66,21],其中攻擊者總是可以獲得對模型的完全訪問權,包括給定輸入的精確梯度,另一種是黑盒攻擊[93,245,226],在這種攻擊中,模型的結構和參數被阻止,因此攻擊者只能通過查詢訪問(輸入,輸出)對。最近的工作ME-Sampler[260]在生成對抗視頻時直接利用了運動信息,并被證明可以使用更少的查詢成功地攻擊許多視頻分類模型。總之,這個方向是有用的,因為許多公司為視頻分類、異常檢測、鏡頭檢測、人臉檢測等服務提供api。此外,本課題還與檢測DeepFake視頻有關。因此,研究攻擊和防御方法對于保證這些視頻服務的安全至關重要。

5.8. Zero-shot action recognition

Zero-shot learning (ZSL) has been trending in the image understanding domain, and has recently been adapted to video action recognition. Its goal is to transfer the learned knowledge to classify previously unseen categories. Due to (1) the expensive data sourcing and annotation and (2) the set of possible human actions is huge, zero-shot action recognition is a very useful task for real-world applications.

零鏡頭學習(ZSL)已經成為圖像理解領域的一個趨勢,最近被應用到視頻動作識別中。它的目標是將學到的知識轉化為以前從未見過的分類。由于(1)昂貴的數據來源和注釋,(2)可能的人類動作的集合是巨大的,零動作識別是一個非常有用的現實應用程序的任務。

There are many early attempts [242, 88, 243, 137, 168, 57] in this direction. Most of them follow a standard framework, which is to first extract visual features from videos using a pretrained network, and then train a joint model that maps the visual embedding to a semantic embedding space. In this manner, the model can be applied to new classes by finding the test class whose embedding is the nearest-neighbor of the model’s output. A recent work URL [279] proposes to learn a universal representation that generalizes across datasets. Following URL [279], [10] present the first end-to-end ZSL action recognition model. They also establish a new ZSL training and evaluation protocol, and provide an in-depth analysis to further advance this field. Inspired by the success of pre-training and then zero-shot in NLP domain, we believe ZSL action recognition is a promising research topic.

在這個方向上有許多早期的嘗試[242,88,243,137,168,57]。它們大多遵循一個標準框架,即首先使用預訓練的網絡從視頻中提取視覺特征,然后訓練一個聯合模型,將視覺嵌入映射到語義嵌入空間。通過這種方式,通過找到其嵌入是模型輸出的最近鄰居的測試類,可以將模型應用于新的類。最近的一項工作URL[279]提出學習一種跨數據集的通用表示。繼URL[279]之后,[10]提出了第一個端到端ZSL動作識別模型。他們還建立了新的ZSL培訓和評估協議,并提供了深入的分析,以進一步推進該領域。受自然語言處理領域先訓練后零射成功的啟發,我們認為ZSL動作識別是一個很有前途的研究課題。

5.9. Weakly-supervised video action recognition

Building a high-quality video action recognition dataset [190, 100] usually requires multiple laborious steps: 1) first sourcing a large amount of raw videos, typically from the internet; 2) removing videos irrelevant to the categories in the dataset; 3) manually trimming the video segments that have actions of interest; 4) refining the categorical labels. Weakly-supervised action recognition explores how to reduce the cost for curating training data.

構建一個高質量的視頻動作識別數據集[190,100]通常需要多個艱難的步驟:1)首先從互聯網上獲取大量原始視頻;2)刪除與數據集中類別無關的視頻;3)手動裁剪有興趣動作的視頻片段;4)細化分類標簽。弱監督動作識別探討了如何降低訓練數據策劃的成本。

The first direction of research [19, 60, 58] aims to reduce the cost of sourcing videos and accurate categorical labeling. They design training methods that use training data such as action-related images or partially annotated videos, gathered from publicly available sources such as Internet. Thus this paradigm is also referred to as webly-supervised learning [19, 58]. Recent work on omni-supervised learning [60, 64, 38] also follows this paradigm but features bootstrapping on unlabelled videos by distilling the models’ own inference results.

第一個研究方向[19,60,58]旨在降低視頻來源的成本和準確的分類標簽。他們設計訓練方法,使用訓練數據,如動作相關的圖像或部分注釋的視頻,收集自公開來源,如互聯網。因此,這種范式也被稱為網絡監督學習[19,58]。最近關于全監督學習的研究[60,64,38]也遵循這一范式,但通過提取模型自身的推理結果,對無標簽視頻進行自引導。

The second direction aims at removing trimming, the most time consuming part in annotation. UntrimmedNet [216] proposed a method to learn action recognition model on untrimmed videos with only categorical labels [149, 172]. This task is also related to weakly-supervised temporal action localization which aims to automatically generate the temporal span of the actions. Several papers propose to simultaneously [155] or iteratively [184] learn models for these two tasks.

第二個方向是去除注釋中最耗時的部分修剪。UntrimmedNet[216]提出了一種方法,在只有分類標簽的未裁剪視頻上學習動作識別模型[149,172]。此任務還與弱監督的時間動作定位相關,其目的是自動生成動作的時間跨度。幾篇論文建議同時[155]或迭代[184]學習這兩個任務的模型。

5.10. Fine-grained video action recognition

Popular action recognition datasets, such as UCF101 [190] or Kinetics400 [100], mostly comprise actions happening in various scenes. However, models learned on these datasets could overfit to contextual information irrelevant to the actions [224, 227, 24]. Several datasets have been proposed to study the problem of fine-grained action recognition, which could examine the models’ capacities in modeling action specific information. These datasets comprise fine-grained actions in human activities such as cooking [28, 108, 174], working [103] and sports [181, 124]. For example, FineGym [181] is a recent large dataset annotated with different moves and sub-actions in gymnastic videos.

流行的動作識別數據集,如UCF101[190]或Kinetics400[100],大多包含發生在不同場景中的動作。然而,在這些數據集上學習的模型可能會過度擬合與動作無關的上下文信息[224,227,24]。研究細粒度動作識別問題的數據集可以檢驗模型在動作特定信息建模方面的能力。這些數據集包括烹飪[28,108,174]、工作[103]和體育[181,124]等人類活動中的細粒度動作。例如,FineGym[181]是最近的一個大型數據集,在體操視頻中注釋了不同的動作和子動作。

5.11. Egocentric action recognition

Recently, large-scale egocentric action recognition [29, 28] has attracted increasing interest with the emerging of wearable cameras devices. Egocentric action recognition requires a fine understanding of hand motion and the interacting objects in the complex environment. A few papers leverage object detection features to offer fine object context to improve egocentric video recognition [136, 223, 229, 180]. Others incorporate spatio-temporal attention [192] or gaze annotations [131] to localize the interacting object to facilitate action recognition. Similar to third-person action recognition, multi-modal inputs (e.g., optical flow and audio) have been demonstrated to be effective in egocentric action recognition [101].

近年來,隨著可穿戴相機設備的出現,大規模以自我為中心的動作識別[29,28]引起了越來越多的關注。以自我為中心的動作識別需要對手部運動和復雜環境中相互作用的物體有很好的理解。一些論文利用對象檢測特征提供精細的對象上下文來改進以自我為中心的視頻識別[136,223,229,180]。其他方法則結合時空注意力[192]或凝視注釋[131]來定位交互對象,以促進動作識別。與第三人動作識別類似,多模態輸入(如光流和音頻)已被證明在自我中心動作識別中是有效的[101]。

5.12. Multi-modality

Multi-modal video understanding has attracted increasing attention in recent years [55, 3, 129, 167, 154, 2, 105]. There are two main categories for multi-modal video understanding. The first group of approaches use multi-modalities such as scene, object, motion, and audio to enrich the video representations. In the second group, the goal is to design a model which utilizes modality information as a supervision signal for pre-training models [195, 138, 249, 62, 2].

近年來,多模態視頻理解越來越受到關注[55,3,129,167,154,2,105]。多模態視頻理解主要有兩類。第一組方法使用場景、物體、運動和音頻等多模式來豐富視頻表現。第二組的目標是設計一個模型,該模型利用情態信息作為預訓練模型的監督信號[195,138,249,62,2]。

Multi-modality for comprehensive video understanding Learning a robust and comprehensive representation of video is extremely challenging due to the complexity of semantics in videos. Video data often includes variations in different forms including appearance, motion, audio, text or scene [55, 129, 166]. Therefore, utilizing these multi-modal representations is a critical step in understanding video content more efficiently. The multi-modal representations of video can be approximated by gathering representations of various modalities such as scene, object, audio, motion, appearance and text. Ngiam et al. [148] was an early attempt to suggest using multiple modalities to obtain better features. They utilized videos of lips and their corresponding speech for multi-modal representation learning. Miech et al. [139] proposed a mixture-of embedding-experts model to combine multiple modalities including motion, appearance, audio, and face features and learn the shared embedding space between these modalities and text. Roig et al. [175] combines multiple modalities such as action, scene, object and acoustic event features in a pyramidal structure for action recognition. They show that adding each modality improves the final action recognition accuracy. Both CE [129] and MMT [55], follow a similar research line to [139] where the goal is to combine multiple-modalities to obtain a comprehensive representation of video for joint video-text representation learning. Piergiovanni et al. [166] utilized textual data together with video data to learn a joint embedding space. Using this learned joint embedding space, the method is capable of doing zero-shot action recognition. This line of research is promising due to the availability of strong semantic extraction models and also success of transformers on both vision and language tasks.

由于視頻中語義的復雜性,學習一個健壯的、全面的視頻表示是非常具有挑戰性的。視頻數據通常包括不同形式的變化,包括外觀、運動、音頻、文本或場景[55,129,166]。因此,利用這些多模態表示是更有效地理解視頻內容的關鍵步驟。視頻的多模態表示可以通過收集諸如場景、對象、音頻、運動、外觀和文本等各種形式的表示來近似。Ngiam等人[148]是一個早期的嘗試,建議使用多種模式來獲得更好的特征。他們利用嘴唇和相應的語音視頻進行多模態表征學習。Miech等人[139]提出了一種混合嵌入專家模型,將包括運動、外觀、音頻和面部特征在內的多種模態結合起來,并學習這些模態與文本之間的共享嵌入空間。Roig等人[175]將動作、場景、物體和聲學事件特征等多種模式結合在一個金字塔結構中進行動作識別。結果表明,添加各種模態可以提高最終動作的識別精度。CE[129]和MMT[55]都遵循與[139]相似的研究路線,其目標是結合多種模式來獲得視頻的綜合表示,用于聯合視頻-文本表示學習。Piergiovanni等人[166]利用文本數據和視頻數據一起學習聯合嵌入空間。利用學習得到的關節嵌入空間,該方法能夠實現零鏡頭動作識別。由于強語義提取模型的可用性以及在視覺和語言任務上的轉換器的成功,這一研究方向很有前途。

Table 6. Comparison of self-supervised video representation learning methods. Top section shows the multi-modal video representation learning approaches and bottom section shows the video-only representation learning methods. From left to right, we show the self-supervised training setting, e.g. dataset, modalities, resolution, and architecture. Two last right columns show the action recognition results on two datasets UCF101 and HMDB51 to measure the quality of self-supervised pre-trained model. HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400 and K600: Kinetics600.

表6所示。自監督視頻表示學習方法的比較。上半部分為多模態視頻表示學習方法,下半部分為純視頻表示學習方法。從左到右,我們展示了自我監督的訓練設置,例如數據集、模式、分辨率和架構。右最后兩列顯示了在UCF101和HMDB51兩個數據集上的動作識別結果,以衡量自監督預訓練模型的質量。HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400和K600: Kinetics600。

Multi-modality for self-supervised video representation learning Most videos contain multiple modalities such as audio or text/caption. These modalities are great source of supervision for learning video representations [3, 144, 154, 2, 162]. Korbar et al. [105] incorporated the natural synchronization between audio and video as a supervision signal in their contrastive learning objective for selfsupervised representation learning. In multi-modal self-supervised representation learning, the dataset plays an im-portant role. VideoBERT [195] collected 310K cooking videos from YouTube. However, this dataset is not publicly available. Similar to BERT, VideoBERT used a “masked language model” training objective and also quantized the visual representations into “visual words”. Miech et al. [140] introduced HowTo100M dataset in 2019. This dataset includes 136M clips from 1.22M videos with their corresponding text. They collected the dataset from YouTube with the aim of obtaining instructional videos (how to perform an activity). In total, it covers 23.6K instructional tasks. MIL-NCE [138] used this dataset for self-supervised cross-modal representation learning. They tackled the problem of visually misaligned narrations, by considering multiple positive pairs in the contrastive learning objective. Act-BERT [275], utilized HowTo100M dataset for pre-training of the model in a self-supervised way. They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr. The top section of Table 6 compares multi-modal self-supervised representation learning methods. We will discuss more work on video-only representation learning in the next section.

用于自我監督視頻表示學習的多模態大多數視頻包含多種模態,如音頻或文本/標題。這些模式是學習視頻表示的重要監督來源[3,144,154,2,162]。Korbar等人[105]將音頻和視頻之間的自然同步作為監督信號納入了自我監督表示學習的對比學習目標中。在多模態自監督表示學習中,數據集起著重要的作用。VideoBERT[195]從YouTube上收集了310K個烹飪視頻。然而,該數據集尚未公開。與BERT類似,VideoBERT使用了“遮蔽語言模型”的訓練目標,也將視覺表示量化為“視覺詞”。Miech等人[140]于2019年介紹了HowTo100M數據集。該數據集包括1.22M視頻中的136M剪輯和相應的文本。他們從YouTube上收集數據集,目的是獲得教學視頻(如何執行一項活動)。它總共涵蓋23.6K教學任務。MIL-NCE[138]使用該數據集進行自監督跨模態表示學習。他們通過在對比學習目標中考慮多個正對來解決視覺失調的敘述問題。Act-BERT[275],利用HowTo100M數據集以自監督的方式對模型進行預訓練。它們結合了視覺、動作、文本和對象特征進行跨模態表示學習。最近AVLnet[176]和MMV[2]考慮了三種形式的自我監督表現學習視覺,音頻和語言。由于對比學習在許多視覺和語言任務上的成功,以及YouTube、Instagram或Flickr等平臺上大量未標記的多模態視頻數據的訪問,這一研究方向也越來越受到關注。表6的上半部分比較了多模態自監督表示學習方法。我們將在下一節中討論更多關于純視頻表示學習的工作。

5.13. Self-supervised video representation learning

Self-supervised learning has attracted more attention recently as it is able to leverage a large amount of unlabeled data by designing a pretext task to obtain free supervisory signals from data itself. It first emerged in image representation learning. On images, the first stream of papers aimed at designing pretext tasks for completing missing information, such as image coloring [262] and image reordering [153, 61, 263]. The second stream of papers uses instance discrimination [235] as the pretext task and contrastive losses [235, 151] for supervision. They learn visual representation by modeling visual similarity of object instances without class labels [235, 75, 201, 18, 17].

自監督學習是一種利用大量未標記數據,設計借口任務從數據本身獲取自由監督信號的學習方法,近年來備受關注。它首先出現在圖像表示學習中。關于圖像,第一組論文旨在設計借口任務來完成缺失的信息,如圖像著色[262]和圖像重新排序[153,61,263]。第二批論文以實例區別[235]為托詞,以對比損失[235,151]為監督。他們通過對沒有類標簽的對象實例的可視化相似度建模來學習可視化表示[235,75,201,18,17]。

Self-supervised learning is also viable for videos. Compared with images, videos has another axis, temporal dimension, which we can use to craft pretext tasks. Information completion tasks for this purpose include predicting the correct order of shuffled frames [141, 52] and video clips [240]. Jing et al. [94] focus on the spatial dimension only by predicting the rotation angles of rotated video clips. Combining temporal and spatial information, several tasks have been introduced to solve a space-time cubic puzzle, anticipate future frames [208], forecast long-term motions [134] and predict motion and appearance statistics [211]. RSPNet [16] and visual tempo [247] exploit the relative speed between video clips as a supervision signal.

自我監督學習在視頻中也是可行的。與圖像相比,視頻還有一個軸,即時間維度,我們可以利用它來制作借口任務。為此目的的信息完成任務包括預測打亂的幀[141,52]和視頻剪輯[240]的正確順序。Jing等[94]只通過預測旋轉視頻片段的旋轉角度來關注空間維度。結合時間和空間信息,幾個任務被引入來解決時空立方難題,預測未來幀[208],預測長期運動[134],預測運動和外觀統計[211]。RSPNet[16]和visual tempo[247]利用視頻片段之間的相對速度作為監控信號。

The added temporal axis can also provide flexibility in designing instance discrimination pretexts [67, 167]. Inspired by the decoupling of 3D convolution to spatial and temporal separable convolutions [239], Zhang et al. [266] proposed to decouple the video representation learning into two sub-tasks: spatial contrast and temporal contrast. Recently, Han et al. [72] proposed memory augmented dense predictive coding for self-supervised video representation learning. They split each video into several blocks and the embedding of future block is predicted by the combination of condensed representations in memory.

增加的時間軸還可以為設計實例區別借口提供靈活性[67,167]。受3D卷積解耦為時空可分離卷積的啟發[239],Zhang等人[266]提出將視頻表示學習解耦為兩個子任務:空間對比和時間對比。最近,Han等人[72]提出了用于自監督視頻表示學習的記憶增強密集預測編碼。他們將每個視頻分割成幾個塊,并通過在內存中組合壓縮表示來預測未來塊的嵌入。

The temporal continuity in videos inspires researchers to design other pretext tasks around correspondence. Wang et al. [221] proposed to learn representation by performing cycle-consistency tracking. Specifically, they track the same object backward and then forward in the consecutive video frames, and use the inconsistency between the start and end points as the loss function. TCC [39] is a concurrent paper. Instead of tracking local objects, [39] used cycle-consistency to perform frame-wise temporal alignment as a supervision signal. [120] was a follow-up work of [221], and utilized both object-level and pixel-level correspondence across video frames. Recently, long-range temporal correspondence is modelled as a random walk graph to help learning video representation in [87].

視頻中的時間連續性激勵研究人員圍繞通信設計其他借口任務。Wang等人[221]提出通過執行周期一致性跟蹤來學習表示。具體來說,它們在連續的視頻幀中,先向后跟蹤同一個物體,然后向前跟蹤,并使用起始點和結束點之間的不一致作為損失函數。TCC[39]是一篇并發論文。[39]不是跟蹤本地對象,而是使用周期一致性執行幀相關的時間對齊作為監督信號。[120]是[221]的后續工作,并利用了跨視頻幀的對象級和像素級通信。最近,遠程時間對應被建模為隨機游走圖,以幫助學習視頻表示[87]。

We compare video self-supervised representation learning methods at the bottom section of Table 6. A clear trend can be observed that recent papers have achieved much better linear evaluation accuracy and fine-tuning accuracy comparable to supervised pre-training. This shows that self-supervised learning could be a promising direction towards learning better video representations.

我們在表6的底部部分比較了視頻自監督表示學習方法。可以觀察到一個明顯的趨勢,最近的論文取得了更好的線性評價精度和微調精度可比監督前訓練。這表明,自我監督學習可能是學習更好的視頻表示的一個有前途的方向。

6. Conclusion

In this survey, we present a comprehensive review of 200+ deep learning based recent approaches to video action recognition. Although this is not an exhaustive list, we hope the survey serves as an easy-to-follow tutorial for those seeking to enter the field, and an inspiring discussion for those seeking to find new research directions.

在本次調查中,我們全面回顧了200多個基于深度學習的視頻動作識別最新方法。雖然這不是一個詳盡的列表,我們希望調查可以成為那些尋求進入該領域的人的一個易于遵循的指南,并為那些尋求新的研究方向的人提供一個鼓舞人心的討論。

參考

行為識別 - A Comprehensive Study of Deep Video Action Recognition
視頻理解論文綜述
Video Action Recognition淺析(一)
A Comprehensive Study of Deep Video Action Recognition
A Comprehensive Study of Deep Video Action Recognition 論文筆記

總結

以上是生活随笔為你收集整理的论文笔记【A Comprehensive Study of Deep Video Action Recognition】的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

国产极品视觉盛宴 | 国内精品一区二区三区不卡 | 六月丁香婷婷色狠狠久久 | 西西人体www44rt大胆高清 | 99久久精品午夜一区二区 | 性欧美牲交xxxxx视频 | 亚洲a无码综合a国产av中文 | 无套内谢的新婚少妇国语播放 | 欧洲熟妇色 欧美 | 国产激情艳情在线看视频 | а√天堂www在线天堂小说 | 国产人妻人伦精品1国产丝袜 | 国产精品va在线观看无码 | 97se亚洲精品一区 | 亚洲精品久久久久中文第一幕 | 乱人伦中文视频在线观看 | 免费观看又污又黄的网站 | 天天av天天av天天透 | 九九综合va免费看 | 性开放的女人aaa片 | 日韩人妻无码中文字幕视频 | 东京热无码av男人的天堂 | www国产亚洲精品久久久日本 | 在线а√天堂中文官网 | 婷婷丁香六月激情综合啪 | 99麻豆久久久国产精品免费 | 国产精品二区一区二区aⅴ污介绍 | 少妇无码av无码专区在线观看 | 国产三级精品三级男人的天堂 | 国产人妻久久精品二区三区老狼 | 成人免费视频一区二区 | 久久久精品人妻久久影视 | 精品偷拍一区二区三区在线看 | 18精品久久久无码午夜福利 | 久久久精品国产sm最大网站 | 国产午夜亚洲精品不卡下载 | 久久人人爽人人爽人人片ⅴ | 男女性色大片免费网站 | 日本护士毛茸茸高潮 | 色婷婷欧美在线播放内射 | 久久国产精品_国产精品 | 超碰97人人做人人爱少妇 | 18无码粉嫩小泬无套在线观看 | 激情内射日本一区二区三区 | 国产国语老龄妇女a片 | 亚洲中文字幕va福利 | 久久精品成人欧美大片 | 中文字幕无码av激情不卡 | 2020久久超碰国产精品最新 | 国产一区二区不卡老阿姨 | 自拍偷自拍亚洲精品10p | 国产黑色丝袜在线播放 | 亚洲aⅴ无码成人网站国产app | 欧美一区二区三区视频在线观看 | 国产熟女一区二区三区四区五区 | 国产亚洲人成a在线v网站 | ass日本丰满熟妇pics | 免费国产成人高清在线观看网站 | 精品无码国产自产拍在线观看蜜 | 97精品国产97久久久久久免费 | 中文字幕无码乱人伦 | 免费观看黄网站 | 国产亚洲精品久久久久久久 | www国产亚洲精品久久网站 | 在线 国产 欧美 亚洲 天堂 | 国产成人综合色在线观看网站 | 亚欧洲精品在线视频免费观看 | 亚洲国产精品久久人人爱 | 午夜福利一区二区三区在线观看 | 天干天干啦夜天干天2017 | 人人超人人超碰超国产 | 久久久久久久人妻无码中文字幕爆 | 99久久婷婷国产综合精品青草免费 | 国产麻豆精品一区二区三区v视界 | 日本丰满护士爆乳xxxx | 日本免费一区二区三区最新 | 丰满少妇弄高潮了www | 欧美zoozzooz性欧美 | 久久国内精品自在自线 | 午夜精品一区二区三区的区别 | 亚洲日韩av一区二区三区中文 | 无码人妻少妇伦在线电影 | 国产内射老熟女aaaa | 高中生自慰www网站 | 在线а√天堂中文官网 | 国产成人一区二区三区在线观看 | 色老头在线一区二区三区 | 97夜夜澡人人爽人人喊中国片 | 人人妻在人人 | 国产人妻大战黑人第1集 | 伊人久久婷婷五月综合97色 | 伊在人天堂亚洲香蕉精品区 | 婷婷综合久久中文字幕蜜桃三电影 | 国产成人综合美国十次 | 天堂在线观看www | 精品人妻中文字幕有码在线 | 男女性色大片免费网站 | 亚洲一区二区三区偷拍女厕 | 国产精品毛多多水多 | 久久亚洲国产成人精品性色 | 久久国产精品二国产精品 | 国产做国产爱免费视频 | 精品无码一区二区三区爱欲 | 好屌草这里只有精品 | 5858s亚洲色大成网站www | 欧美国产日韩久久mv | 国产激情精品一区二区三区 | 亚洲自偷自偷在线制服 | 国产69精品久久久久app下载 | 精品成人av一区二区三区 | 国产超级va在线观看视频 | 亚洲色欲色欲欲www在线 | 麻花豆传媒剧国产免费mv在线 | 欧美日韩视频无码一区二区三 | 樱花草在线播放免费中文 | 2019nv天堂香蕉在线观看 | av人摸人人人澡人人超碰下载 | 成人免费无码大片a毛片 | 奇米综合四色77777久久 东京无码熟妇人妻av在线网址 | 亚洲熟妇色xxxxx亚洲 | 2020久久香蕉国产线看观看 | 亚洲欧美日韩综合久久久 | 少妇久久久久久人妻无码 | 国内综合精品午夜久久资源 | 青青青手机频在线观看 | 欧美freesex黑人又粗又大 | 激情内射亚州一区二区三区爱妻 | 国产成人无码av片在线观看不卡 | 亚洲爆乳精品无码一区二区三区 | 国产精品亚洲综合色区韩国 | 亚洲熟妇自偷自拍另类 | 97精品人妻一区二区三区香蕉 | 成人免费视频一区二区 | 男人扒开女人内裤强吻桶进去 | 大色综合色综合网站 | 免费男性肉肉影院 | 亚洲国产成人a精品不卡在线 | 国产偷国产偷精品高清尤物 | 色窝窝无码一区二区三区色欲 | 亚洲中文字幕无码一久久区 | 自拍偷自拍亚洲精品10p | 日韩 欧美 动漫 国产 制服 | 亚洲欧美综合区丁香五月小说 | 丰满诱人的人妻3 | 亚洲熟妇色xxxxx欧美老妇y | 好爽又高潮了毛片免费下载 | 国产乱人伦app精品久久 国产在线无码精品电影网 国产国产精品人在线视 | 国产一区二区不卡老阿姨 | 色综合久久网 | 久久精品人人做人人综合 | 亚洲小说春色综合另类 | 少妇被粗大的猛进出69影院 | 18禁黄网站男男禁片免费观看 | 久久综合激激的五月天 | 四十如虎的丰满熟妇啪啪 | 久久亚洲中文字幕精品一区 | 日韩精品乱码av一区二区 | 久久午夜夜伦鲁鲁片无码免费 | 东京热一精品无码av | 无码国内精品人妻少妇 | 少妇高潮一区二区三区99 | 成人三级无码视频在线观看 | 国内丰满熟女出轨videos | 中文字幕无码热在线视频 | 99久久久无码国产aaa精品 | 人人妻人人藻人人爽欧美一区 | 97久久国产亚洲精品超碰热 | 亚洲国产精品无码久久久久高潮 | 日本饥渴人妻欲求不满 | 欧美人妻一区二区三区 | 亚洲精品国产第一综合99久久 | 精品久久久久久亚洲精品 | 久久精品国产99久久6动漫 | 亚洲日韩av片在线观看 | 亚洲男人av香蕉爽爽爽爽 | 国产精品igao视频网 | 国产精品久久久久久无码 | 国内揄拍国内精品人妻 | 亚洲狠狠婷婷综合久久 | 午夜肉伦伦影院 | 色狠狠av一区二区三区 | 中文字幕无码av波多野吉衣 | 丰满人妻精品国产99aⅴ | 伊人久久大香线蕉av一区二区 | 婷婷综合久久中文字幕蜜桃三电影 | 久久这里只有精品视频9 | 日韩精品久久久肉伦网站 | 国产精品久久久久久无码 | 少妇被黑人到高潮喷出白浆 | a国产一区二区免费入口 | 日日摸日日碰夜夜爽av | 中文字幕乱码中文乱码51精品 | 又粗又大又硬又长又爽 | 精品一区二区不卡无码av | 亚洲欧洲无卡二区视頻 | 久久99精品国产麻豆蜜芽 | 亚洲人成网站在线播放942 | 日本xxxx色视频在线观看免费 | 日本乱人伦片中文三区 | 国产精品对白交换视频 | 强辱丰满人妻hd中文字幕 | 男人和女人高潮免费网站 | 激情综合激情五月俺也去 | 亚洲天堂2017无码 | 精品一二三区久久aaa片 | 婷婷色婷婷开心五月四房播播 | 成人精品一区二区三区中文字幕 | 国产精品美女久久久 | 亚洲伊人久久精品影院 | 欧美日韩亚洲国产精品 | 久久99国产综合精品 | 无码福利日韩神码福利片 | 内射巨臀欧美在线视频 | 激情内射亚州一区二区三区爱妻 | 精品夜夜澡人妻无码av蜜桃 | 粗大的内捧猛烈进出视频 | 亚洲成a人片在线观看无码3d | 亚洲伊人久久精品影院 | 无码毛片视频一区二区本码 | 丁香花在线影院观看在线播放 | 国产亚洲精品久久久久久 | 内射老妇bbwx0c0ck | 动漫av一区二区在线观看 | 97se亚洲精品一区 | 亚洲一区二区三区香蕉 | 极品嫩模高潮叫床 | 99久久婷婷国产综合精品青草免费 | a片免费视频在线观看 | 亚洲色欲色欲天天天www | 久久精品国产日本波多野结衣 | 日韩成人一区二区三区在线观看 | 亚洲精品中文字幕乱码 | 丝袜足控一区二区三区 | 夜夜躁日日躁狠狠久久av | 日产国产精品亚洲系列 | 国产xxx69麻豆国语对白 | 天堂а√在线中文在线 | 美女毛片一区二区三区四区 | 女人高潮内射99精品 | 欧美老熟妇乱xxxxx | 丁香啪啪综合成人亚洲 | 欧美性猛交xxxx富婆 | 精品国产一区二区三区av 性色 | 露脸叫床粗话东北少妇 | 国产成人无码a区在线观看视频app | 日本精品少妇一区二区三区 | 久9re热视频这里只有精品 | 国产综合久久久久鬼色 | 色综合天天综合狠狠爱 | 亚洲成av人影院在线观看 | 久久久久成人精品免费播放动漫 | 人妻体内射精一区二区三四 | 色噜噜亚洲男人的天堂 | 日产精品99久久久久久 | 中文精品久久久久人妻不卡 | 色欲人妻aaaaaaa无码 | 久久国产精品_国产精品 | 性色av无码免费一区二区三区 | 久久久久国色av免费观看性色 | 色一情一乱一伦一区二区三欧美 | 午夜福利不卡在线视频 | 国产一区二区三区影院 | 亚洲精品国偷拍自产在线观看蜜桃 | 又大又硬又黄的免费视频 | 久久精品视频在线看15 | 少妇被粗大的猛进出69影院 | 国产真人无遮挡作爱免费视频 | 欧美人与禽猛交狂配 | 色一情一乱一伦一区二区三欧美 | 99国产精品白浆在线观看免费 | 欧美日韩精品 | 99精品无人区乱码1区2区3区 | 午夜性刺激在线视频免费 | 国产精品人人妻人人爽 | 在线播放免费人成毛片乱码 | 久久久精品国产sm最大网站 | 玩弄少妇高潮ⅹxxxyw | 亚洲一区二区三区香蕉 | 欧美亚洲日韩国产人成在线播放 | 中文字幕+乱码+中文字幕一区 | 人妻无码αv中文字幕久久琪琪布 | 国产麻豆精品精东影业av网站 | 久久久久成人精品免费播放动漫 | 少妇人妻偷人精品无码视频 | 久久精品丝袜高跟鞋 | 国产成人综合在线女婷五月99播放 | 性欧美videos高清精品 | 人人妻人人澡人人爽欧美一区九九 | а天堂中文在线官网 | 奇米影视888欧美在线观看 | 亚洲日韩av片在线观看 | 亚洲欧美日韩国产精品一区二区 | 亚洲成av人综合在线观看 | 久久精品国产日本波多野结衣 | 久久zyz资源站无码中文动漫 | 国产猛烈高潮尖叫视频免费 | 人妻体内射精一区二区三四 | 娇妻被黑人粗大高潮白浆 | 免费无码肉片在线观看 | 亚欧洲精品在线视频免费观看 | 久久人人97超碰a片精品 | 男人的天堂2018无码 | 亚洲成色www久久网站 | 精品一区二区不卡无码av | 人人妻人人澡人人爽人人精品浪潮 | 亚洲精品成人福利网站 | 99久久久国产精品无码免费 | 亚洲啪av永久无码精品放毛片 | 国产精品人人爽人人做我的可爱 | 欧美高清在线精品一区 | 亚洲国产精品毛片av不卡在线 | 在线成人www免费观看视频 | 国产无av码在线观看 | 亚洲日韩av一区二区三区中文 | 亚洲精品欧美二区三区中文字幕 | 久久国语露脸国产精品电影 | 波多野结衣高清一区二区三区 | 色欲人妻aaaaaaa无码 | 18精品久久久无码午夜福利 | 久久久久成人精品免费播放动漫 | 亚洲精品一区三区三区在线观看 | 免费观看激色视频网站 | 久久亚洲国产成人精品性色 | 福利一区二区三区视频在线观看 | 任你躁在线精品免费 | 久久精品无码一区二区三区 | 一本久久a久久精品亚洲 | 乱码午夜-极国产极内射 | 亚洲 激情 小说 另类 欧美 | 日韩av无码一区二区三区不卡 | 好男人社区资源 | 99视频精品全部免费免费观看 | 女人高潮内射99精品 | 老太婆性杂交欧美肥老太 | 少妇性l交大片 | 日本丰满熟妇videos | 蜜桃视频插满18在线观看 | 国产69精品久久久久app下载 | 亚洲国产欧美国产综合一区 | 日韩精品久久久肉伦网站 | 欧美人与物videos另类 | 内射白嫩少妇超碰 | 国精产品一品二品国精品69xx | 国产精品美女久久久久av爽李琼 | 日本熟妇浓毛 | 永久免费精品精品永久-夜色 | 亚洲精品美女久久久久久久 | 日韩精品无码一区二区中文字幕 | 兔费看少妇性l交大片免费 | 人人妻人人藻人人爽欧美一区 | 国产xxx69麻豆国语对白 | 精品成在人线av无码免费看 | 欧美日韩色另类综合 | 亚洲国产欧美日韩精品一区二区三区 | 奇米影视7777久久精品 | 77777熟女视频在线观看 а天堂中文在线官网 | 精品成在人线av无码免费看 | 中文久久乱码一区二区 | 亚洲人成人无码网www国产 | 亚洲成a人片在线观看无码3d | 玩弄少妇高潮ⅹxxxyw | 国产精品无码成人午夜电影 | 成熟人妻av无码专区 | 久久综合九色综合欧美狠狠 | 国产亚洲精品久久久久久久久动漫 | 曰韩少妇内射免费播放 | 久久久久成人片免费观看蜜芽 | 嫩b人妻精品一区二区三区 | 久久久久久a亚洲欧洲av冫 | 色妞www精品免费视频 | 亚洲国精产品一二二线 | 精品人妻人人做人人爽夜夜爽 | 人妻少妇精品无码专区二区 | 波多野结衣av一区二区全免费观看 | 久久精品成人欧美大片 | 亲嘴扒胸摸屁股激烈网站 | 久久久久久av无码免费看大片 | 精品一区二区三区波多野结衣 | 亚洲小说春色综合另类 | 国产激情无码一区二区app | 成人无码精品一区二区三区 | 无码人妻丰满熟妇区五十路百度 | 国产真实伦对白全集 | 人人妻在人人 | 狠狠色噜噜狠狠狠狠7777米奇 | 国产av久久久久精东av | 国产麻豆精品一区二区三区v视界 | 国语精品一区二区三区 | 免费国产黄网站在线观看 | 夜先锋av资源网站 | 99麻豆久久久国产精品免费 | 亚洲s色大片在线观看 | 亚洲国产欧美国产综合一区 | 人人妻人人澡人人爽欧美精品 | 天天摸天天透天天添 | 日日噜噜噜噜夜夜爽亚洲精品 | 爽爽影院免费观看 | 日本熟妇大屁股人妻 | 久久婷婷五月综合色国产香蕉 | 欧美zoozzooz性欧美 | 无码av最新清无码专区吞精 | 日日碰狠狠丁香久燥 | 久久久久久av无码免费看大片 | 我要看www免费看插插视频 | 亚洲天堂2017无码中文 | 日本www一道久久久免费榴莲 | 国产成人精品一区二区在线小狼 | 九月婷婷人人澡人人添人人爽 | 97无码免费人妻超级碰碰夜夜 | 国色天香社区在线视频 | 欧美猛少妇色xxxxx | 亚洲精品中文字幕久久久久 | 女人和拘做爰正片视频 | 精品国产乱码久久久久乱码 | 国产精品va在线播放 | 色一情一乱一伦一区二区三欧美 | 久久99精品久久久久婷婷 | 草草网站影院白丝内射 | 国产精品99久久精品爆乳 | 无码播放一区二区三区 | 精品亚洲成av人在线观看 | 亚洲成av人综合在线观看 | ass日本丰满熟妇pics | 377p欧洲日本亚洲大胆 | 性欧美疯狂xxxxbbbb | 乱人伦中文视频在线观看 | 亚洲精品中文字幕乱码 | 国产片av国语在线观看 | 日韩精品a片一区二区三区妖精 | 99久久久国产精品无码免费 | 国产精品高潮呻吟av久久4虎 | 亚洲熟妇色xxxxx欧美老妇y | 亚洲色偷偷偷综合网 | 国产在线精品一区二区三区直播 | 欧洲美熟女乱又伦 | 国产激情一区二区三区 | 国产人妻精品一区二区三区 | 嫩b人妻精品一区二区三区 | 中文无码精品a∨在线观看不卡 | 日日噜噜噜噜夜夜爽亚洲精品 | 亚洲国产欧美日韩精品一区二区三区 | 色欲人妻aaaaaaa无码 | 欧洲vodafone精品性 | 中文字幕无码av激情不卡 | 性开放的女人aaa片 | 国产成人精品优优av | 台湾无码一区二区 | 鲁大师影院在线观看 | 欧美兽交xxxx×视频 | 西西人体www44rt大胆高清 | 国产精品高潮呻吟av久久4虎 | 中文字幕无线码免费人妻 | 人妻少妇精品无码专区动漫 | 1000部啪啪未满十八勿入下载 | 亚洲精品国偷拍自产在线观看蜜桃 | 啦啦啦www在线观看免费视频 | 国产人妻人伦精品1国产丝袜 | 欧美成人家庭影院 | 久久人人97超碰a片精品 | 一本加勒比波多野结衣 | 六月丁香婷婷色狠狠久久 | 亚洲国产av精品一区二区蜜芽 | 久久午夜无码鲁丝片午夜精品 | 精品无码国产一区二区三区av | 亚洲精品一区二区三区婷婷月 | 丰满肥臀大屁股熟妇激情视频 | 亚洲一区二区三区偷拍女厕 | 日日天日日夜日日摸 | 成人无码视频免费播放 | 国产成人无码区免费内射一片色欲 | 国产亚洲tv在线观看 | 亚洲乱码中文字幕在线 | 性色欲网站人妻丰满中文久久不卡 | 欧美肥老太牲交大战 | 中文无码成人免费视频在线观看 | 国产肉丝袜在线观看 | 成 人 网 站国产免费观看 | 久久zyz资源站无码中文动漫 | 久精品国产欧美亚洲色aⅴ大片 | 国色天香社区在线视频 | 最近免费中文字幕中文高清百度 | 99re在线播放 | 国产无遮挡又黄又爽又色 | 国产精品亚洲五月天高清 | 内射欧美老妇wbb | 亚洲日本一区二区三区在线 | 无码一区二区三区在线 | 天天拍夜夜添久久精品 | 午夜理论片yy44880影院 | 国产疯狂伦交大片 | 一本无码人妻在中文字幕免费 | 日韩精品无码免费一区二区三区 | 国产欧美熟妇另类久久久 | 啦啦啦www在线观看免费视频 | 日本一卡2卡3卡4卡无卡免费网站 国产一区二区三区影院 | 国产一区二区三区日韩精品 | 国产精品亚洲а∨无码播放麻豆 | 2020久久香蕉国产线看观看 | 免费无码午夜福利片69 | 国产成人av免费观看 | 国产亚洲精品久久久久久大师 | 丰满妇女强制高潮18xxxx | 色爱情人网站 | 久久精品女人天堂av免费观看 | 精品人人妻人人澡人人爽人人 | 国色天香社区在线视频 | 大肉大捧一进一出好爽视频 | 国产精品久久久久无码av色戒 | 国产熟女一区二区三区四区五区 | 中文字幕 亚洲精品 第1页 | 无码人妻丰满熟妇区五十路百度 | 精品国产精品久久一区免费式 | 麻豆蜜桃av蜜臀av色欲av | 日本一本二本三区免费 | 高清国产亚洲精品自在久久 | 亚洲色大成网站www国产 | 日本精品高清一区二区 | 久久99久久99精品中文字幕 | 久久国产精品二国产精品 | 欧美午夜特黄aaaaaa片 | 九九热爱视频精品 | 中文字幕无码日韩专区 | 免费人成在线视频无码 | 国产午夜福利亚洲第一 | 强辱丰满人妻hd中文字幕 | 国产精品18久久久久久麻辣 | 特级做a爰片毛片免费69 | 色综合久久网 | 我要看www免费看插插视频 | 成在人线av无码免观看麻豆 | 永久免费精品精品永久-夜色 | 51国偷自产一区二区三区 | 国内少妇偷人精品视频免费 | 久久精品中文字幕大胸 | 婷婷五月综合激情中文字幕 | 久久精品国产99精品亚洲 | 99国产欧美久久久精品 | 国产在线一区二区三区四区五区 | 水蜜桃亚洲一二三四在线 | 久久精品国产精品国产精品污 | 激情综合激情五月俺也去 | 午夜成人1000部免费视频 | 一本无码人妻在中文字幕免费 | 久久人人爽人人爽人人片ⅴ | 成人精品视频一区二区三区尤物 | 麻豆av传媒蜜桃天美传媒 | 国产亚洲精品久久久久久久久动漫 | 国产亚洲日韩欧美另类第八页 | 国产在线无码精品电影网 | 乱人伦人妻中文字幕无码久久网 | 日日橹狠狠爱欧美视频 | 中文字幕乱码亚洲无线三区 | 久久精品国产一区二区三区 | 性啪啪chinese东北女人 | 欧美老妇交乱视频在线观看 | 久久伊人色av天堂九九小黄鸭 | 久久99精品国产.久久久久 | 亚洲人亚洲人成电影网站色 | 99久久精品日本一区二区免费 | 亚洲成a人一区二区三区 | 97色伦图片97综合影院 | 性欧美熟妇videofreesex | 麻豆蜜桃av蜜臀av色欲av | 色情久久久av熟女人妻网站 | 午夜丰满少妇性开放视频 | 国产综合在线观看 | 国产艳妇av在线观看果冻传媒 | 天堂久久天堂av色综合 | 中文字幕色婷婷在线视频 | 无遮无挡爽爽免费视频 | 伦伦影院午夜理论片 | 亚洲熟熟妇xxxx | 成人无码精品一区二区三区 | 亚洲成熟女人毛毛耸耸多 | 精品 日韩 国产 欧美 视频 | 欧美人妻一区二区三区 | 亚洲精品国产精品乱码视色 | 国产人妻大战黑人第1集 | 久久午夜无码鲁丝片 | 色婷婷久久一区二区三区麻豆 | 色综合久久久无码中文字幕 | aa片在线观看视频在线播放 | 亚洲中文字幕av在天堂 | 国产另类ts人妖一区二区 | 又粗又大又硬毛片免费看 | 熟女少妇人妻中文字幕 | 成人综合网亚洲伊人 | 久久人人爽人人爽人人片av高清 | 国内揄拍国内精品人妻 | 女人被爽到呻吟gif动态图视看 | 欧洲极品少妇 | 国产人成高清在线视频99最全资源 | 精品国偷自产在线视频 | 久久久久亚洲精品中文字幕 | 国内精品久久久久久中文字幕 | 天天做天天爱天天爽综合网 | 夜夜夜高潮夜夜爽夜夜爰爰 | 1000部啪啪未满十八勿入下载 | 午夜时刻免费入口 | 国产成人无码区免费内射一片色欲 | 亚洲综合无码久久精品综合 | 少妇无码av无码专区在线观看 | 福利一区二区三区视频在线观看 | 精品国产一区二区三区四区在线看 | 麻豆蜜桃av蜜臀av色欲av | 亚洲中文字幕无码中字 | 午夜不卡av免费 一本久久a久久精品vr综合 | 欧美日韩亚洲国产精品 | 成人亚洲精品久久久久软件 | 亚洲色大成网站www国产 | 久久熟妇人妻午夜寂寞影院 | 国产亚洲精品精品国产亚洲综合 | 激情国产av做激情国产爱 | 乱码av麻豆丝袜熟女系列 | 久久国产精品_国产精品 | 两性色午夜免费视频 | 国产成人无码av片在线观看不卡 | 性做久久久久久久久 | 国产精品-区区久久久狼 | 久久无码人妻影院 | 曰本女人与公拘交酡免费视频 | 色爱情人网站 | 久久久久国色av免费观看性色 | 亚洲人成影院在线观看 | 国产口爆吞精在线视频 | 中文字幕亚洲情99在线 | 强辱丰满人妻hd中文字幕 | 亚洲中文字幕av在天堂 | 人妻中文无码久热丝袜 | 中文无码成人免费视频在线观看 | 曰本女人与公拘交酡免费视频 | 日韩av无码中文无码电影 | 无码中文字幕色专区 | 亚洲精品综合一区二区三区在线 | 欧美35页视频在线观看 | 好爽又高潮了毛片免费下载 | 国产国产精品人在线视 | 波多野结衣av在线观看 | 亚洲国产精品久久久天堂 | 国产成人综合色在线观看网站 | 国产精品资源一区二区 | 97久久超碰中文字幕 | 成人精品一区二区三区中文字幕 | 国产精品美女久久久久av爽李琼 | 欧美变态另类xxxx | 国产成人无码a区在线观看视频app | 亚洲精品国产精品乱码不卡 | 国产精品久久国产三级国 | 亚洲中文字幕在线无码一区二区 | 欧美人与牲动交xxxx | 国产精品二区一区二区aⅴ污介绍 | 亚洲精品久久久久中文第一幕 | 色诱久久久久综合网ywww | 欧美日韩视频无码一区二区三 | 国产精品久久久久久亚洲影视内衣 | 欧美兽交xxxx×视频 | 国产精品人妻一区二区三区四 | 日韩在线不卡免费视频一区 | 久久国产精品_国产精品 | 狠狠亚洲超碰狼人久久 | 精品乱码久久久久久久 | 久久国内精品自在自线 | 伊人久久大香线蕉av一区二区 | 无码人妻少妇伦在线电影 | 自拍偷自拍亚洲精品10p | 国产69精品久久久久app下载 | 成熟人妻av无码专区 | 国内精品久久毛片一区二区 | 免费国产黄网站在线观看 | 久久精品国产99久久6动漫 | 日韩人妻无码一区二区三区久久99 | 少妇邻居内射在线 | 亚洲精品综合五月久久小说 | 福利一区二区三区视频在线观看 | 天天燥日日燥 | 亚洲自偷自拍另类第1页 | 成在人线av无码免费 | 日本一区二区三区免费播放 | 国产色在线 | 国产 | 国产精品美女久久久网av | 国产亚洲精品久久久闺蜜 | 精品一区二区三区无码免费视频 | 色五月五月丁香亚洲综合网 | 欧美大屁股xxxxhd黑色 | 亚洲综合在线一区二区三区 | 欧美xxxxx精品 | 久激情内射婷内射蜜桃人妖 | 99久久婷婷国产综合精品青草免费 | 精品无码成人片一区二区98 | 国产亚洲欧美日韩亚洲中文色 | 日本一卡2卡3卡四卡精品网站 | 露脸叫床粗话东北少妇 | 成人精品视频一区二区三区尤物 | 亚洲色欲色欲天天天www | 国产高清av在线播放 | 东京热男人av天堂 | 国内精品一区二区三区不卡 | 亚洲精品国产精品乱码视色 | 亚洲精品国产第一综合99久久 | 国产av人人夜夜澡人人爽麻豆 | 久精品国产欧美亚洲色aⅴ大片 | 2019午夜福利不卡片在线 | 两性色午夜免费视频 | 成人免费视频一区二区 | 蜜桃av抽搐高潮一区二区 | 亚洲欧美日韩综合久久久 | 巨爆乳无码视频在线观看 | 亚洲综合另类小说色区 | 亚洲精品国产品国语在线观看 | 亚洲中文字幕在线无码一区二区 | 在线成人www免费观看视频 | 蜜桃臀无码内射一区二区三区 | 久久视频在线观看精品 | 性开放的女人aaa片 | 在教室伦流澡到高潮hnp视频 | 天下第一社区视频www日本 | 西西人体www44rt大胆高清 | 丰满岳乱妇在线观看中字无码 | av人摸人人人澡人人超碰下载 | 思思久久99热只有频精品66 | 亚洲中文无码av永久不收费 | 人妻少妇精品视频专区 | 久青草影院在线观看国产 | 国产精品高潮呻吟av久久 | 在线播放亚洲第一字幕 | 精品一二三区久久aaa片 | 性欧美牲交在线视频 | 国产午夜无码精品免费看 | 国产亚av手机在线观看 | 久久亚洲中文字幕无码 | 日本饥渴人妻欲求不满 | 日韩成人一区二区三区在线观看 | 久久久久久九九精品久 | 澳门永久av免费网站 | 亚洲色大成网站www国产 | 欧美性猛交xxxx富婆 | 中文精品久久久久人妻不卡 | 99久久人妻精品免费二区 | аⅴ资源天堂资源库在线 | 无码成人精品区在线观看 | 日本饥渴人妻欲求不满 | 俺去俺来也www色官网 | 成年女人永久免费看片 | 国产sm调教视频在线观看 | 51国偷自产一区二区三区 | 久久午夜无码鲁丝片秋霞 | 奇米影视888欧美在线观看 | av小次郎收藏 | 久久综合九色综合97网 | 色一情一乱一伦一视频免费看 | 久久久成人毛片无码 | 女人被男人躁得好爽免费视频 | 丰满少妇弄高潮了www | 无码av最新清无码专区吞精 | 欧美日韩精品 | 国产精品怡红院永久免费 | 鲁大师影院在线观看 | 亚洲日韩av一区二区三区中文 | 国产精品无码mv在线观看 | 久久久久久久人妻无码中文字幕爆 | 午夜熟女插插xx免费视频 | 一个人看的www免费视频在线观看 | 日本乱人伦片中文三区 | 蜜桃臀无码内射一区二区三区 | 国产精品毛片一区二区 | 亚洲gv猛男gv无码男同 | 无码人妻av免费一区二区三区 | 大胆欧美熟妇xx | 中文字幕无线码免费人妻 | 又紧又大又爽精品一区二区 | 亚洲国精产品一二二线 | 国产后入清纯学生妹 | 亚洲精品中文字幕乱码 | 免费乱码人妻系列无码专区 | 中文无码精品a∨在线观看不卡 | 精品无码国产一区二区三区av | 无码人妻精品一区二区三区不卡 | 中文字幕乱妇无码av在线 | 香蕉久久久久久av成人 | 影音先锋中文字幕无码 | 天天摸天天透天天添 | 日本www一道久久久免费榴莲 | 精品水蜜桃久久久久久久 | 久久久久久久久888 | 亚洲自偷精品视频自拍 | 免费人成在线视频无码 | 欧美兽交xxxx×视频 | 亚洲熟妇自偷自拍另类 | a片在线免费观看 | 人妻有码中文字幕在线 | 日韩在线不卡免费视频一区 | 野狼第一精品社区 | 亚洲欧美精品aaaaaa片 | 人人澡人摸人人添 | 国产两女互慰高潮视频在线观看 | 亚洲成a人片在线观看无码3d | 中文字幕无码日韩欧毛 | 成人亚洲精品久久久久软件 | 熟妇人妻激情偷爽文 | 欧美xxxxx精品 | 亚洲中文字幕成人无码 | 国产 浪潮av性色四虎 | 99riav国产精品视频 | 国产精品99久久精品爆乳 | 少妇性俱乐部纵欲狂欢电影 | 亚洲天堂2017无码 | 三上悠亚人妻中文字幕在线 | 成人影院yy111111在线观看 | 天堂а√在线中文在线 | 亚洲日韩一区二区三区 | 欧洲vodafone精品性 | 青春草在线视频免费观看 | 国产亚洲精品久久久久久久 | 久久久久久久人妻无码中文字幕爆 | 欧美变态另类xxxx | 丰满少妇熟乱xxxxx视频 | 无码人妻精品一区二区三区下载 | 亚洲爆乳大丰满无码专区 | 亚洲乱码中文字幕在线 | 女高中生第一次破苞av | 女人被爽到呻吟gif动态图视看 | 人人超人人超碰超国产 | 亚洲中文字幕在线无码一区二区 | 99久久人妻精品免费二区 | 日韩精品乱码av一区二区 | 中文字幕精品av一区二区五区 | 日本护士xxxxhd少妇 | 少妇人妻av毛片在线看 | 久9re热视频这里只有精品 | 熟女俱乐部五十路六十路av | 国内精品一区二区三区不卡 | 欧美性猛交内射兽交老熟妇 | 欧美刺激性大交 | 国产电影无码午夜在线播放 | 国产亚洲欧美日韩亚洲中文色 | 1000部啪啪未满十八勿入下载 | 国产97人人超碰caoprom | 奇米影视7777久久精品人人爽 | 内射白嫩少妇超碰 | 亚洲成av人片在线观看无码不卡 | 永久黄网站色视频免费直播 | 2020最新国产自产精品 | 粗大的内捧猛烈进出视频 | 波多野结衣一区二区三区av免费 | 少妇愉情理伦片bd | 少妇性l交大片 | 久久久婷婷五月亚洲97号色 | 国产免费无码一区二区视频 | 久久国产精品偷任你爽任你 | 日韩少妇内射免费播放 | 暴力强奷在线播放无码 | 极品尤物被啪到呻吟喷水 | 日日碰狠狠躁久久躁蜜桃 | 欧洲vodafone精品性 | 精品国产一区二区三区四区在线看 | 国产在线精品一区二区高清不卡 | 九九热爱视频精品 | 黑人玩弄人妻中文在线 | 熟妇人妻中文av无码 | 初尝人妻少妇中文字幕 | 熟妇人妻激情偷爽文 | 免费看男女做好爽好硬视频 | 九九在线中文字幕无码 | 亚洲欧美国产精品久久 | 亚洲 另类 在线 欧美 制服 | 在线观看欧美一区二区三区 | 高潮毛片无遮挡高清免费视频 | 无码人妻少妇伦在线电影 | 国产真实乱对白精彩久久 | 色偷偷人人澡人人爽人人模 | 亚洲人成影院在线无码按摩店 | 国产精品第一区揄拍无码 | 无码人妻精品一区二区三区不卡 | 秋霞成人午夜鲁丝一区二区三区 | 性欧美牲交xxxxx视频 | 欧美阿v高清资源不卡在线播放 | 久久国产自偷自偷免费一区调 | 国内精品九九久久久精品 | 少妇无码吹潮 | 亚洲欧美日韩国产精品一区二区 | 宝宝好涨水快流出来免费视频 | 亚无码乱人伦一区二区 | 成人无码精品一区二区三区 | 亚洲日韩精品欧美一区二区 | 亚洲 激情 小说 另类 欧美 | 亚洲国产日韩a在线播放 | 九九综合va免费看 | 日产精品高潮呻吟av久久 | 精品一区二区三区无码免费视频 | 在线观看欧美一区二区三区 | 国产成人综合美国十次 | 国产在热线精品视频 | 亚洲国产精品一区二区第一页 | 久久久久久国产精品无码下载 | 国产内射老熟女aaaa | 人妻无码αv中文字幕久久琪琪布 | 婷婷色婷婷开心五月四房播播 | 99久久99久久免费精品蜜桃 | 影音先锋中文字幕无码 | 无码国内精品人妻少妇 | 亚洲天堂2017无码 | 亚洲精品一区二区三区在线 | 国产乱子伦视频在线播放 | 国产成人av免费观看 | 欧美丰满少妇xxxx性 | 国产精品无码mv在线观看 | 国产精品对白交换视频 | 玩弄少妇高潮ⅹxxxyw | 牛和人交xxxx欧美 | 乌克兰少妇xxxx做受 | 成年美女黄网站色大免费视频 | 国内丰满熟女出轨videos | 免费看男女做好爽好硬视频 | 大地资源网第二页免费观看 | 亚洲第一无码av无码专区 | 中文字幕无码av波多野吉衣 | 国产人妻精品一区二区三区不卡 | 一区二区三区高清视频一 | 中文字幕精品av一区二区五区 | yw尤物av无码国产在线观看 | 亚洲理论电影在线观看 | 亚洲色成人中文字幕网站 | 蜜桃av蜜臀av色欲av麻 999久久久国产精品消防器材 | 国产精品永久免费视频 | 亚洲国产日韩a在线播放 | 色婷婷综合中文久久一本 | 久久熟妇人妻午夜寂寞影院 | 377p欧洲日本亚洲大胆 | 天天躁夜夜躁狠狠是什么心态 | 夜夜影院未满十八勿进 | 窝窝午夜理论片影院 | 夜夜夜高潮夜夜爽夜夜爰爰 | 精品厕所偷拍各类美女tp嘘嘘 | 成人欧美一区二区三区黑人免费 | 无码人妻丰满熟妇区五十路百度 | 亚洲中文字幕乱码av波多ji | 无码人妻精品一区二区三区不卡 | 欧美日本日韩 | 88国产精品欧美一区二区三区 | 日韩少妇内射免费播放 | 亚洲无人区一区二区三区 | 亚洲成色在线综合网站 | 国产精品自产拍在线观看 | 在线欧美精品一区二区三区 | 狠狠噜狠狠狠狠丁香五月 | 国产亚洲tv在线观看 | 清纯唯美经典一区二区 | 国产精品无码一区二区三区不卡 | 51国偷自产一区二区三区 | 无码毛片视频一区二区本码 | 国产无av码在线观看 | 女人高潮内射99精品 | 亚洲一区二区三区偷拍女厕 | 日韩少妇白浆无码系列 | 精品欧美一区二区三区久久久 | 18禁止看的免费污网站 | 午夜无码区在线观看 | 久久国产精品萌白酱免费 | 国产精品丝袜黑色高跟鞋 | 国产亚洲精品久久久久久久久动漫 | 亚洲精品成a人在线观看 | 欧美老熟妇乱xxxxx | 亚洲熟妇色xxxxx欧美老妇y | 精品无码av一区二区三区 | 在线播放无码字幕亚洲 | 内射欧美老妇wbb | 性欧美大战久久久久久久 | 麻豆果冻传媒2021精品传媒一区下载 | 久久久亚洲欧洲日产国码αv | 亚洲va欧美va天堂v国产综合 | 久久精品成人欧美大片 | 风流少妇按摩来高潮 | 久激情内射婷内射蜜桃人妖 | 国产亲子乱弄免费视频 | 久久国内精品自在自线 | 中文字幕+乱码+中文字幕一区 | 无码av免费一区二区三区试看 | 老子影院午夜伦不卡 | 一区二区三区乱码在线 | 欧洲 | 亚洲国产精品毛片av不卡在线 | 国产精品无码一区二区三区不卡 | 亚洲国产精品一区二区第一页 | 久久久精品人妻久久影视 | 国产成人无码区免费内射一片色欲 | 麻花豆传媒剧国产免费mv在线 | 兔费看少妇性l交大片免费 | 久久国语露脸国产精品电影 | 亚洲日韩av一区二区三区四区 | 双乳奶水饱满少妇呻吟 | 久久久久成人精品免费播放动漫 | 精品国产一区二区三区四区 | 亚洲熟妇色xxxxx欧美老妇 | 国产精品久免费的黄网站 | 久久综合九色综合97网 | 夜夜躁日日躁狠狠久久av | 精品厕所偷拍各类美女tp嘘嘘 | 欧美大屁股xxxxhd黑色 | 天天av天天av天天透 | 欧美国产亚洲日韩在线二区 | а天堂中文在线官网 | 欧美熟妇另类久久久久久不卡 | 一本久久a久久精品亚洲 | 国产高清av在线播放 | 亚洲精品鲁一鲁一区二区三区 | 久久国产自偷自偷免费一区调 | 国产激情综合五月久久 | 一个人看的视频www在线 | 老司机亚洲精品影院无码 | 久久午夜无码鲁丝片秋霞 | 曰韩少妇内射免费播放 | 久久久中文字幕日本无吗 | 又湿又紧又大又爽a视频国产 | 女人被爽到呻吟gif动态图视看 | 欧美日韩色另类综合 | 人人妻人人澡人人爽欧美一区 | 免费看男女做好爽好硬视频 | аⅴ资源天堂资源库在线 | 水蜜桃亚洲一二三四在线 | 久久久久se色偷偷亚洲精品av | 久久99久久99精品中文字幕 | 任你躁国产自任一区二区三区 | 激情国产av做激情国产爱 | 成人无码视频免费播放 | 最近的中文字幕在线看视频 | 亚洲人成无码网www | 国内少妇偷人精品视频 | 亚洲欧洲日本综合aⅴ在线 | 国产精品国产自线拍免费软件 | 日本一区二区更新不卡 | 好爽又高潮了毛片免费下载 | 精品一区二区不卡无码av | 精品无码av一区二区三区 | 国产精品久久久久无码av色戒 | 色婷婷欧美在线播放内射 | 亚欧洲精品在线视频免费观看 | 亚洲国产精华液网站w | 人妻互换免费中文字幕 | 十八禁真人啪啪免费网站 | 亚洲国产精品毛片av不卡在线 | 亚洲日韩av一区二区三区中文 | 蜜桃av抽搐高潮一区二区 | 无码精品人妻一区二区三区av | 精品无码一区二区三区爱欲 | 超碰97人人做人人爱少妇 | 亚洲一区二区三区在线观看网站 | 小鲜肉自慰网站xnxx | 亚洲一区av无码专区在线观看 | 精品乱码久久久久久久 | 久久这里只有精品视频9 | 国产成人午夜福利在线播放 | 欧美阿v高清资源不卡在线播放 | 精品国产精品久久一区免费式 | 欧美丰满熟妇xxxx性ppx人交 | 大色综合色综合网站 | 亚洲日韩乱码中文无码蜜桃臀网站 | 久激情内射婷内射蜜桃人妖 | 成人免费无码大片a毛片 | 欧美人与动性行为视频 | 99久久无码一区人妻 | 一本加勒比波多野结衣 | v一区无码内射国产 | 国产亚洲精品久久久久久 | √8天堂资源地址中文在线 | 牛和人交xxxx欧美 | 国产av无码专区亚洲a∨毛片 | 精品无码成人片一区二区98 | 国产综合久久久久鬼色 | 国内精品久久毛片一区二区 | 牲欲强的熟妇农村老妇女 | 鲁鲁鲁爽爽爽在线视频观看 | 亚洲人成无码网www | 国产肉丝袜在线观看 | 欧美人与禽zoz0性伦交 | 亚洲热妇无码av在线播放 | 国模大胆一区二区三区 | 国产无套粉嫩白浆在线 | 国产又爽又猛又粗的视频a片 | 精品国产av色一区二区深夜久久 | 女人被爽到呻吟gif动态图视看 | 97精品人妻一区二区三区香蕉 | 高潮毛片无遮挡高清免费 | 在线精品国产一区二区三区 | 久久精品国产一区二区三区 | 婷婷五月综合激情中文字幕 | 婷婷丁香六月激情综合啪 | 美女毛片一区二区三区四区 | 国产亚洲欧美日韩亚洲中文色 | 国语自产偷拍精品视频偷 | 88国产精品欧美一区二区三区 | 内射白嫩少妇超碰 | 国产成人一区二区三区在线观看 | 亚洲精品中文字幕乱码 | 亚洲成a人片在线观看日本 | 天天拍夜夜添久久精品大 | 老熟女乱子伦 | 六十路熟妇乱子伦 | 成人综合网亚洲伊人 | 国产手机在线αⅴ片无码观看 | 亚洲无人区午夜福利码高清完整版 | 国产激情综合五月久久 | 东北女人啪啪对白 | 亚洲国产午夜精品理论片 | 少妇性l交大片 | 欧美老妇交乱视频在线观看 | 久久久久久久女国产乱让韩 | 日日天日日夜日日摸 | 国产免费久久精品国产传媒 | 乌克兰少妇xxxx做受 | 精品久久久无码人妻字幂 | 天堂无码人妻精品一区二区三区 | 欧美人与善在线com | 亚洲自偷自偷在线制服 | 99国产精品白浆在线观看免费 | 无码精品国产va在线观看dvd | 亚洲狠狠色丁香婷婷综合 | 强辱丰满人妻hd中文字幕 | 欧美自拍另类欧美综合图片区 | 帮老师解开蕾丝奶罩吸乳网站 | 国产精品久久久久久久影院 | 伊人色综合久久天天小片 | 国产电影无码午夜在线播放 | 国产无av码在线观看 | 久久精品人妻少妇一区二区三区 | 成人精品天堂一区二区三区 | 欧美日韩一区二区三区自拍 | 国产又爽又猛又粗的视频a片 | 国产精品怡红院永久免费 | 欧美人与禽猛交狂配 | 在线观看国产午夜福利片 | 成人一区二区免费视频 | 久久午夜无码鲁丝片 | 日产国产精品亚洲系列 | 丰腴饱满的极品熟妇 | 国产精品国产自线拍免费软件 | 熟妇人妻中文av无码 | 亚洲中文字幕乱码av波多ji | 动漫av网站免费观看 | 四虎国产精品免费久久 | 丰满人妻一区二区三区免费视频 | 熟女少妇在线视频播放 | 国产午夜精品一区二区三区嫩草 | 初尝人妻少妇中文字幕 | 亚欧洲精品在线视频免费观看 | 少妇人妻偷人精品无码视频 | 荫蒂被男人添的好舒服爽免费视频 | 日本护士xxxxhd少妇 | 日本爽爽爽爽爽爽在线观看免 | 久久精品视频在线看15 | 亚洲中文字幕久久无码 | 999久久久国产精品消防器材 | 国产99久久精品一区二区 | 亚洲综合在线一区二区三区 | 亚洲午夜无码久久 | 中文字幕+乱码+中文字幕一区 | 国产9 9在线 | 中文 | 美女毛片一区二区三区四区 | 乌克兰少妇xxxx做受 | 成 人 网 站国产免费观看 | 伊人久久大香线蕉av一区二区 | 国产口爆吞精在线视频 | 免费人成在线观看网站 | 欧美肥老太牲交大战 | 欧美国产日韩久久mv | 在线欧美精品一区二区三区 | 无码乱肉视频免费大全合集 | 国产手机在线αⅴ片无码观看 | 在线а√天堂中文官网 | 久久久国产精品无码免费专区 | 麻豆md0077饥渴少妇 | 国产艳妇av在线观看果冻传媒 | 国产精品99爱免费视频 | 欧美性猛交内射兽交老熟妇 | 国产乱人伦app精品久久 国产在线无码精品电影网 国产国产精品人在线视 | 夜夜高潮次次欢爽av女 | 日韩视频 中文字幕 视频一区 | 青青久在线视频免费观看 | 久久精品一区二区三区四区 | аⅴ资源天堂资源库在线 | 色欲人妻aaaaaaa无码 | 亚洲码国产精品高潮在线 | 免费无码肉片在线观看 | 强辱丰满人妻hd中文字幕 | 国产猛烈高潮尖叫视频免费 | 成人动漫在线观看 | 亚拍精品一区二区三区探花 | 男女猛烈xx00免费视频试看 | 牲欲强的熟妇农村老妇女视频 | 爆乳一区二区三区无码 | 樱花草在线播放免费中文 | 成在人线av无码免费 | 国产精品国产自线拍免费软件 | 日本乱偷人妻中文字幕 | 性做久久久久久久久 | 国产极品视觉盛宴 | 在线播放无码字幕亚洲 | 少妇高潮一区二区三区99 | 国产亚洲欧美日韩亚洲中文色 | 久久久久av无码免费网 | 亚洲精品中文字幕久久久久 | 欧美 亚洲 国产 另类 | 久久精品国产99久久6动漫 | 无码人妻精品一区二区三区不卡 | 亚洲七七久久桃花影院 | 精品少妇爆乳无码av无码专区 | 国产精品香蕉在线观看 | 久久国产精品萌白酱免费 | 天天摸天天碰天天添 | 4hu四虎永久在线观看 | 久久国产精品偷任你爽任你 | 99久久精品国产一区二区蜜芽 | 免费中文字幕日韩欧美 | 国产亚洲人成a在线v网站 | 亚洲色无码一区二区三区 | aa片在线观看视频在线播放 | 国产精品亚洲综合色区韩国 | 1000部夫妻午夜免费 | 最近中文2019字幕第二页 | 少妇被粗大的猛进出69影院 | 国内揄拍国内精品人妻 | 强伦人妻一区二区三区视频18 | 日韩精品无码一区二区中文字幕 | 蜜桃视频插满18在线观看 | 亚洲精品国产精品乱码视色 | 男女爱爱好爽视频免费看 | 精品久久久久香蕉网 | 国产成人精品无码播放 | 日韩在线不卡免费视频一区 | 国产精品国产三级国产专播 | 欧美老熟妇乱xxxxx | 久久99国产综合精品 | 熟女体下毛毛黑森林 | 清纯唯美经典一区二区 | 成人精品一区二区三区中文字幕 | 国产真实夫妇视频 | 97久久国产亚洲精品超碰热 | 无码国产色欲xxxxx视频 | 国产精品久久国产精品99 | аⅴ资源天堂资源库在线 | 中文字幕精品av一区二区五区 | 色综合视频一区二区三区 | 日本成熟视频免费视频 | 少妇久久久久久人妻无码 | 一个人免费观看的www视频 | 欧美日韩人成综合在线播放 | 18精品久久久无码午夜福利 | 中文字幕人妻无码一区二区三区 | 国产亚洲日韩欧美另类第八页 | 又粗又大又硬毛片免费看 | 久久精品丝袜高跟鞋 | av小次郎收藏 | 日本丰满熟妇videos | 国产激情无码一区二区app | 久久久精品成人免费观看 | 久久久久久亚洲精品a片成人 | 免费视频欧美无人区码 | 成人亚洲精品久久久久 | 亚无码乱人伦一区二区 | √8天堂资源地址中文在线 | 国产亚洲精品久久久久久国模美 | 中文字幕乱妇无码av在线 | 成人性做爰aaa片免费看 | 98国产精品综合一区二区三区 | 天天爽夜夜爽夜夜爽 | 久久精品中文字幕一区 | 国产精品人人爽人人做我的可爱 | 人人妻人人藻人人爽欧美一区 | 久久久久久亚洲精品a片成人 | 亚洲色大成网站www | 亚洲国产午夜精品理论片 | 国产性生大片免费观看性 | 88国产精品欧美一区二区三区 | 国产卡一卡二卡三 | 帮老师解开蕾丝奶罩吸乳网站 | 少妇被黑人到高潮喷出白浆 | 国产一区二区三区精品视频 | 亚洲人亚洲人成电影网站色 | 亚洲一区二区三区含羞草 | 国产猛烈高潮尖叫视频免费 | 亚洲综合精品香蕉久久网 | 日本丰满护士爆乳xxxx | 一本久久伊人热热精品中文字幕 | 欧美日本免费一区二区三区 | 欧美老熟妇乱xxxxx | 亚洲精品无码人妻无码 | 精品欧美一区二区三区久久久 | 又色又爽又黄的美女裸体网站 | 久久综合香蕉国产蜜臀av | 2019午夜福利不卡片在线 | 特黄特色大片免费播放器图片 | 蜜臀av无码人妻精品 | 亚洲国精产品一二二线 | 欧美 丝袜 自拍 制服 另类 | 国产亚洲精品久久久ai换 | 免费网站看v片在线18禁无码 | 亚洲一区二区三区四区 | 色欲人妻aaaaaaa无码 | 亚洲 高清 成人 动漫 | 无码国产乱人伦偷精品视频 | 亚洲大尺度无码无码专区 | 精品久久综合1区2区3区激情 | 国产欧美亚洲精品a | 国内精品人妻无码久久久影院 | 国产精品亚洲а∨无码播放麻豆 | 国产欧美精品一区二区三区 | 国内精品人妻无码久久久影院蜜桃 | 久久国产精品_国产精品 | 香蕉久久久久久av成人 | 国产精品久久福利网站 | 十八禁真人啪啪免费网站 | 日韩人妻无码一区二区三区久久99 | 久久综合久久自在自线精品自 | 欧美第一黄网免费网站 | 蜜桃臀无码内射一区二区三区 | 男女超爽视频免费播放 | 精品人人妻人人澡人人爽人人 | 国产精品理论片在线观看 | 亚洲无人区一区二区三区 | 亚洲熟悉妇女xxx妇女av | 国产性生大片免费观看性 | 欧洲欧美人成视频在线 | 妺妺窝人体色www在线小说 | 成人欧美一区二区三区黑人 | aⅴ在线视频男人的天堂 | 一本一道久久综合久久 | 日本www一道久久久免费榴莲 | аⅴ资源天堂资源库在线 | 亚洲中文字幕在线无码一区二区 | 黄网在线观看免费网站 | 日本免费一区二区三区最新 | 一个人看的视频www在线 | 国精产品一品二品国精品69xx | 欧美三级a做爰在线观看 | 国产成人无码区免费内射一片色欲 | 又大又硬又黄的免费视频 | 国产精品va在线播放 | 精品国产一区二区三区四区 | 未满成年国产在线观看 | 日韩视频 中文字幕 视频一区 | 亚洲欧美日韩国产精品一区二区 | 日韩av无码中文无码电影 | 亚洲欧美精品伊人久久 | 毛片内射-百度 | 日欧一片内射va在线影院 | 久久精品中文字幕一区 | 亚洲精品成人福利网站 | 国产va免费精品观看 | 国产超碰人人爽人人做人人添 | 国产成人人人97超碰超爽8 | 水蜜桃亚洲一二三四在线 | 天堂久久天堂av色综合 | 国产激情精品一区二区三区 | 中文无码精品a∨在线观看不卡 | 成人性做爰aaa片免费看不忠 | 青青草原综合久久大伊人精品 | 无码一区二区三区在线 | 国产无av码在线观看 | 欧美老妇交乱视频在线观看 | 日本精品久久久久中文字幕 | 无码国产乱人伦偷精品视频 | 亚洲日韩精品欧美一区二区 | 初尝人妻少妇中文字幕 | 18禁止看的免费污网站 | 97久久超碰中文字幕 | 国产办公室秘书无码精品99 | 鲁鲁鲁爽爽爽在线视频观看 | 免费人成在线观看网站 | 中文字幕无码日韩专区 | 欧美精品免费观看二区 | 又大又紧又粉嫩18p少妇 | 97资源共享在线视频 | 久久无码专区国产精品s | 久久国产精品精品国产色婷婷 | 午夜成人1000部免费视频 | 久久综合狠狠综合久久综合88 | 精品久久久久久亚洲精品 | 日韩av无码一区二区三区 | 内射欧美老妇wbb | 国产亚洲精品久久久久久久久动漫 | 国模大胆一区二区三区 | 无码人妻精品一区二区三区下载 | 乱中年女人伦av三区 | 成人三级无码视频在线观看 | 亚洲爆乳无码专区 | 婷婷丁香五月天综合东京热 | 亚洲色在线无码国产精品不卡 | 一本大道久久东京热无码av | 国产精品18久久久久久麻辣 | 好爽又高潮了毛片免费下载 | 国内少妇偷人精品视频免费 | 亚洲自偷自拍另类第1页 | 少妇太爽了在线观看 | 老司机亚洲精品影院无码 | 色欲久久久天天天综合网精品 | 色 综合 欧美 亚洲 国产 | 日韩av无码一区二区三区 | 亚洲自偷自拍另类第1页 | 日韩欧美成人免费观看 | 男人扒开女人内裤强吻桶进去 | 亚洲精品一区二区三区大桥未久 | 亚洲成熟女人毛毛耸耸多 | аⅴ资源天堂资源库在线 | 亚欧洲精品在线视频免费观看 | 国产乱子伦视频在线播放 | 日日鲁鲁鲁夜夜爽爽狠狠 | 久久久久久九九精品久 | 综合激情五月综合激情五月激情1 | 亚洲小说春色综合另类 | 内射爽无广熟女亚洲 | 在线天堂新版最新版在线8 | 青草视频在线播放 | 一本一道久久综合久久 | 久久久久人妻一区精品色欧美 | 久久午夜无码鲁丝片午夜精品 | 亚洲中文字幕在线观看 | 精品国产福利一区二区 | 亚洲一区二区三区偷拍女厕 | 久久99精品久久久久婷婷 | 中国女人内谢69xxxxxa片 | 亚洲国产精品一区二区美利坚 | 欧美日本精品一区二区三区 | 狠狠色噜噜狠狠狠狠7777米奇 | 久久久久久国产精品无码下载 | 熟女少妇在线视频播放 | 乌克兰少妇xxxx做受 | 国产精品美女久久久久av爽李琼 | 中文字幕久久久久人妻 | 中文字幕日韩精品一区二区三区 | 国产9 9在线 | 中文 | 在线视频网站www色 | 欧美成人高清在线播放 | 强开小婷嫩苞又嫩又紧视频 | 亚洲 日韩 欧美 成人 在线观看 | 久久综合网欧美色妞网 | 精品无人国产偷自产在线 | 亚洲国产精品无码久久久久高潮 | 亚洲精品综合五月久久小说 | 蜜臀aⅴ国产精品久久久国产老师 | 大地资源中文第3页 | 国产片av国语在线观看 | 欧美成人高清在线播放 | 成熟女人特级毛片www免费 | 久久久中文字幕日本无吗 | 一本一道久久综合久久 | 亚洲精品国产精品乱码视色 | 亚洲中文字幕在线无码一区二区 | 永久免费观看美女裸体的网站 | 中文字幕无码日韩欧毛 | 香蕉久久久久久av成人 | 帮老师解开蕾丝奶罩吸乳网站 | 亚洲日韩一区二区三区 | 久久综合给久久狠狠97色 | 国产精品久久久久久久影院 | 国产精品久久久 | 午夜嘿嘿嘿影院 | 国产在线一区二区三区四区五区 | 久久精品女人天堂av免费观看 | 日本大香伊一区二区三区 | 亚洲精品美女久久久久久久 | 亚洲精品午夜国产va久久成人 | 中文字幕无码日韩专区 | 成人性做爰aaa片免费看 | 中文字幕 人妻熟女 | 伦伦影院午夜理论片 | 国产精品久久国产三级国 | yw尤物av无码国产在线观看 | 欧美日韩人成综合在线播放 | 99精品视频在线观看免费 | 秋霞成人午夜鲁丝一区二区三区 | 国产特级毛片aaaaaa高潮流水 | 国产在线aaa片一区二区99 | 国产精品福利视频导航 | 亚洲精品久久久久久一区二区 | 99国产欧美久久久精品 | 亚洲精品国偷拍自产在线麻豆 | 日日碰狠狠躁久久躁蜜桃 | 欧美日韩综合一区二区三区 | 熟女少妇在线视频播放 | 亚洲综合色区中文字幕 | 成 人影片 免费观看 | 宝宝好涨水快流出来免费视频 | 奇米影视888欧美在线观看 | 特大黑人娇小亚洲女 | 久久这里只有精品视频9 | 乱人伦人妻中文字幕无码久久网 | 性欧美疯狂xxxxbbbb | 国产成人亚洲综合无码 | 四虎国产精品一区二区 | 欧洲精品码一区二区三区免费看 | 日本精品少妇一区二区三区 | 国产在线精品一区二区三区直播 | 搡女人真爽免费视频大全 | 18无码粉嫩小泬无套在线观看 | 无码人妻出轨黑人中文字幕 | 西西人体www44rt大胆高清 | 免费国产成人高清在线观看网站 | 久久国产36精品色熟妇 | 欧美成人家庭影院 | 国产亚洲精品久久久久久大师 | 久久午夜夜伦鲁鲁片无码免费 | 亚洲色欲色欲天天天www | 日韩av无码一区二区三区 | 欧美精品无码一区二区三区 | 丰满人妻翻云覆雨呻吟视频 | 四十如虎的丰满熟妇啪啪 | 三上悠亚人妻中文字幕在线 | 无码乱肉视频免费大全合集 | 国产精品久久久av久久久 | 人妻中文无码久热丝袜 | 国产免费久久精品国产传媒 | 麻豆果冻传媒2021精品传媒一区下载 | 精品无人区无码乱码毛片国产 | 亚洲精品www久久久 | 美女张开腿让人桶 | 伊人久久大香线蕉午夜 | 性做久久久久久久久 | 亚洲人成网站在线播放942 | 午夜精品久久久久久久久 | 国产亚洲精品久久久久久久久动漫 | 国产另类ts人妖一区二区 | 欧洲vodafone精品性 | 国产高清不卡无码视频 | 国产精品毛片一区二区 | 国语自产偷拍精品视频偷 | 久久久久se色偷偷亚洲精品av | 国内精品久久毛片一区二区 | 天干天干啦夜天干天2017 | 中文字幕人妻丝袜二区 | 国产成人精品久久亚洲高清不卡 | 性啪啪chinese东北女人 | 国产特级毛片aaaaaaa高清 | 草草网站影院白丝内射 | 亚洲人成影院在线无码按摩店 | 丁香花在线影院观看在线播放 | 久9re热视频这里只有精品 | 曰韩无码二三区中文字幕 | 亚洲а∨天堂久久精品2021 | 亚洲天堂2017无码中文 | 亚洲精品美女久久久久久久 | 国产精品久久国产精品99 | 中文字幕乱码人妻二区三区 | 99麻豆久久久国产精品免费 | 人人妻人人藻人人爽欧美一区 | 欧美人妻一区二区三区 | 欧美肥老太牲交大战 | 国产又粗又硬又大爽黄老大爷视 | 亲嘴扒胸摸屁股激烈网站 | 久精品国产欧美亚洲色aⅴ大片 | 自拍偷自拍亚洲精品被多人伦好爽 | 成人一区二区免费视频 | 秋霞特色aa大片 | 乱中年女人伦av三区 | 日日天干夜夜狠狠爱 | 亚洲熟妇色xxxxx欧美老妇y | 欧美xxxx黑人又粗又长 | 成人性做爰aaa片免费看不忠 | 国产农村乱对白刺激视频 | 高清无码午夜福利视频 | 日本免费一区二区三区最新 | 国产人妖乱国产精品人妖 | 日本肉体xxxx裸交 | 国产亚洲精品久久久久久大师 | 99久久99久久免费精品蜜桃 | 日韩av无码一区二区三区不卡 | 久久午夜无码鲁丝片秋霞 | 国产午夜无码视频在线观看 | 欧美黑人巨大xxxxx | 国产片av国语在线观看 | 国产艳妇av在线观看果冻传媒 | 无码精品国产va在线观看dvd | 在线 国产 欧美 亚洲 天堂 | 色一情一乱一伦 | 麻豆果冻传媒2021精品传媒一区下载 | 亚洲成在人网站无码天堂 | 少妇性荡欲午夜性开放视频剧场 | 荫蒂被男人添的好舒服爽免费视频 | 妺妺窝人体色www婷婷 | 国产精品二区一区二区aⅴ污介绍 | 久久天天躁夜夜躁狠狠 | 色偷偷人人澡人人爽人人模 | 亚洲 a v无 码免 费 成 人 a v | 午夜精品一区二区三区的区别 | 亚洲国产精品一区二区第一页 | 老熟女重囗味hdxx69 | 女人被爽到呻吟gif动态图视看 | 欧美老妇交乱视频在线观看 | 国产精品第一区揄拍无码 | 特级做a爰片毛片免费69 | 亚洲中文字幕久久无码 | 日本大香伊一区二区三区 | 中文精品久久久久人妻不卡 | 在线视频网站www色 | 九九久久精品国产免费看小说 | 婷婷综合久久中文字幕蜜桃三电影 | 两性色午夜免费视频 | 一个人免费观看的www视频 | 免费人成网站视频在线观看 | 日日夜夜撸啊撸 | 沈阳熟女露脸对白视频 | 亚洲日韩一区二区三区 | 国产在线无码精品电影网 | 青青草原综合久久大伊人精品 | 成人欧美一区二区三区黑人免费 | 免费无码av一区二区 | 久久精品人人做人人综合试看 | 一二三四社区在线中文视频 | 亚洲精品无码国产 | 牲欲强的熟妇农村老妇女 | 熟妇人妻激情偷爽文 | 黑森林福利视频导航 | 中文字幕人妻无码一夲道 | 宝宝好涨水快流出来免费视频 | 无遮挡啪啪摇乳动态图 | 欧美亚洲国产一区二区三区 | 国产深夜福利视频在线 | 在线亚洲高清揄拍自拍一品区 | 国产成人无码专区 | 日本护士毛茸茸高潮 | www国产亚洲精品久久久日本 | 99精品视频在线观看免费 | 亚洲精品一区二区三区大桥未久 | 欧美日韩在线亚洲综合国产人 | 亚洲人成影院在线无码按摩店 | 四十如虎的丰满熟妇啪啪 | 亚洲中文字幕无码中文字在线 | 亚洲成av人影院在线观看 | 牲交欧美兽交欧美 | 中文字幕 亚洲精品 第1页 | 激情五月综合色婷婷一区二区 | 噜噜噜亚洲色成人网站 | 久久国产精品偷任你爽任你 | 午夜丰满少妇性开放视频 | a在线观看免费网站大全 | 欧美精品一区二区精品久久 | 久久国产自偷自偷免费一区调 | 亚洲色大成网站www | 国产99久久精品一区二区 | 老司机亚洲精品影院 | 国产激情艳情在线看视频 | 少妇性俱乐部纵欲狂欢电影 | 秋霞成人午夜鲁丝一区二区三区 | 国产亚洲精品久久久久久久 | 高中生自慰www网站 | 无码一区二区三区在线观看 | 精品日本一区二区三区在线观看 | 亚洲熟熟妇xxxx | 在线а√天堂中文官网 | 久久成人a毛片免费观看网站 | 欧美老熟妇乱xxxxx | 丝袜 中出 制服 人妻 美腿 | 美女黄网站人色视频免费国产 | 乌克兰少妇xxxx做受 | 亚洲s码欧洲m码国产av | 亚洲一区av无码专区在线观看 | 一本大道久久东京热无码av | 国产精品久久久 | 国产精品毛多多水多 | 国产成人久久精品流白浆 | 中文无码精品a∨在线观看不卡 | 久久久国产精品无码免费专区 | а√资源新版在线天堂 | 波多野结衣乳巨码无在线观看 | 夜夜躁日日躁狠狠久久av | 日韩av无码一区二区三区不卡 | 国产精品高潮呻吟av久久4虎 | 国产国产精品人在线视 | 久久精品一区二区三区四区 | 精品国产青草久久久久福利 | 熟女俱乐部五十路六十路av | 午夜福利不卡在线视频 | 国产综合久久久久鬼色 | 国产精品高潮呻吟av久久4虎 | 青青久在线视频免费观看 | ass日本丰满熟妇pics | 2020久久超碰国产精品最新 | 无码国模国产在线观看 | 中国女人内谢69xxxx |