视频语言理解：从模型架构、模型训练和数据视角进行综述

  Thong Nguyen¹,  Yi Bin¹  ,  Junbin Xiao¹,  Leigang Qu¹,  Yicong Li¹,
  Jay Zhangjie Wu¹,   Cong-Duy Nguyen²,  See–Kiong Ng¹,  Luu Anh Tuan²^∗
¹National University of Singapore, Singapore
²Nanyang Technological University, Singapore
e0998147@u.nus.edu, anhtuan.luu@ntu.edu.sg
  Corresponding author

摘要

人类利用多种感官来理解环境。视觉和语言是我们最重要的两种感官，因为它们使我们能够轻松地表达我们的思想和感知周围的世界。由于视频语言对可以模仿我们的语言媒介和具有时间动态的视觉环境，因此人们对创建具有类人感官的视频语言理解系统产生了浓厚的兴趣。在这篇综述中，我们回顾了这些系统的关键任务，并强调了相关的挑战。基于这些挑战，我们从模型架构、模型训练和数据角度总结了它们的方法。我们还对这些方法进行了性能比较，并讨论了未来研究的有希望的方向。

1 引言

视觉和语言构成了我们感知的基本组成部分：视觉使我们能够感知物理世界，而语言使我们能够描述和谈论它。然而，世界不仅仅是一幅静态的图像，而是表现出动态性，其中物体随时间移动和相互作用。随着时间维度的出现，视频能够捕捉到表征物理世界的这种时间动态。因此，为了赋予人工智能以类人的感知能力，研究人员一直在开发能够解释视频的时空动态和语言语义的视频语言理解模型，这可以追溯到 1970 年代 (Lazarus, 1973; McGurk and MacDonald, 1976)。这些模型与图像语言理解模型不同，因为它们表现出额外的解释时间动态的能力 (Li et al., 2020)。

它们在各种视频语言理解任务中表现出了令人印象深刻的性能。这些任务从粗粒度到细粒度理解能力评估视频语言模型。例如，对于粗粒度的理解，文本-视频检索任务评估模型将语言查询与整个视频整体关联的能力 (Han 等人，2023)。为了更细粒度的理解能力，视频字幕模型需要理解整个视频内容和细节内容，然后用简洁的语言描述内容 (Abdar 等人，2023)。视频问答中的细粒度理解仍然是一项困难的任务，其中模型需要识别细微的视觉对象或动作，并推断其语义、空间、时间和因果关系 (Xiao 等人，2021)。

为了有效地执行此类视频-语言理解任务，视频-语言理解工作必须探索三个挑战。第一个挑战在于设计一种适当的神经架构来模拟视频和语言模态之间的交互。第二个挑战是设计一种有效的策略来训练视频-语言理解模型，以便有效地适应多个目标任务和领域。第三个挑战是准备高质量的视频-语言数据，为这些模型的训练提供燃料。

forked edges, for tree = grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw=hidden-black, rounded corners, align=left, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=6.0em, font=, where level=2text width=5.5em, font=, where level=3text width=19.3em, font=, where level=4text width=19.3em, font=, where level=5text width=19.3em, font=, [ 视频-语言理解, ver, font=[ 视频-语言

理解任务 [ 文本-视频检索 [ 例如 (Jiang 等人，2022a；Jin 等人，2023；Dong 等人，2022；Pei 等人，2023；Lin 等人，2022；Zhang 等人，2023a), text width=33em, fill=hidden-blue!90 ] ] [ 视频字幕 [ 例如 (Seo 等人，2022；Wu 等人，2021；Zhang 等人，2020；Pan 等人，2020；Xu 等人，2020；Lin 等人，2020), text width=33em, fill=hidden-blue!90 ] ] [ 视频问答 [ 例如 (Xiao 等人，2023b, 2022b；Park 等人，2021；Li 等人，2023e；Guo 等人，2021；Peng 等人，2021；Zhao 等人，2017b), text width=33em, fill=hidden-blue!90 ] ] [ 其他任务 [ 例如 (Liu 等人，2022；Zeng 等人，2022；Yang 等人，2021；Li 等人，2023c；Lin 等人，2023；Hwang 等人，2023), text width=33em, fill=hidden-blue!90 ] ] ] [ 视频-语言

理解方法 [ 模型架构 [ 预Transformer, text width=5em [ 例如 (Ye 等人，2017；Feichtenhofer 等人，2016；Yang 等人，2017；Zhao 等人，2017a), text width=26.4em, fill=hidden-blue!90 ] ] [ 基于Transformer, text width=5em [ 例如 (Akbari 等人，2021；Tang 等人，2021；Li 等人，2023b；Luo 等人，2022；Xue 等人，2022b), text width=26.4em, fill=hidden-blue!90 ] ] [ LLM增强, text width=5em [ 例如 (Zhang 等人，2023b；Li 等人，2023a；Chen 等人，2023；Li 等人，2023d；Pan 等人，2023), text width=26.4em, fill=hidden-blue!90 ] ] ] [ 模型训练 [ 预训练, text width=5em [ 例如 (Cheng 等人，2023；Lei 等人，2021c；Fu 等人，2023；Gao 等人，2021；Bain 等人，2021), text width=26.4em, fill=hidden-blue!90 ] ] [ 微调, text width=5em [ 例如 (Xu 等人，2019；Anne Hendricks 等人，2017；Pan 等人，2022；Yang 等人，2022a), text width=26.4em, fill=hidden-blue!90 ] ] ] [ 数据视角 [ 数据整理, text width=5em [ 手动收集, text width=5.6em [ 例如 (Xue 等人，2022a；Zellers 等人，2021；Castro 等人，2022b), fill=hidden-blue!90, text width=19.2em, ] ] [ 数据增强, text width=5.6em [ 例如 (Xing 等人，2023；Jiang 等人，2022c；Wang 等人，2021b), fill=hidden-blue!90, text width=19.2em, ] ] ] [ 标签标注, text width=5em [ 手动标注, text width=5.6em [ 例如 (Li 等人，2022a；Xiao 等人，2021；Castro 等人，2022a), fill=hidden-blue!90, text width=19.2em ] ] [ 自动生成, text width=5.6em [ 例如 (Zhao 等人，2023；Yang 等人，2023a；Ventura 等人，2023), fill=hidden-blue!90, text width=19.2em ] ] ] ] ] ]

图 1：视频-语言理解分类

尽管最近有一些工作试图回顾视频-语言理解，但它们主要侧重于一个挑战，例如基于Transformer (Ruan 和 Jin，2022) 和LLM增强架构 (Tang 等人，2023b)（第一个挑战），自监督学习 (Schiappa 等人，2023) 和预训练 (Cheng 等人，2023)（第二个挑战），以及数据增强 (Zhou 等人，2024)（第三个挑战）。此外，其他工作也仅仅关注一个视频-语言理解任务，例如视频问答 (Zhong 等人，2022)，文本-视频检索 (Zhu 等人，2023) 和视频字幕 (Abdar 等人，2023)。这种狭隘的关注与日益增长的共识相矛盾，该共识倡导开发能够适应各种任务和领域的人工通用智能。考虑一个人机交互场景，其中一个人反复提出关于视频的问题，搜索相关时刻，并请求摘要。此类用例需要广泛的能力来理解视频和语言内容，而不受特定任务的限制。此外，视频-语言理解系统的开发通常涉及一个多步骤过程，包括设计模型架构、制定训练方法和准备数据，而不是一个单一步骤的努力。因此，本文旨在提供一个全面而有意义的综述，以连接视频语言理解的各个方面。我们的贡献如下：

•

我们总结了视频语言理解的关键任务，并讨论了它们的共同挑战：模内和跨模交互、跨域自适应和数据准备。
•

我们从三个方面，根据上述三个挑战，对视频语言理解工作进行了清晰的分类：（1）模型架构视角：我们将现有工作分为预Transformer、基于Transformer和LLM增强的架构，以对视频语言关系进行建模。在后一类别中，我们讨论了最近利用LLM的优势来增强视频语言理解的努力。（2）模型训练视角：我们将训练方法分为预训练和微调，以使视频语言表示适应目标下游任务。（3）数据视角：我们总结了现有的方法，这些方法用于整理视频语言数据并对其进行标注，以促进视频语言理解模型的训练。
•

最后，我们提供了我们的展望，并提出了未来研究的潜在方向。

2 视频语言任务

文本-视频检索。文本-视频检索的任务是根据语言查询（文本到视频）搜索相应的视频，或者相反地根据视频搜索语言描述（视频到文本）。在实际应用中，返回整个视频可能并不理想。因此，视频片段检索 (VMR) 应运而生，其目标是根据用户查询准确地定位视频中的相关片段。 VMR 考察了更细致入微的理解，以捕捉视频中不同的概念和事件，从而准确地定位特定片段，而不是像标准文本-视频检索那样捕捉整体主题。

视频字幕。视频字幕的任务是为视频生成简明的语言描述。视频字幕模型以视频作为输入，并可选地接受从视频音频中转录的语言文本。通常，模型会为整个视频生成一个句子级的字幕，或者也可能生成一个段落作为更详细的摘要。

视频问答 (videoQA)。视频问答的任务是根据问题 $q$ 和视频 $v$ 预测正确答案。视频QA 主要分为两种类型：即多项选择视频QA 和开放式视频QA。在多项选择视频QA 中，模型会提供一定数量的候选答案，模型会从中选择正确的答案。开放式视频QA 可以被表述为一个分类问题、一个生成问题或一个回归问题。基于分类的视频QA 将视频-问题对与来自预定义词汇集的答案相关联。基于生成的视频QA 不限于词汇集，模型可以生成代表问题答案的符元序列。基于回归的视频问答通常用于计数问题，例如计算动作的重复次数或视频中某个物体的数量。

视频语言理解任务之间的联系。这些任务构成了视频语言理解能力的三个基本测试平台（参见附录A以获取它们的示例）。在图2中，我们提供了一个层次结构来描述其视频语言理解程度的升级。在基本层面上，文本-视频检索将整个视频与文本内容全局关联起来。在中等层面上，视频字幕比检索任务更难，因为它需要选择性地将视频中的实体和事件映射到语言模式。在最高层面上，视频问答探索视频和语言内容的关系以产生适当的输出。每个级别的视频语言理解任务都与一个相应的版本相关联，该版本需要更推断或更细粒度的理解，例如推理视频问答(Xiao et al., 2021; Li et al., 2022a) 与视频问答，密集视频字幕(Zhou et al., 2018b) 或视频章节生成(Yang et al., 2023b) 与视频字幕，以及视频片段检索（时间定位）与文本-视频检索。这些更推断或更细粒度的任务带来了更多挑战，并在当前研究中发挥着越来越重要的作用，朝着人类智能的核心发展(Fei-Fei and Krishna, 2022)。

3 视频语言理解的挑战

与图像语言理解相比，所讨论的视频语言理解任务提出了独特的挑战，因为视频包含一个额外的时态通道。我们总结了它们的重大挑战，如下：

模内和跨模交互。虽然语言内模态交互建模可以直接从图像-语言理解中借鉴，但视频内模态交互建模有所不同，因为它同时包含空间交互和时间交互。空间交互深入研究单个帧内像素、补丁、区域或对象之间的关系，而时间交互捕获视频帧或视频片段之间的顺序依赖关系。更长的视频时长通过需要在更多视频帧中识别更多对象和事件 (Yu et al., 2020; Lin et al., 2022)，以及推理它们的长期依赖关系 Zhao et al. (2018)，从而加剧了时间建模的复杂性。特定的视频领域，如自我中心视频，也使时间交互建模变得复杂，因为对象会随着时间的推移经历剧烈的出现和消失动态，这对捕捉它们之间的关系提出了挑战 (Bansal et al., 2022; Tang et al., 2023a)。

鉴于视频-语言与图像-语言相比存在更大的语义差距，跨模态交互在视频-语言理解中起着至关重要的作用。视觉和语言特征之间的交互对于对齐视频和文本查询的语义至关重要，以将它们关联以进行文本-视频检索，或识别相关部分以分别回答问题和撰写视频QA和视频字幕中的字幕。此外，结合运动和语言特征的交互可以减轻从视频中提取噪声信息的提取 (Ding et al., 2022)。 Lin et al. (2022) 还发现音频和语言特征之间的交互可以紧凑地捕获与对象、动作和复杂事件相关的信息，弥补了稀疏提取的视频帧。

跨领域适应。鉴于在线视频的无限性，我们的视频-语言理解模型会遇到与我们的训练数据相同分布的测试场景是一个不切实际的假设。此外，随着可以处理各种视频-语言理解任务的 LLM 增强型模型的出现 (Li et al., 2023a, d)，目前更建议训练一个可以有效适应多个任务和领域的模型，而不是获得专门针对特定理解任务的模型。此外，由于视频可以被视为图像序列，因此在视频-文本数据上训练模型比在图像-文本数据上训练模型的计算成本更高。结合最近视频-语言理解模型的大规模 (Jiang et al., 2022a; Yang et al., 2022a)，还需要设计一种有效的微调策略来节省微调这些模型的计算成本。

数据准备。虽然 Lei et al. (2021c) 只使用图像-文本数据来训练用于视频-语言理解任务的模型，但本质上，视频-文本数据对于这些模型的有效性至关重要。特别是，与静态图像相比，视频提供了更丰富的信息，具有与一致的时间动态相一致的多样空间语义 (Zhuang et al., 2023)。因此，Cheng 等人 (2023) 发现使用视频进行训练优于使用图像进行训练，但同时在这两种数据上进行训练可以获得最佳性能。作为额外证据，Yuan 等人 (2023) 表明视频预训练模型在对运动丰富的视频进行分类时优于图像预训练模型。然而，视频文本数据比图像文本数据占用更多存储成本，因为视频包含多个图像作为视频帧。此外，对视频进行标注比对图像进行标注更加耗时和劳动力密集 (Xing 等人，2023)。因此，视频语言理解模型受到干净配对视频文本语料库规模小的限制，而图像文本数据集却有数十亿规模 (Zhao 等人，2023)。各种努力 (Zhao 等人，2023; Xing 等人，2023) 投入到设计有效和经济的方法来整理和标记视频文本数据。

解决挑战。这些已识别的挑战涵盖了视频语言理解领域的三种关键视角：模型架构、模型训练和数据准备。一般来说，这些组件之间应该存在协同关系。具体来说，模型架构应该被设计为有效地捕捉视频语言交互。同时，模型训练应该针对使架构能够适应目标领域及其捕获的视频语言交互而定制。最后，数据准备在塑造模型训练中发挥着至关重要的作用，这反过来又会极大地影响有效模型架构的开发。

4 视频语言理解的模型架构

解决模态内和跨模态交互的挑战是设计视频语言理解模型架构的关键目标，这可以分为预 Transformer 和基于 Transformer 的架构。大语言模型 (LLM) 在解决多种任务方面表现出非凡的零样本能力，这导致了 LLM增强型架构的设计，这些架构展现出跨领域适应各种视频语言理解任务的能力。

4.1 变压器前架构

变压器前架构通常包含单模态视频和语言编码器，用于实现模态内交互，以及跨模态编码器，用于实现跨模态交互。

单模态编码器。视频编码器通常通过提取帧外观和剪辑运动特征来对原始视频进行编码，分别作为空间和时间表示。由于每个视频帧可以被视为单张图像，因此各种工作已经利用 CNN 来提取空间表示(Simonyan 和 Zisserman，2014；Feichtenhofer 等人，2016；Zhao 等人，2017b)。对于时间表示，RNN 的顺序性质使其在变压器前架构中成为一个受欢迎的选择(Yang 等人，2017；Zhao 等人，2017a；Venugopalan 等人，2015；Wang 等人，2019a)。此外，在 2D CNN 中插入一个额外的时态通道的 3D CNN 也在提取时空表示方面展示了有效性(Tran 等人，2017；Carreira 和 Zisserman，2017)。除了 CNN 和 RNN，Chen 等人 (2018)、Gay 等人 (2019) 以及 Wei 等人 (2017) 也构建了图来整合视频实体（例如视频片段或视觉对象）之间的模态内关系。这些图结构化工作强调了模型架构的推理能力。

语言编码器的一个常见框架是提取预训练词嵌入，例如 word2vec(Kaufman 等人，2016；Yu 等人，2017) 或 GloVe(Torabi 等人，2016；Kiros 等人，2014)，然后继续使用基于 RNN 的模块，例如 LSTM 或 GRU。这种框架源于变压器时代之前的语言模型架构。

跨模态编码器。 Gao 等人 (2017) 和 Zeng 等人 (2017) 应用逐元素乘法来融合全局视频和问题表示，用于视频问答。这表明了简单操作在视频语言融合中的优势。注意力也被用于对视频语言关系进行建模，以便识别视频和语言句子中的突出部分(Yuan 等人，2019)，或根据语言问题来细化视频的表示(Xu 等人，2017)。在 Transformer 出现之前，视频语言领域的研究也结合了注意力机制与多种技术，包括分层学习 (Baraldi 等人，2017)、记忆网络 (Fan 等人，2019) 以及图网络 (Xiao 等人，2022a; Wei 等人，2023)。

4.2 基于 Transformer 的架构

基于自注意力机制，该机制将所有输入符元对 exhaustively 相互关联，基于 Transformer 的架构能够捕捉长期依赖关系并从网络规模数据中学习。它在许多视频语言任务中表现出非凡的性能。与 Transformer 之前的架构类似，基于 Transformer 的框架也包括单模编码器和跨模编码器，分别用于建模模态内和跨模态交互。对于单模编码器，一些研究发现，用于视频编码的视觉 Transformer 和用于语言编码的 BERT 编码器，比基于 RNN 和 CNN 的编码器表现更好 (Fu 等人，2021; Bain 等人，2021; Seo 等人，2022)。然后，我们总结了基于 Transformer 的架构的基本类型，并在图 4 中说明。

共享 Transformer。受 Transformer 在语言建模方面成功的启发 (Devlin 等人，2018)，Akbari 等人 (2021) 和 Wang 等人 (2023a) 为视频语言理解构建了共享 Transformer 编码器。它们的编码器架构接收视觉块和语言符元的串联，然后以基于 BERT 的方式共同计算它们之间的交互。 Akbari 等人 (2021) 额外引入了模态嵌入，它包含三个值来表示三种输入模态，即（视频、音频、文本）。

堆叠 Transformer。 Li 等人 (2020) 指出，共享 Transformer 编码器在建模视频和文本之间的时序关系方面较弱。为了解决这个问题，他们引入了一个堆叠 Transformer 架构，它包含一个分层的堆叠，由单模编码器分别对视频和语言输入进行编码，然后由跨模 Transformer 计算视频语言交互。大量视频语言理解工作遵循这种设计，在单模编码器之上堆叠一个基于跨模 Transformer 的编码器 (Fu 等人，2023; Li 等人，2023b; Lei 等人，2021c; Wei 等人，2022; Luo 等人，2022; Nie 等人，2022; Wei 等人，2024)。为了进行视频字幕生成，Seo 等人 (2022) 和 Luo 等人 (2020) 进一步插入了一个因果 Transformer 基解码器，该解码器基于编码的跨模态表示生成语言符元。

双重Transformer。双重Transformer架构一直是文本-视频检索的热门选择 (Luo et al., 2022; Bain et al., 2021, 2022; Lin et al., 2022; Xue et al., 2022b)。这些架构使用两个Transformer编码器分别对视频和语言进行编码，从而为每种输入模态生成全局表示，然后应用余弦相似度等简单操作来计算跨模态交互。这种单独的编码方案使它们能够降低计算每对视频和语言输入之间的成对交互的计算成本。它们不仅在文本-视频检索问题中实现了效率，而且也实现了有效性。

4.3 LLM增强架构

大型语言模型 (LLM) 在同时处理多个NLP任务方面取得了令人瞩目的成果。最近的努力旨在将LLM应用于视频-语言理解，以将其跨域适应能力扩展到视频-语言环境 (Chen et al., 2023; Li et al., 2023a)。这些努力可以分为两种方法。第一种方法将LLM用作控制器，将视频-语言理解模型用作辅助工具。控制器将根据语言输入指令调用特定工具。第二种方法将LLM用作输出生成器，并试图将视频预训练模型与LLM对齐。对于视频-语言理解，由于第二种方法在众多最近的研究工作中占据主导地位 (Chen et al., 2023; Li et al., 2023a; Chen et al., 2023; Li et al., 2023d; Zhang et al., 2023b; Maaz et al., 2023)，我们将在下面对其进行回顾：

LLM作为输出生成器。该框架包括一个视觉编码器、一个语义翻译器和一个LLM作为输出生成器。关于视觉编码器，LLM增强架构通常使用预Transformer和基于Transformer的架构的视觉Transformer和CNN模型 (Chen et al., 2023)。由于 LLM 在训练期间从未见过视频，因此需要一个语义翻译器将视频的视觉语义翻译成 LLM 的语义。对于翻译器，Video-LLaMA (Zhang et al., 2023b) 和 VideoChat (Li et al., 2023a) 实现了一个 Q-Former 作为基于 Transformer 的模块，该模块使用一系列查询嵌入与视频的视觉特征交互以提取信息丰富的视频信息。与 Q-Former 不同，VideoLLM (Chen et al., 2023)、Video-ChatGPT (Maaz et al., 2023) 和 LLaMA-Vid (Li et al., 2023d) 发现简单的线性投影可以将视觉特征投影到 LLM 的输入维度，从而实现有效的性能。随后，这些基于视觉的查询嵌入或投影的视觉特征与语言指令相结合，成为输入到 LLM 中以生成最终输出。

4.4 架构分析

在图 3 中，我们展示了视频语言理解方法的时间线，根据我们定义的架构分类法及其关联的下游任务进行分类。预 Transformer 模型的演变与我们视频语言理解级别的层次结构一致，i.e. 用于视频字幕的模型通常出现在用于文本视频检索的模型之后，然后是视频问答模型的开发。由于其强大的能力，能够处理多个任务的基于 Transformer 的模型已经与特定于任务的 Transformer 框架同时推出。最近，大型语言模型 (LLM) 因其卓越的上下文学习能力而备受关注，使它们能够在无需微调的情况下处理各种任务。因此，新的 LLM 增强型架构应运而生，以利用这种能力来解决多个理解任务。

在基于 Transformer 的架构中，双 Transformer 作为文本视频检索最有效的方法脱颖而出，熟练地关联视频和语言模态的全局语义。另一方面，堆叠 Transformer 架构通过其专门的单模态和跨模态编码器，在促进模态内和跨模态交互方面表现出色。这些编码器在将视频内容与视频问答中的问题相关联方面特别有效。此外，对于视频字幕，跨模态编码器在将视频内容翻译成文本描述方面起着至关重要的作用。最近，LLM 增强型模型在视频问答中开始超越基于 Transformer 的架构，表明它们有潜力成为视频语言理解研究的下一前沿。我们分别在表 1、2 和 3 中提供了文本视频检索、视频字幕和视频问答任务中性能的完整细节。

5 视频语言理解模型训练

模型训练旨在解决视频语言理解模型的跨领域自适应能力。为了实现这一目标，人们设计了预训练策略来获取跨多个场景泛化的世界知识，然后进行特定于任务的微调，以专门提高下游任务的性能。

5.1 视频语言理解的预训练

本节主要将视频语言理解模型的预训练策略总结为三类：

基于语言的预训练。最流行的基于语言的预训练任务是掩码语言建模 (MLM) (Lei 等人，2021c；Sun 等人，2019；Cheng 等人，2023)，它随机掩盖语言输入中的一部分词语，并训练模型根据未掩盖的语言词语和视频实体来预测掩盖的词语。 UniVL (Luo 等人，2020) 和 VICTOR (Lei 等人，2021a) 发现，掩盖整个语言模态有利于视频字幕任务，而不是掩盖一部分词语。 MLM 可以与其他基于语言的预训练任务相结合，例如掩码句子顺序建模，其目的是对随机打乱的语言句子的原始顺序进行分类 (Lei 等人，2021a)。

基于视频的预训练。基于视频的预训练任务帮助视频语言模型捕捉视频模态中的上下文信息。作为 MLM 的对应物，掩码视频建模 (MVM) 训练模型根据未掩盖的实体和语言词语来预测掩盖的视频实体部分。视频的连续性导致了视频实体的不同选择，例如帧块 (Li 等人，2020) 或视频帧 (Fu 等人，2021)。在训练目标方面，Li 等人 (2020) 使用 L2 回归损失来训练模型，以预测由 ResNet 和 SlowFast 模型提取的掩盖视频帧的预训练特征，而 Fu 等人 (2021) 使用交叉熵损失来训练模型，以预测掩盖的视觉符元，这些符元是通过变分自动编码器从视觉帧块中量化的。

视频文本预训练。视频文本预训练对于模型捕捉视频语言关系至关重要。 Xue 等人 (2022b)、Gao 等人 (2021) 和 Bain 等人 (2021) 利用视频文本对比学习框架来生成语义相似视频和语言输入的紧密表示。这些工作集中于创建一个将视频和语言的独立表示对齐的联合语义空间。 Tang 等人 (2021)、Fu 等人 (2021) 和 Li 等人 (2023b) 则让视频和文本表示相互作用，并使用单个符元来表示跨模态输入，然后将其转发以预测视频文本对是否匹配。在这两个预训练框架中，除了视频文本数据外，图像文本数据也在预训练过程中被使用，其中图像被视为具有单帧的视频。

对比学习已展现出令人鼓舞的结果 (Lin 等人，2022; Gao 等人，2021; Xue 等人，2022b; Nguyen 等人，2022; Nguyen 和 Luu，2021; Nguyen 等人，2024c, a, 2023a; Wu 等人，2023a, 2024, 2022)。 MLM 有助于增强 VideoQA，因为该任务类似于 MLM，它根据视频语言对（问题是视频QA中的语言输入）来预测语言词。与这些预训练策略相比，MVM 确实为视频语言理解提供了性能提升，但其提升幅度并不显著。有关预训练的更多详细信息，请参阅 (Cheng 等人，2023)。

5.2 微调以进行视频语言理解

预 Transformer 架构通常使用特定任务的微调从头开始训练，因为这些模型没有足够的参数容量来通过预训练学习可泛化的特征。它也被基于 Transformer 的架构广泛采用，以提高特定下游任务的性能。此外，LLM 增强的架构还利用指令调优作为微调的变体，以从视觉和音频空间适应到 LLM 语言空间。

微调策略。通常，所有模型参数在微调过程中都会更新 (Gao 等人，2017; Xu 等人，2019; Anne Hendricks 等人，2017; Nguyen 等人，2023b; Wu 等人，2023b)。但是，在计算资源或训练数据有限的情况下，只会微调自适应层，例如低秩适配器 (Pan 等人，2022; Yang 等人，2022a; Nguyen 等人，2024b) 或可学习提示向量 (Ju 等人，2022)，以降低训练成本或防止过拟合。这些风险也适用于第 4.3 节中讨论的 LLM 增强的架构，因为 LLM 的参数规模达到数十亿，因此如果进行完全微调，将产生过高的成本。对于此类模型，Zhang 等人 (2023b) 和 Li 等人 (2023d) 设计了一种两阶段指令调优策略，该策略仅对语义翻译器进行微调。第一阶段训练模型根据视频和语言指令的组合来生成文本描述，以将视觉编码器提取的视觉表示与 LLM 的语言空间对齐。第二阶段通常在作者手动收集的小规模视频-文本对上进行，以进一步调整翻译器的输出特征以适应目标领域。

6 视频-语言理解的数据视角

本节分析视频-语言理解模型的数据准备方法，并在附录 B 中详细介绍数据集。

6.1 数据整理

手动收集。为了整理视频-语言数据，多项研究都搜索了公开可用的在线视频，这些视频展现了多种多样的内容。具有在线视频的视频-语言数据集主要用于预训练模型以学习可推广的知识，例如 HowTo100M (Miech 等人，2019) 和 YT-Temporal-180M (Zellers 等人，2021)，或者它们也可以用于微调，例如 MSRVTT (Xu 等人，2016) 和 YouCook2 (Zhou 等人，2018a)。为了满足某些要求，可以从现有的数据集中继承与在线视频不同的视频，例如 Xiao 等人 (2021) 利用了 VidOR 数据集中的 6,000 个视频，(Li 等人，2022a) 继承了 Kinetics-700 中的 546,882 个视频，因为它们分别描述了日常生活和现实世界的场景。除了利用现有数据集和在线视频之外，还可以由人工标注人员录制视频，以实现质量控制 (Goyal 等人，2017; Damen 等人，2022)。

数据增强。而不是从外部来源手动收集视频，Xing 等人 (2023) 和 Jiang 等人 (2022c) 探索了专门针对视频设计的数据增强技术。具体来说，他们的 TubeTokenMix 混合了两个视频，其中混合系数在时间维度上定义，而他们的时间偏移则在时间维度上随机向前或向后移动视频帧特征。这些技术优于图像数据的标准增强方法，例如 CutMix (Yun 等人，2019)、Mixup (Zhang 等人，2017) 和 PixMix (Hendrycks 等人，2022)。

6.2 标签标注

手动标注。由于人工标注人员能够提供高质量的标签，因此有几项工作 (Li et al., 2022a; Lei et al., 2021b; Xiao et al., 2021) 使用人工标注人员。但是，这种方法成本很高，特别是在处理视频数据时。例如，标注 QVHighlights 数据集 (Lei et al., 2021b) 需要约 16,000 美元，10,000 个视频，并且需要 3 个月才能完成。同样，NExT-QA (Xiao et al., 2021) 需要 100 名本科生和 1 年的时间才能仅标注 5,000 个视频。

自动生成。直接将 YouTube 视频的语言转录作为文本标签可以降低标注成本 (Miech et al., 2019; Xue et al., 2022a; Zellers et al., 2021)。但是，这些标签已被证明在语法上不正确，并且在时间上与视频内容不一致 (Tang et al., 2021)。受大型语言模型成功的启发，Zhao et al. (2023) 训练了一个系统，该系统由 TimeSformer-L 视觉编码器和 GPT-2XL 解码器组成，用于为视频编写密集字幕。此外，Li et al. (2023a) 使用 GPT-4 为电影梗概生成摘要。

7 未来方向

细粒度理解。现有方法擅长于粗粒度级别的视频语言理解，能够有效地响应诸如“什么是”之类的提问，或者在没有明显困难的情况下识别全局事件 (Xiao et al., 2021)。然而，将理解限制在这一粗略级别可能会阻碍现有系统的实际应用。在现实世界场景中，用户可能需要视频中某个对象的精确时间戳和位置 (Jiang et al., 2022b)，或者要求人工智能代理预测可能的替代事件，这是预测分析中的常见需求 (Xiao et al., 2021; Li et al., 2022a)。这些任务需要对视频中存在的因果和时间关系进行高级理解和推理能力。目前，模型在进行时间推理方面表现出有限的视觉语言能力，将它们归类为图像-序列-语言模型，而不是视频-语言模型 (Kesen 等人，2023)。因此，未来在这一方向上的研究值得更多关注和探索。

长篇视频-语言理解。当前的理解系统在时长几秒的短视频片段上已展现出非凡的性能。然而，当切换到持续数分钟或数小时的长篇视频时，它们往往会遇到困难。为了提高这些系统的适用性，增强它们理解长篇视频的能力至关重要。当前的方法主要特点是通过比基于 Transformer 的架构更有效的架构（例如状态空间模型 (Yang 等人，2024; Li 等人，2024)，可以被视为具有专门设计的固定权重的线性 RNN）来降低计算成本，或通过额外信息 (Lin 等人，2022) 来弥补稀疏提取的视频帧。总体而言，如何有效地对长篇视频进行建模，并将其与语言的联合语境相适应，值得更多关注。

视频-语言理解模型的可信度。尽管现代视频-语言理解系统已展现出非凡的性能，但其黑盒性质削弱了我们对其部署的信任。具体来说，我们仍然不完全理解视频 QA 模型查看了视频的哪一部分来回答问题 (Li 等人，2022b)，或者视频和语言语义信息如何流入视频检索模型的公共表示空间 (Jia 等人，2022)。此外，视频-语言理解模型的对抗性噪声敏感性或幻觉也是开放问题。面向实用系统的未来可信度基准，例如 (Xiao 等人，2023a; Wang 等人，2021a)，对于视频-语言理解具有重要意义。

8 结论

在本文中，我们对视频-语言理解这一广阔的研究领域进行了综述。特别地，我们对相关的视频-语言理解任务进行了分类，并从模型架构、模型训练和数据角度探讨了有意义的见解。我们对每个角度进行了深入分析，最后得出了有希望的未来方向。我们希望我们的综述能够促进更多研究，以构建能够全面理解动态视觉世界并与人类有意义地交互的有效人工智能系统。

9 局限性

尽管我们努力全面分析视频语言理解文献，但我们可能无法涵盖所有任务、模型架构、模型训练和数据视角。因此，我们通过一个仓库 https://github.com/nguyentthong/video-language-understanding 来补充这项综述。该仓库包含最新的论文、数据集及其开源实现。我们将定期更新仓库以跟踪最新研究的进展。

10 致谢

本项研究/项目得到了新加坡国家研究基金会人工智能新加坡计划的支持 (AISG 奖项编号：AISG2-TC-2022-005)。

参考文献

Abdar et al. (2023) Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, et al. 2023. A review of deep learning for video captioning. arXiv preprint arXiv:2304.11431.
Abu-El-Haija et al. (2016) Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675.
Akbari et al. (2021) Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221.
Anne Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812.
Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846.
Bain et al. (2021) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738.
Bain et al. (2022) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2022. A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508.
Bansal et al. (2022) Siddhant Bansal, Chetan Arora, and CV Jawahar. 2022. My view is the best view: Procedure learning from egocentric videos. In European Conference on Computer Vision, pages 657–675. Springer.
Baraldi et al. (2017) Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1657–1666.
Carreira et al. (2018) Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv preprint arXiv:1808.01340.
Carreira et al. (2019) Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987.
Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308.
Castro et al. (2022a) Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, and Rada Mihalcea. 2022a. In-the-wild video question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5613–5635.
Castro et al. (2022b) Santiago Castro, Ruoyao Wang, Pingxuan Huang, Ian Stewart, Oana Ignat, Nan Liu, Jonathan Stroud, and Rada Mihalcea. 2022b. Fiber: Fill-in-the-blanks as a challenging video understanding evaluation framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2925–2940.
Chen and Dolan (2011) David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200.
Chen et al. (2023) Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. 2023. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292.
Chen et al. (2018) Yuting Chen, Joseph Wang, Yannan Bai, Gregory Castañón, and Venkatesh Saligrama. 2018. Probabilistic semantic retrieval for surveillance videos with activity graphs. IEEE Transactions on Multimedia, 21(3):704–716.
Cheng et al. (2023) Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, and Gedas Bertasius. 2023. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739–10750.
Damen et al. (2022) Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2022. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Ding et al. (2022) Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, and Si Liu. 2022. Language-bridged spatial-temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4964–4973.
Dong et al. (2022) Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE transactions on circuits and systems for video technology, 32(8):5680–5694.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Fan et al. (2019) Chenyou Fan, Xiaofan Zhang, Shu Zhang, et al. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1999–2007.
Fang et al. (2023) Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
Fei-Fei and Krishna (2022) Li Fei-Fei and Ranjay Krishna. 2022. Searching for computer vision north stars. Journal of the American Academy of Arts & Sciences, page 85.
Feichtenhofer et al. (2016) Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional Two-Stream Network Fusion for Video Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Fu et al. (2021) Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. 2021. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681.
Fu et al. (2023) Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. 2023. An empirical study of end-to-end video-language transformers with masked visual modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22898–22909.
Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275.
Gao et al. (2021) Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, and Lili Zhao. 2021. Clip2tv: Align, match and distill for video-text retrieval. arXiv preprint arXiv:2111.05610.
Gay et al. (2019) Paul Gay, James Stuart, and Alessio Del Bue. 2019. Visual graphs from motion (vgfm): Scene understanding with object geometry reasoning. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 330–346. Springer.
Goyal et al. (2017) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" Something Something" Video Database for Learning and Evaluating Visual Common Sense. In The IEEE International Conference on Computer Vision (ICCV).
Guo et al. (2021) Zhicheng Guo, Jiaxuan Zhao, Licheng Jiao, Xu Liu, and Lingling Li. 2021. Multi-scale progressive attention network for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 973–978.
Han et al. (2023) Ning Han, Yawen Zeng, Chuhao Shi, Guangyi Xiao, Hao Chen, and Jingjing Chen. 2023. Bic-net: Learning efficient spatio-temporal relation for text-video retrieval. ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–21.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
He et al. (2023) Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, and Jiashi Feng. 2023. Vlab: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167.
Hendrycks et al. (2022) Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt. 2022. Pixmix: Dreamlike pictures comprehensively improve safety measures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16783–16792.
Huang et al. (2020) Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. 2020. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760.
Hwang et al. (2023) Minyoung Hwang, Jaeyeon Jeong, Minsoo Kim, Yoonseon Oh, and Songhwai Oh. 2023. Meta-explore: Exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6683–6693.
Jang et al. (2019) Yunseok Jang, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2019. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision, 127(10):1385–1412.
Jang et al. (2017) Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766.
Jia et al. (2022) Mohan Jia, Zhongjian Dai, Yaping Dai, and Zhiyang Jia. 2022. An adversarial video moment retrieval algorithm. In 2022 41st Chinese Control Conference (CCC), pages 6689–6694. IEEE.
Jiang et al. (2022a) Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022a. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
Jiang et al. (2022b) Ji Jiang, Meng Cao, Tengtao Song, and Yuexian Zou. 2022b. Video referring expression comprehension via transformer with content-aware query. arXiv preprint arXiv:2210.02953.
Jiang et al. (2020) Jianwen Jiang, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11101–11108.
Jiang and Han (2020) Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11109–11116.
Jiang et al. (2022c) Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, and Heng Tao Shen. 2022c. Semi-supervised video paragraph grounding with contrastive encoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2466–2475.
Jin et al. (2023) Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, and Jie Chen. 2023. Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867.
Ju et al. (2022) Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer.
Kaufman et al. (2016) Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. 2016. Temporal tessellation for video annotation and summarization. arXiv preprint arXiv:1612.06950, 3.
Kay et al. (2017) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950.
Kesen et al. (2023) Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem, et al. 2023. Vilma: A zero-shot benchmark for linguistic and temporal grounding in video-language models. arXiv preprint arXiv:2311.07022.
Kiros et al. (2014) Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Kuehne et al. (2011) Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A Large Video Database for Human Motion Recognition. In The IEEE International Conference on Computer Vision (ICCV).
Lazarus (1973) Arnold A Lazarus. 1973. Multimodal behavior therapy: Treating the “basic id”. The Journal of nervous and mental disease, 156(6):404–411.
Le et al. (2020) Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981.
Lei et al. (2021a) Chenyi Lei, Shixian Luo, Yong Liu, Wanggui He, Jiamang Wang, Guoxin Wang, Haihong Tang, Chunyan Miao, and Houqiang Li. 2021a. Understanding chinese video and language via contrastive multimodal pre-training. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2567–2576.
Lei et al. (2021b) Jie Lei, Tamara L Berg, and Mohit Bansal. 2021b. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858.
Lei et al. (2021c) Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021c. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341.
Lei et al. (2018) Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. Tvqa: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1369–1379.
Lei et al. (2020) Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer.
Li et al. (2022a) Jiangtong Li, Li Niu, and Liqing Zhang. 2022a. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21273–21282.
Li et al. (2023a) KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023a. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
Li et al. (2024) Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977.
Li et al. (2020) Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200.
Li et al. (2023b) Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. 2023b. Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23119–23129.
Li et al. (2023c) Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. 2023c. Momentdiff: Generative video moment retrieval from random to real. arXiv preprint arXiv:2307.02869.
Li et al. (2019) Xiangpeng Li, Zhilong Zhou, Lijiang Chen, and Lianli Gao. 2019. Residual attention-based lstm for video captioning. World Wide Web, 22:621–636.
Li et al. (2023d) Yanwei Li, Chengyao Wang, and Jiaya Jia. 2023d. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043.
Li et al. (2022b) Yicong Li, Xiang Wang, Junbin Xiao, and Tat-Seng Chua. 2022b. Equivariant and invariant grounding for video question answering. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4714–4722.
Li et al. (2023e) Yicong Li, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-Seng Chua. 2023e. Discovering spatio-temporal rationales for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13869–13878.
Lin et al. (2020) Ke Lin, Zhuoxin Gan, and Liwei Wang. 2020. Semi-supervised learning for video captioning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1096–1106.
Lin et al. (2023) Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. 2023. Learning vision-and-language navigation from youtube videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8317–8326.
Lin et al. (2022) Yan-Bo Lin, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2022. Eclipse: Efficient long-range video retrieval using sight and sound. In European Conference on Computer Vision, pages 413–430. Springer.
Liu et al. (2022) Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042–3051.
Liu et al. (2021) Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230.
Long et al. (2018) Xiang Long, Chuang Gan, and Gerard De Melo. 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational Linguistics, 6:173–184.
Luo et al. (2020) Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353.
Luo et al. (2022) Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304.
Maaz et al. (2023) Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
Mangalam et al. (2023) Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126.
McGurk and MacDonald (1976) Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588):746–748.
Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640.
Monfort et al. (2019) Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. 2019. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508.
Nagrani et al. (2022) Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. 2022. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pages 407–426. Springer.
Nguyen et al. (2023a) Cong-Duy Nguyen, Thong Nguyen, Duc Vu, and Anh Luu. 2023a. Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14714–14724.
Nguyen et al. (2024a) Cong-Duy Nguyen, Thong Nguyen, Xiaobao Wu, and Anh Tuan Luu. 2024a. Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486.
Nguyen and Luu (2021) Thong Nguyen and Anh Tuan Luu. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34:11974–11986.
Nguyen et al. (2024b) Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi M Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, and Anh Tuan Luu. 2024b. Read-pvla: Recurrent adapter with partial video-language alignment for parameter-efficient transfer learning in low-resource video-language modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18824–18832.
Nguyen et al. (2023b) Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy Nguyen, See Kiong Ng, and Anh Luu. 2023b. Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3635–3649.
Nguyen et al. (2024c) Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy T Nguyen, See-Kiong Ng, and Anh Tuan Luu. 2024c. Topic modeling as multi-objective contrastive optimization. arXiv preprint arXiv:2402.07577.
Nguyen et al. (2022) Thong Nguyen, Xiaobao Wu, Anh-Tuan Luu, Cong-Duy Nguyen, Zhen Hai, and Lidong Bing. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524.
Nie et al. (2022) Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented micro-video captioning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3234–3243.
Pan et al. (2020) Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879.
Pan et al. (2023) Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng Li. 2023. Retrieving-to-answer: Zero-shot video question answering with frozen large language models. arXiv preprint arXiv:2306.11732.
Pan et al. (2022) Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. 2022. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477.
Park et al. (2021) Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. 2021. Bridge to answer: Structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15526–15535.
Pei et al. (2023) Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Songcen Xu, Peng Dai, Juwei Lu, and Youliang Yan. 2023. Clipping: Distilling clip-based models with a student base for video-language retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18983–18992.
Pei et al. (2019) Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8347–8356.
Peng et al. (2021) Liang Peng, Shuangji Yang, Yi Bin, and Guoqing Wang. 2021. Progressive graph attention network for video question answering. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2871–2879.
Regneri et al. (2013) Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36.
Rohrbach et al. (2015) Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3202–3212.
Rohrbach et al. (2012) Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In 2012 IEEE conference on computer vision and pattern recognition, pages 1194–1201. IEEE.
Ruan and Jin (2022) Ludan Ruan and Qin Jin. 2022. Survey: Transformer based video-language pre-training. AI Open, 3:1–13.
Sanabria et al. (2018) Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. 2018. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347.
Schiappa et al. (2023) Madeline C Schiappa, Yogesh S Rawat, and Mubarak Shah. 2023. Self-supervised learning for videos: A survey. ACM Computing Surveys, 55(13s):1–37.
Seo et al. (2022) Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. 2022. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968.
Shang et al. (2019) Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems (NeurIPS).
Song et al. (2015) Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187.
Stroud et al. (2020) Jonathan C Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, and David A Ross. 2020. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937.
Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473.
Sun et al. (2014) Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 787–802. Springer.
Tang et al. (2023a) Hao Tang, Kevin Liang, Kristen Grauman, Matt Feiszli, and Weiyao Wang. 2023a. Egotracks: A long-term egocentric visual object tracking dataset. arXiv preprint arXiv:2301.03213.
Tang et al. (2019) Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216.
Tang et al. (2023b) Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2023b. Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432.
Tang et al. (2021) Zineng Tang, Jie Lei, and Mohit Bansal. 2021. Decembert: Learning from noisy instructional videos via dense captions and entropy minimization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2415–2426.
Thomee et al. (2016) Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73.
Torabi et al. (2016) Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124.
Tran et al. (2017) Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038.
Ventura et al. (2023) Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. 2023. Covr: Learning composed video retrieval from web video captions. arXiv preprint arXiv:2308.14746.
Venugopalan et al. (2015) Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542.
Wang et al. (2019a) Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, and Vijay Chandrasekhar. 2019a. Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, 29:489–499.
Wang et al. (2023a) Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Kevin Qinghong Lin, Satoshi Tsutsui, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, et al. 2023a. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608.
Wang et al. (2021a) Lijie Wang, Hao Liu, Shuyuan Peng, Hongxuan Tang, Xinyan Xiao, Ying Chen, Hua Wu, and Haifeng Wang. 2021a. Dutrust: A sentiment analysis dataset for trustworthiness evaluation. arXiv preprint arXiv:2108.13140.
Wang et al. (2021b) Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Changxin Gao, and Nong Sang. 2021b. Self-supervised learning for semi-supervised temporal action proposal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1905–1914.
Wang et al. (2019b) Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019b. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591.
Wang et al. (2023b) Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, et al. 2023b. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942.
Wei et al. (2023) Jie Wei, Guanyu Hu, Luu Anh Tuan, Xinyu Yang, and Wenjing Zhu. 2023. Multi-scale receptive field graph model for emotion recognition in conversations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Wei et al. (2022) Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2022. Audio-visual domain adaptation feature fusion for speech emotion recognition. In INTERSPEECH, pages 1988–1992.
Wei et al. (2024) Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2024. Learning facial expression and body gesture visual information for video emotion recognition. Expert Systems with Applications, 237:121419.
Wei et al. (2017) Lina Wei, Fangfang Wang, Xi Li, Fei Wu, and Jun Xiao. 2017. Graph-theoretic spatiotemporal context modeling for video saliency detection. In 2017 IEEE International Conference on Image Processing (ICIP), pages 4197–4201. IEEE.
Wu et al. (2021) Bofeng Wu, Guocheng Niu, Jun Yu, Xinyan Xiao, Jian Zhang, and Hua Wu. 2021. Weakly supervised dense video captioning via jointly usage of knowledge distillation and cross-modal matching. arXiv preprint arXiv:2105.08252.
Wu et al. (2023a) Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liang-Ming Pan, and Anh Tuan Luu. 2023a. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13763–13771.
Wu et al. (2023b) Xiaobao Wu, Xinshuai Dong, Thong Nguyen, and Anh Tuan Luu. 2023b. Effective neural topic modeling with embedding clustering regularization. In International Conference on Machine Learning. PMLR.
Wu et al. (2024) Xiaobao Wu, Xinshuai Dong, Liangming Pan, Thong Nguyen, and Anh Tuan Luu. 2024. Modeling dynamic topics in chain-free fashion by evolution-tracking contrastive learning and unassociated word exclusion. arXiv preprint arXiv:2405.17957.
Wu et al. (2022) Xiaobao Wu, Anh Tuan Luu, and Xinshuai Dong. 2022. Mitigating data sparsity for short text topic modeling by topic-semantic contrastive learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2748–2760, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Xiao et al. (2021) Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786.
Xiao et al. (2023a) Junbin Xiao, Angela Yao, Yicong Li, and Tat Seng Chua. 2023a. Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327.
Xiao et al. (2022a) Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022a. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2804–2812.
Xiao et al. (2022b) Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022b. Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer.
Xiao et al. (2023b) Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, and Tat-Seng Chua. 2023b. Contrastive video question answering via video graph transformer. arXiv preprint arXiv:2302.13668.
Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Xing et al. (2023) Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2023. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816–18826.
Xu et al. (2017) Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653.
Xu et al. (2023) Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. 2023. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402.
Xu et al. (2021) Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021. Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996.
Xu et al. (2019) Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9062–9069.
Xu et al. (2016) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296.
Xu et al. (2020) Wanru Xu, Jian Yu, Zhenjiang Miao, Lili Wan, Yi Tian, and Qiang Ji. 2020. Deep reinforcement polishing network for video captioning. IEEE Transactions on Multimedia, 23:1772–1784.
Xue et al. (2022a) Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. 2022a. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045.
Xue et al. (2022b) Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. 2022b. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430.
Yang et al. (2022a) Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022a. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141.
Yang et al. (2023a) Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. 2023a. Vidchapters-7m: Video chapters at scale. arXiv preprint arXiv:2309.13952.
Yang et al. (2023b) Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. 2023b. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726.
Yang et al. (2022b) Bang Yang, Tong Zhang, and Yuexian Zou. 2022b. Clip meets video captioning: Concept-aware representation learning does matter. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 368–381. Springer.
Yang et al. (2021) Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1–10.
Yang et al. (2024) Yijun Yang, Zhaohu Xing, and Lei Zhu. 2024. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168.
Yang et al. (2017) Yinchong Yang, Denis Krompass, and Volker Tresp. 2017. Tensor-train recurrent neural networks for video classification. In International Conference on Machine Learning, pages 3891–3900. PMLR.
Yao et al. (2015) Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pages 4507–4515.
Ye et al. (2017) Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 829–832.
Yu et al. (2016) Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4584–4593.
Yu et al. (2020) Ting Yu, Jun Yu, Zhou Yu, Qingming Huang, and Qi Tian. 2020. Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Transactions on Circuits and Systems for Video Technology, 31(3):931–944.
Yu et al. (2018) Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471–487.
Yu et al. (2017) Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3165–3173.
Yu et al. (2019) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134.
Yuan et al. (2023) Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, et al. 2023. Videoglue: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166.
Yuan et al. (2019) Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9159–9166.
Yun et al. (2019) Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032.
Zellers et al. (2021) Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651.
Zeng et al. (2017) Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Thirty-First AAAI Conference on Artificial Intelligence.
Zeng et al. (2022) Yawen Zeng, Da Cao, Shaofei Lu, Hanling Zhang, Jiao Xu, and Zheng Qin. 2022. Moment is important: Language-based video moment retrieval via adversarial learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2):1–21.
Zhang et al. (2023a) Bowen Zhang, Xiaojie Jin, Weibo Gong, Kai Xu, Zhao Zhang, Peng Wang, Xiaohui Shen, and Jiashi Feng. 2023a. Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868.
Zhang et al. (2023b) Hang Zhang, Xin Li, and Lidong Bing. 2023b. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Zhang et al. (2020) Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13278–13288.
Zhao et al. (2017a) Rui Zhao, Haider Ali, and Patrick Van der Smagt. 2017a. Two-stream rnn/cnn for action recognition in 3d videos. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4260–4267. IEEE.
Zhao et al. (2023) Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597.
Zhao et al. (2017b) Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017b. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the 25th ACM international conference on Multimedia, pages 1050–1058.
Zhao et al. (2018) Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. 2018. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8.
Zhong et al. (2022) Yaoyao Zhong, Junbin Xiao, Wei Ji, Yicong Li, Weihong Deng, and Tat-Seng Chua. 2022. Video question answering: Datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225.
Zhou et al. (2018a) Luowei Zhou, Chenliang Xu, and Jason Corso. 2018a. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
Zhou et al. (2018b) Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018b. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8739–8748.
Zhou et al. (2024) Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, and Yuan Wu. 2024. A survey on data augmentation in large model era. arXiv preprint arXiv:2401.15422.
Zhu et al. (2023) Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, and Yu Liu. 2023. Deep learning for video-text retrieval: a review. International Journal of Multimedia Information Retrieval, 12(1):3.
Zhu and Yang (2020) Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8746–8755.
Zhu and Jiang (2019) Yongqing Zhu and Shuqiang Jiang. 2019. Attention-based densely connected lstm for video captioning. In Proceedings of the 27th ACM international conference on multimedia, pages 802–810.
Zhuang et al. (2023) Jiafan Zhuang, Zilei Wang, and Junjie Li. 2023. Video semantic segmentation with inter-frame feature fusion and inner-frame feature refinement. arXiv preprint arXiv:2301.03832.
Zhukov et al. (2019) Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. 2019. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545.

附录

附录 A 视频语言理解任务示例

在本附录中，我们在图 5 和 6 中提供了视频语言理解任务的示例。

附录 B 视频语言理解数据集分析

由于篇幅限制，视频语言理解任务的数据集详情列于表 4 中。我们根据数据集支持的任务对其进行分类。虽然针对下游任务和微调的数据集一直在不断发展，但针对预训练的数据集是在 Transformer 架构出现后才出现的。尽管预训练和下游视频语言理解数据集追求不同的目标，但它们主要来源于互联网。关于下游数据集，最近的一些数据集旨在提出新的技术挑战，例如评估推理和推断能力 (Xiao 等人，2021；Li 等人，2022a)，或检验视频语言理解模型的长篇建模能力 (Mangalam 等人，2023)。

Methods	Model architecture	Video	Text	R@1	R@5	R@10
VSE-LSTM (Kiros et al., 2014)	Pre-TF	ConvNet/OxfordNet	GloVe-LSTM	3.8	12.7	17.1
C+LSTM+SA-FC7 (Torabi et al., 2016)		VGG	GloVe-LSTM	4.2	12.9	19.9
EITanque (Kaufman et al., 2016)		VGG	word2vec-LSTM	4.7	16.6	24.1
SA-G+SA-FC7 (Torabi et al., 2016)		VGG	GloVe	3.1	9.0	13.4
CT-SAN (Yu et al., 2017)		RN	word2vec-LSTM	4.4	16.6	22.3
JSFusion (Yu et al., 2018)		RN	GloVe-LSTM	10.2	31.2	43.2
All-in-one (Wang et al., 2023a)	Shared TF	Linear	BT	37.9	68.1	77.1
VLM (Xu et al., 2021)	Shared TF	S3D	BT	28.1	55.5	67.4
DeCEMBERT (Tang et al., 2021)	Shared TF	RN	BT	17.5	44.3	58.6
ActBERT (Zhu and Yang, 2020)	Stacked TF	Faster-RCNN	BT	16.3	42.8	56.9
VIOLET (Fu et al., 2023)	Stacked TF	VS-TF	BT	37.2	64.8	75.8
VindLU (Cheng et al., 2023)	Stacked TF	ViT	BT	48.8	72.4	82.2
HERO (Li et al., 2020)	Stacked TF	RN+SlowFast	BT	16.8	43.4	57.7
MV-GPT (Seo et al., 2022)	Stacked TF	ViViT	BT	37.3	65.5	75.1
CLIP2TV (Gao et al., 2021)	Dual TF	ViT	CLIP-text	32.4	58.2	68.6
CLIP-ViP (Xue et al., 2022a)	Dual TF	ViT	CLIP-text	49.6	74.5	84.8
CLIP4Clip (Luo et al., 2022)	Dual TF	ViT	CLIP-text	44.5	71.4	81.6

表 1: 文本-视频检索性能。 (预训练-Transformer：预训练 Transformer，共享 Transformer：共享 Transformer，堆叠 Transformer：堆叠 Transformer，双 Transformer：双 Transformer，RN：ResNet/ResNeXt (He 等人，2016；Xie 等人，2017)，ViT：视觉 Transformer (Dosovitskiy 等人，2020)，BT：BERT (Devlin 等人，2018)，ViViT：视频视觉 Transformer (Arnab 等人，2021))。我们报告排名第 1 (R@1)、5 (R@5) 和 10 (R@10) 的召回率。我们选择 MSRVTT 作为最流行的文本-视频检索数据集之一。

Methods	Model architecture	Video	BLEU-4	METEOR	CIDEr
TA (Yao et al., 2015)	Pre-TF	Video: 3D-CNN	36.5	25.7	-
h-RNN (Yu et al., 2016)		Video: VGG	36.8	25.9	-
MFATT (Long et al., 2018)		Video: RN+C3D	39.1	26.7	-
CAT-TM (Long et al., 2018)		Video: RN+C3D	36.6	25.6	-
NFS-TM (Long et al., 2018)		Video: RN+C3D	37.0	25.9	-
Fuse-TM (Long et al., 2018)		Video: RN+C3D	37.5	25.9	-
MARN (Pei et al., 2019)		Video: RN	-	-	46.8
Res-ATT (Li et al., 2019)		Video: RN	37.0	26.9	40.7
DenseLSTM (Zhu and Jiang, 2019)		Video: VGG	38.1	27.2	42.8
VIOLET (Fu et al., 2023)	Stacked TF	VS-TF	-	-	58.0
LAVENDER (Li et al., 2023b)		VS-TF	-	-	57.4
VLAB (He et al., 2023)		EVA-G	54.6	33.4	74.9
UniVL (Luo et al., 2020)		S3D	41.8	28.9	50.0
MV-GPT (Seo et al., 2022)		ViViT	48.9	38.7	60.0
CLIP-DCD (Yang et al., 2022b)		ViT	48.2	30.9	64.8
DeCEMBERT (Tang et al., 2021)		RN	45.2	29.7	52.3
mPLUG-2 (Xu et al., 2023)		ViT	57.8	34.9	80.3

表 2: 视频字幕性能。 (预训练-Transformer：预训练 Transformer，堆叠 Transformer：堆叠 Transformer，RN：ResNet/ResNeXt (He 等人，2016；Xie 等人，2017)，ViViT：视频视觉 Transformer (Arnab 等人，2021)，EVA-G：Fang 等人 (2023))。我们报告 BLEU-4 和 METEOR，它们是两种流行的语言生成指标。我们选择 MSRVTT 作为最流行的视频字幕数据集之一。

Methods	Architecture	Video	Text	Dataset
Methods	Architecture	Video	Text	MSRVTT	MSVD
E-MN (Xu et al., 2017)	Pre-TF	VGG + C3D	GloVe-LSTM	30.4	26.7
QueST (Jiang et al., 2020)		RN + C3D	GloVe-LSTM	40.0	-
HME (Fan et al., 2019)		RN/VGG + C3D	GloVe-GRU	34.6	36.1
HGA (Jiang and Han, 2020)		RN/VGG + C3D	GloVe-GRU	33.0	33.7
ST-VQA (Jang et al., 2019)		RN+C3D	GloVe-LSTM	35.5	34.7
PGAT (Peng et al., 2021)		Faster-RCNN	GloVe-LSTM	38.1	39.0
HCRN (Le et al., 2020)		RN	GloVe-LSTM	35.6	36.1
HQGA (Xiao et al., 2022a)		Faster-RCNN	BERT-LSTM	38.6	41.2
All in one (Wang et al., 2023a)	Shared TF	Linear	BT	44.3	47.9
LAVENDER (Li et al., 2023b)	Stacked TF	VS-TF	BT	45.0	56.6
DeCEMBERT (Tang et al., 2021)	Stacked TF	RN	BT	37.4	-
VindLU (Cheng et al., 2023)	Stacked TF	ViT	BT	44.6	-
VIOLET (Fu et al., 2023)	Stacked TF	VS-TF	BT	44.5	54.7
ClipBERT (Lei et al., 2021c)	Stacked TF	CLIP-text	BT	37.4	-
VGT (Xiao et al., 2022b)	Dual TF	Faster-RCNN	BT	39.7	-
CoVGT (Xiao et al., 2023b)	Dual TF	Faster-RCNN	BT	40.0	-
LLaMA-Vid (Li et al., 2023d)	LLM-Augmented	EVA-G	Vicuna	58.9	70.0

表 3：视频问答性能。 (预训练 Transformer：预训练 Transformer，双 Transformer：双 Transformer，RN：ResNet/ResNeXt (He 等人，2016；Xie 等人，2017)，BT：BERT (Devlin 等人，2018)，VS-TF：视频 Swin Transformer (Liu 等人，2021)，EVA-G：Fang 等人 (2023))。我们报告了这些方法的准确性。我们选择 MSRVTT 和 MSVD 作为两个最流行的视频问答数据集。

Dataset	Video source	Annotation	Tasks	#Videos/#Routes
MSVD (Chen and Dolan, 2011)	YouTube videos	Manual	TVR, VC, VideoQA	1.9K
MSRVTT (Xu et al., 2016)	Web videos	Manual	TVR, VC, VideoQA	7.2K
ActivityNet (Yu et al., 2019)	YouTube videos	Manual	AL, TVR, VC, VMR	5.8K
FIBER (Castro et al., 2022b)	VaTeX (Wang et al., 2019b)	Manual	VC, VideoQA	28K
WildQA (Castro et al., 2022a)	YouTube videos	Manual	VideoQA	0.4K
NExT-QA (Xiao et al., 2021)	VidOR Shang et al. (2019)	Manual	VideoQA	5.4K
CausalVid-QA (Li et al., 2022a)	Kinetics-700 (Carreira et al., 2019)	Manual	VideoQA	26K
HowTo100M (Miech et al., 2019)	YouTube videos	Auto	PT	1.2M
HD-VILA-100M (Xue et al., 2022a)	YouTube videos	Auto	PT	3.3M
YT-Temporal-180M (Zellers et al., 2021)	YouTube videos	Auto	PT	6M
TGIF-QA (Jang et al., 2017)	Animated GIFs	Manual	VideoQA	71K
TGIF-QA-R (Peng et al., 2021)	TGIF-QA (Jang et al., 2017)	Manual, Auto	VideoQA	71K
DiDeMo (Anne Hendricks et al., 2017)	YFCC100M (Thomee et al., 2016)	Manual	TVR	11K
YouCook2 (Zhou et al., 2018a)	YouTube videos	Manual	TVR, VC	2K
HMDB-51 (Kuehne et al., 2011)	Web videos	Manual	TVR, AR	6.8K
Kinetics-400 (Kay et al., 2017)	YouTube videos	Manual	AR	306K
Kinetics-600 (Carreira et al., 2018)	Kinetics-400 (Kay et al., 2017)	Manual	AR, VG	480K
Kinetics-700 (Carreira et al., 2019)	Kinetics-600 (Carreira et al., 2018)	Manual	AR	650K
VaTeX (Wang et al., 2019b)	Kinetics-600 (Carreira et al., 2018)	Manual	TVR, VC	41K
TVR (Lei et al., 2020)	TVQA (Lei et al., 2018)	Manual	VMR	22K
How2R (Li et al., 2020)	HowTo100M (Miech et al., 2019)	Manual	VMR	22K
How2QA (Li et al., 2020)	HowTo100M (Miech et al., 2019)	Manual	VideoQA	22K
YouTube Highlights (Sun et al., 2014)	YouTube videos	Manual	VMR	0.6K
TACoS (Regneri et al., 2013)	MPII Composites (Rohrbach et al., 2012)	Manual	VMR	0.1K
QVHighlights (Lei et al., 2021b)	YouTube vlogs	Manual	VMR	10K
TVSum (Song et al., 2015)	YouTube videos	Manual	VMR	50
ViTT (Huang et al., 2020)	YouTube-8M (Abu-El-Haija et al., 2016)	Manual	VMR	5.8K
VidChapters-7M (Yang et al., 2023a)	YT-Temporal-180M (Zellers et al., 2021)	Auto	VC, VMR	817K
VideoCC3M (Nagrani et al., 2022)	Web videos	Auto	PT	6.3M
WebVid-10M (Bain et al., 2021)	Web videos	Auto	PT	10.7M
COIN (Tang et al., 2019)	YouTube videos	Manual	AS	12K
CrossTask (Zhukov et al., 2019)	YouTube videos	Manual	AR	4.7K
Alivol-10M (Lei et al., 2021a)	E-commerce videos	Auto	PT	10M
LSMDC (Rohrbach et al., 2015)	British movies	Manual	TVR	72
EK-100 (Damen et al., 2022)	Manual	Manual	AR, AL	7K
SSV1 (Goyal et al., 2017)	Manual	Manual	AR	108K
SSV2 (Goyal et al., 2017)	Manual	Manual	AR	221K
Moments in Time (Monfort et al., 2019)	Web videos	Manual	AR	1M
InternVid (Wang et al., 2023b)	YouTube videos	Auto	PT	7.1M
How2 (Sanabria et al., 2018)	YouTube videos	Auto	VC	13.2K
WTS70M (Stroud et al., 2020)	YouTube videos	Auto	PT	70M
Charades (Gao et al., 2017)	Manual	Manual	AR, VMR, VideoQA	10K

表 4：文献中的视频理解数据集。 (VMR：视频片段检索，TVR：文本视频检索，VC：视频字幕，AL：动作定位，AR：动作识别，AS：动作分割，VG：视频生成，PT：预训练)。