Skip to the content.

Time and location

18th June 2024, 8.30AM - 1PM, Summit 335-336, 900 Pine Street, Seattle.


Video understanding has seen tremendous progress over the past decades, with models attaining excellent results on classic benchmarks and widespread applications in industry. However, there are still many open challenges and the path forward is not clear. Image-text models have pushed progress on current video benchmarks, but we are still lacking in video foundation models. Recent works have shown several popular benchmarks can be solved using single video frames. New benchmark tasks are frequently being proposed, however their adoption is often limited and researchers default to saturated tasks such as classification and localisation. The ever-growing computational requirements are also particularly demanding for this field, which curbs the accessibility of video understanding. This raises the questions of how much deeper our understanding of video needs to be, whether the current benchmarks we have are enough, how to work in this field with limited resources, and what comes next beyond the task of action recognition.

The purpose of this workshop is to create a forum for discussion on what is next in video understanding, i.e. what are the paths forward and the fundamental problems that still need to be addressed in this research area. We will engage in a discussion with our community on the various facets of video understanding, including, but not limited to, tasks, model design, video length, multi-modality, generalisation properties, efficiency, ethics, fairness, and scale.


Jitendra Malik University of California at Berkeley, Facebook AI Research
Michael Ryoo Salesforce AI, Stony Brook University
Makarand Tapaswi Wadhwani AI, IIIT Hyderabad
Gül Varol École des Ponts ParisTech
Angela Yao National University of Singapore


Time What
08:30 Welcome and opening remarks
08:45 Angela Yao’s talk
09:10 Makarand Tapaswi’s talk
09:35 Pitches on the future of video understanding (8 presentations)
10:15 Break
10:35 Jitendra Malik’s talk
11:00 Michael Ryoo’s talk
11:25 Gül Varol’s talk
11:50 Panel discussion
12:35 Conclusions and closing remarks

Pitches on the future of video understanding

This is the list of position papers that were accepted at our workshop. These will be presented by the authors during the workshop.

Name Paper
Wufei Ma Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Xiaohan Wang Building Agentic System for Video Understanding
Sujay Kumar Extreme Video Compression: Key To Large Scale Video Understanding
Orr Zohar Rethinking Video Instruction Dataset Generation: The Case for Bootstrapping with Large Vision-Language Models
Sathyanarayanan Aakur VOWL: Towards Video Understanding in an Open World
Deeksha Arun Enhancing Video Analysis: Selecting Informative Frames with Active Learning for a Reduced Video Gallery
Sophia Abraham Beyond Frame-by-Frame: Utilizing Homotopy Theory and Domain Knowledge in Foundation Models for Video Understanding
Sujoy Roy Chowdhury Installation Video validation from Instruction Manuals - The Next Frontier

Share your views

We are collecting opinions on the best and worst practices in the field, as well as insights on past and future trends in video understanding.

You can share your views by filling this form.


Davide Moltisanti University of Bath
Hazel Doughty Leiden University
Michael Wray University of Bristol
Bernard Ghanem King Abdullah University of Science and Technology
Lorenzo Torresani Facebook AI Research