Skip to the content.


Video understanding has seen tremendous progress over the past decades, with models attaining excellent results on classic benchmarks and widespread applications in industry. However, there are still many open challenges and the path forward is not clear. Image-text models have pushed progress on current video benchmarks, but we are still lacking in video foundation models. Recent works have shown several popular benchmarks can be solved using single video frames. New benchmark tasks are frequently being proposed, however their adoption is often limited and researchers default to saturated tasks such as classification and localisation. The ever-growing computational requirements are also particularly demanding for this field, which curbs the accessibility of video understanding. This raises the questions of how much deeper our understanding of video needs to be, whether the current benchmarks we have are enough, how to work in this field with limited resources, and what comes next beyond the task of action recognition.

The purpose of this workshop is to create a forum for discussion on what is next in video understanding, i.e. what are the paths forward and the fundamental problems that still need to be addressed in this research area. We will engage in a discussion with our community on the various facets of video understanding, including, but not limited to, tasks, model design, video length, multi-modality, generalisation properties, efficiency, ethics, fairness, and scale.

Call for Position Papers

We invite submissions of 1-2 page position papers on the future of video understanding. Papers will not be part of the CVPR proceedings.

We do not expect submissions to be based on accepted papers or even include experiments. Rather, they should be a way to call the community’s attention to an overlooked, under-explored, or new direction for video understanding. e.g. “We should not forget optical flow” or “The future lies in language-video models, but how can small groups compete?”. Papers can focus on both the short-term and long-term future of video understanding.

Selected position papers will be “pitched” in a short oral talk during the workshop followed by an open discussion.

Submissions should be formatted as per the CVPR 2024 guidelines, but please submit in single blind format, i.e. there is no need to anonymise. Please use the template made available at CVPR2024AuthorGuidelines.

Submission Deadline

11th April 2024, 23.59 Pacific Time.

Notification of Acceptance

25th April 2024.

Submission Website

Please submit your paper on CMT.


Jitendra Malik University of California at Berkeley, Facebook AI Research
Michael Ryoo Salesforce AI, Stony Brook University
Makarand Tapaswi Wadhwani AI, IIIT Hyderabad
Gül Varol École des Ponts ParisTech
Angela Yao National University of Singapore


Davide Moltisanti University of Bath
Hazel Doughty Leiden University
Michael Wray University of Bristol
Bernard Ghanem King Abdullah University of Science and Technology
Lorenzo Torresani Facebook AI Research