Time and location

18th June 2024, 8.30AM - 1PM, Summit 335-336, 900 Pine Street, Seattle.

Abstract

Video understanding has seen tremendous progress over the past decades, with models attaining excellent results on classic benchmarks and widespread applications in industry. However, there are still many open challenges and the path forward is not clear. Image-text models have pushed progress on current video benchmarks, but we are still lacking in video foundation models. Recent works have shown several popular benchmarks can be solved using single video frames. New benchmark tasks are frequently being proposed, however their adoption is often limited and researchers default to saturated tasks such as classification and localisation. The ever-growing computational requirements are also particularly demanding for this field, which curbs the accessibility of video understanding. This raises the questions of how much deeper our understanding of video needs to be, whether the current benchmarks we have are enough, how to work in this field with limited resources, and what comes next beyond the task of action recognition.

The purpose of this workshop is to create a forum for discussion on what is next in video understanding, i.e. what are the paths forward and the fundamental problems that still need to be addressed in this research area. We will engage in a discussion with our community on the various facets of video understanding, including, but not limited to, tasks, model design, video length, multi-modality, generalisation properties, efficiency, ethics, fairness, and scale.

Speakers

	Jitendra Malik	University of California at Berkeley, Facebook AI Research
	Michael Ryoo	Salesforce AI, Stony Brook University
	Makarand Tapaswi	Wadhwani AI, IIIT Hyderabad
	Gül Varol	École des Ponts ParisTech
	Angela Yao	National University of Singapore

Programme

Time	What
08:30	Welcome and opening remarks
08:45	Angela Yao’s talk
09:10	Makarand Tapaswi’s talk
09:35	Pitches on the future of video understanding (8 presentations)
10:15	Break
10:35	Jitendra Malik’s talk
11:00	Michael Ryoo’s talk
11:25	Gül Varol’s talk
11:50	Panel discussion
12:35	Conclusions and closing remarks

Pitches on the future of video understanding

This is the list of position papers that were accepted at our workshop. These will be presented by the authors during the workshop.

Name	Paper
Wufei Ma	Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Xiaohan Wang	Building Agentic System for Video Understanding
Sujay Kumar	Extreme Video Compression: Key To Large Scale Video Understanding
Orr Zohar	Rethinking Video Instruction Dataset Generation: The Case for Bootstrapping with Large Vision-Language Models
Sathyanarayanan Aakur	VOWL: Towards Video Understanding in an Open World
Deeksha Arun	Enhancing Video Analysis: Selecting Informative Frames with Active Learning for a Reduced Video Gallery
Sophia Abraham	Beyond Frame-by-Frame: Utilizing Homotopy Theory and Domain Knowledge in Foundation Models for Video Understanding
Sujoy Roy Chowdhury	Installation Video validation from Instruction Manuals - The Next Frontier

We are collecting opinions on the best and worst practices in the field, as well as insights on past and future trends in video understanding.

You can share your views by filling this form.

Organisers

	Davide Moltisanti	University of Bath
	Hazel Doughty	Leiden University
	Michael Wray	University of Bristol
	Bernard Ghanem	King Abdullah University of Science and Technology
	Lorenzo Torresani	Facebook AI Research