Time and location
18th June 2024, 8.30AM - 1PM, Summit 335-336, 900 Pine Street, Seattle.
Abstract
Video understanding has seen tremendous progress over the past decades, with models attaining excellent results on classic benchmarks and widespread applications in industry. However, there are still many open challenges and the path forward is not clear. Image-text models have pushed progress on current video benchmarks, but we are still lacking in video foundation models. Recent works have shown several popular benchmarks can be solved using single video frames. New benchmark tasks are frequently being proposed, however their adoption is often limited and researchers default to saturated tasks such as classification and localisation. The ever-growing computational requirements are also particularly demanding for this field, which curbs the accessibility of video understanding. This raises the questions of how much deeper our understanding of video needs to be, whether the current benchmarks we have are enough, how to work in this field with limited resources, and what comes next beyond the task of action recognition.
The purpose of this workshop is to create a forum for discussion on what is next in video understanding, i.e. what are the paths forward and the fundamental problems that still need to be addressed in this research area. We will engage in a discussion with our community on the various facets of video understanding, including, but not limited to, tasks, model design, video length, multi-modality, generalisation properties, efficiency, ethics, fairness, and scale.
Speakers
Jitendra Malik | University of California at Berkeley, Facebook AI Research | |
Michael Ryoo | Salesforce AI, Stony Brook University | |
Makarand Tapaswi | Wadhwani AI, IIIT Hyderabad | |
Gül Varol | École des Ponts ParisTech | |
Angela Yao | National University of Singapore |
Programme
Time | What |
---|---|
08:30 | Welcome and opening remarks |
08:45 | Angela Yao’s talk |
09:10 | Makarand Tapaswi’s talk |
09:35 | Pitches on the future of video understanding (8 presentations) |
10:15 | Break |
10:35 | Jitendra Malik’s talk |
11:00 | Michael Ryoo’s talk |
11:25 | Gül Varol’s talk |
11:50 | Panel discussion |
12:35 | Conclusions and closing remarks |
Pitches on the future of video understanding
This is the list of position papers that were accepted at our workshop. These will be presented by the authors during the workshop.
Name | Paper |
---|---|
Wufei Ma | Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data |
Xiaohan Wang | Building Agentic System for Video Understanding |
Sujay Kumar | Extreme Video Compression: Key To Large Scale Video Understanding |
Orr Zohar | Rethinking Video Instruction Dataset Generation: The Case for Bootstrapping with Large Vision-Language Models |
Sathyanarayanan Aakur | VOWL: Towards Video Understanding in an Open World |
Deeksha Arun | Enhancing Video Analysis: Selecting Informative Frames with Active Learning for a Reduced Video Gallery |
Sophia Abraham | Beyond Frame-by-Frame: Utilizing Homotopy Theory and Domain Knowledge in Foundation Models for Video Understanding |
Sujoy Roy Chowdhury | Installation Video validation from Instruction Manuals - The Next Frontier |
Share your views
We are collecting opinions on the best and worst practices in the field, as well as insights on past and future trends in video understanding.
You can share your views by filling this form.
Organisers
Davide Moltisanti | University of Bath | |
Hazel Doughty | Leiden University | |
Michael Wray | University of Bristol | |
Bernard Ghanem | King Abdullah University of Science and Technology | |
Lorenzo Torresani | Facebook AI Research |