Video models are zero-shot learners and reasoners

Perception, modeling, and manipulation all integrate to tackle visual reasoning. While language models manipulate human-invented symbols, video models can apply changes across the dimensions of the real world: time and space. Since these changes are applied frame-by-frame in a generated video, this parallels chain-of-thought in LLMs and could therefore be called chain-of-frames, or CoF for short. In the language domain, chain-of-thought enabled models to tackle reasoning problems. Similarly, chain-of-frames (a.k.a. video generation) might enable video models to solve challenging visual problems that require step-by-step reasoning across time and space.
— Read on simonwillison.net/2025/Sep/27/video-models-are-zero-shot-learners-and-reasoners/


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *