Get the latest tech news

Video models are zero-shot learners and reasoners

Video models like Veo 3 are on a path to become vision foundation models.

Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation models—just like LLMs became foundation models for language. This transformation emerged from simple primitives: large, generative models trained on web-scale data. We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.

Get the Android app

Or read this on Hacker News