Get the latest tech news

Apple trained a large language model to efficiently understand long-form video


Apple researchers have developed a version of the SlowFast-LLaVA model that beats larger models at long-form video understanding.

Of course, there are more efficient ways to train video LLMs (NVIDIA recently published an interesting paper on this), but this is the general idea to keep in mind for Apple’s study. What’s more, the model also overcomes one of the three shortcomings noted by the researchers, and performs well on image tasks too, including benchmarks for knowledge, math reasoning, OCR, and text-rich scenarios. The team even tested several video compression strategies, but found that their setup struck the best balance between speed, accuracy, and token count.

Get the Android app

Or read this on r/apple

Read more on:

Photo of Apple

Apple

Photo of large language model

large language model

Photo of form video

form video

Related news:

News photo

What's Next for Apple's iPad Lineup

News photo

Apple in talks to use Google's Gemini AI to power revamped Siri

News photo

LLM Siri: The Complete Guide to Apple's AI Assistant Overhaul Coming in 2026