Get the latest tech news
Apple trained a large language model to efficiently understand long-form video
Apple researchers have developed a version of the SlowFast-LLaVA model that beats larger models at long-form video understanding.
Of course, there are more efficient ways to train video LLMs (NVIDIA recently published an interesting paper on this), but this is the general idea to keep in mind for Apple’s study. What’s more, the model also overcomes one of the three shortcomings noted by the researchers, and performs well on image tasks too, including benchmarks for knowledge, math reasoning, OCR, and text-rich scenarios. The team even tested several video compression strategies, but found that their setup struck the best balance between speed, accuracy, and token count.
Or read this on r/apple