Get the latest tech news

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation


A next-frame (or next-frame-section) prediction model looks like this: So we have many input frames and want to diffuse some new frames. The idea is that we can encode the input frames to some GPU layout like this: This chart shows the logical GPU memory layout - frames images are not stitched.

This chart shows the logical GPU memory layout- frames images are not stitched. The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target. The idea is that we can encode the input frames to some GPU layout like this:

Get the Android app

Or read this on Hacker News

Read more on:

Photo of video generation

video generation

Photo of input frame context

input frame context

Related news:

News photo

Gemini App Rolling Out Veo 2 Video Generation For Advanced Users

News photo

Google brings Veo 2 video generation to Gemini Advanced

News photo

Adobe says video generation is coming to Firefly this year