Get the latest tech news
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
A next-frame (or next-frame-section) prediction model looks like this: So we have many input frames and want to diffuse some new frames. The idea is that we can encode the input frames to some GPU layout like this: This chart shows the logical GPU memory layout - frames images are not stitched.
This chart shows the logical GPU memory layout- frames images are not stitched. The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target. The idea is that we can encode the input frames to some GPU layout like this:
Or read this on Hacker News