Get the latest tech news

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

A next-frame (or next-frame-section) prediction model looks like this: So we have many input frames and want to diffuse some new frames. The idea is that we can encode the input frames to some GPU layout like this: This chart shows the logical GPU memory layout - frames images are not stitched.

This chart shows the logical GPU memory layout- frames images are not stitched. The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target. The idea is that we can encode the input frames to some GPU layout like this:

Get the Android app

Or read this on Hacker News