Get the latest tech news
How Does GPT-4o Encode Images?
Here’s a fact: GPT-4o charges 170 tokens to process each 512x512 tile used in high-res mode. At ~0.75 tokens/word, this suggests a picture is worth about 227 words—only a factor of four off from the traditional saying. (There’s also an 85 tokens charge for a low-res ‘master thumbnail’ of each picture and higher resolution images are broken into many such 512x512 tiles, but let’s just focus on a single high-res tile.
Releases like GPT-4 Turbo were actually faster and cheaper than earlier version, and a reduction in embedding dimension may have been part of that if the developers had benchmarks showing that the smaller size was just as good in terms of quality. The goal is to suggest a workable CNN architecture that connects the known input size ( 512x512 images with 3 RGB color channels) to the assumed output shape ( 13x13 embedding vectors with 12,228 dimensions each.) The pyramid strategy has a lot of intuitive appeal—it feels like an almost “obvious” way to encode spatial information at different zoom levels - and may explain why it does so well with the 5x5 grid and below and so poorly on 6x6 and above.
Or read this on Hacker News