Get the latest tech news
The Speed of VITs and CNNs
foo
However, I believe this criticism is a misguided knee-jerk reaction and, in practice, ViTs scale perfectly fine up to at least 1024x1024px², which is enough for the vast majority of usage scenarios for image encoders. Higher resolutions exist purely for human consumption: for the aesthetic beauty of very crisp lines, and to avoid eye strain. Besides all this, there is a very simple and elegant mechanism to make ViTs for high resolution even faster and more memory efficient: local attention.
Or read this on Hacker News