Get the latest tech news

Grok-1.5 Vision Preview


Connecting the digital and physical worlds with our first multimodal model.

In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. Benchmark Grok-1.5V GPT-4VClaude 3 SonnetClaude 3 OpusGemini Pro 1.5MMMUMulti-discipline53.6%56.8%53.1% 59.4% 58.5%MathvistaMath 52.8% 49.9%47.9%50.5%52.1%AI2D For samples missing annotations such as A, B, C, etc., we render bounding boxes and corresponding letters at the relevant areas in the image.Diagrams88.3%78.2% 88.7% 88.1%80.3%TextVQAText reading 78.1% 78.0%--73.5%ChartQACharts76.1%78.5%81.1%80.8% 81.3% DocVQADocuments85.6%88.4% 89.5% 89.3%86.5%RealWorldQAReal-world understanding 68.7% 61.4%51.9%49.8%67.5%

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Vision Preview

Vision Preview