Get the latest tech news

Kagi LLM Benchmarking Project

Kagi Search Help

Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities. ModelAccuracy (%)TokensTotal Cost ($)Median Latency (s)Speed (tokens/sec)OpenAI gpt-4o 52.0074820.143101.6048.00Together meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo 50.0077670.071362.0046.49Anthropic claude-3.5-sonnet-20240620 46.0065950.120182.5448.90Mistral large-latest 44.0050970.067873.0818.03Groq llama-3.1-70b-versatile 40.0051900.007810.7181.62Reka reka-core 36.0069660.124016.2117.56OpenAI gpt-4o-mini 34.0060290.004511.6436.92DeepSeek deepseek-chat 32.0073100.003044.8117.20Anthropic claude-3-haiku-20240307 28.0056420.008811.3355.46Groq llama-3.1-8b-instant 28.0066280.000852.2682.02DeepSeek deepseek-coder 28.0080790.003274.1316.72OpenAI gpt-4 26.0024770.334081.3216.68Mistral open-mistral-nemo 22.0041350.003230.6582.65Groq gemma2-9b-it 22.0048890.002491.6954.39OpenAI gpt-3.5-turbo 22.0015690.015520.5145.03Reka reka-edge 20.0053770.007982.0246.87Reka reka-flash 16.0057380.016683.2828.75GoogleGenAI gemini-1.5-flash 14.0052870.027773.0221.16GoogleGenAI gemini-1.5-pro 12.0052840.277623.3216.49The table includes metrics such as overall mode quality (measured as percent of correct responses), total tokens output (some models are less verbose by default, affecting both cost and speed), total cost to run the test, median response latency and average speed in tokens per second at the time of testing. The table below is updated to the best of our abilities, feel free to submit changes by editing this page.

Get the Android app

Or read this on Hacker News