Get the latest tech news
Kagi LLM Benchmarking Project
Kagi Search Help
Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities. ModelAccuracy (%)TokensTotal Cost ($)Median Latency (s)Speed (tokens/sec)OpenAI gpt-4o 52.0074820.143101.6048.00Together meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo 50.0077670.071362.0046.49Anthropic claude-3.5-sonnet-20240620 46.0065950.120182.5448.90Mistral large-latest 44.0050970.067873.0818.03Groq llama-3.1-70b-versatile 40.0051900.007810.7181.62Reka reka-core 36.0069660.124016.2117.56OpenAI gpt-4o-mini 34.0060290.004511.6436.92DeepSeek deepseek-chat 32.0073100.003044.8117.20Anthropic claude-3-haiku-20240307 28.0056420.008811.3355.46Groq llama-3.1-8b-instant 28.0066280.000852.2682.02DeepSeek deepseek-coder 28.0080790.003274.1316.72OpenAI gpt-4 26.0024770.334081.3216.68Mistral open-mistral-nemo 22.0041350.003230.6582.65Groq gemma2-9b-it 22.0048890.002491.6954.39OpenAI gpt-3.5-turbo 22.0015690.015520.5145.03Reka reka-edge 20.0053770.007982.0246.87Reka reka-flash 16.0057380.016683.2828.75GoogleGenAI gemini-1.5-flash 14.0052870.027773.0221.16GoogleGenAI gemini-1.5-pro 12.0052840.277623.3216.49The table includes metrics such as overall mode quality (measured as percent of correct responses), total tokens output (some models are less verbose by default, affecting both cost and speed), total cost to run the test, median response latency and average speed in tokens per second at the time of testing. The table below is updated to the best of our abilities, feel free to submit changes by editing this page.
Or read this on Hacker News