Get the latest tech news

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

A new benchmark from Salesforce research evaluates model and agentic performance on real-life enterprise tasks.

“Existing benchmarks predominantly focus on isolated aspects of LLM performance, such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world MCP servers across diverse scenarios,” Salesforce said in a paper. This is why it’s crucial not to take a DIY approach with a single model to power agents alone, but instead, to rely on a platform that combines data context, enhanced reasoning, and trust guardrails to truly meet the needs of enterprise AI.” The repository management domain looks at codebase operations and connects to the GitHub MCP to expose version control tools like repo search, issue tracking and code editing.

Get the Android app

Or read this on Venture Beat