Get the latest tech news

LLM Benchmark for 'Longform Creative Writing'


Emotional Intelligence Benchmarks for LLMs Github | Paper | | Twitter | About 💙EQ-Bench3 ✍️Longform Writing 🎨Creative Writing v3 ⚖️Judgemark v2 🎤BuzzBench 🌍DiploBench 🎨Creative Writing (Legacy) 💗EQ-Bench (Legacy)An LLM-judged longform creative writing benchmark (v3). Learn more This benchmark evaluates several abilities: Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings.

Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings. Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7. Score (0-100) The overall final rating assigned by the judge LLM, scaled to 0–100.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of benchmark

benchmark

Related news:

News photo

a16z- and Benchmark-backed 11x has been claiming customers it doesn’t have

News photo

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

News photo

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork