Get the latest tech news

Task-specific LLM evals that do and don't work


Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

This is where we generate a summary that captures the key aspects and associated sentiments from a set of opinions, such as customer feedback, social media, or product reviews. Dhamala, Jwala, et al. “Bold: Dataset and metrics for measuring biases in open-ended language generation.” Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. However, while its reported Spearman correlation with human judgements surpasses previous SOTA evaluators, empirically, it’s unreliable (low recall), costly (at least double the token count), and has poor sensitivity (to nuanced inconsistencies).

Get the Android app

Or read this on Hacker News

Read more on:

Photo of task

task

Photo of specific llm evals

specific llm evals

Related news:

News photo

Show HN: PreCog AI – Automatic AI Model Selection for Any Task

News photo

Task-Switching Experiment (2015)

News photo

Task-Switching Experiment (2015)