Get the latest tech news

Task-specific LLM evals that do and don't work

Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

This is where we generate a summary that captures the key aspects and associated sentiments from a set of opinions, such as customer feedback, social media, or product reviews. Dhamala, Jwala, et al. “Bold: Dataset and metrics for measuring biases in open-ended language generation.” Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. However, while its reported Spearman correlation with human judgements surpasses previous SOTA evaluators, empirically, it’s unreliable (low recall), costly (at least double the token count), and has poor sensitivity (to nuanced inconsistencies).

Get the Android app

Or read this on Hacker News