Get the latest tech news
Task-specific LLM evals that do and don't work
Evals for classification, summarization, translation, copyright regurgitation, and toxicity.
This is where we generate a summary that captures the key aspects and associated sentiments from a set of opinions, such as customer feedback, social media, or product reviews. Dhamala, Jwala, et al. “Bold: Dataset and metrics for measuring biases in open-ended language generation.” Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. However, while its reported Spearman correlation with human judgements surpasses previous SOTA evaluators, empirically, it’s unreliable (low recall), costly (at least double the token count), and has poor sensitivity (to nuanced inconsistencies).
Or read this on Hacker News