Get the latest tech news
Evaluation quirks, metric pitfalls and some recommendations
Evaluation Quirks, Metric Pitfalls and Some Recommendations
3) Optimistic result because of double-counting: By improperly evaluating retrieved instances in an IR setting, the F1 score can rise up to 200% points, kind of exploding the scale which is supposed to end at 100%. For deeper reading, and to help practitioners and researchers with this, I’ve written a paper that explores how to select the right metrics and make more sense of their behavior: Now if you feed a list into this function where the same element accidentally occurs multiple times (can happen in generative AI), go figure!
Or read this on Hacker News