Get the latest tech news

Experimentation Matters: Why Nuenki isn't using pairwise evaluations

Pairwise evaluations require a huge amount of data - and cost - to build a good model. Perhaps it's better to sacrifice the benefits of pairwise evaluation in order to more easily experiment?

I've spent the last week or so working on a new benchmark using pairwise evaluations and a Bradley-Terry model. I then spent 100 USD attempting to run enough comparisons to get the P-values to reduce and the model to find some actual signal from the noise. There are of course various controls (e.g. consolidating multiple translations into one when there are duplicates, randomising the order, and making the test blinded).

Get the Android app

Or read this on Hacker News