Get the latest tech news
Experimentation Matters: Why Nuenki isn't using pairwise evaluations
Pairwise evaluations require a huge amount of data - and cost - to build a good model. Perhaps it's better to sacrifice the benefits of pairwise evaluation in order to more easily experiment?
I've spent the last week or so working on a new benchmark using pairwise evaluations and a Bradley-Terry model. I then spent 100 USD attempting to run enough comparisons to get the P-values to reduce and the model to find some actual signal from the noise. There are of course various controls (e.g. consolidating multiple translations into one when there are duplicates, randomising the order, and making the test blinded).
Or read this on Hacker News