Get the latest tech news

LLMs know more than what they say

... and how that provides winning evals

More structured approaches to evaluations also have yielded mixed results in practice – while metric-, tool- and model-based evals are fast and inexpensive, most mission critical applications still rely on slow and costly human review as the gold standard for accuracy. Lynx [11] is a recently published model that was created by fine tuning Llama-3-Instruct on 3200 samples from HaluBench data sources, i.e. similar passage-question-answer hallucination detection examples (but not including Halueval). Comparison of zero-shot prompting (base model completion), latent space readout (LSR) and fine tuning on similarly structured data (Lynx as an example).

Get the Android app

Or read this on Hacker News