Get the latest tech news
LLMs know more than what they say
... and how that provides winning evals
More structured approaches to evaluations also have yielded mixed results in practice – while metric-, tool- and model-based evals are fast and inexpensive, most mission critical applications still rely on slow and costly human review as the gold standard for accuracy. Lynx [11] is a recently published model that was created by fine tuning Llama-3-Instruct on 3200 samples from HaluBench data sources, i.e. similar passage-question-answer hallucination detection examples (but not including Halueval). Comparison of zero-shot prompting (base model completion), latent space readout (LSR) and fine tuning on similarly structured data (Lynx as an example).
Or read this on Hacker News