Get the latest tech news

Extract-0: A specialized language model for document information extraction


This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of OpenAI

OpenAI

Photo of OpenAI O3

OpenAI O3

Photo of document extraction

document extraction

Related news:

News photo

OpenAI responds to furious ChatGPT subscribers who accuse it of secretly switching to inferior models

News photo

OpenAI will let you buy things from Etsy within ChatGPT

News photo

Top A.I. Researchers Leave OpenAI, Google and Meta for New Start-Up