Get the latest tech news

Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown


Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.

Main Content Extraction: Evaluated the models' ability to accurately convert body text, preserving paragraphs, formatting lists, and maintaining consistency in presentation. When we realized that the time and effort spent preparing the training data—using dynamic programming and heuristics to create perfect token-level labeling sequences—was significant, we decided to discontinue this approach. There's still much room for improvement in terms of both efficiency and quality: expanding the context length, speeding up decoding, and adding support for instructions in the input, which would allow Reader-LM to extract specific parts of a webpage into markdown.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of HTML

HTML

Related news:

News photo

The UX of HTML (2023)

News photo

Servo Web Engine Now Leverages Multiple CPU Cores For Rendering HTML Tables

News photo

My daughter (7 years old) used HTML to make a website