Get the latest tech news
Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown
Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.
Main Content Extraction: Evaluated the models' ability to accurately convert body text, preserving paragraphs, formatting lists, and maintaining consistency in presentation. When we realized that the time and effort spent preparing the training data—using dynamic programming and heuristics to create perfect token-level labeling sequences—was significant, we decided to discontinue this approach. There's still much room for improvement in terms of both efficiency and quality: expanding the context length, speeding up decoding, and adding support for instructions in the input, which would allow Reader-LM to extract specific parts of a webpage into markdown.
Or read this on Hacker News