Get the latest tech news

MM1.5: Methods, Analysis and Insights from Multimodal LLM Fine-Tuning


We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

View PDF Abstract:We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of insights

insights

Photo of analysis

analysis

Photo of Methods

Methods

Related news:

News photo

Ming-Chi Kuo survey: Apple’s iPhone 16 series, particularly the Pro models, seems to be facing significant challenges in capturing consumer interest, with potential shifts in consumer loyalty towards Android and older iPhone models. (Link & AI analysis)

News photo

The 3 key differences between U.S. and Chinese markets and what it means for ecommerce: Insights from Lesley Gao

News photo

Navigating Endpoint Privilege Management: Insights for CISOs and Admins