Get the latest tech news

Invalid SMILES beneficial rather than detrimental to chemical language models


Generative models for chemical structures are often trained to create output in the common SMILES notation. Michael Skinnider shows that training models with the goal of avoiding the generation of incorrect SMILES strings is detrimental to learning other chemical properties and that allowing models to generate incorrect molecules, which can be easily removed post hoc, leads to better performing models.

This finding suggests that removing invalid SMILES has the effect of filtering low-quality samples from the model output, which in turn would be expected to improve performance on distribution-learning metrics such as the Fréchet ChemNet distance. d, Saturation curve showing the proportion of the full GDB-13 database reproduced after sampling a given number of valid molecules from chemical language models trained on SMILES versus SELFIES. 7 – 9, I show that language models can contribute to the structure elucidation of a range of complex small molecules including natural products, environmental pollutants, and food-derived compounds, and that the ability to generate invalid outputs improves performance on these tasks.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of language models

language models

Photo of Invalid SMILES

Invalid SMILES

Related news:

News photo

DeepMind’s PEER scales language models with millions of tiny experts

News photo

Refusal in language models is mediated by a single direction

News photo

Can language models serve as text-based world simulators?