Get the latest tech news

Invalid SMILES beneficial rather than detrimental to chemical language models

Generative models for chemical structures are often trained to create output in the common SMILES notation. Michael Skinnider shows that training models with the goal of avoiding the generation of incorrect SMILES strings is detrimental to learning other chemical properties and that allowing models to generate incorrect molecules, which can be easily removed post hoc, leads to better performing models.

This finding suggests that removing invalid SMILES has the effect of filtering low-quality samples from the model output, which in turn would be expected to improve performance on distribution-learning metrics such as the Fréchet ChemNet distance. d, Saturation curve showing the proportion of the full GDB-13 database reproduced after sampling a given number of valid molecules from chemical language models trained on SMILES versus SELFIES. 7 – 9, I show that language models can contribute to the structure elucidation of a range of complex small molecules including natural products, environmental pollutants, and food-derived compounds, and that the ability to generate invalid outputs improves performance on these tasks.

Get the Android app

Or read this on Hacker News