Get the latest tech news
Chemical knowledge and reasoning of large language models vs. chemist expertise
Large language models are increasingly used for diverse tasks, yet we have limited insight into their understanding of chemistry. Now ChemBench—a benchmarking framework containing more than 2,700 question–answer pairs—has been developed to assess their chemical knowledge and reasoning, revealing that the best models surpass human chemists on average but struggle with some basic tasks.
Owing to the lack of widely accepted standard benchmarks, the developers of chemical language models 16, 44, 45, 46, 47 frequently utilize language-interfaced 48 tabular datasets such as the ones reported in MoleculeNet 49, 50, Therapeutic Data Commons 51, safety databases 52 or MatBench 53. Although our findings indicate many areas for further improvement of LLM-based systems, such as agents (more discussion in Supplementary Note 11), it is also important to realize that clearly defined metrics have been the key to the progress of many fields of ML, such as computer vision. Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast & Kevin Maik Jablonka
Or read this on Hacker News