Generative AI for Drug Discovery (RIPS - JMM Oral '25, NCUWM Oral '25)

June 2024 IPAM, UCLA Relay Therapeutics
Generative AI for Drug Discovery

This project was completed through the RIPS (Research in Industrial Projects for Students) summer program at IPAM (Institute for Pure and Applied Mathematics), one of the NSF's Mathematical Sciences Institutes. I was lucky to be selected to be part of RIPS 2024's 36-student cohort (from a pool of 4000+ applicants), and even luckier to find such incredible mentors in Fang Sun (UCLA CS PhD), Luca Ponzoni (Relay Therapeutics), Pat Walters (Relay Therapeutics), Susana Serna (RIPS Director), Dima Shlyaktenko (IPAM Director). I am grateful to have had the opportunity to work with, learn from, and grow alongside a cohort of 35 other math-loving, coffee-ingesting, bad-pun-cracking students, especially my teammates Ellen Li, David Baron, and Walter Virany.

Our project was for Relay Therapeutics, a precision medicine company based in Cambridge, MA known for being a pioneer in harnessing powerful computational techniques to accelerate drug discovery. In addition to exposing how much my chemistry knowledge had atrophied since high school, this project taught me how to dig into the metaphorical guts of industry-standard software; in a few weeks I went from describing molecules as "more or less wiggly" to modeling protein-ligand interactions and imagining new ways to optimize hit-to-lead molecular mutations.

As a math and CS student, I've often felt that many of the canonical "interesting problems" were interesting chiefly because of some quirk or symmetry or elegance in the underlying mathematical skeleton, while the applications were tacked on as an afterthought - an arbitrary wrinkle buried under a foot of mathematical memory foam. Working through this project - fiddling with the molecule renderings, spending hours manually parsing the molecular embeddings - gave me a new appreciation for the dynamic interplay between the mathematical and the applied. It sounds trite, but this project drove home the idea that important applications are not just peculiar corollaries of beautiful mathematics; rather, beautiful mathematics is often born to fit the unique topographical contours of important applications.

Abstract

We characterize the differences in performance and behavior across four state-of-the-art generative AI models for drug discovery: REINVENT4, CReM, SAFE-GPT, and COATI. To do so, we develop a random forest pipeline for classifying molecules and producing meaningful low-dimensional visualizations, a comparative analysis of models' performance in a hit-to-lead optimization setting, and a large-scale case study of the distributions of molecules generated from each model.

In particular, we develop a new sensitivity metric to capture the variance in these distributions, helping to highlight the limitations of existing models and inform future model development.

Resources

Note: This work was conducted for a Relay Therapeutics, aprivate company through the RIPS program, so I will not be publicly posting the detailed slides and technical report.