By: Cormac Rea
While data science is driving breakthroughs in countless areas, the lack of availability of experimental training data has limited its impact on drug discovery. In particular, there is a need to help data scientists understand experimental drug discovery data, ask the right questions, and decide for themselves on the best answers.
The Data Sciences Institute (DSI) has awarded the Galvanizing Data Science Applications in Early Stage Drug Discovery proposal as an Emergent Data Science Program, which funds researchers to energize, support, and advance data science.
The Early Stage Drug Discovery Program will build bridges between data scientists and drug discovery experimentalists – two communities that typically do not speak the same language – by providing training to expose data science trainees to the next frontiers in drug discovery and galvanize a new generation of scientists into a space poised for machine learning-driven transformation.
The initiative is led by University of Toronto professors: Matthieu Schapira (Department of Pharmacology and Toxicology, Temerty Faculty of Medicine and the Structural Genomics Consortium); Rachel Harding (Leslie Dan Faculty of Pharmacy, and the Structural Genomics Consortium); Mohamed Moosavi (Department of Chemical Engineering & Applied Science, Faculty of Engineering & Applied Science); Chris Maddison (Department of Computer Science and Department of Statistical Sciences, Faculty of Arts & Science and Vector Institute) and Hui Peng (Department of Chemistry, Faculty of Arts & Science).
“Recent advances in machine learning (ML) are poised to have a transformative impact along the drug discovery and development trajectory, including finding the best protein target for a given disease, discovering and optimizing drugs and selecting patients most likely to respond to a given treatment,” says lead researcher Matthieu Schapira.
The Early Stage Drug Discovery program will build bridges between data scientists and drug discovery experimentalists, two communities that typically do not speak the same language.
Offering quarterly workshops on data science for hit-finding that include interactive sessions and lab visits where data scientists will learn about data generation and experimentalists will learn about data analysis, the program launches on January 31 2025 with the CrossTALK Bootcamp.
The bootcamp includes workshops to explain the chemical library screening process and associated data challenges in which participants will use their ML models to retrospectively retrieve blinded hits.
“Supporting emergent areas of data science is a core activity of the Data Sciences Institute that helps to fulfil its mission of bringing people together for collaborative generation and application of new ideas in the data sciences,” says David Lie, DSI Associate Director, Thematic Programming.
DSI met with Prof. Schapira to learn more about this Emergent Data Science Program:
From a personal or professional perspective, could you explain what led you and your collaborators to propose this as an emerging data science program to the Data Sciences Institute?
MS: A challenge for machine learning (ML) in early-stage drug discovery is the lack of publicly accessible, large and consistent data sets to train ML models, but efforts are underway to fill this gap, which will lead to new opportunities for data-science driven drug discovery. A new initiative at The Structural Genomics Consortium (SGC) aims to screen up to 2000 proteins against billions of molecules using two experimental platforms well-established in the pharmaceutical industry: DNA-encoded libraries (DEL) and Affinity Selection Mass Spectrometry (ASMS). A network of AI experts around the world committed to exploiting these data for early-stage drug discovery is rapidly growing at https://aircheck.ai/mainframe. As the SGC, in partnership with our industry partners, is poised to become a leading generator of open-science protein-ligand data, our goal is to ensure that the data science and drug discovery breakthroughs made from our U of T-generated data are not all made elsewhere. Our goal is to position Canada at the forefront of this breakthrough. This grant will enable a pilot project to train the next generation of data scientists at U of T. If successful, we will then expand this program at Universities across Canada.
Our experience with the ML divisions of pharmaceutical companies has revealed that understanding the genesis of the data is critical to elaborate efficient machine learning strategies, and a challenge. Conversely, we believe that it is critical for bench scientists to share a common language with data scientists to better provide guidelines for the reliable interpretation of experimental data.
Our solution is to galvanize Canadian data science trainees around open science data for drug discovery, and pair them with experimentalists. We will organize four bootcamps each year where experimentalists and data scientists team-up and learn together how experimental training datasets are generated, how ML models are built and used to predict bioactive molecules, and how predicted molecules are tested experimentally.
What are some of the main challenges to bringing together researchers, trainees and students interested in this computational work?
MS: Most participants will be graduate students and post-docs, though staff are welcome as well… and many PIs say they are keen to attend, though each bootcamp is ~20 hours, which is a real time commitment! I believe pairing experimentalists and data scientists will have a positive impact on the learning curve. Our first bootcamp starts in February, so we’ll see how things go.
What would you like to see coming out of the CrossTalk bootcamp?
MS: There is no question that ML will transform the way life sciences are conducted and the speed at which discoveries are made. Canada cannot afford to miss this departing train. U of T is privileged to have a pool of exceptionally talented ML trainees.
I hope this program will provide some tools for data scientists and experimentalists at U of T and beyond to harness the waves of chemical data that are bound to accelerate early-stage drug discovery. The 2024 Nobel prize in Chemistry highlighted the first steps in this direction.