SUDS Student Call 

May-August 2025

Call for student researchers!

The Data Sciences Institute (DSI) welcomes carefully selected undergraduate students from across Canada for a rich data sciences research experience. Through the SUDS Research Program, undergraduate students, who are interested in exploring data science as a career path, have an exciting opportunity to engage in hands-on research supervised by DSI member researchers across the three UofT campuses.

The DSI is strongly committed to diversity within its community and especially welcomes applications from racialized persons/persons of colour, women, Indigenous/Aboriginal People of North America, persons with disabilities, LGBTQ2S+ persons, and others who may contribute to the further diversification of ideas.

Below are the SUDS research opportunities for May-August 2025. You can apply and rank your top three choices.

See here for information on eligibility, award value and duration, and SUDS programming.

Research Opportunities

Research description:

Type Ia Supernovae are calibrated standard light beacons that enable us to measure distances across cosmic time. These distances encode the expansion history of the Universe; however, one of the biggest challenges is finding a “pure” sample of these supernovae, given that many things explode in the night sky, and only some of those are useful cosmological probes. The Vera C Rubin Observatory is a telescope that takes images of the sky and will find hundreds of thousands of these objects, contaminated by other light sources.  Our group is working on a fully Bayesian supernova cosmology analysis pipeline to process the incoming Rubin data. 
 
There are many aspects to this analysis, including parametrizing supernova rates over time, modelling supernova spectra, and more practical considerations such as optimizing the analytic and numerical runtime, and performing coverage tests. Depending on the SUDS Scholar's interests and strengths, your tasks could include developing statistical tests to determine the accuracy of the Bayesian model, using conformal prediction or similar methods to improve quantified uncertainties, performing an independent analysis on an alternate supernova dataset, or optimizing the code for accuracy or performance.
 

Researcher: Renee Hlozek, David A. Dunlap Department of Astronomy and Astrophysics, University of Toronto 

 

Skills required:

  • Python programming and a keen interest in rigorous analysis of real data.  
  • Previous experience with JAX or high performance computing and Bayesian analysis are helpful, but not required.
  • Knowledge of astronomy and supernovae are also helpful, but not necessary.

Primary research location:

  • Hybrid

Research description:

Concussion affects over 400,000 Canadians annually, with up to 30% experiencing prolonged post-concussion symptoms that disrupt recovery and quality of life. Early follow-up is critical but is frequently delayed by months due to clinician shortages across Canada and limited access to specialized urban centers, resulting in symptom exacerbation, prolonged disability, and greater strain on healthcare systems. AI-driven platforms have the potential to automate triage, summarize clinical information to inform clinicians, and support clinical decision-making, yet current systems lack multimodal sensing, clinical validation, and workflow integration. This project enhances the validated Acute Concussion Triage Agent (ACT-A), a multilingual, privacy-preserving web platform that conducts adaptive interviews, analyzes affective and behavioral cues, and generates structured summaries and recommendations for clinician review. ACT-A integrates retrieval-augmented generation (RAG)-based recommendation agents built on secure Microsoft Azure-hosted large language models to produce evidence-based next-step decisions. These structured summaries and recommendations are designed to reduce clinician workload, enabling more focused, efficient, and higher-quality patient interactions, while allowing clinicians to allocate more time to complex or high-priority cases. Through multimodal data fusion, prompt-engineered summarization, and clinician-in-the-loop validation, ACT-A will reduce triage delays and establish a scalable, agentic-AI framework for equitable, intelligent concussion care.

 

Researcher: Khan Shehroz, Toronto Rehabilitation Institute (KITE), University Health Network

 

Skills required:

  • React JS web application development, Large Language Models, Prompt Engineering, Agentic AI, Retrieval-Augmented Generation, Machine Learning, Deep Learning

Primary research location:

  • Hybrid

Research description:

Canadians spend nearly 90% of their time indoors, where they are exposed to various airborne contaminants. Indoor air quality (IAQ) has a significant impact on health and overall quality of life. However, analyzing and understanding IAQ in diverse indoor environments remains challenging due to missing information about key factors such as contaminant generation rates, air mixing, and airflow patterns between spaces. Building on last year’s successful DSI SUDS project, this research continues the development of physics-informed machine learning (ML) methods to better understand IAQ dynamics. This year’s project will extend the previous work by refining and validating probabilistic ML models using data collected from a controlled experiment in the Twin Suites Rooftop Lab, where ground-truth information about the key factors affecting IAQ dynamics is measured. The focus will be on improving the models’ ability to estimate these factors under uncertainty. Probabilistic programming will serve as the overarching framework to integrate data-driven inference with domain knowledge.

The SUDS Scholar will work with Professor Jeffrey Siegel (CIVMIN, IAQ expert) and Professor Seungjae Lee (CIVMIN, ML expert in building science). While the project primarily focuses on the analysis of IAQ data, the SUDS Scholar will also have the opportunity to participate in the IAQ data collection.

 

Researcher: Seungjae Lee, Department of Civil and Mineral Engineering, University of Toronto

 

Skills required:

  • Proficiency in Python, with experience using essential data science libraries (e.g., scikit-learn).
  • Preferred:
    • Experience with PyTorch/Tensorflow.
    • Experience with handling time series data.
    • Foundational understanding of machine learning and probability theories.
    • Experience with high-performance computing.
    • Experience with Git for version control.
    • Interest in building science and IAQ applications.

Primary research location:

  • Hybrid

Research description:

This project explores the use of Causal Prior-Fitted Networks (CausalPFNs) and Large Language Models (LLMs) to better understand treatment heterogeneity in clinical trial datasets. CausalPFNs are transformer-based models trained on diverse simulated data-generating processes that can estimate causal effects directly from observational or experimental data without additional tuning. Applying CausalPFNs to clinical trial data enables automatic estimation of conditional average treatment effects (CATEs), revealing patient subgroups that respond differently to interventions. Meanwhile, LLMs can process and interpret unstructured clinical documents, such as trial protocols and patient narratives, to extract relevant covariates and contextualize causal findings. By combining CausalPFNs’ quantitative inference with LLMs’ interpretive capabilities, the project aims to build a unified framework for automated causal analysis and clinical insight generation. The outcomes will include validated pipelines for identifying heterogeneous treatment responses, interpretable summaries of causal results, and guidelines for integrating language-based reasoning with causal machine learning—advancing personalized medicine and evidence synthesis.

 

Researcher: Rahul Krishnan, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Strong skills in causal inference, deep learning (PyTorch/JAX), and transformer architectures, with familiarity in clinical data analysis and Bayesian reasoning.
  • Experience with LLMs, text extraction, and data preprocessing is essential, alongside statistical literacy, scientific writing, and the ability to interpret model outputs in biomedical contexts.

Primary research location:

  • University of Toronto, St. George Campus

Research description:

The nervous system is essential for generating and coordinating complex motor behaviors that are critical for animal survival and reproduction across species. We are using C. elegans and mice to study how components of the nervous system, from the molecular to the circuit level, determine its properties and generate the complex behaviors. We have developed strategies to monitor and control the components of the nervous system in real time, both in living, behaving animals and in isolated neuronal tissues. These approaches combine genetic mutants, calcium imaging, electrophysiology, optogenetics, and immunohistochemistry to investigate the structure and function of nervous system. With these tools, we are able to examine how molecular and cellular components of the nervous system affect animal development and behavior. One challenge we face is implementing automated tracking, segmentation, and quantification of specific behaviors of interest. We have developed imaging setups for the behaviors of interest. 
 
The SUDS Scholar will work on developing an automated pipeline for characterizing and quantifying animal behavior based on our imaging setups. The student will collaborate with our team and partners who are currently building machine learning algorithms to address these challenges.
 

Researcher: Mei Zhen, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Proficient in either image processing, algorithm development, or statistical analyses.
  • Knowledge in programming is essential.
  • Students interested in applied math and physics are strongly encouraged to apply, but the key ingredient is a strong drive to learn and apply all the above to real biological problems.

Primary research location: 

  • Hybrid

Research description:

This project offers an opportunity for a motivated undergraduate student to explore global biodiversity change using established long-term ecological datasets.
 
The SUDS Scholar will work with large-scale biodiversity and environmental data collected from international monitoring programs and open repositories. Using data science tools and methods—including, but not limited to, data cleaning, visualization, and statistical or time series analyses—the student will investigate spatial and temporal trends in species diversity, abundance, and distribution. The student will learn best practices for handling complex ecological data, explore reproducible workflows, and contribute to the development of analytical pipelines that help quantify global biodiversity loss or recovery. The project will encourage critical thinking about data quality, scale, and uncertainty, as well as the broader implications of biodiversity change for ecosystem health and sustainability. This opportunity is ideal for students interested in combining computational skills with environmental science to address urgent global challenges through data-driven research.
 

Researcher: Tianna Peller, Department of Ecology and Evolutionary Biology, University of Toronto

 

Skills required:

  • Strong analytical, coding, and organizational skills, with an interest in applying data-driven methods to ecological and environmental questions.
  • Familiarity with time-series or spatial data, statistical analysis, and integrating multiple datasets to assess potential drivers of observed patterns are considered valuable assets.

Primary research location:

  • Hybrid

Research description:

Cities worldwide are investing in green infrastructure to enhance resilience, improve livability, and reduce carbon emissions. However, the financial value of urban greenness, how it translates into tangible economic benefits, remains underexplored. This project will quantify the economic and financial impacts of urban vegetation by integrating satellite-based greenness indices with housing market data and financial modeling techniques across the Greater Toronto Area.
 
Using multi-temporal 10-m Sentinel-2, the SUDS Scholar will calculate the vegetation index a representing vegetation density and distribution. These spatial greenness metrics will be merged with housing price datasets (from CREA, MLS, or municipal open data) to evaluate the relationship between environmental quality and property value. Using spatial regression and hedonic pricing models, the project will estimate the “green premium”, the monetary contribution of vegetation to housing prices after controlling for confounding factors (e.g., proximity to transit, schools, and employment centers). Building on these relationships, the student will apply financial modeling to translate environmental benefits into investment value. Scenarios will simulate how future urban greening initiatives or carbon pricing policies might influence neighbourhood-level property and ecosystem service value. The analysis will culminate in spatial visualizations and financial summaries quantifying how urban greenness contributes to climate resilience and economic prosperity.
 
Researcher: Yuhong He, Department of Geography, Geomatics, and Environment, University of Toronto
 
Skills required:
  • Proficiency in GIS and remote sensing.
  • Experience with satellite imagery analysis.
  • Basic knowledge of statistical modeling
  • Data integration from open housing datasets and census sources.
  • Ability to conduct spatial data cleaning, visualization, and mapping.
  • Interest in urban sustainability, environmental economics, and climate policy.
  • Strong written and analytical skills.

Primary research location:

  • Hybrid

Research description:

This project seeks to develop an automated, data-driven workflow for interpreting X-ray photoelectron spectroscopy (XPS) dataset of the solid-electrolyte interphase (SEI) in high-energy-density batteries—often dubbed the “Mona Lisa” of battery interfaces due to its chemical complexity and analytical opacity. Despite XPS being a cornerstone technique for SEI characterization, its interpretation is plagued by overlapping spectral features, mixed oxidation states, and subjective, non-reproducible analysis methods that hinder scientific consensus. By embedding advanced data science techniques—such as automated signal processing, dimensionality reduction, and probabilistic modeling—into the core of the XPS workflow, this project will produce an open-source, Python-based software toolkit that enables interpretable and reproducible spectral analysis. The toolkit will detect anomalous features, suggest candidate species with quantified uncertainty, and facilitate transparent, modular exploration of SEI chemistry. Aligned with the Data Sciences Institute’s mission to promote fair, ethical, and reproducible data practices, this project fosters interdisciplinary collaboration between electrochemistry and data science. By openly disseminating tools and annotated datasets, it democratizes access to advanced analytical capabilities, accelerating innovation in sustainable energy storage and advancing the development of safer, more efficient batteries. 
 
The SUDS Scholar will lead the development of an automated, data-driven workflow for interpreting XPS spectra of battery solid–electrolyte interphases (SEIs). Responsibilities include designing signal-processing and machine-learning pipelines for peak deconvolution, feature extra ction, anomaly detection, and uncertainty quantification; developing a modular, open-source Python toolkit with robust documentation and testing; curating and standardizing XPS datasets using FAIR data principles; validating models against reference standards and expert interpretations; and working closely with electrochemists to ensure chemical and physical relevance. The Scholar will also implement reproducible research practices, maintain transparent version-controlled workflows, and support knowledge dissemination through documentation, tutorials, and publications.
 

Researcher: Weilai Yu, Department of Chemical Engineering and Applied Chemistry, University of Toronto

 

Skills required:

  • Skilled in Python programming, data analysis, and signal processing, with familiarity in machine learning, dimensionality reduction, and probabilistic modeling.
  • Experience with Git, reproducible workflows, and scientific visualization is valued.
  • Interest in spectroscopy, materials science, or electrochemistry is an asset, alongside strong documentation and interdisciplinary communication abilities.

Primary research location:

  • Hybrid

Research description:

In-situ synchrotron X‑ray instruments perform material characterization to determine properties such as phase nucleation and transformation under controlled heating. However, the complexity and amount of data from synchrotron X-ray diffraction (XRD) make the analysis challenging. This project develops a high-throughput computational workflow for automated extraction of key structural features from XRD data, including crystallinity, peak parameters, and phase-transition temperatures. 
 
The SUDS Scholar will apply data science approaches such as distribution modeling and signal processing, as well as supervised/unsupervised machine learning methods, to evaluate physics-based candidate features and indicators. After identifying a workflow, students will work to automate the analysis for compatibility with high-throughput experimentation, identifying the phase evolution processes and corresponding structure information. Students will also practice software engineering skills necessary to document the workflow in an open-science framework. Final outcomes include open, reproducible analysis that accelerates materials discovery and demonstrates core data science competencies: algorithm design, scalable computing, and automated knowledge extraction.
 

Researcher: Jason Hattrick-Simpers, Department of Materials Science and Engineering, University of Toronto

 

Skills required:

  • Experience with Python, Machine Learning Knowledge, GitHub, Data Visualization

Primary research location:

  • On campus

Research description:

The landscape of student help-seeking behaviour is undergoing a significant transformation with the rise of generative AI tools like Large Language Models (LLMs). Building on prior research that explores help-seeking tendencies among university students, this project aims to investigate and analyse large-scale student data on the effects of integrating LLM-powered assistants in programming courses, focusing on their influence on student behaviour, engagement, and learning outcomes. Ideally generating an approach for improved (predictive and prescriptive) decision making. The research will involve a comprehensive analysis of how the introduction of LLM-based conversational agents (e.g., ChatGPT) and other LLM-based educational tools, such as CodeAid and QuickTA, both developed at the University of Toronto, influence student approaches to seeking help. This will involve data mapping and analysis, but also the need to identify patterns in large conversational data. Traditional help-seeking behaviours have shown a reliance on informal support (e.g., peers) rather than formal educational resources (e.g., instructors), often due to perceived barriers like stigma or accessibility. We hypothesise that the availability of LLM tools may shift these dynamics, increasing students’ reliance on automated, real-time assistance and providing data-rich insights into evolving help-seeking patterns that could enhance predictive and prescriptive modelling for educational support strategies.

 

Researcher: Michael Liut, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto

 

Skills required:

  • Excellent interpersonal skills
  • HCI: familiarity with designing user studies, conducting thematic analysis, and ability to analyse qualitative feedback from participants.
  • Strong programming skills: Python, R, visualisation libraries (e.g., matplotlib, pandas, plotly), bash, full-stack frameworks (e.g., Django, React), LLM-architectures.
  • Experience handling interaction log data and EdTech development is a bonus.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Recording from the peripheral nervous system can be used to decode control signals exchanged throughout the body, with applications in creating assistive technologies and treating chronic diseases. Our laboratory has collected unique datasets from multi-channel nerve cuff electrodes, which record data from the surface of nerves. We have developed neural networks to decode these recordings by classifying the source of each detected neural event. Using existing data, this project will involve refining neural network architectures and training strategies to optimize performance. Creating neural networks that can generalize well over time and across subjects with minimal re-calibration is of particular interest. The student will have the opportunity to gain a better understanding of real-world data science challenges in neurotechnology, and of strategies to manage these obstacles when developing deep learning systems.

 

Researcher: José Zariffa, Toronto Rehabilitation Institute (KITE), University Health Network

 

Skills required:

  • Experience designing and evaluating deep neural networks.
  • Processing of physiological signals.

Primary research location:

  • University Health Network in-person

Research description:

This project aims to develop a general method for defining clusters of cell types from single-cell RNA sequencing data. This problem is widely considered one of the most important and fundamental problems in single-cell data analysis, but suffers from a paucity of methods to define whether two cell-type clusters are actually distinct from each other. We will use hierarchical clustering via the ultra-fast HGC method to define an initial hierarchy of cell-type clusters. We will then recurse through this hierarchy and apply a significance test at each split, to determine whether the two clusters at the split are significantly different from each other. If they are not, recursion will stop. The most creative aspect of the project will be defining the significance test. HGC is based on the shared nearest-neighbor (SNN) graph, so it seems natural to use that for significance testing as well. However, naively testing whether the number of between-cluster connections is less than expected will not be sufficient, since this criterion was already used to define the clusters themselves - an example of a ”double-dipping” problem. Possibly solutions may involve some combination of permutation testing, the recently-developed "count splitting" method, and graph theoretic properties.

 

Researcher: Michael Weinberg, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Experience in Python is a must-have, for instance through introductory computer science courses.
  • Familiarity with genetics, statistics, data science packages like NumPy and polars, and graph theory are major assets.

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person and remote

Research description:

We are offering a unique research opportunity for students passionate about the intersection of statistics, psychometrics, psychology, and artificial intelligence. This project aims to revolutionize psychological assessment by leveraging AI to design more reliable and valid psychological scales. By employing machine learning algorithms and natural language processing, we will analyze existing scales to identify limitations and develop enhanced tools that more accurately measure psychological constructs. As a participant, you will engage in a case study exploring how AI can be of help refining scale items to be culturally sensitive and reducing unwanted bias. You’ll collaborate with a multidisciplinary team of psychometricians, data scientists, statisticians and AI experts, gaining hands-on experience in both qualitative and quantitative research methods. This immersive experience will not only deepen your understanding of psychometrics but also equip you with cutting-edge skills in AI applications within psychology. This project offers the chance to contribute to pioneering research with the potential to make a significant impact on psychological assessment practices. You’ll develop valuable skills in data analysis and AI, preparing you for advanced studies or careers in psychology, statistics, education, data science, or related fields.

 

Researcher: Feng Ji, Department of Applied Psychology and Human Development, Ontario Institute for Studies in Education, University of Toronto

 

Skills required:

  • Familiarity with machine learning concepts and programming languages such as Python or R is highly desirable.

  • Familiarity with APIs (such as OpenAI API) is preferred (but not required).

  • Essential skills include excellent analytical abilities, attention to detail, and the capacity to work effectively in a collaborative team environment.

  • Coursework in psychology, statistics, and data science (generally defined) is preferred (but not required).

 

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:
This project leverages the transformative potential of machine learning to advance mental health diagnostics through the development of a secure and reliable digital psychiatry platform. By employing multitask learning (MTL) methodologies, the project aims to uncover intricate patterns across physiological, psychological, behavioral, and contextual data derived from wearables and digital diaries. These insights will enhance the detection and prediction of overlapping symptoms and risk factors in comorbid mental health disorders.
 
The project involves the integration of multimodal datasets, including: Wearable Device Data: Physiological signals such as heart rate variability and sleep patterns; Digital Diaries: Self-reported psychological states and circumstantial factors; and Contextual and Social Activity Data: Behavioral and interactional cues for enhanced contextual understanding.
 
Researcher: Deepa Kundar, Edward S. Rogers Sr. Department of Electrical & Computer Engineering, Faculty of Applied Science and Engineering, University of Toronto
 
Skills required:
 
  • Machine Learning: Basic understanding of supervised learning and model evaluation.
  • Programming: Proficiency in Python and familiarity with ML libraries (e.g. TensorFlow, PyTorch, or scikit-learn).
  • Data Handling: Experience with data preprocessing and feature extraction.
  • Cybersecurity Awareness: General understanding of adversarial attacks and model robustness.
  • Problem-Solving: Strong analytical skills and creativity in tackling challenges.
 
Primary research location:
University of Toronto St. George Campus and/or Remote
 

Research description:

Tertiary lymphoid structures (TLS) have recently been shown to be predictive of survival in pancreatic adenocarcinoma (PDAC). This project aims to quantify and subtype TLS in three PDAC cohorts spanning over 600 patients. These findings will then be associated with clinical metadata, genomic mutations and transcriptional subtypes. The successful candidate will benchmark existing TLS identification methods and compare these to recently developed foundation models. Upon identification, we will attempt to stratify TLS into distinct subtypes based on the embeddings produced by foundation models. We will then attempt to identify whether these subtypes are driven by TLS specific aspects such as lymphocyte morphology or the surrounding environment such as the composition of the stroma or distance to the closest tumor. Finally, we will benchmark the extent to which these subtypes recapitulate transcriptional TLS subtypes we have already identified using spatial sequencing technologies. Upon creation of a robust TLS subtyping method, we will run it over slides from over 600 deeply phenotyped patients and associate the presence and TLS subtype with patient survival, genomic mutations and copy number aberrations as well as known transcriptional subtypes. Overall, this will be the most in-depth characterization of TLS’ in PDAC to date.

 

Researcher: Kieran Campbell, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Proficient in R and python Experience with machine learning libraries including sklearn/pytorch, workflow managers (e.g. snakemake) and the unix command line
  • Experience with medical imaging data (e.g. histopathology, X-rays or CT scans)
  • Familiarity with analyzing genotyping data (e.g. point mutations or tandem repeats) and transcriptomic data

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person and remote

Research description:

The majority of ovarian cancers are diagnosed at an advanced stage, and consequently, the case-fatality rate is high. To some extent, this is because there is no effective screening program and because of delay in diagnosis. It is of interest to explore innovative means of accelerating the date of diagnosis. One possibility is CA125 testing at the first point of care for symptomatic women that seek consultation with front-line physicians. The goal of this project is to leverage a robust database of ~600 ovarian cancers diagnosed in Ontario and to conduct a detailed evaluation of the distribution of CA125 levels at the time of diagnosis by various patient and clinical factors (i.e., stage, histology) and to explore whether by increasing the threshold for CA125 levels may accelerate the diagnostic process lead to earlier identification of affected individuals. Finally, analysis of predictors of survival are also of interest and available to analyze in this dataset.

 

Researcher: Joanne Kotsopoulos, Women's College Hospital

 

Skills required:

  • Dependable
  • Hardworking
  • Detail-oriented
  • Team player
  • Independent
  • Strong communication skills
  • Analytic skills
  • Strong organization skills
  • Prior experience in SAS or R is an asset but not required.

Primary research location:

  • Women's College Hospital in-person and remote

Research description:

Variation in gene expression underpins variation in organismal traits and diversity. Therefore, understanding how gene expression evolves will allow us to better understand the mechanisms of evolutionary change. The strength and form of selection on gene expression and its role in evolution is difficult to estimate, however, because of the high dimensional and highly correlated nature of gene expression data. In this project the SUDS scholar will estimate selection on gene expression traits and compare the results from different methods that are commonly used to study selection on gene expression.

 

Researcher: Jacqueline Sztepanacz, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Proficiency in R
  • Knowledge of basic statistical/machine learning models.
  • High attention to detail
  • Excellent oral and written communication skills.
  • Background in genetics would be an asset

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Large language models (LLMs) have opened up new frontiers for reducing administrative burdens in health systems. Healthcare institutions around the world have already begun piloting the use of automated scribes and other tools aimed at summarizing patients’ clinical records. Aside from these institutional endeavors, there is also evidence that independent care providers are increasingly utilizing large-language models to support care delivery, despite the lack of guidelines and oversight mechanisms. In light of these recent trends, there is a critical need to better understand the prevalence, types, and impacts of bias that risk being perpetuated by LLMs. Social biases such as racial and gender stereotypes, as well as systematic discrepancies in clinical LLM summaries, pose a risk of exacerbating health disparities. Relatedly, biases may also stem from sycophancy, a phenomenon where LLMs generate outputs that reflect the user’s anticipated preferences or assumptions. The goal of this project is to evaluate the risk of social bias and the effects of sycophancy on several publicly-available LLMs, and summarise findings in a whitepaper or research report. To support these evaluations, we will use anonymized clinical notes from the MIMIC-IV dataset, which have already been annotated for patients’ language, race, and ethnicity.

 

Researcher: Zahra Shaker, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto

 

Skills required:

  • We welcome students with an interest in machine learning and/or natural language processing, as evidenced by previous coursework, research projects, or self-study.
  • An intermediary knowledge of Python and familiarity with APIs are an asset.
  • Previous research or volunteer experience in a healthcare setting is preferred, but not required.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

This project seeks to characterize the relationship between components of the built environment and breast cancer risk among BRCA mutation carriers. The built environment touches all aspects of our lives, encompassing the buildings we live in, distribution systems that provide us with water and electricity, and the roads, bridges, and transportation systems we use to get from place to place. It can be described as the manufactured or modified structures that provide people with living, working, and recreational spaces. As a result, these environments can have a lasting impact on human health. Previous literature has established relationships between built environment factors, including proximity to roadways, neighbourhood greenspace, and indoor environment, and breast cancer risk. However, to our knowledge, no studies have specifically examined this risk among BRCA mutation carriers. This study aims to leverage our existing database of BRCA mutation carriers from across Canada, alongside detailed environmental data available through the Canadian Urban Environmental Health Research Consortium (CANUE), to assess and quantify these risks. Findings from this study will provide novel insights into how various built environment factors may influence breast cancer risk in high-risk populations, allowing us to better understand potential risk reduction interventions and urban planning efforts.

 

Researcher: Joanne Kotsopoulous, Women's College Hospital

 

Skills required:

  • Data Management and Entry
  • Research and Literature Review
    • Statistical Analysis
    • Foundational knowledge of statistics and familiarity with software like SAS or R, but not required
  • Attention to Detail
    • Strong Organization Skills
    • Clear Communication Skills
  • Collaboration
    • Independent Work
    • Critical Thinking

Primary research location:

  • Women's College Hospital in-person and remote

Research description:

Animal species exhibit circadian activity patterns in response to the rotation and light cycle on Earth. However, we do not understand the evolutionary causes or consequences of this variation; for example, why are moths nocturnal, while butterflies are diurnal? Research in our lab has suggested that nocturnality may confer an evolutionary advantage during mass extinction events (Shafer, et. al., 2023), and transitions between activity patterns might drive speciation (Nichols & Shafer, et. al., 2024). However, we only have information on the activity patterns of ~12% of vertebrate species, and no systematic information is available on the activity patterns of invertebrates, which represent >97% of all animal species. Given the scale of missing information, we aim to leverage citizen science to fill in the gap. iNaturalist is a popular application that allows users to post observations of organisms along with metadata for their location/timing, spawning a new generation of digital naturalists, and generating huge databases of scientific-grade observations of Earth’s biodiversity. We propose to mine >200 million observations of ~500,000 species by >8 million users from around the world. The SUDS scholar will dereminte the activity pattern for millions of species by identifying patterns in this data using data science techniques.

 

Researcher: Maxwell Shafer, Department of Cell and Systems Biology, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Experience with bioinformatics, data mining, statistics, or programming languages (R, Python) are beneficial.
  • Coursework in evolution or evolutionary modelling is preferred (but not required).

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

This project aims to develop innovative geometric deep learning methods to identify and characterize stellar streams in the Milky Way. Stellar streams are elongated groups of stars that once belonged to smaller galaxies or star clusters that were disrupted by our galaxy’s gravitational forces. These celestial structures serve as crucial forensic evidence of our galaxy’s formation history and provide unique probes of dark matter’s distribution and properties. We will apply graph neural networks and other geometric deep learning techniques to analyze stellar data from the Gaia satellite, which has mapped the positions and velocities of tens of millions of stars with unprecedented precision. These methods are particularly well-suited for this astronomical challenge as they can naturally capture the spatial and kinematic relationships between stars while handling irregular data structures. The project will also incorporate complementary data from the Dark Energy Spectroscopic Instrument (DESI) survey to enhance our understanding of stellar properties. By developing this novel approach to stellar stream detection, we aim to uncover previously unknown structures and gain deeper insights into the Milky Way’s evolutionary history and dark matter distribution.

 

Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Strong Python programming skills.
  • Familiarity with data analysis and visualization
  • Basic familiarity with machine learning techniques.
  • Optional: prior exposure to deep learning or graph neural networks

Primary research location:

  • University of Toronto St. George Campus in-person

Research description:

AI debate has been proposed as an adversarial scalable oversight method, with encouraging recent progress (see refs below). Debate elicits a wide range of capabilities, however, in particular a mix of knowledge and persuasion. In this pilot project, a new debate protocol focused on disentangling persuasive tendencies from knowledge elicitation will be implemented, validated and explored. Additionally supported by OpenAI funds, this research theme broadly aims to develop scalable oversight methods for super-alignment, using physics as a ground truth. The objective of super-alignment is to ensure that AI systems remain aligned with human values and intentions, even in the limit where they become more capable than humans. Reference document1 

 

Researcher: Kristen Menou, Department of Physical and Environmental Sciences, University of Toronto Scarborough, 

 

Skills required:

  • LLM inference
  • Alignment & Scalable Oversight
  • Extras: Top-down Representations, Reinforcement Learning

Primary research location:

  • University of Toronto Scarborough Campus and/or Remote

Research description:

This project aims to explore the potential of large language models (LLMs) to address critical challenges in smart grids, such as cyberattack detection and energy forecasting. LLMs excel not only in accuracy but also in generating explainable insights, making them valuable for complex decision-making in energy systems. The project will focus on developing LLM-based frameworks tailored to smart grid applications, emphasizing explainability to enhance trust and transparency in model predictions. Key tasks include designing models for detecting cyber threats and forecasting energy demand, as well as evaluating their ability to provide clear, actionable explanations for their outputs.
Interns will gain hands-on experience in deploying and fine-tuning LLMs, applying cuttingedge AI solutions to real-world energy challenges, and enhancing cybersecurity and operational efficiency in smart grids.
 
Researcher: Deepa Kundar, Edward S. Rogers Sr. Department of Electrical & Computer Engineering, Faculty of Applied Science and Engineering, University of Toronto
 
Skills required: 
  • Machine Learning: Basic understanding of large language models and fine-tuning techniques.
  • Programming: Proficiency in Python and experience with libraries like Hugging Face Transformers or OpenAI APIs.
  • Smart Grid Fundamentals: General knowledge of smart grid operations and challenges (cybersecurity, energy forecasting).
  • Cybersecurity Awareness: Familiarity with cyberattack detection concepts.
  • Analytical Thinking: Ability to interpret model outputs and focus on explainability

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Horizontal Gene Transfer (HGT) is a process in which organisms acquire foreign genes from different species. HGT contributes to organismal evolution and has been an important source of genetic diversity. HGT was commonly identified in prokaryotes but rarely reported in eukaryotes. However, our understanding of HGT in eukaryotes is quickly expanding with the production of genomic resources and the development of Detection tools. The Kingdom Fungi represent a striking example, especially the ones known as obligate symbionts which interact with various host organisms intimately. Our research group has been dedicated to detecting fungus-related HGT elements and has discovered several such cases including the mosquito gut-dwelling fungi (doi:10.1093/molbev/msw126), herbivorous mammal rumen fungi (doi:10.1128/mSystems.00247-19), amphibian gastrointestinal fungi (doi:10.1534/g3.120.401516), and photobionts associated fungi (doi:10.1016/j.cub.2021.01.058). This project aims to identify novel HGT using lab newly assembled fungal genomes representing underexplored lineages on the Tree of Life. The student working on this project will help refine lab existing pipelines and analyze the fungal genomes as well as related host data to reconstruct the evolutionary history of identified genes by conducting comparative genomics. A high-impact research report will be accomplished and aimed for publication at the end of the project.

 

Researcher: Yan Wang, Department of Biological Sciences, University of Toronto Scarborough, University of Toronto

 

Skills required:

  • Basic programming skills in Linux, Python, and/or R; effective communication skills
  • Preferred qualification: strong interests in comparative genomics, host-microbe interactions, and competencies in writing and public speaking.

Primary research location:

  • University of Toronto Scarborough Campus and/or Remote

Research description:

Why don’t more households invest in the stock market? Is it too difficult to open a brokerage account? While this may have been true in the past, advancements in FinTech have made the process simple and accessible, often requiring just a few taps on a smartphone. Instead, could the real issue be that households are simply misinformed about the risks and returns of stock market investing? Using large-scale survey data, this project aims to explore whether limited stock market participation can be attributed to misperceptions about expected returns. We will study patterns of misperceptions across household types along observable characteristics like income, age, and occupation. We also seek to study which interventions can alleviate misinformation and help increase stock market participation. This project will entail collecting, analyzing, and visualizing data. Strong and pragmatic programming experience are required to download and assemble large data sets. An understanding of financial conepts is required for analysis, and visualization entails displaying in a concise yet appealing way. This project is ideal for an undergraduate student with some research experience and who is considering graduate school in economics or finance.

 

Researcher: Michael Boutros, Department of Economics, University of Toronto Mississauga, University of Toronto

 

Skills required:

  • Background or interest in finance/economics.
  • Knowledgable in at least one of Stata, R, Python, or similar.
  • Strong written communicator.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Measuring and predicting ocean currents is crucial to understanding our climate system, marine ecosystems, and societal maritime activities. Satellites are key tools to do so, but cannot provide more than surface information. In this project, we seek to infer sub-surface properties by leveraging three-dimensional realistic numerical forecasts and machine learning techniques. Of prime interest is the mixed layer, which is the uppermost layer of the ocean. It is the buffer between the atmosphere and the deep ocean, and hosts rich ecosystems. To reconstruct its depth is key to predicting the state of the upper ocean, and to do so from satellite data would provide . You will use output from a Fisheries and Oceans Canada operational numerical model as your dataset. The data is three-dimensional and therefore contains the answer to the question of how deep it is. I solves equations that are constrained by observations and finely tuned to reproduce realistic conditions. Using this data set, you will train a deep-learning algorithm (most likely a U-Net, but we are open to exploring different avenues) to predict this depth when only surface information (e.g. sea surface temperature, height, or salinity) is provided.

 

Researcher: Nicolas Grisouard, Department of Physics, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Programming experience in, or willingness to learn, Python and deep-learning tools such as TensorFlow or PyTorch.
  • We do not require notions of fluid dynamics or oceanography.

Primary research location:

  • University of Toronto St. George Campus in-person

Research description:

The development an equity dashboard in hospitals has been proposed as a solution to facilitate the identification of variations in outcomes, encourage accountability, and support ongoing monitoring. Our research sought to develop an equity dashboard using data collected from the maternal care wards of a hospital in the US in 2019 and 2020. The data obtained were cleaned, and patient delivery data were linked to their demographic data using Microsoft Excel and Python. The data were then disaggregated by race/ethnicity and statistical analysis was performed to assess differences in the outcomes using R. Tableau Desktop was used to develop 18 visualizations of the measures. We are currently conducting usability testing. We could not complete the planned predictive modeling; however, we are working with our collaborators to obtain five years of data to incorporate predictive analytics in the next iteration. Once we validate its efficacy through user testing, we will disseminate our dashboard for implementation. 1) Develop predictive models of adverse events and outcomes based on patient characteristics and social vulnerability. Analyze feature importance for these predictions. 2) Develop an Excel Macro and content pack in Power BI that can generate comparable visualizations 3) Make dashboard publicly accessible through Tableau Public

 

Researcher: Myrtede Alfred, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto

 

Skills required:

  • Knowledge of statistical analysis techniques
  • Knowledge of ML techniques (regression, random forests, SVM, GBTs)
  • Ability to conduct statistical analysis in R or Python
  • Experience using Python libraries for ML and explainable artificial intelligence tools
  • Experience developing macros in Microsoft Excel
  • Experience developing data visualizations (Python, R, Tableau, and Power BI)

Primary research location:

  • University of Toronto St. George Campus in-person

Research description:

Supervisor Fralick has developed a framework of six domains of study design that can affect the internal validity of randomized controlled trials (RCTs), encapsulated by the acronym PHOBIA: Placebo controlled? How was it funded? Outcome clinically valid? Blinded? Intention-to-Treat? A lot of centres and patients included? When evaluating an RCT, these 6 elements are crucial considerations. The current paradigm leaves reviewers to parse these details from the manuscript, which is inefficient, time-consuming, risks bias, and lacks quality control. All RCTs require registration on a publicly available clinical trial registry, meaning key aspects of their design are readily available. This project will apply supervised machine learning (ML) and two large language models (LLMs) for automating part of the peer review process. The data from the RCT will be parsed. Then, LLM 1 (Summarizer) will extract key information related to the PHOBIA framework. LLM 2 (Validator) will validate the summary by checking it against the original study content. Performance of the dual-LLM system will be evaluated according to the following metrics: hallucination detection, consistency, speed, and helpfulness. A detailed comparison of the system’s reviews with traditional human reviews will assess whether the LLMs can reliably augment the peer review process for RCTs.

 

Researcher: Michael Fralick, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Self-motivated
  • Strong critical thinking skills
  • Strong writing and communication skills
  • Not required but an asset:
    • familiar with natural language processing and/or machine learning

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person and remote

Research description:

We have each inherited our genomes from a vast set of ancestors who were scattered across geographic space. The locations of these ancestors influence the patterns of genetic diversity we see today. Given the genetic relationships among a set of individuals we can therefore hope to reconstruct the spatial history of our shared ancestors. Our lab has recently developed a method to locate genetic ancestors by modeling movement down the many trees that relate recombining genomes (Osmond & Coop 2024) and we are applying this to a variety of species. One limit of our current approach is that the uncertainty in the location of ancestors increases as we move back in time, away from the known locations of the samples. This limit can now be relaxed with the increasing availability of ancient genomes that will effectively anchor the trees in space further back in time. The goal of this project is to extend our method to include ancient genomes and apply the method to publically available human genetic data. There are two key questions: 1) How well does our existing method locate the ancient genomes? and 2) How much do the ancient genomes change the inferred locations of other genetic ancestors?

 

Researcher: Matthew Osmond, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • We will extend the method in Python, use a computer cluster to implement it on human data, and share our new method with others on GitHub.
  • Some coding experience, especially in Python and bash/Unix.
  • Advanced math and stats would also be useful.
  • Familiarity with evolution, genetics, and probability/statistics are major assets.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

This project aims to develop an integrative machine learning approach to identify causal genes underlying genome-wide association study (GWAS) risk loci. We will build on the state of the art in three major ways: 1) By dramatically increasing the diversity of input biological networks. We will incorporate curated pathway databases, co-essentiality networks, protein-protein interaction networks, genetic interaction maps, and co-expression networks. 2) By improving inference and featurization of these networks. We will use BIONIC, a deep learning approach developed by collaborators in Toronto that performs network fusion via graph convolutional neural networks, to combine the information gleaned from our biological networks into a single low-dimensional feature vector per gene. 3) By improving the machine learning modelling itself. We will predict gene-level GWAS p-values from our network-based feature vectors via leave-one-chromosome-out cross-validation. We will use gradient boosting, a popular machine learning approach that flexibly capturse non-linear relationships between features while avoiding overfitting. Naively applying gradient boosting is incorrect because it ignores that gene-level GWAS p-values may not be i.i.d. due to linkage disequilibrium. We will preprocess with Cholesky whitening to decorrelate the gene-level p-values and features. Thus, we will develop better methods for inferring both biological networks and GWAS causal genes.

 

Researcher: Michael Wainberg, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Experience in Python is a must-have, for instance through introductory computer science courses.
  • Familiarity with genetics, statistics, and data science packages like NumPy and polars are major assets.

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person and remote

Research description:

The Simons Observatory (SO) is a new, multi-telescope experiment to study the origin and evolution of the cosmos by measuring the cosmic microwave background (CMB), the oldest light in the Universe. Raw data consist of TBs of timestreams of measured sky brightness recorded each day—adding up to several PB over several years—that need to be reconstructed into 2D maps. However, before this can happen, the timestreams need to be automatically processed to remove noise contaminants and foreground galaxies/stars that block the main signal. The successful candidate will join the research groups of Profs. Adam Hincks and Renée Hložek that are actively researching machine learning methods to identify and classify these objects, using existing data from the Atacama Cosmology Telescope (ACT), a precursor to SO. Possible projects include characterising and improving deep learning techniques (including combining multi-modal data and using attention mapping) for detecting and classifying events in the telescope’s raw timestreams and contributing to the data processing pipeline of SO. An exciting aspect of this project is that our classification will help enable the search for astrophysical transients, such as flaring stars and gamma ray bursts.

 

Researcher: Adam Hincks, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Python coding
  • Highly desirable:
    • understanding of machine learning concepts (e.g., active learning)
    • experience with scikit-learn/sklearn
    • familiarity with collaborative coding workflows with Github
  • Helpful assets:
    • web development (e.g., CSS, JS, Vue, React)
    • database development (e.g., SQL)
    • an interest in cosmology and astrophysics

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Our Milky Way galaxy is surrounded by numerous small galaxies and star clusters, each influenced by the powerful gravitational forces of our galaxy. These forces create stellar streams—celestial ”rivers” of stars that gracefully orbit around the Milky Way. These streams are not just beautiful; they hold the keys to unraveling the mysteries of galaxy formation and the hidden nature of dark matter. (Curious? Check out this fascinating feature in The Globe & Mail: Star Streams Reveal Milky Way’s Ravenous History. Thanks to revolutionary cosmic surveys, we now have detailed data on millions of stars, including their positions and velocities in full 6D! As a SUDS Scholar, you’ll be at the cutting edge of this exciting field, developing a Bayesian framework to determine the probability that a star belongs to a particular stream and to characterize the properties of these stellar streams. You will work with massive astronomical datasets, totaling several gigabytes, from one of the most extensive spectroscopic surveys—the Dark Energy Spectroscopic Instrument (DESI). This project will give you the opportunity to develop and apply innovative statistical and computational techniques that are not only crucial for revealing the secrets of stellar streams but also for shaping the future of astronomical surveys.

 

Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Python Programming: Strong interest in developing and troubleshooting code using Python.
  • Bayesian Statistics: Enthusiasm for Bayesian statistics, including sampling and model comparison.
  • Communication: Proficiency in literature reading, scientific writing, and presenting scientific findings.
  • Teamwork: Ability to work well in teams and contribute to collaborative research.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

We are constantly exposed to various sensory stimuli such as sight, sound, and smell. Although sensory systems detect and interpret these stimuli, our perception is influenced by internal states such as hunger, stress, and inflammation. However, it is still unclear how the signals that signify these states, such as hormones, peptide, and cytokines, are encoded in gene expression of individual neurons and modulate the patterns of neural responses to stimuli. To address this question, our lab uses the mouse olfactory system as a model. Olfaction plays fundamental roles in many aspects of our life including learning and memory and detection of food and danger. In addition, it has been shown that odor processing is influenced by internal states even at the first step where sensory neurons in the nose detect odors. However, mechanisms through which individual neurons encode the internal states and modify responses to stimuli are still unknown. This SUDS project will primarily aim to quantitatively characterize the state-dependent changes in gene expression in the olfactory system by analyzing single-cell and bulk genomics datasets (RNA, epigenome, and protein) obtained from mice that are imposed changes in internal states such as hunger and inflammation.

 

Researcher: Tatsuya Tsukahara, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Proficiency in python (or R) for analysing datasets including single-cell and bulk transcriptomics, epigenome profiling data (chromatin accessibility and DNA/histone modifications), and proteomics data.

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person

Research description:

To develop safe nanoparticles for use during pregnancy, we first need to understand the cross-talk (communication) between cells of the placenta (barrier between the mother and the baby) and other cells from the mother at different pathological conditions, e.g. cancer. We developed an organ-on-a-chip model to mimic this environment in the lab and investigate the cross-talk between cells. We used this model to generate protemic and transcriptomic data. A data science student will work with a graduate student and help analyze this big data and enable different visualization approaches of the data. This a great opportunity for the student to work in an interdisciplinary team that works at the intersection between nanotechnology and microfluidics, and learn new wet-lab techniques, and apply their knowledge in data science to solve real-case problems.

 

Researcher: Hagar Labouta, Unity Health Toronto

 

Skills required:

  • A motivated data science student with expertise in R, Python and/or other data packages.
  • Prior experience on omics projects is advantageous.
  • No prior knowledge in nanomedicine or organ-on-a-chip technology is required; this will be a learning opportunity for the student as well.

Primary research location:

  • Unity Health Toronto in-person

Research description:

This project is an expansion of `piccard`, a Python library to perform longitudinal analyses on data tabulated on unharmonized spatial units. The final library will have three modules: (1) temporal path creation, (2) visualization, and (3) classification. The first module is available on [PyPI]. This module introduces one of `piccard`’s graph-based solution to a frequent problem in spatial data science: identifying temporal trends across noncongruent spatial units of aggregation—e.g., census tracts, dissemination areas, and postal codes from different years. We conceptualize spatial units as nodes, and the edges connecting them as their overlapping geographical areas. Our method creates paths that preserve the original spatial units and their attributes. Thus, `piccard` overcomes some of the limitations of traditional harmonization methods involving labour-intensive apportioning—e.g., defining ad-hoc target units. The selected student will work with the PI and Profesor Daniel Silver (UTSC Sociology) in developing the second and third modules of the library. The visualization module will allow users to subset and inspect network paths. Meanwhile, the classification module will facilitate the classification of paths according to the distribution of the shared attributes across the original geographic units. For example, a user could classify census tracts according to patterns of variation over time.

 

Researcher: Fernando Calderón Figueroa, Department of Human Geography, University of Toronto Scarborough, University of Toronto

 

Skills required:

  • Competent in Python and have some experience with version control (GitHub).
  • Familiarity with network analysis, and spatial data science concepts and tools (including `networkX`, `matplotlib` and `geopandas`) is an asset.
  • Additional (but _not_ required) skills include experience with R and familiarity with quantitative urban studies.

Primary research location:

  • University of Toronto Scarborough Campus - Remote

Research description:

Magnetic Resonance Imaging (MRI) has revolutionized the study of brain aging. It provides non-invasive, detailed images of brain structure and function, allowing researchers to observe changes associated with normal aging and neurodegenerative diseases. MRI results have shown promise in predicting longitudinal brain functions in aging through the following: Volumetry - Measures changes in brain volume, particularly in regions like the hippocampus and prefrontal cortex, which are vulnerable to age-related decline; cortical thickness - assesses the thickness of the cerebral cortex, which can thin with age; white matter integrity - diffusion-tensor MRI (DTI) measures the diffusion of water molecules in white matter tracts, revealing changes in microstructure and connectivity. In this project, we will focus on the use of the MRI and cognitive data from the Baltimore Longitudinal Study of Aging (BLSA), and aim to determine a predictive modeling approach for estimating longitudinal changes in cognitive function in older adults. Methods include but are not limited to linear mixed-effects model, support-vector machines, neural networks and deep learning. The outcome of this project will enable more effective use of MRI in early diagnosis.

 

Researcher: Jean Chen, Baycrest

 

Skills required:

  • Usage of the Linux operating system Programming in Python and/or Matlab Basic data-science concepts, e.g. correlation, regression
  • Basic statistical concepts, e.g. t-tests, F-tests, outlier identification (optional)
  • Experience with advanced data-science and machine-learning methods (optional)
  • Medical imaging analysis experience

Primary research location:

  • Baycrest in-person and remote

Research description:

Children with medical complexity are those who have multiple significant chronic health problems, functional limitations, high health care and resource needs/utilization. Interventional Radiologists use minimally invasive techniques to place a tube through the abdomen and into the stomach, called a gastrostomy tube (g-tube) in these children. There is limited evidence that guides the clinical management of g-tube feeding in children, substantial variation in practice, and opportunities to improve care and outcomes through research, data analytics, and clinical innovation.
Dr. Sanjay Mahant (SickKids) and Dr. Nathan Taback (Statistical Sciences) will co-supervise this project. The student can participate in training programs and lecture series at the SickKids Research Institute.
SEDAR—hospital EMR data—accessed through the HPC4Health high computing environment at the SickKids Research Institute will be used as the data source.
 
Project Goals:
  • Conduct descriptive phenotyping of children undergoing primary G tube insertion at SickKids.
  • Statistical analysis of outcomes after primary G tube insertion and patient trajectories after primary G tube insertion.
  • Predictive modelling to identify children who will develop feeding intolerance and have the high healthcare utilization after primary G insertion.

Researcher: Nathan Taback, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Intermediate/advanced courses in statistics/predictive models.
  • Data analysis/modelling (including machine learning models) experience.
  • Intermediate level programming with data using R or Python.
  • Basic experience with databases queries (e.g., SQL).
  • Experience using and working in a command line environment.
  • Excellent oral and written communication.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Assisted reproductive technology (ART) refers to any fertility treatment in which oocytes (eggs) or embryos (fertilized eggs) are manipulated in a laboratory. The first step of ART is controlled ovarian stimulation (COS), which involves daily injections that stimulate the growth of multiple ovarian follicles. Eggs are then retrieved from these follicles. Predicting a patient’s response to COS is challenging. Ovarian reserve markers such as antral follicle count (AFC) and serum antimüllerian hormone (AMH) are good, but not perfect, predictors of oocyte yield. For example, a patient with a high AFC or AMH may have poor COS response, whereas a different patient with the same AFC or AMH might have a robust COS response. Follicular Output RaTe (FORT) has been proposed as a solution to this problem. The FORT score is calcuated by dividing the preovulatory follicular count (follicles measuring 16-22 mm) by AFC and multiplying by 100. Previous studies have demonstrated an association between FORT and mature oocyte yield; however, these data have been generated from young egg donors or patients undergoing in vitro fertilization. The current project aims to investigate the utility of FORT scores in women undergoing COS for urgent fertility preservation due cancer or other medical indications.

 

Researcher: Nigel Pereira, Lunenfeld-Tanenbaum Research Institute

 

Skills required:

  • Creation and maintenance of a master data list using Excel
  • Exporting baseline and COS parameters from patient charts
  • Basic descriptive statistics, tabulation, graphing and calculation of FORT
  • Motivated and eager to learn
  • The selected student is welcome to observe ART procedures to better understand the project

Primary research location:

  • Lunenfeld-Tanenbaum Research Institute in-person and remote

Research Description:

Reproducibility and replicability provide an important foundation for increasing the openness and transparency of research findings. As such, efforts to ensure that research is both reproducible (i.e., the same findings can be reproduced using the
same data and analyses) and replicable (i.e., being able to produce the same results in new datasets) have increased over recent years. In fields such as neuroscience, this can be challenging given the array of datatypes, tools, libraries, frameworks,
programming languages and operating systems used in the analysis of any given study. In this project, the student will have the opportunity to work on pipelines we are establishing at the Rotman Research Institute (RRI) to process different kinds of
MRI and MEG (magnetoencephalography) data. In particular, they will test the robustness of the pipelines in terms of reproducibility and replicability using existing datasets collected at the RRI as well as open datasets from online repositories
(e.g., openneuro.org). The student will also contribute to documentation related to these pipelines. They will have the opportunity to work with experts in biomedical imaging, MRI/MEG data analysis and neuroinformatics, and become familiar with initiatives such as the Brain Imaging Data Standard (BIDS).
 
Researcher: Bradley Buschbaum
 
Skills required:
  • Advanced computer programming skills (e.g., Python, shell scripting, Matlab)
  • Data analysis skills including machine learning
  • Usage of Linux operating system
  • Effective oral and written communication skills
  • Ability to work independently and within a team
  • Beneficial to have:
    • Neuroimaging analysis experience
Primary research location:
  • Baycrest in-person and remote

Research description:

Increased investment in early childhood education and care (ECEC) is being seen globally. Canada has already made significant strides by implementing universal ECEC, ensuring all children have access to early learning opportunities. However, as countries roll out such large-scale systems, it is critical that these changes are made thoughtfully and correctly from the outset, as altering a system once it is entrenched becomes difficult. A central consideration during implementation is equity and inclusion, ensuring that all children, regardless of background or ability, benefit from these services. In Canada, the rollout of the Canada-Wide Early Learning and Child Care (CWELCC) initiative prioritizes equity and inclusion as founding principles. Yet, one area of concern remains how children with disabilities are being included within early years curriculum frameworks. These frameworks, while comprehensive, are often lengthy and dense, making it difficult to evaluate their effectiveness for children with disabilities using traditional qualitative methods. To address this gap, I plan to leverage large language models (LLMs) to analyze the content of these frameworks to determine how children with disabilities are discussed and integrated, offering valuable insights into the current state of inclusivity in early childhood education.

 

Researcher: Elizabeth Dhuey, Department of Management, University of Toronto Scarborough, University of Toronto

 

Skills required:

  • Skills needed are Python and R.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Recent advances in foundation models and self-supervised learning have opened new possibilities for learning robust state representations and world models for robotics. While traditional approaches often rely on hand-crafted state representations or require large amounts of task-specific data, modern approaches leveraging pre-trained models and self-supervised learning promise to create more generalizable and data-efficient solutions. We are exploring novel approaches to learn and utilize state representations and world models that can effectively capture both the physical dynamics of robotic systems and the semantic understanding needed for complex tasks. This includes investigating several promising directions: Leveraging large language models (LLMs) and vision-language models (VLMs) as knowledge priors for robotics tasks Developing self-supervised learning techniques that can efficiently learn from unlabeled robot interaction data Creating hybrid architectures that combine learned world models with imitation learning for improved learning Investigating methods for abstracting and transferring learned representations across different tasks and domains The ultimate goal is to develop algorithms that can learn more efficiently from demonstrations while maintaining robustness and generalization capabilities. Success in this area could significantly reduce the task-specific data needed for robot learning while improving the ability to handle novel situations.

 

Researcher: Igor Gilitschenski, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto

 

Skills required:

  • Strong background in deep learning and familiarity with modern architectures (Transformers, diffusion models, etc).
  • Experience with robot learning frameworks (PyBullet, MuJoCo, etc) and real robots is highly beneficial.
  • Previous exposure to imitation learning or reinforcement learning is a plus.

Primary research location:

  • University of Toronto Mississauga campus in-person

Research description:

Understanding censoring, which occurs when the event of interest is not observed for some individuals within the study period, is critical for modeling time-to-event data. This is particularly important for applications such as risk prediction in cancer studies, electronic health records, and clinical trials. Ignoring censoring can lead to biased and inaccurate predictive performance. While numerous statistical approaches in survival analysis, such as Cox regression, have been developed to handle censoring, it remains an open challenge to effectively integrate these methods with modern statistical learning techniques for classification. This SUDS project aims to extend the use of Inverse Probability of Censoring Weighting (IPCW) in conjunction with statistical learning to improve risk prediction for right-censored data. Although IPCW has shown promise when integrated with statistical learning methods (e.g., Vock et al., 2016), its predictive performance can suffer when a significant proportion of subjects are censored before the time of interest due to a huge reduction in effective sample sizes. This project will explore new methodological advancements to address these limitations and validate these approaches through simulation studies and real-world applications in cancer genomics. Students working on this SUDS project will meet weekly with the supervisor to discuss progress and address challenges.

 

Researcher: Jun Young Park, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto

 

Skills required:

  • Completion of one-semester undergraduate course in (i) calculus-based probability, (ii) statistical learning (or machine learning), (iii) generalized linear model is required.
  • Strong programming skills in R or Python, demonstrated through relevant coursework or prior experience, are highly desirable.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Social determinants of health are increasingly acknowledged as key factors in achieving equitable, efficient, and patient-centered care. It is now well recognized that factors such as inadequate transportation, economic hardship, language barriers, employment security, and health literacy play a critical role in patient’s care experiences and health outcomes. To this effect, understanding the prevalence of these determinants and their impact on patient care is essential to shaping health services and programs that are inclusive and responsive to community needs. The following project intends to employ Natural Language Processing (NLP) approaches to uncover and help study references to social determinants of health arising from patient experience data. In collaboration with the Investigative Journalism Bureau at the University of Toronto, our lab has acquired 120,000 anonymous patient feedback comments from 45 Ontario hospitals, spanning 2015 to 2020. With support from the Institute for Pandemics (IfP), our lab has previously developed approaches to help mitigate selection biases in comments and analyze trends in patient experiences over time. This research project will build upon our previous work and support our efforts to develop analytic approaches that can help uncover barriers to health equity. Outputs will consist of a departmental presentation and research report.

 

Researcher: Zahra Shakeri, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto

 

Skills required:

  • We welcome students with an interest in machine learning and/or natural language processing, as evidenced by previous coursework, research projects, or self-study.
  • Students are encouraged to articulate their interest in health equity and how it aligns with their experience and/or career goals.
  • Experience working in multidisciplinary teams is an asset.

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Pertrubation of genetic interactions on genomic and transcriptomics level play critical role in promoting tumorigenesis. Therefore, as systematic understandin of these perturbations will likely provide novel insight into cancer biology and open new therapeutic avenues. As part of this project, we will levergae recent advances in graph-based machine learning methods and large-scale genomics & transcriptomics data to systematically characterize genetic perturbations in various cancer types.

 

Researcher: Sushant Kumar, Princess Margaret Cancer Centre, University Health Network

 

Skills required:

  • Computer programming (Python/R preferable) prior machine learning experience
  • Background in computational biology/bioinformatics preferable but not required

Primary research location:

  • Princess Margaret Cancer Centre in-person and remote

Research description:

This project focuses on developing new quantization methods for representing the weights and activations of large language models as numbers with lower precisions to achieve faster training and inference for large language models while minimizing the reconstruction error. In 2024, several new methods including EasyQuant and SqueezeLLM are proposed for quantizing LLMs to reduce training and inference time under acceptably low reconstruction errors. While the existing methods provide remarkable performance, it is expected that a quantization algorithm that relies on mathematical optimization can exceed the performance of existing methods. In this project, the SUDS scholar will be supervised by a faculty member from the MIE department to complete a series of weekly assignments. These tasks will encompass activities such as data analysis, computational experiments, and the implementation and testing of new algorithm enhancements in a git environment. This project leverages cutting-edge techniques in mathematical optimization to advance the quantization of LLMs by reducing reconstruction error. The results of this summer research initiative contribute to the development of a new algorithm for weight and activation quantization of large language models, thereby enhancing a widely used AI technology in using data science.

 

Researcher: Samin Aref, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto

 

Skills required:

  • Python (demonstrable experience with PyTorch)
  • Asymptotic analysis of algorithms, and memory management
  • Operations research (heuristic optimization, integer programming)
  • Version control (Git)
  • CPU and GPU parallelism is an asset.
  • LLM inference optimization and heuristic algorithms
  • Skills for reading and comprehending technical articles
  • Undergrad program in Computer Science, Industrial Engineering, or equivalent

Primary research location:

  • University of Toronto St. George Campus and/or Remote

Research description:

Our research aims to address the physical health challenges faced by skilled trades workers, particularly electricians, who are prone to repetitive strain injuries. This project will employ a mixed-method approach, gathering quantitative and qualitative data through a survey of 100 participants and semi-structured interviews with 30 participants. Data analysis will play a critical role in uncovering key insights. Survey responses will be analyzed statistically to assess the prevalence of physical injuries, mental health issues, and the effectiveness of workplace safety practices, while interviews will be analyzed qualitatively to identify patterns and themes regarding workplace conditions, ergonomic stressors, and the use of personal protective equipment (PPE). We plan to onboard a DSI student to support the data analysis, integrating advanced statistical tools and qualitative software to handle the complex nature of our dataset. The student will assist in synthesizing the results, contributing to a comprehensive understanding of both the physical and psychological health of apprentices, contractors, and employers in the skilled trades. This interdisciplinary approach will allow us to develop practical toolkits for injury prevention and mental health support, ultimately improving worker well-being and workplace productivity.

 

Researcher: Behdin Nowrouzi-Kia, Department of Occupational Science and Occupational Therapy, Temerty Faculty of Medicine, University of Toronto

 

Skills required:

  • Quantitative and qualitative data analysis
  • Survey and interview design
  • Statistical software proficiency (e.g., SPSS, R)
  • Thematic analysis for qualitative data
  • Understanding of workplace health and safety in skilled trades
  • Collaboration with multidisciplinary teams
  • Experience with mixed-methods research

Primary research location:

  • University of Toronto St. George Campus and/or Remote

For more information

SUDS.dsi@utoronto.ca

News

DSI Celebrates SUDS Cohort of 2024 with Annual Showcase

Read the full story.

Students may also be interested in the Urban Data Science Corps Summer Internships offered by the School of Cities.

Learn more