SUDS Student Call
May-August 2025

Call for student researchers!

The Data Sciences Institute (DSI) welcomes carefully selected undergraduate students from across Canada for a rich data sciences research experience. Through the SUDS Research Program, undergraduate students, who are interested in exploring data science as a career path, have an exciting opportunity to engage in hands-on research supervised by DSI member researchers across the three UofT campuses.

The DSI is strongly committed to diversity within its community and especially welcomes applications from racialized persons/persons of colour, women, Indigenous/Aboriginal People of North America, persons with disabilities, LGBTQ2S+ persons, and others who may contribute to the further diversification of ideas.

Below are the SUDS research opportunities for May-August 2025. You can apply and rank your top three choices.

See here for information on eligibility, award value and duration, and SUDS programming.

Research Opportunities

3D Generative Modeling

Research description:

The rapid advancement of 3D generative models is completely transforming how we create and manipulate creative content. We’re on the verge of a technological revolution, and there’s no denying that we’ve made significant progress in this field. However, the complexity of 3D content generation demands a stronger emphasis on making it user-friendly and precise. Hence, we’re introducing a research project to push the boundaries of 3D generative modeling. Our ultimate objective is to empower artists, designers, and developers to fully exploit this technology with ease and precision. This can be regarded as a significant step in the direction of realizing a 3D equivalent to Photoshop. Working on the project, the student will gain valuable experience in training neural networks and implementing novel computer vision pipelines and would eventually write and submit their work at a top-tier AI/Robotics conference or workshop (CVPR, NeurIPS, ICRA, etc).

Researcher: Igor Gilitschenski, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto

Skills required:

A good understanding of popular neural architectures (CNNs, Transformers, etc) and being fluent in programming with pytorch.
Awareness of the recent literature in 3DV, neural rendering, graphics, self and unsupervised learning is a big plus

Primary research location:

University of Toronto Mississauga campus in-person

A secondary analysis of the NISCI Trial – GRASSP/ISNCSCI/SCIM/CMAP

Research description:

This study is a post-hoc analysis of the GRASSP, ISNCSCI, SCIM and CMAP data from the NISCI (Nogo-A Inhibition in acute Spinal Cord Injury) Study. This trial was a multicenter, multinational, placebo controlled phase-II study for the safety and preliminary efficacy of intrathecal anti-Nogo-A [NG101] in patients with acute cervical spinal cord injury. The purpose of the NISCI trial was to test if an antibody therapy can improve motor function and quality of life of tetraplegic patients. The purpose of this specific post-hoc analysis is to explore the differences between groups when measured with the GRASSP. Furthermore, to look at the relationships between the GRASSP scores and ISNCSCI, SCIM and CMAP scores The study aims to answer the following questions: To determine if Nogo-A therapy improves upper limb impairment and function, 6-month time point after SCI I in comparison to control group. -Determine the relationships between completeness of SCI and recovery of the upper limb -Determine the relationships between function and recovery of the upper limb. -Determine the relationships between CMAP and recovery of the upper limb. -Understand the recovery profiles of the upper limb, with both the control and treatment data.

Researcher: Sukhvinder Kalsi-Ryan, Toronto Rehabilitation Institute (KITE), University Health Network

Skills required:

Ability to perform and conduct statistical analysis on clinical data, with MATLAB, or SAS, or SPSS or R.
Any program is accepted.
The skill for data analysis is primarily what we are looking for.

Primary research location:

Toronto Rehabilitation Institute in-person and remote

Advanced Climate Downscaling for Engineering and Climate Impacts Analysis

Research description:

The built environment is sensitive to global warming and climate change, which are leading to increased cooling loads, dangerous heat waves, damaging flooding in cities, permafrost degradation, and other impacts. To model, quantify, and predict these impacts for engineering analysis requires “downscaling”, which maps available climate information (e.g. weather station data, model output) to the requirements of engineering (site specific information, design requirements), while accounting for incompatible sampling, errors in observations, and uncertainty. Downscaling is a workflow of data processing and model calibration against observations that is increasingly informed by machine learning and modern data science. Over the last few years we have developed the UofT Climate Downscaling Workflow (UTCDW), a set of guides, software, and visuals to bring downscaling into engineering research and design, and piloted these tools during a Climate Impacts Hackathon in March, 2024. The SUDS scholars will improve the usability of and extend the data-science approaches in the UTCDW, with an emphasis on how to best use connections between climate fields like temperature and precipitation in our workflow, in particular to implement multvariate downscaling methods for climate extremes. The project will demonstrate how the UTCDW can effectively translate climate science knowledge and data into actionable information.

Researcher: Paul Kushner, Department of Physics,Faculty of Arts and Science,University of Toronto

Skills required:

We seek grounding in data science (programming in R/python, data QC and organization, modelling and machine learning, etc.) as well as classical statistics (multivariate statistics an asset).
An interest in either or both of climate/atmospheric science or civil/environmental engineering is desirable, but no prior experience in these areas is required.

Primary research location:

University of Toronto St. George Campus and/or Remote

AI-guided curation of electronic medical records using large language models

Research description:

Predicting patient outcomes like risk of readmission is crucial for improving healthcare quality and efficiency. However, the unstructured, unstandardized nature of electronic medical record (EMR) data makes it challenging to develop robust supervised learning models. Large language models (LLMs) offer a promising approach to automating the curation of EMR data into machine-readable formats needed for predictive modeling. However, concerns remain around the reliability, stability, and tendency of LLMs to ”hallucinate” - generating plausible-sounding but factually incorrect outputs. In this project, the student will leverage open-source LLMs and explore prompt engineering and fine-tuning strategies to curate EMRs, mitigating the above issues and maximizing the effectiveness of LLMs for EMR data curation and predictive model development. The student will assess the performance of the LLM-powered approach against traditional manual data curation methods in terms of accuracy, scalability, and cost-effectiveness. The insights gained could enable more widespread adoption of LLM techniques to unlock the predictive power of EMR data at Sinai Health, leading to improved patient outcomes and healthcare system efficiency.

Researcher:

Kieran Campbell, Lunenfeld-Tanenbaum Research Institute

Skills required:

Essential: Python, Linux/command line, data processing, strong collaborative/team skills, communication, and ability to work independently
Bonus: any experience with LLMs/prompt engineering and/or EMR curation

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Analysis of indoor air quality data using physics-informed machine learning

Research description:

People spend nearly 90% of their time indoors, where they are exposed to various airborne contaminants. Indoor air quality (IAQ) has a substantial impact on human health and comfort. However, understanding and analyzing IAQ in diverse indoor environments remains challenging despite the well-established principles of mass transfer and fluid dynamics and various low-cost sensing technologies. This is due to the difficulty of collecting key information, such as contaminant generation rate, degree of air mixing, airflow patterns between spaces, etc. This project aims to develop a method for analyzing time series IAQ data using physics-informed machine learning (ML). The method will incorporate mass balance equations, represented by ordinary differential equations, as physical knowledge. A set of probabilistic ML models, regulated by domain knowledge, will address the imperfection of the mass balance equations and the impact of missing key information. Probabilistic programming will serve as the overarching framework to integrate all the model components. The student will work with Professor Jeffrey Siegel (CIVMIN, IAQ expert) and Professor Seungjae Lee (CIVMIN, ML expert in building science). Indoor air quality data collected from multiple homes and other indoor environments will be used to test the developed method.

Researcher: Seungjae Lee, Department of Civil and Mineral Engineering, Faculty of Applied Science and Engineering, University of Toronto

Skills required:

Proficiency in Python, with experience using essential data science libraries (e.g., scikit-learn, pandas, etc.).
Preferred qualifications:
- Background knowledge in indoor air quality and building science.
- Foundational understanding of probability theory.
- Experience with remote servers or clusters.
- Experience with PyTorch for data-driven modelling.
- Experience with Git for version control.

Primary research location:

University of Toronto St. George Campus and/or Remote

Analysis of Internet Spaces to Understand Societal Discourses Linking Hearing Loss and Dementia

Research description:

Public health and media messages increasingly emphasize the link between hearing loss and cognitive decline in older adults, promoting the idea that hearing loss may causally contribute to dementia. This discourse has profound psychosocial implications, raising concerns within hearing-oriented community organizations about how it may amplify stigma, anxiety, and self-doubt among older adults with hearing loss. The discourse also opens the door to older adults’ exploitation by the multi-billion-dollar hearing services industry, which can capitalize on narratives equating aging with decline and trends like the medicalization of aging by marketing hearing services as preventative solutions for cognitive decline. Websites are crucial for attracting clients and website-marketing recommendations often focus on client recruitment because the 70% of people with hearing loss who do not use hearing aids are considered a vast market. Websites from hearing-service providers can offer education about the link between hearing loss and cognition, but might also aim to capitalize on the narrative and common cognitive concerns in older people to increase service and product sales. The student will use sophisticated website scraping and analysis tools jointly with large language models (Python) to provide a systematic analysis of website education and marketing of audiological clinics.

Researcher: Björn Herrmann, Baycrest

Skills required:

Advanced computer programming skills (Python or MATLAB)
Effective oral and written communication skills
Inter-cultural competence
Ability to work independently and within a team

Beneficial:

Background in artificial intelligence
Experience with natural language processing
Knowledge in internet/website analysis
Interest in auditory research

Primary research location:

Baycrest in-person

Analysis of kinematic timeseries waveforms to identfy optimal return-to-sport in individuals following an ACL reconstruction

Research description:

There are more than 250,000 anterior cruciate ligament injuries in North America annually, and upwards of 30% of personsexperience re-injury following return-to-play (RTP). A limitation of current RTP assessments is that they do not assess howthe patient performed a test by analyzing the actual joint motions or the relationship/coordination between joints.Therefore, the purpose of this project is to quantify the motion patterns in the lower extremity and determine theeffectiveness in assessing RTP compared to traditional metrics.

We have collected jumping and running data on 25 healthy controls and 25 patients that have undergone surgical reconstruction for a torn ACL. All participants have full time-series lower extremity kinematic data sets (joint angles at the hip, knee, and ankle) and we have RTP data up to 1 year post reconstruction.

The data analysis for this project has two focuses: i) to apply statistical and data analytics techniques to classify the kinematic waveforms as healthy controls and those from ACL reconstructed patients and to determine if there are differences within

the ACL reconstructed group; and ii) to determine of the kinematic waveforms can be used to predict successful RTP better than the performance metrics.

Researcher: Timothy Burkhart, Faculty of Kinesiology and Physical Education, University of Toronto

Skills required:

Comfortable speaking with a multidisciplinary group (we consist of kinesiologists, orthopaedic surgeons, and engineers)

Primary research location:

University of Toronto St. George Campus and/or Remote

Bridging Administrative Decisions and Caseworker Narratives: A Computational Exploration of Child Welfare Practices

Research description:

The Children’s Aid Society of Toronto (CAST) is North America’s largest not-for-profit child welfare agency, with a legal mandate to protect children and youth from abuse and neglect. CAST provides essential services such as investigating protection needs, offering guidance and counselling to families, and facilitating permanency through adoption. CAST operates across the Greater Toronto Area and ensures that services are delivered through an equity lens, addressing the unique needs of children, youth, and families based on their race, culture, religion, gender, and sexual orientation. Through this project, CAST aims to address key operational challenges, such as understanding why some cases remain open for extended periods and why re-referrals occur after cases are closed. By analyzing the narrative data alongside administrative outcomes, the project will help CAST gain insights into decision-making processes at various stages of a child’s involvement with the system. The anticipated social and economic benefits of the project for CAST include more efficient case management and improved decision-making frameworks, reducing the backlog of long-term cases and enhancing service delivery. This will lead to better outcomes for children and families by ensuring that decisions made during child protection investigations are well-informed and supported by comprehensive data analysis.

Researcher: Shion Guha, Faculty of Information, University of Toronto

Skills required:

Experience working with structured and unstructured data for data analysis.
Familiarity with basic NLP to identify themes in text data.
Understanding of basic statistics like correlations and trends analysis.
Ability to apply simple models (e.g., decision trees) to predict outcomes.
Ability to create clear and informative visualizations to communicate findings.

Primary research location:

University of Toronto St. George Campus and/or Remote

Customized bibliometric tools to track open science activities at a neuroscience institute

Research description:

Scholar Metrics Scraper is a Python script recently developed at UBC that enables automated retrieval of citation and author data. In this project, we propose to customize this tool in a number of ways to support open science activities and reporting at a Canadian neuroscience institute: the Rotman Research Institute (RRI). This will involve modifying or writing new code to automatically (on a scheduled basis and on demand) retrieve and clean publication data of RRI scientists, develop ways to automatically determine the open access status of publications, as well as to retrieve data on study preregistrations and open datasets shared in online repositories (e.g., osf.io). Code will also be developed for plotting key variables (e.g., publication and citation counts; chord diagrams visualizing collaborations) for each scientist and the institute; generating reports; and automatically updating scientist webpages with lists of publications and datasets that include open access status and details. The student will have the opportunity to learn about open science best practices and tools, and to work with a number of scientists as well as Research IT. They will also have the chance to make important contribution to establishing and normalizing open science at the RRI.

Researcher: Donna Rose Addis, Baycrest

Skills required:

Advanced computer programming skills (e.g., Python, shell scripting, HTML, SQL)
Experience using Linux operating system
Effective oral and written communication skills
Ability to work independently and within a team

Beneficial to have:

Familiarity with open science
Data visualization skills

Primary research location:

Baycrest in-person and remote

Data-Driven Analysis of Help-Seeking Student Behaviour in Programming Education with LLM Integration

Research description:

The landscape of student help-seeking behaviour is undergoing a significant transformation with the rise of generative AI tools like Large Language Models (LLMs). Building on prior research that explores help-seeking tendencies among university students, this project aims to investigate and analyse large-scale student data on the effects of integrating LLM-powered assistants in programming courses, focusing on their influence on student behaviour, engagement, and learning outcomes. Ideally generating an approach for improved (predictive and prescriptive) decision making. The research will involve a comprehensive analysis of how the introduction of LLM-based conversational agents (e.g., ChatGPT) and other LLM-based educational tools, such as CodeAid and QuickTA, both developed at the University of Toronto, influence student approaches to seeking help. This will involve data mapping and analysis, but also the need to identify patterns in large conversational data. Traditional help-seeking behaviours have shown a reliance on informal support (e.g., peers) rather than formal educational resources (e.g., instructors), often due to perceived barriers like stigma or accessibility. We hypothesise that the availability of LLM tools may shift these dynamics, increasing students’ reliance on automated, real-time assistance and providing data-rich insights into evolving help-seeking patterns that could enhance predictive and prescriptive modelling for educational support strategies.

Researcher: Michael Liut, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto

Skills required:

Excellent interpersonal skills
HCI: familiarity with designing user studies, conducting thematic analysis, and ability to analyse qualitative feedback from participants.
Strong programming skills: Python, R, visualisation libraries (e.g., matplotlib, pandas, plotly), bash, full-stack frameworks (e.g., Django, React), LLM-architectures.
Experience handling interaction log data and EdTech development is a bonus.

Primary research location:

University of Toronto St. George Campus and/or Remote

Deep learning methods for decoding peripheral nervous system activity

Research description:

Recording from the peripheral nervous system can be used to decode control signals exchanged throughout the body, with applications in creating assistive technologies and treating chronic diseases. Our laboratory has collected unique datasets from multi-channel nerve cuff electrodes, which record data from the surface of nerves. We have developed neural networks to decode these recordings by classifying the source of each detected neural event. Using existing data, this project will involve refining neural network architectures and training strategies to optimize performance. Creating neural networks that can generalize well over time and across subjects with minimal re-calibration is of particular interest. The student will have the opportunity to gain a better understanding of real-world data science challenges in neurotechnology, and of strategies to manage these obstacles when developing deep learning systems.

Researcher: José Zariffa, Toronto Rehabilitation Institute (KITE), University Health Network

Skills required:

Experience designing and evaluating deep neural networks.
Processing of physiological signals.

Primary research location:

University Health Network in-person

Defining cell types from single-cell data with hierarchical clustering and graph-based significance testing

Research description:

This project aims to develop a general method for defining clusters of cell types from single-cell RNA sequencing data. This problem is widely considered one of the most important and fundamental problems in single-cell data analysis, but suffers from a paucity of methods to define whether two cell-type clusters are actually distinct from each other. We will use hierarchical clustering via the ultra-fast HGC method to define an initial hierarchy of cell-type clusters. We will then recurse through this hierarchy and apply a significance test at each split, to determine whether the two clusters at the split are significantly different from each other. If they are not, recursion will stop. The most creative aspect of the project will be defining the significance test. HGC is based on the shared nearest-neighbor (SNN) graph, so it seems natural to use that for significance testing as well. However, naively testing whether the number of between-cluster connections is less than expected will not be sufficient, since this criterion was already used to define the clusters themselves - an example of a ”double-dipping” problem. Possibly solutions may involve some combination of permutation testing, the recently-developed "count splitting" method, and graph theoretic properties.

Researcher: Michael Weinberg, Lunenfeld-Tanenbaum Research Institute

Skills required:

Experience in Python is a must-have, for instance through introductory computer science courses.
Familiarity with genetics, statistics, data science packages like NumPy and polars, and graph theory are major assets.

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Designing Better Psychological Scales Using AI: A Case Study

Research description:

We are offering a unique research opportunity for students passionate about the intersection of statistics, psychometrics, psychology, and artificial intelligence. This project aims to revolutionize psychological assessment by leveraging AI to design more reliable and valid psychological scales. By employing machine learning algorithms and natural language processing, we will analyze existing scales to identify limitations and develop enhanced tools that more accurately measure psychological constructs. As a participant, you will engage in a case study exploring how AI can be of help refining scale items to be culturally sensitive and reducing unwanted bias. You’ll collaborate with a multidisciplinary team of psychometricians, data scientists, statisticians and AI experts, gaining hands-on experience in both qualitative and quantitative research methods. This immersive experience will not only deepen your understanding of psychometrics but also equip you with cutting-edge skills in AI applications within psychology. This project offers the chance to contribute to pioneering research with the potential to make a significant impact on psychological assessment practices. You’ll develop valuable skills in data analysis and AI, preparing you for advanced studies or careers in psychology, statistics, education, data science, or related fields.

Researcher: Feng Ji, Department of Applied Psychology and Human Development, Ontario Institute for Studies in Education, University of Toronto

Skills required:

Familiarity with machine learning concepts and programming languages such as Python or R is highly desirable.
Familiarity with APIs (such as OpenAI API) is preferred (but not required).
Essential skills include excellent analytical abilities, attention to detail, and the capacity to work effectively in a collaborative team environment.
Coursework in psychology, statistics, and data science (generally defined) is preferred (but not required).

Primary research location:

University of Toronto St. George Campus and/or Remote

Developing a Secure Multitask Learning Framework for Mental Health Diagnostics

Research description:

This project leverages the transformative potential of machine learning to advance mental health diagnostics through the development of a secure and reliable digital psychiatry platform. By employing multitask learning (MTL) methodologies, the project aims to uncover intricate patterns across physiological, psychological, behavioral, and contextual data derived from wearables and digital diaries. These insights will enhance the detection and prediction of overlapping symptoms and risk factors in comorbid mental health disorders.

The project involves the integration of multimodal datasets, including: Wearable Device Data: Physiological signals such as heart rate variability and sleep patterns; Digital Diaries: Self-reported psychological states and circumstantial factors; and Contextual and Social Activity Data: Behavioral and interactional cues for enhanced contextual understanding.

Researcher: Deepa Kundar, Edward S. Rogers Sr. Department of Electrical & Computer Engineering, Faculty of Applied Science and Engineering, University of Toronto

Skills required:

Machine Learning: Basic understanding of supervised learning and model evaluation.
Programming: Proficiency in Python and familiarity with ML libraries (e.g. TensorFlow, PyTorch, or scikit-learn).
Data Handling: Experience with data preprocessing and feature extraction.
Cybersecurity Awareness: General understanding of adversarial attacks and model robustness.
Problem-Solving: Strong analytical skills and creativity in tackling challenges.

Primary research location:

University of Toronto St. George Campus and/or Remote

Detection of immune aggregates from histopathology imaging using foundation models

Research description:

Tertiary lymphoid structures (TLS) have recently been shown to be predictive of survival in pancreatic adenocarcinoma (PDAC). This project aims to quantify and subtype TLS in three PDAC cohorts spanning over 600 patients. These findings will then be associated with clinical metadata, genomic mutations and transcriptional subtypes. The successful candidate will benchmark existing TLS identification methods and compare these to recently developed foundation models. Upon identification, we will attempt to stratify TLS into distinct subtypes based on the embeddings produced by foundation models. We will then attempt to identify whether these subtypes are driven by TLS specific aspects such as lymphocyte morphology or the surrounding environment such as the composition of the stroma or distance to the closest tumor. Finally, we will benchmark the extent to which these subtypes recapitulate transcriptional TLS subtypes we have already identified using spatial sequencing technologies. Upon creation of a robust TLS subtyping method, we will run it over slides from over 600 deeply phenotyped patients and associate the presence and TLS subtype with patient survival, genomic mutations and copy number aberrations as well as known transcriptional subtypes. Overall, this will be the most in-depth characterization of TLS’ in PDAC to date.

Researcher: Kieran Campbell, Lunenfeld-Tanenbaum Research Institute

Skills required:

Proficient in R and python Experience with machine learning libraries including sklearn/pytorch, workflow managers (e.g. snakemake) and the unix command line
Experience with medical imaging data (e.g. histopathology, X-rays or CT scans)
Familiarity with analyzing genotyping data (e.g. point mutations or tandem repeats) and transcriptomic data

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Distribution of CA125 levels of epithelial ovarian cancer patients diagnosed in Ontario, Canada in 2016

Research description:

The majority of ovarian cancers are diagnosed at an advanced stage, and consequently, the case-fatality rate is high. To some extent, this is because there is no effective screening program and because of delay in diagnosis. It is of interest to explore innovative means of accelerating the date of diagnosis. One possibility is CA125 testing at the first point of care for symptomatic women that seek consultation with front-line physicians. The goal of this project is to leverage a robust database of ~600 ovarian cancers diagnosed in Ontario and to conduct a detailed evaluation of the distribution of CA125 levels at the time of diagnosis by various patient and clinical factors (i.e., stage, histology) and to explore whether by increasing the threshold for CA125 levels may accelerate the diagnostic process lead to earlier identification of affected individuals. Finally, analysis of predictors of survival are also of interest and available to analyze in this dataset.

Researcher: Joanne Kotsopoulos, Women's College Hospital

Skills required:

Dependable
Hardworking
Detail-oriented
Team player
Independent
Strong communication skills
Analytic skills
Strong organization skills
Prior experience in SAS or R is an asset but not required.

Primary research location:

Women's College Hospital in-person and remote

Estimating selection on gene expression

Research description:

Variation in gene expression underpins variation in organismal traits and diversity. Therefore, understanding how gene expression evolves will allow us to better understand the mechanisms of evolutionary change. The strength and form of selection on gene expression and its role in evolution is difficult to estimate, however, because of the high dimensional and highly correlated nature of gene expression data. In this project the SUDS scholar will estimate selection on gene expression traits and compare the results from different methods that are commonly used to study selection on gene expression.

Researcher: Jacqueline Sztepanacz, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto

Skills required:

Proficiency in R
Knowledge of basic statistical/machine learning models.
High attention to detail
Excellent oral and written communication skills.
Background in genetics would be an asset

Primary research location:

University of Toronto St. George Campus and/or Remote

Evaluating Bias and Sycophancy in Clinical Large Language Model Summaries

Research description:

Large language models (LLMs) have opened up new frontiers for reducing administrative burdens in health systems. Healthcare institutions around the world have already begun piloting the use of automated scribes and other tools aimed at summarizing patients’ clinical records. Aside from these institutional endeavors, there is also evidence that independent care providers are increasingly utilizing large-language models to support care delivery, despite the lack of guidelines and oversight mechanisms. In light of these recent trends, there is a critical need to better understand the prevalence, types, and impacts of bias that risk being perpetuated by LLMs. Social biases such as racial and gender stereotypes, as well as systematic discrepancies in clinical LLM summaries, pose a risk of exacerbating health disparities. Relatedly, biases may also stem from sycophancy, a phenomenon where LLMs generate outputs that reflect the user’s anticipated preferences or assumptions. The goal of this project is to evaluate the risk of social bias and the effects of sycophancy on several publicly-available LLMs, and summarise findings in a whitepaper or research report. To support these evaluations, we will use anonymized clinical notes from the MIMIC-IV dataset, which have already been annotated for patients’ language, race, and ethnicity.

Researcher: Zahra Shaker, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto

Skills required:

We welcome students with an interest in machine learning and/or natural language processing, as evidenced by previous coursework, research projects, or self-study.
An intermediary knowledge of Python and familiarity with APIs are an asset.
Previous research or volunteer experience in a healthcare setting is preferred, but not required.

Primary research location:

University of Toronto St. George Campus and/or Remote

Examining the Impact of Built Environment Factors on Breast Cancer Risk among BRCA Mutation Carriers in Canada

Research description:

This project seeks to characterize the relationship between components of the built environment and breast cancer risk among BRCA mutation carriers. The built environment touches all aspects of our lives, encompassing the buildings we live in, distribution systems that provide us with water and electricity, and the roads, bridges, and transportation systems we use to get from place to place. It can be described as the manufactured or modified structures that provide people with living, working, and recreational spaces. As a result, these environments can have a lasting impact on human health. Previous literature has established relationships between built environment factors, including proximity to roadways, neighbourhood greenspace, and indoor environment, and breast cancer risk. However, to our knowledge, no studies have specifically examined this risk among BRCA mutation carriers. This study aims to leverage our existing database of BRCA mutation carriers from across Canada, alongside detailed environmental data available through the Canadian Urban Environmental Health Research Consortium (CANUE), to assess and quantify these risks. Findings from this study will provide novel insights into how various built environment factors may influence breast cancer risk in high-risk populations, allowing us to better understand potential risk reduction interventions and urban planning efforts.

Researcher: Joanne Kotsopoulous, Women's College Hospital

Skills required:

Data Management and Entry
Research and Literature Review
- Statistical Analysis
- Foundational knowledge of statistics and familiarity with software like SAS or R, but not required
Attention to Detail
- Strong Organization Skills
- Clear Communication Skills
Collaboration
- Independent Work
- Critical Thinking

Primary research location:

Women's College Hospital in-person and remote

Fantastic beasts and when to find them: leveraging citizen science to understand the activity patterns of animals

Research description:

Animal species exhibit circadian activity patterns in response to the rotation and light cycle on Earth. However, we do not understand the evolutionary causes or consequences of this variation; for example, why are moths nocturnal, while butterflies are diurnal? Research in our lab has suggested that nocturnality may confer an evolutionary advantage during mass extinction events (Shafer, et. al., 2023), and transitions between activity patterns might drive speciation (Nichols & Shafer, et. al., 2024). However, we only have information on the activity patterns of ~12% of vertebrate species, and no systematic information is available on the activity patterns of invertebrates, which represent >97% of all animal species. Given the scale of missing information, we aim to leverage citizen science to fill in the gap. iNaturalist is a popular application that allows users to post observations of organisms along with metadata for their location/timing, spawning a new generation of digital naturalists, and generating huge databases of scientific-grade observations of Earth’s biodiversity. We propose to mine >200 million observations of ~500,000 species by >8 million users from around the world. The SUDS scholar will dereminte the activity pattern for millions of species by identifying patterns in this data using data science techniques.

Researcher: Maxwell Shafer, Department of Cell and Systems Biology, Faculty of Arts and Science, University of Toronto

Skills required:

Experience with bioinformatics, data mining, statistics, or programming languages (R, Python) are beneficial.
Coursework in evolution or evolutionary modelling is preferred (but not required).

Primary research location:

University of Toronto St. George Campus and/or Remote

Finding Substructures within the Milky Way with Geometric Deep Learning

Research description:

This project aims to develop innovative geometric deep learning methods to identify and characterize stellar streams in the Milky Way. Stellar streams are elongated groups of stars that once belonged to smaller galaxies or star clusters that were disrupted by our galaxy’s gravitational forces. These celestial structures serve as crucial forensic evidence of our galaxy’s formation history and provide unique probes of dark matter’s distribution and properties. We will apply graph neural networks and other geometric deep learning techniques to analyze stellar data from the Gaia satellite, which has mapped the positions and velocities of tens of millions of stars with unprecedented precision. These methods are particularly well-suited for this astronomical challenge as they can naturally capture the spatial and kinematic relationships between stars while handling irregular data structures. The project will also incorporate complementary data from the Dark Energy Spectroscopic Instrument (DESI) survey to enhance our understanding of stellar properties. By developing this novel approach to stellar stream detection, we aim to uncover previously unknown structures and gain deeper insights into the Milky Way’s evolutionary history and dark matter distribution.

Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

Skills required:

Strong Python programming skills.
Familiarity with data analysis and visualization
Basic familiarity with machine learning techniques.
Optional: prior exposure to deep learning or graph neural networks

Primary research location:

University of Toronto St. George Campus in-person

Grounded Debate for AI Safety

Research description:

AI debate has been proposed as an adversarial scalable oversight method, with encouraging recent progress (see refs below). Debate elicits a wide range of capabilities, however, in particular a mix of knowledge and persuasion. In this pilot project, a new debate protocol focused on disentangling persuasive tendencies from knowledge elicitation will be implemented, validated and explored. Additionally supported by OpenAI funds, this research theme broadly aims to develop scalable oversight methods for super-alignment, using physics as a ground truth. The objective of super-alignment is to ensure that AI systems remain aligned with human values and intentions, even in the limit where they become more capable than humans. Reference document1

Researcher: Kristen Menou, Department of Physical and Environmental Sciences, University of Toronto Scarborough,

Skills required:

LLM inference
Alignment & Scalable Oversight
Extras: Top-down Representations, Reinforcement Learning

Primary research location:

University of Toronto Scarborough Campus and/or Remote

Harnessing Large Language Models for Smart Grid Security and Optimization

Research description:

This project aims to explore the potential of large language models (LLMs) to address critical challenges in smart grids, such as cyberattack detection and energy forecasting. LLMs excel not only in accuracy but also in generating explainable insights, making them valuable for complex decision-making in energy systems. The project will focus on developing LLM-based frameworks tailored to smart grid applications, emphasizing explainability to enhance trust and transparency in model predictions. Key tasks include designing models for detecting cyber threats and forecasting energy demand, as well as evaluating their ability to provide clear, actionable explanations for their outputs.

Interns will gain hands-on experience in deploying and fine-tuning LLMs, applying cuttingedge AI solutions to real-world energy challenges, and enhancing cybersecurity and operational efficiency in smart grids.

Researcher: Deepa Kundar, Edward S. Rogers Sr. Department of Electrical & Computer Engineering, Faculty of Applied Science and Engineering, University of Toronto

Skills required:

Machine Learning: Basic understanding of large language models and fine-tuning techniques.
Programming: Proficiency in Python and experience with libraries like Hugging Face Transformers or OpenAI APIs.
Smart Grid Fundamentals: General knowledge of smart grid operations and challenges (cybersecurity, energy forecasting).
Cybersecurity Awareness: Familiarity with cyberattack detection concepts.
Analytical Thinking: Ability to interpret model outputs and focus on explainability

Primary research location:

University of Toronto St. George Campus and/or Remote

Horizontal Gene Transfers across Kingdom of organisms

Research description:

Horizontal Gene Transfer (HGT) is a process in which organisms acquire foreign genes from different species. HGT contributes to organismal evolution and has been an important source of genetic diversity. HGT was commonly identified in prokaryotes but rarely reported in eukaryotes. However, our understanding of HGT in eukaryotes is quickly expanding with the production of genomic resources and the development of Detection tools. The Kingdom Fungi represent a striking example, especially the ones known as obligate symbionts which interact with various host organisms intimately. Our research group has been dedicated to detecting fungus-related HGT elements and has discovered several such cases including the mosquito gut-dwelling fungi (doi:10.1093/molbev/msw126), herbivorous mammal rumen fungi (doi:10.1128/mSystems.00247-19), amphibian gastrointestinal fungi (doi:10.1534/g3.120.401516), and photobionts associated fungi (doi:10.1016/j.cub.2021.01.058). This project aims to identify novel HGT using lab newly assembled fungal genomes representing underexplored lineages on the Tree of Life. The student working on this project will help refine lab existing pipelines and analyze the fungal genomes as well as related host data to reconstruct the evolutionary history of identified genes by conducting comparative genomics. A high-impact research report will be accomplished and aimed for publication at the end of the project.

Researcher: Yan Wang, Department of Biological Sciences, University of Toronto Scarborough, University of Toronto

Skills required:

Basic programming skills in Linux, Python, and/or R; effective communication skills
Preferred qualification: strong interests in comparative genomics, host-microbe interactions, and competencies in writing and public speaking.

Primary research location:

University of Toronto Scarborough Campus and/or Remote

How Well Do Households Understand Stock Market Returns?

Research description:

Why don’t more households invest in the stock market? Is it too difficult to open a brokerage account? While this may have been true in the past, advancements in FinTech have made the process simple and accessible, often requiring just a few taps on a smartphone. Instead, could the real issue be that households are simply misinformed about the risks and returns of stock market investing? Using large-scale survey data, this project aims to explore whether limited stock market participation can be attributed to misperceptions about expected returns. We will study patterns of misperceptions across household types along observable characteristics like income, age, and occupation. We also seek to study which interventions can alleviate misinformation and help increase stock market participation. This project will entail collecting, analyzing, and visualizing data. Strong and pragmatic programming experience are required to download and assemble large data sets. An understanding of financial conepts is required for analysis, and visualization entails displaying in a concise yet appealing way. This project is ideal for an undergraduate student with some research experience and who is considering graduate school in economics or finance.

Researcher: Michael Boutros, Department of Economics, University of Toronto Mississauga, University of Toronto

Skills required:

Background or interest in finance/economics.
Knowledgable in at least one of Stata, R, Python, or similar.
Strong written communicator.

Primary research location:

University of Toronto St. George Campus and/or Remote

Inferring interior ocean properties from satellite images of its surface

Research description:

Measuring and predicting ocean currents is crucial to understanding our climate system, marine ecosystems, and societal maritime activities. Satellites are key tools to do so, but cannot provide more than surface information. In this project, we seek to infer sub-surface properties by leveraging three-dimensional realistic numerical forecasts and machine learning techniques. Of prime interest is the mixed layer, which is the uppermost layer of the ocean. It is the buffer between the atmosphere and the deep ocean, and hosts rich ecosystems. To reconstruct its depth is key to predicting the state of the upper ocean, and to do so from satellite data would provide . You will use output from a Fisheries and Oceans Canada operational numerical model as your dataset. The data is three-dimensional and therefore contains the answer to the question of how deep it is. I solves equations that are constrained by observations and finely tuned to reproduce realistic conditions. Using this data set, you will train a deep-learning algorithm (most likely a U-Net, but we are open to exploring different avenues) to predict this depth when only surface information (e.g. sea surface temperature, height, or salinity) is provided.

Researcher: Nicolas Grisouard, Department of Physics, Faculty of Arts and Science, University of Toronto

Skills required:

Programming experience in, or willingness to learn, Python and deep-learning tools such as TensorFlow or PyTorch.
We do not require notions of fluid dynamics or oceanography.

Primary research location:

University of Toronto St. George Campus in-person

Integrating predictive analytics into an equity dashboard

Research description:

The development an equity dashboard in hospitals has been proposed as a solution to facilitate the identification of variations in outcomes, encourage accountability, and support ongoing monitoring. Our research sought to develop an equity dashboard using data collected from the maternal care wards of a hospital in the US in 2019 and 2020. The data obtained were cleaned, and patient delivery data were linked to their demographic data using Microsoft Excel and Python. The data were then disaggregated by race/ethnicity and statistical analysis was performed to assess differences in the outcomes using R. Tableau Desktop was used to develop 18 visualizations of the measures. We are currently conducting usability testing. We could not complete the planned predictive modeling; however, we are working with our collaborators to obtain five years of data to incorporate predictive analytics in the next iteration. Once we validate its efficacy through user testing, we will disseminate our dashboard for implementation. 1) Develop predictive models of adverse events and outcomes based on patient characteristics and social vulnerability. Analyze feature importance for these predictions. 2) Develop an Excel Macro and content pack in Power BI that can generate comparable visualizations 3) Make dashboard publicly accessible through Tableau Public

Researcher: Myrtede Alfred, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto

Skills required:

Knowledge of statistical analysis techniques
Knowledge of ML techniques (regression, random forests, SVM, GBTs)
Ability to conduct statistical analysis in R or Python
Experience using Python libraries for ML and explainable artificial intelligence tools
Experience developing macros in Microsoft Excel
Experience developing data visualizations (Python, R, Tableau, and Power BI)

Primary research location:

University of Toronto St. George Campus in-person

Leveraging LLMs for Automated Peer Review of Randomized Controlled Trials (RCTs)

Research description:

Supervisor Fralick has developed a framework of six domains of study design that can affect the internal validity of randomized controlled trials (RCTs), encapsulated by the acronym PHOBIA: Placebo controlled? How was it funded? Outcome clinically valid? Blinded? Intention-to-Treat? A lot of centres and patients included? When evaluating an RCT, these 6 elements are crucial considerations. The current paradigm leaves reviewers to parse these details from the manuscript, which is inefficient, time-consuming, risks bias, and lacks quality control. All RCTs require registration on a publicly available clinical trial registry, meaning key aspects of their design are readily available. This project will apply supervised machine learning (ML) and two large language models (LLMs) for automating part of the peer review process. The data from the RCT will be parsed. Then, LLM 1 (Summarizer) will extract key information related to the PHOBIA framework. LLM 2 (Validator) will validate the summary by checking it against the original study content. Performance of the dual-LLM system will be evaluated according to the following metrics: hallucination detection, consistency, speed, and helpfulness. A detailed comparison of the system’s reviews with traditional human reviews will assess whether the LLMs can reliably augment the peer review process for RCTs.

Researcher: Michael Fralick, Lunenfeld-Tanenbaum Research Institute

Skills required:

Self-motivated
Strong critical thinking skills
Strong writing and communication skills
Not required but an asset:
- familiar with natural language processing and/or machine learning

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Locating genetic ancestors with ancient DNA

Research description:

We have each inherited our genomes from a vast set of ancestors who were scattered across geographic space. The locations of these ancestors influence the patterns of genetic diversity we see today. Given the genetic relationships among a set of individuals we can therefore hope to reconstruct the spatial history of our shared ancestors. Our lab has recently developed a method to locate genetic ancestors by modeling movement down the many trees that relate recombining genomes (Osmond & Coop 2024) and we are applying this to a variety of species. One limit of our current approach is that the uncertainty in the location of ancestors increases as we move back in time, away from the known locations of the samples. This limit can now be relaxed with the increasing availability of ancient genomes that will effectively anchor the trees in space further back in time. The goal of this project is to extend our method to include ancient genomes and apply the method to publically available human genetic data. There are two key questions: 1) How well does our existing method locate the ancient genomes? and 2) How much do the ancient genomes change the inferred locations of other genetic ancestors?

Researcher: Matthew Osmond, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto

Skills required:

We will extend the method in Python, use a computer cluster to implement it on human data, and share our new method with others on GitHub.
Some coding experience, especially in Python and bash/Unix.
Advanced math and stats would also be useful.
Familiarity with evolution, genetics, and probability/statistics are major assets.

Primary research location:

University of Toronto St. George Campus and/or Remote

Machine learning to infer causal genes from genome-wide association studies

Research description:

This project aims to develop an integrative machine learning approach to identify causal genes underlying genome-wide association study (GWAS) risk loci. We will build on the state of the art in three major ways: 1) By dramatically increasing the diversity of input biological networks. We will incorporate curated pathway databases, co-essentiality networks, protein-protein interaction networks, genetic interaction maps, and co-expression networks. 2) By improving inference and featurization of these networks. We will use BIONIC, a deep learning approach developed by collaborators in Toronto that performs network fusion via graph convolutional neural networks, to combine the information gleaned from our biological networks into a single low-dimensional feature vector per gene. 3) By improving the machine learning modelling itself. We will predict gene-level GWAS p-values from our network-based feature vectors via leave-one-chromosome-out cross-validation. We will use gradient boosting, a popular machine learning approach that flexibly capturse non-linear relationships between features while avoiding overfitting. Naively applying gradient boosting is incorrect because it ignores that gene-level GWAS p-values may not be i.i.d. due to linkage disequilibrium. We will preprocess with Cholesky whitening to decorrelate the gene-level p-values and features. Thus, we will develop better methods for inferring both biological networks and GWAS causal genes.

Researcher: Michael Wainberg, Lunenfeld-Tanenbaum Research Institute

Skills required:

Experience in Python is a must-have, for instance through introductory computer science courses.
Familiarity with genetics, statistics, and data science packages like NumPy and polars are major assets.

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Machine Learning to Uncover the Oldest Light in the Universe

Research description:

The Simons Observatory (SO) is a new, multi-telescope experiment to study the origin and evolution of the cosmos by measuring the cosmic microwave background (CMB), the oldest light in the Universe. Raw data consist of TBs of timestreams of measured sky brightness recorded each day—adding up to several PB over several years—that need to be reconstructed into 2D maps. However, before this can happen, the timestreams need to be automatically processed to remove noise contaminants and foreground galaxies/stars that block the main signal. The successful candidate will join the research groups of Profs. Adam Hincks and Renée Hložek that are actively researching machine learning methods to identify and classify these objects, using existing data from the Atacama Cosmology Telescope (ACT), a precursor to SO. Possible projects include characterising and improving deep learning techniques (including combining multi-modal data and using attention mapping) for detecting and classifying events in the telescope’s raw timestreams and contributing to the data processing pipeline of SO. An exciting aspect of this project is that our classification will help enable the search for astrophysical transients, such as flaring stars and gamma ray bursts.

Researcher: Adam Hincks, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

Skills required:

Python coding
Highly desirable:
- understanding of machine learning concepts (e.g., active learning)
- experience with scikit-learn/sklearn
- familiarity with collaborative coding workflows with Github
Helpful assets:
- web development (e.g., CSS, JS, Vue, React)
- database development (e.g., SQL)
- an interest in cosmology and astrophysics

Primary research location:

University of Toronto St. George Campus and/or Remote

Mapping Celestial Rivers: A Novel Exploration Using Intensive Astronomical Datasets

Research description:

Our Milky Way galaxy is surrounded by numerous small galaxies and star clusters, each influenced by the powerful gravitational forces of our galaxy. These forces create stellar streams—celestial ”rivers” of stars that gracefully orbit around the Milky Way. These streams are not just beautiful; they hold the keys to unraveling the mysteries of galaxy formation and the hidden nature of dark matter. (Curious? Check out this fascinating feature in The Globe & Mail: Star Streams Reveal Milky Way’s Ravenous History. Thanks to revolutionary cosmic surveys, we now have detailed data on millions of stars, including their positions and velocities in full 6D! As a SUDS Scholar, you’ll be at the cutting edge of this exciting field, developing a Bayesian framework to determine the probability that a star belongs to a particular stream and to characterize the properties of these stellar streams. You will work with massive astronomical datasets, totaling several gigabytes, from one of the most extensive spectroscopic surveys—the Dark Energy Spectroscopic Instrument (DESI). This project will give you the opportunity to develop and apply innovative statistical and computational techniques that are not only crucial for revealing the secrets of stellar streams but also for shaping the future of astronomical surveys.

Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto

Skills required:

Python Programming: Strong interest in developing and troubleshooting code using Python.
Bayesian Statistics: Enthusiasm for Bayesian statistics, including sampling and model comparison.
Communication: Proficiency in literature reading, scientific writing, and presenting scientific findings.
Teamwork: Ability to work well in teams and contribute to collaborative research.

Primary research location:

University of Toronto St. George Campus and/or Remote

Molecular encoding of sensory and internal signals in individual neurons

Research description:

We are constantly exposed to various sensory stimuli such as sight, sound, and smell. Although sensory systems detect and interpret these stimuli, our perception is influenced by internal states such as hunger, stress, and inflammation. However, it is still unclear how the signals that signify these states, such as hormones, peptide, and cytokines, are encoded in gene expression of individual neurons and modulate the patterns of neural responses to stimuli. To address this question, our lab uses the mouse olfactory system as a model. Olfaction plays fundamental roles in many aspects of our life including learning and memory and detection of food and danger. In addition, it has been shown that odor processing is influenced by internal states even at the first step where sensory neurons in the nose detect odors. However, mechanisms through which individual neurons encode the internal states and modify responses to stimuli are still unknown. This SUDS project will primarily aim to quantitatively characterize the state-dependent changes in gene expression in the olfactory system by analyzing single-cell and bulk genomics datasets (RNA, epigenome, and protein) obtained from mice that are imposed changes in internal states such as hunger and inflammation.

Researcher: Tatsuya Tsukahara, Lunenfeld-Tanenbaum Research Institute

Skills required:

Proficiency in python (or R) for analysing datasets including single-cell and bulk transcriptomics, epigenome profiling data (chromatin accessibility and DNA/histone modifications), and proteomics data.

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person

Multiomics data analysis and visualization to investigate cross-talk between cells in integrated organ-on-a-chip models

Research description:

To develop safe nanoparticles for use during pregnancy, we first need to understand the cross-talk (communication) between cells of the placenta (barrier between the mother and the baby) and other cells from the mother at different pathological conditions, e.g. cancer. We developed an organ-on-a-chip model to mimic this environment in the lab and investigate the cross-talk between cells. We used this model to generate protemic and transcriptomic data. A data science student will work with a graduate student and help analyze this big data and enable different visualization approaches of the data. This a great opportunity for the student to work in an interdisciplinary team that works at the intersection between nanotechnology and microfluidics, and learn new wet-lab techniques, and apply their knowledge in data science to solve real-case problems.

Researcher: Hagar Labouta, Unity Health Toronto

Skills required:

A motivated data science student with expertise in R, Python and/or other data packages.
Prior experience on omics projects is advantageous.
No prior knowledge in nanomedicine or organ-on-a-chip technology is required; this will be a learning opportunity for the student as well.

Primary research location:

Unity Health Toronto in-person

Piccard: An Open-Source Tool to Analyze Longitudinal Data without Geographic Harmonization

Research description:

This project is an expansion of `piccard`, a Python library to perform longitudinal analyses on data tabulated on unharmonized spatial units. The final library will have three modules: (1) temporal path creation, (2) visualization, and (3) classification. The first module is available on [PyPI]. This module introduces one of `piccard`’s graph-based solution to a frequent problem in spatial data science: identifying temporal trends across noncongruent spatial units of aggregation—e.g., census tracts, dissemination areas, and postal codes from different years. We conceptualize spatial units as nodes, and the edges connecting them as their overlapping geographical areas. Our method creates paths that preserve the original spatial units and their attributes. Thus, `piccard` overcomes some of the limitations of traditional harmonization methods involving labour-intensive apportioning—e.g., defining ad-hoc target units. The selected student will work with the PI and Profesor Daniel Silver (UTSC Sociology) in developing the second and third modules of the library. The visualization module will allow users to subset and inspect network paths. Meanwhile, the classification module will facilitate the classification of paths according to the distribution of the shared attributes across the original geographic units. For example, a user could classify census tracts according to patterns of variation over time.

Researcher: Fernando Calderón Figueroa, Department of Human Geography, University of Toronto Scarborough, University of Toronto

Skills required:

Competent in Python and have some experience with version control (GitHub).
Familiarity with network analysis, and spatial data science concepts and tools (including `networkX`, `matplotlib` and `geopandas`) is an asset.
Additional (but _not_ required) skills include experience with R and familiarity with quantitative urban studies.

Primary research location:

University of Toronto Scarborough Campus - Remote

Research description:

Magnetic Resonance Imaging (MRI) has revolutionized the study of brain aging. It provides non-invasive, detailed images of brain structure and function, allowing researchers to observe changes associated with normal aging and neurodegenerative diseases. MRI results have shown promise in predicting longitudinal brain functions in aging through the following: Volumetry - Measures changes in brain volume, particularly in regions like the hippocampus and prefrontal cortex, which are vulnerable to age-related decline; cortical thickness - assesses the thickness of the cerebral cortex, which can thin with age; white matter integrity - diffusion-tensor MRI (DTI) measures the diffusion of water molecules in white matter tracts, revealing changes in microstructure and connectivity. In this project, we will focus on the use of the MRI and cognitive data from the Baltimore Longitudinal Study of Aging (BLSA), and aim to determine a predictive modeling approach for estimating longitudinal changes in cognitive function in older adults. Methods include but are not limited to linear mixed-effects model, support-vector machines, neural networks and deep learning. The outcome of this project will enable more effective use of MRI in early diagnosis.

Researcher: Jean Chen, Baycrest

Skills required:

Usage of the Linux operating system Programming in Python and/or Matlab Basic data-science concepts, e.g. correlation, regression
Basic statistical concepts, e.g. t-tests, F-tests, outlier identification (optional)
Experience with advanced data-science and machine-learning methods (optional)
Medical imaging analysis experience

Primary research location:

Baycrest in-person and remote

Predicting Feeding Intolerance and Healthcare Utilization among Children With Gastrostomy Tubes

Research description:

Children with medical complexity are those who have multiple significant chronic health problems, functional limitations, high health care and resource needs/utilization. Interventional Radiologists use minimally invasive techniques to place a tube through the abdomen and into the stomach, called a gastrostomy tube (g-tube) in these children. There is limited evidence that guides the clinical management of g-tube feeding in children, substantial variation in practice, and opportunities to improve care and outcomes through research, data analytics, and clinical innovation.

Dr. Sanjay Mahant (SickKids) and Dr. Nathan Taback (Statistical Sciences) will co-supervise this project. The student can participate in training programs and lecture series at the SickKids Research Institute.

SEDAR—hospital EMR data—accessed through the HPC4Health high computing environment at the SickKids Research Institute will be used as the data source.

Project Goals:

Conduct descriptive phenotyping of children undergoing primary G tube insertion at SickKids.
Statistical analysis of outcomes after primary G tube insertion and patient trajectories after primary G tube insertion.
Predictive modelling to identify children who will develop feeding intolerance and have the high healthcare utilization after primary G insertion.

Researcher: Nathan Taback, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto

Skills required:

Intermediate/advanced courses in statistics/predictive models.
Data analysis/modelling (including machine learning models) experience.
Intermediate level programming with data using R or Python.
Basic experience with databases queries (e.g., SQL).
Experience using and working in a command line environment.
Excellent oral and written communication.

Primary research location:

University of Toronto St. George Campus and/or Remote

Predicting mature oocyte yield with follicular output rate (FORT) in assisted reproductive technology

Research description:

Assisted reproductive technology (ART) refers to any fertility treatment in which oocytes (eggs) or embryos (fertilized eggs) are manipulated in a laboratory. The first step of ART is controlled ovarian stimulation (COS), which involves daily injections that stimulate the growth of multiple ovarian follicles. Eggs are then retrieved from these follicles. Predicting a patient’s response to COS is challenging. Ovarian reserve markers such as antral follicle count (AFC) and serum antimüllerian hormone (AMH) are good, but not perfect, predictors of oocyte yield. For example, a patient with a high AFC or AMH may have poor COS response, whereas a different patient with the same AFC or AMH might have a robust COS response. Follicular Output RaTe (FORT) has been proposed as a solution to this problem. The FORT score is calcuated by dividing the preovulatory follicular count (follicles measuring 16-22 mm) by AFC and multiplying by 100. Previous studies have demonstrated an association between FORT and mature oocyte yield; however, these data have been generated from young egg donors or patients undergoing in vitro fertilization. The current project aims to investigate the utility of FORT scores in women undergoing COS for urgent fertility preservation due cancer or other medical indications.

Researcher: Nigel Pereira, Lunenfeld-Tanenbaum Research Institute

Skills required:

Creation and maintenance of a master data list using Excel
Exporting baseline and COS parameters from patient charts
Basic descriptive statistics, tabulation, graphing and calculation of FORT
Motivated and eager to learn
The selected student is welcome to observe ART procedures to better understand the project

Primary research location:

Lunenfeld-Tanenbaum Research Institute in-person and remote

Reproducible neuroimaging: Pipelines for MRI and MEG data

Research Description:

Reproducibility and replicability provide an important foundation for increasing the openness and transparency of research findings. As such, efforts to ensure that research is both reproducible (i.e., the same findings can be reproduced using the

same data and analyses) and replicable (i.e., being able to produce the same results in new datasets) have increased over recent years. In fields such as neuroscience, this can be challenging given the array of datatypes, tools, libraries, frameworks,

programming languages and operating systems used in the analysis of any given study. In this project, the student will have the opportunity to work on pipelines we are establishing at the Rotman Research Institute (RRI) to process different kinds of

MRI and MEG (magnetoencephalography) data. In particular, they will test the robustness of the pipelines in terms of reproducibility and replicability using existing datasets collected at the RRI as well as open datasets from online repositories

(e.g., openneuro.org). The student will also contribute to documentation related to these pipelines. They will have the opportunity to work with experts in biomedical imaging, MRI/MEG data analysis and neuroinformatics, and become familiar with initiatives such as the Brain Imaging Data Standard (BIDS).

Researcher: Bradley Buschbaum

Skills required:

Advanced computer programming skills (e.g., Python, shell scripting, Matlab)
Data analysis skills including machine learning
Usage of Linux operating system
Effective oral and written communication skills
Ability to work independently and within a team
Beneficial to have:
- Neuroimaging analysis experience

Primary research location:

Baycrest in-person and remote

Special Education and Early Years Education in Canada

Research description:

Increased investment in early childhood education and care (ECEC) is being seen globally. Canada has already made significant strides by implementing universal ECEC, ensuring all children have access to early learning opportunities. However, as countries roll out such large-scale systems, it is critical that these changes are made thoughtfully and correctly from the outset, as altering a system once it is entrenched becomes difficult. A central consideration during implementation is equity and inclusion, ensuring that all children, regardless of background or ability, benefit from these services. In Canada, the rollout of the Canada-Wide Early Learning and Child Care (CWELCC) initiative prioritizes equity and inclusion as founding principles. Yet, one area of concern remains how children with disabilities are being included within early years curriculum frameworks. These frameworks, while comprehensive, are often lengthy and dense, making it difficult to evaluate their effectiveness for children with disabilities using traditional qualitative methods. To address this gap, I plan to leverage large language models (LLMs) to analyze the content of these frameworks to determine how children with disabilities are discussed and integrated, offering valuable insights into the current state of inclusivity in early childhood education.

Researcher: Elizabeth Dhuey, Department of Management, University of Toronto Scarborough, University of Toronto

Skills required:

Skills needed are Python and R.

Primary research location:

University of Toronto St. George Campus and/or Remote

State Representation and World Models for Robotics

Research description:

Recent advances in foundation models and self-supervised learning have opened new possibilities for learning robust state representations and world models for robotics. While traditional approaches often rely on hand-crafted state representations or require large amounts of task-specific data, modern approaches leveraging pre-trained models and self-supervised learning promise to create more generalizable and data-efficient solutions. We are exploring novel approaches to learn and utilize state representations and world models that can effectively capture both the physical dynamics of robotic systems and the semantic understanding needed for complex tasks. This includes investigating several promising directions: Leveraging large language models (LLMs) and vision-language models (VLMs) as knowledge priors for robotics tasks Developing self-supervised learning techniques that can efficiently learn from unlabeled robot interaction data Creating hybrid architectures that combine learned world models with imitation learning for improved learning Investigating methods for abstracting and transferring learned representations across different tasks and domains The ultimate goal is to develop algorithms that can learn more efficiently from demonstrations while maintaining robustness and generalization capabilities. Success in this area could significantly reduce the task-specific data needed for robot learning while improving the ability to handle novel situations.

Researcher: Igor Gilitschenski, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto

Skills required:

Strong background in deep learning and familiarity with modern architectures (Transformers, diffusion models, etc).
Experience with robot learning frameworks (PyBullet, MuJoCo, etc) and real robots is highly beneficial.
Previous exposure to imitation learning or reinforcement learning is a plus.

Primary research location:

University of Toronto Mississauga campus in-person

Statistical learning for censored data

Research description:

Understanding censoring, which occurs when the event of interest is not observed for some individuals within the study period, is critical for modeling time-to-event data. This is particularly important for applications such as risk prediction in cancer studies, electronic health records, and clinical trials. Ignoring censoring can lead to biased and inaccurate predictive performance. While numerous statistical approaches in survival analysis, such as Cox regression, have been developed to handle censoring, it remains an open challenge to effectively integrate these methods with modern statistical learning techniques for classification. This SUDS project aims to extend the use of Inverse Probability of Censoring Weighting (IPCW) in conjunction with statistical learning to improve risk prediction for right-censored data. Although IPCW has shown promise when integrated with statistical learning methods (e.g., Vock et al., 2016), its predictive performance can suffer when a significant proportion of subjects are censored before the time of interest due to a huge reduction in effective sample sizes. This project will explore new methodological advancements to address these limitations and validate these approaches through simulation studies and real-world applications in cancer genomics. Students working on this SUDS project will meet weekly with the supervisor to discuss progress and address challenges.

Researcher: Jun Young Park, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto

Skills required:

Completion of one-semester undergraduate course in (i) calculus-based probability, (ii) statistical learning (or machine learning), (iii) generalized linear model is required.
Strong programming skills in R or Python, demonstrated through relevant coursework or prior experience, are highly desirable.

Primary research location:

University of Toronto St. George Campus and/or Remote

Uncovering Barriers to Health Equity: An NLP Analysis of Patient Experience Data from 45 Ontario Hospitals

Research description:

Social determinants of health are increasingly acknowledged as key factors in achieving equitable, efficient, and patient-centered care. It is now well recognized that factors such as inadequate transportation, economic hardship, language barriers, employment security, and health literacy play a critical role in patient’s care experiences and health outcomes. To this effect, understanding the prevalence of these determinants and their impact on patient care is essential to shaping health services and programs that are inclusive and responsive to community needs. The following project intends to employ Natural Language Processing (NLP) approaches to uncover and help study references to social determinants of health arising from patient experience data. In collaboration with the Investigative Journalism Bureau at the University of Toronto, our lab has acquired 120,000 anonymous patient feedback comments from 45 Ontario hospitals, spanning 2015 to 2020. With support from the Institute for Pandemics (IfP), our lab has previously developed approaches to help mitigate selection biases in comments and analyze trends in patient experiences over time. This research project will build upon our previous work and support our efforts to develop analytic approaches that can help uncover barriers to health equity. Outputs will consist of a departmental presentation and research report.

Researcher: Zahra Shakeri, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto

Skills required:

We welcome students with an interest in machine learning and/or natural language processing, as evidenced by previous coursework, research projects, or self-study.
Students are encouraged to articulate their interest in health equity and how it aligns with their experience and/or career goals.
Experience working in multidisciplinary teams is an asset.

Primary research location:

University of Toronto St. George Campus and/or Remote

Using machine learning to uncover genetic perturbations in cancer

Research description:

Pertrubation of genetic interactions on genomic and transcriptomics level play critical role in promoting tumorigenesis. Therefore, as systematic understandin of these perturbations will likely provide novel insight into cancer biology and open new therapeutic avenues. As part of this project, we will levergae recent advances in graph-based machine learning methods and large-scale genomics & transcriptomics data to systematically characterize genetic perturbations in various cancer types.

Researcher: Sushant Kumar, Princess Margaret Cancer Centre, University Health Network

Skills required:

Computer programming (Python/R preferable) prior machine learning experience
Background in computational biology/bioinformatics preferable but not required

Primary research location:

Princess Margaret Cancer Centre in-person and remote

Weight and activation quantization using mathematical optimization for efficient training and inference in large language models

Research description:

This project focuses on developing new quantization methods for representing the weights and activations of large language models as numbers with lower precisions to achieve faster training and inference for large language models while minimizing the reconstruction error. In 2024, several new methods including EasyQuant and SqueezeLLM are proposed for quantizing LLMs to reduce training and inference time under acceptably low reconstruction errors. While the existing methods provide remarkable performance, it is expected that a quantization algorithm that relies on mathematical optimization can exceed the performance of existing methods. In this project, the SUDS scholar will be supervised by a faculty member from the MIE department to complete a series of weekly assignments. These tasks will encompass activities such as data analysis, computational experiments, and the implementation and testing of new algorithm enhancements in a git environment. This project leverages cutting-edge techniques in mathematical optimization to advance the quantization of LLMs by reducing reconstruction error. The results of this summer research initiative contribute to the development of a new algorithm for weight and activation quantization of large language models, thereby enhancing a widely used AI technology in using data science.

Researcher: Samin Aref, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto

Skills required:

Python (demonstrable experience with PyTorch)
Asymptotic analysis of algorithms, and memory management
Operations research (heuristic optimization, integer programming)
Version control (Git)
CPU and GPU parallelism is an asset.
LLM inference optimization and heuristic algorithms
Skills for reading and comprehending technical articles
Undergrad program in Computer Science, Industrial Engineering, or equivalent

Primary research location:

University of Toronto St. George Campus and/or Remote

Wired for Wellness: A Focus on Promoting Apprentices, Contractors, and Employers’ Mental & Physical Health and Well-Being

Research description:

Our research aims to address the physical health challenges faced by skilled trades workers, particularly electricians, who are prone to repetitive strain injuries. This project will employ a mixed-method approach, gathering quantitative and qualitative data through a survey of 100 participants and semi-structured interviews with 30 participants. Data analysis will play a critical role in uncovering key insights. Survey responses will be analyzed statistically to assess the prevalence of physical injuries, mental health issues, and the effectiveness of workplace safety practices, while interviews will be analyzed qualitatively to identify patterns and themes regarding workplace conditions, ergonomic stressors, and the use of personal protective equipment (PPE). We plan to onboard a DSI student to support the data analysis, integrating advanced statistical tools and qualitative software to handle the complex nature of our dataset. The student will assist in synthesizing the results, contributing to a comprehensive understanding of both the physical and psychological health of apprentices, contractors, and employers in the skilled trades. This interdisciplinary approach will allow us to develop practical toolkits for injury prevention and mental health support, ultimately improving worker well-being and workplace productivity.

Researcher: Behdin Nowrouzi-Kia, Department of Occupational Science and Occupational Therapy, Temerty Faculty of Medicine, University of Toronto

Skills required: