SUDS Student Call
The Data Sciences Institute (DSI) welcomes carefully selected undergraduate students from across Canada for a rich data sciences research experience. Through the SUDS Research Program, undergraduate students, who are interested in exploring data science as a career path, have an exciting opportunity to engage in hands-on research supervised by DSI member researchers across the three UofT campuses.
The DSI is strongly committed to diversity within its community and especially welcomes applications from racialized persons/persons of colour, women, Indigenous/Aboriginal People of North America, persons with disabilities, LGBTQ2S+ persons, and others who may contribute to the further diversification of ideas.
Below are the SUDS research opportunities for May-August 2025. You can apply and rank your top three choices.
See here for information on eligibility, award value and duration, and SUDS programming.
Research description:
The rapid advancement of 3D generative models is completely transforming how we create and manipulate creative content. We’re on the verge of a technological revolution, and there’s no denying that we’ve made significant progress in this field. However, the complexity of 3D content generation demands a stronger emphasis on making it user-friendly and precise. Hence, we’re introducing a research project to push the boundaries of 3D generative modeling. Our ultimate objective is to empower artists, designers, and developers to fully exploit this technology with ease and precision. This can be regarded as a significant step in the direction of realizing a 3D equivalent to Photoshop. Working on the project, the student will gain valuable experience in training neural networks and implementing novel computer vision pipelines and would eventually write and submit their work at a top-tier AI/Robotics conference or workshop (CVPR, NeurIPS, ICRA, etc).
Researcher: Igor Gilitschenski, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto
Skills required:
Primary research location:
Research description:
This study is a post-hoc analysis of the GRASSP, ISNCSCI, SCIM and CMAP data from the NISCI (Nogo-A Inhibition in acute Spinal Cord Injury) Study. This trial was a multicenter, multinational, placebo controlled phase-II study for the safety and preliminary efficacy of intrathecal anti-Nogo-A [NG101] in patients with acute cervical spinal cord injury. The purpose of the NISCI trial was to test if an antibody therapy can improve motor function and quality of life of tetraplegic patients. The purpose of this specific post-hoc analysis is to explore the differences between groups when measured with the GRASSP. Furthermore, to look at the relationships between the GRASSP scores and ISNCSCI, SCIM and CMAP scores The study aims to answer the following questions: To determine if Nogo-A therapy improves upper limb impairment and function, 6-month time point after SCI I in comparison to control group. -Determine the relationships between completeness of SCI and recovery of the upper limb -Determine the relationships between function and recovery of the upper limb. -Determine the relationships between CMAP and recovery of the upper limb. -Understand the recovery profiles of the upper limb, with both the control and treatment data.
Researcher: Sukhvinder Kalsi-Ryan, Toronto Rehabilitation Institute (KITE), University Health Network
Skills required:
Primary research location:
Research description:
The built environment is sensitive to global warming and climate change, which are leading to increased cooling loads, dangerous heat waves, damaging flooding in cities, permafrost degradation, and other impacts. To model, quantify, and predict these impacts for engineering analysis requires “downscaling”, which maps available climate information (e.g. weather station data, model output) to the requirements of engineering (site specific information, design requirements), while accounting for incompatible sampling, errors in observations, and uncertainty. Downscaling is a workflow of data processing and model calibration against observations that is increasingly informed by machine learning and modern data science. Over the last few years we have developed the UofT Climate Downscaling Workflow (UTCDW), a set of guides, software, and visuals to bring downscaling into engineering research and design, and piloted these tools during a Climate Impacts Hackathon in March, 2024. The SUDS scholars will improve the usability of and extend the data-science approaches in the UTCDW, with an emphasis on how to best use connections between climate fields like temperature and precipitation in our workflow, in particular to implement multvariate downscaling methods for climate extremes. The project will demonstrate how the UTCDW can effectively translate climate science knowledge and data into actionable information.
Researcher: Paul Kushner, Department of Physics,Faculty of Arts and Science,University of Toronto
Skills required:
Primary research location:
Research description:
Predicting patient outcomes like risk of readmission is crucial for improving healthcare quality and efficiency. However, the unstructured, unstandardized nature of electronic medical record (EMR) data makes it challenging to develop robust supervised learning models. Large language models (LLMs) offer a promising approach to automating the curation of EMR data into machine-readable formats needed for predictive modeling. However, concerns remain around the reliability, stability, and tendency of LLMs to ”hallucinate” - generating plausible-sounding but factually incorrect outputs. In this project, the student will leverage open-source LLMs and explore prompt engineering and fine-tuning strategies to curate EMRs, mitigating the above issues and maximizing the effectiveness of LLMs for EMR data curation and predictive model development. The student will assess the performance of the LLM-powered approach against traditional manual data curation methods in terms of accuracy, scalability, and cost-effectiveness. The insights gained could enable more widespread adoption of LLM techniques to unlock the predictive power of EMR data at Sinai Health, leading to improved patient outcomes and healthcare system efficiency.
Researcher:
Kieran Campbell, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
People spend nearly 90% of their time indoors, where they are exposed to various airborne contaminants. Indoor air quality (IAQ) has a substantial impact on human health and comfort. However, understanding and analyzing IAQ in diverse indoor environments remains challenging despite the well-established principles of mass transfer and fluid dynamics and various low-cost sensing technologies. This is due to the difficulty of collecting key information, such as contaminant generation rate, degree of air mixing, airflow patterns between spaces, etc. This project aims to develop a method for analyzing time series IAQ data using physics-informed machine learning (ML). The method will incorporate mass balance equations, represented by ordinary differential equations, as physical knowledge. A set of probabilistic ML models, regulated by domain knowledge, will address the imperfection of the mass balance equations and the impact of missing key information. Probabilistic programming will serve as the overarching framework to integrate all the model components. The student will work with Professor Jeffrey Siegel (CIVMIN, IAQ expert) and Professor Seungjae Lee (CIVMIN, ML expert in building science). Indoor air quality data collected from multiple homes and other indoor environments will be used to test the developed method.
Researcher: Seungjae Lee, Department of Civil and Mineral Engineering, Faculty of Applied Science and Engineering, University of Toronto
Skills required:
Primary research location:
Research description:
Public health and media messages increasingly emphasize the link between hearing loss and cognitive decline in older adults, promoting the idea that hearing loss may causally contribute to dementia. This discourse has profound psychosocial implications, raising concerns within hearing-oriented community organizations about how it may amplify stigma, anxiety, and self-doubt among older adults with hearing loss. The discourse also opens the door to older adults’ exploitation by the multi-billion-dollar hearing services industry, which can capitalize on narratives equating aging with decline and trends like the medicalization of aging by marketing hearing services as preventative solutions for cognitive decline. Websites are crucial for attracting clients and website-marketing recommendations often focus on client recruitment because the 70% of people with hearing loss who do not use hearing aids are considered a vast market. Websites from hearing-service providers can offer education about the link between hearing loss and cognition, but might also aim to capitalize on the narrative and common cognitive concerns in older people to increase service and product sales. The student will use sophisticated website scraping and analysis tools jointly with large language models (Python) to provide a systematic analysis of website education and marketing of audiological clinics.
Researcher: Björn Herrmann, Baycrest
Skills required:
Primary research location:
Research description:
Primary research location:
Research description:
The Children’s Aid Society of Toronto (CAST) is North America’s largest not-for-profit child welfare agency, with a legal mandate to protect children and youth from abuse and neglect. CAST provides essential services such as investigating protection needs, offering guidance and counselling to families, and facilitating permanency through adoption. CAST operates across the Greater Toronto Area and ensures that services are delivered through an equity lens, addressing the unique needs of children, youth, and families based on their race, culture, religion, gender, and sexual orientation. Through this project, CAST aims to address key operational challenges, such as understanding why some cases remain open for extended periods and why re-referrals occur after cases are closed. By analyzing the narrative data alongside administrative outcomes, the project will help CAST gain insights into decision-making processes at various stages of a child’s involvement with the system. The anticipated social and economic benefits of the project for CAST include more efficient case management and improved decision-making frameworks, reducing the backlog of long-term cases and enhancing service delivery. This will lead to better outcomes for children and families by ensuring that decisions made during child protection investigations are well-informed and supported by comprehensive data analysis.
Researcher: Shion Guha, Faculty of Information, University of Toronto
Skills required:
Primary research location:
Research description:
Scholar Metrics Scraper is a Python script recently developed at UBC that enables automated retrieval of citation and author data. In this project, we propose to customize this tool in a number of ways to support open science activities and reporting at a Canadian neuroscience institute: the Rotman Research Institute (RRI). This will involve modifying or writing new code to automatically (on a scheduled basis and on demand) retrieve and clean publication data of RRI scientists, develop ways to automatically determine the open access status of publications, as well as to retrieve data on study preregistrations and open datasets shared in online repositories (e.g., osf.io). Code will also be developed for plotting key variables (e.g., publication and citation counts; chord diagrams visualizing collaborations) for each scientist and the institute; generating reports; and automatically updating scientist webpages with lists of publications and datasets that include open access status and details. The student will have the opportunity to learn about open science best practices and tools, and to work with a number of scientists as well as Research IT. They will also have the chance to make important contribution to establishing and normalizing open science at the RRI.
Researcher: Donna Rose Addis, Baycrest
Skills required:
Primary research location:
Research description:
The landscape of student help-seeking behaviour is undergoing a significant transformation with the rise of generative AI tools like Large Language Models (LLMs). Building on prior research that explores help-seeking tendencies among university students, this project aims to investigate and analyse large-scale student data on the effects of integrating LLM-powered assistants in programming courses, focusing on their influence on student behaviour, engagement, and learning outcomes. Ideally generating an approach for improved (predictive and prescriptive) decision making. The research will involve a comprehensive analysis of how the introduction of LLM-based conversational agents (e.g., ChatGPT) and other LLM-based educational tools, such as CodeAid and QuickTA, both developed at the University of Toronto, influence student approaches to seeking help. This will involve data mapping and analysis, but also the need to identify patterns in large conversational data. Traditional help-seeking behaviours have shown a reliance on informal support (e.g., peers) rather than formal educational resources (e.g., instructors), often due to perceived barriers like stigma or accessibility. We hypothesise that the availability of LLM tools may shift these dynamics, increasing students’ reliance on automated, real-time assistance and providing data-rich insights into evolving help-seeking patterns that could enhance predictive and prescriptive modelling for educational support strategies.
Researcher: Michael Liut, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto
Skills required:
Primary research location:
Research description:
Recording from the peripheral nervous system can be used to decode control signals exchanged throughout the body, with applications in creating assistive technologies and treating chronic diseases. Our laboratory has collected unique datasets from multi-channel nerve cuff electrodes, which record data from the surface of nerves. We have developed neural networks to decode these recordings by classifying the source of each detected neural event. Using existing data, this project will involve refining neural network architectures and training strategies to optimize performance. Creating neural networks that can generalize well over time and across subjects with minimal re-calibration is of particular interest. The student will have the opportunity to gain a better understanding of real-world data science challenges in neurotechnology, and of strategies to manage these obstacles when developing deep learning systems.
Researcher: José Zariffa, Toronto Rehabilitation Institute (KITE), University Health Network
Skills required:
Primary research location:
Research description:
This project aims to develop a general method for defining clusters of cell types from single-cell RNA sequencing data. This problem is widely considered one of the most important and fundamental problems in single-cell data analysis, but suffers from a paucity of methods to define whether two cell-type clusters are actually distinct from each other. We will use hierarchical clustering via the ultra-fast HGC method to define an initial hierarchy of cell-type clusters. We will then recurse through this hierarchy and apply a significance test at each split, to determine whether the two clusters at the split are significantly different from each other. If they are not, recursion will stop. The most creative aspect of the project will be defining the significance test. HGC is based on the shared nearest-neighbor (SNN) graph, so it seems natural to use that for significance testing as well. However, naively testing whether the number of between-cluster connections is less than expected will not be sufficient, since this criterion was already used to define the clusters themselves - an example of a ”double-dipping” problem. Possibly solutions may involve some combination of permutation testing, the recently-developed "count splitting" method, and graph theoretic properties.
Researcher: Michael Weinberg, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
We are offering a unique research opportunity for students passionate about the intersection of statistics, psychometrics, psychology, and artificial intelligence. This project aims to revolutionize psychological assessment by leveraging AI to design more reliable and valid psychological scales. By employing machine learning algorithms and natural language processing, we will analyze existing scales to identify limitations and develop enhanced tools that more accurately measure psychological constructs. As a participant, you will engage in a case study exploring how AI can be of help refining scale items to be culturally sensitive and reducing unwanted bias. You’ll collaborate with a multidisciplinary team of psychometricians, data scientists, statisticians and AI experts, gaining hands-on experience in both qualitative and quantitative research methods. This immersive experience will not only deepen your understanding of psychometrics but also equip you with cutting-edge skills in AI applications within psychology. This project offers the chance to contribute to pioneering research with the potential to make a significant impact on psychological assessment practices. You’ll develop valuable skills in data analysis and AI, preparing you for advanced studies or careers in psychology, statistics, education, data science, or related fields.
Researcher: Feng Ji, Department of Applied Psychology and Human Development, Ontario Institute for Studies in Education, University of Toronto
Skills required:
Familiarity with machine learning concepts and programming languages such as Python or R is highly desirable.
Familiarity with APIs (such as OpenAI API) is preferred (but not required).
Essential skills include excellent analytical abilities, attention to detail, and the capacity to work effectively in a collaborative team environment.
Coursework in psychology, statistics, and data science (generally defined) is preferred (but not required).
Primary research location:
Research description:
Tertiary lymphoid structures (TLS) have recently been shown to be predictive of survival in pancreatic adenocarcinoma (PDAC). This project aims to quantify and subtype TLS in three PDAC cohorts spanning over 600 patients. These findings will then be associated with clinical metadata, genomic mutations and transcriptional subtypes. The successful candidate will benchmark existing TLS identification methods and compare these to recently developed foundation models. Upon identification, we will attempt to stratify TLS into distinct subtypes based on the embeddings produced by foundation models. We will then attempt to identify whether these subtypes are driven by TLS specific aspects such as lymphocyte morphology or the surrounding environment such as the composition of the stroma or distance to the closest tumor. Finally, we will benchmark the extent to which these subtypes recapitulate transcriptional TLS subtypes we have already identified using spatial sequencing technologies. Upon creation of a robust TLS subtyping method, we will run it over slides from over 600 deeply phenotyped patients and associate the presence and TLS subtype with patient survival, genomic mutations and copy number aberrations as well as known transcriptional subtypes. Overall, this will be the most in-depth characterization of TLS’ in PDAC to date.
Researcher: Kieran Campbell, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
The majority of ovarian cancers are diagnosed at an advanced stage, and consequently, the case-fatality rate is high. To some extent, this is because there is no effective screening program and because of delay in diagnosis. It is of interest to explore innovative means of accelerating the date of diagnosis. One possibility is CA125 testing at the first point of care for symptomatic women that seek consultation with front-line physicians. The goal of this project is to leverage a robust database of ~600 ovarian cancers diagnosed in Ontario and to conduct a detailed evaluation of the distribution of CA125 levels at the time of diagnosis by various patient and clinical factors (i.e., stage, histology) and to explore whether by increasing the threshold for CA125 levels may accelerate the diagnostic process lead to earlier identification of affected individuals. Finally, analysis of predictors of survival are also of interest and available to analyze in this dataset.
Researcher: Joanne Kotsopoulos, Women's College Hospital
Skills required:
Primary research location:
Research description:
Variation in gene expression underpins variation in organismal traits and diversity. Therefore, understanding how gene expression evolves will allow us to better understand the mechanisms of evolutionary change. The strength and form of selection on gene expression and its role in evolution is difficult to estimate, however, because of the high dimensional and highly correlated nature of gene expression data. In this project the SUDS scholar will estimate selection on gene expression traits and compare the results from different methods that are commonly used to study selection on gene expression.
Researcher: Jacqueline Sztepanacz, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
Large language models (LLMs) have opened up new frontiers for reducing administrative burdens in health systems. Healthcare institutions around the world have already begun piloting the use of automated scribes and other tools aimed at summarizing patients’ clinical records. Aside from these institutional endeavors, there is also evidence that independent care providers are increasingly utilizing large-language models to support care delivery, despite the lack of guidelines and oversight mechanisms. In light of these recent trends, there is a critical need to better understand the prevalence, types, and impacts of bias that risk being perpetuated by LLMs. Social biases such as racial and gender stereotypes, as well as systematic discrepancies in clinical LLM summaries, pose a risk of exacerbating health disparities. Relatedly, biases may also stem from sycophancy, a phenomenon where LLMs generate outputs that reflect the user’s anticipated preferences or assumptions. The goal of this project is to evaluate the risk of social bias and the effects of sycophancy on several publicly-available LLMs, and summarise findings in a whitepaper or research report. To support these evaluations, we will use anonymized clinical notes from the MIMIC-IV dataset, which have already been annotated for patients’ language, race, and ethnicity.
Researcher: Zahra Shaker, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto
Skills required:
Primary research location:
Research description:
This project seeks to characterize the relationship between components of the built environment and breast cancer risk among BRCA mutation carriers. The built environment touches all aspects of our lives, encompassing the buildings we live in, distribution systems that provide us with water and electricity, and the roads, bridges, and transportation systems we use to get from place to place. It can be described as the manufactured or modified structures that provide people with living, working, and recreational spaces. As a result, these environments can have a lasting impact on human health. Previous literature has established relationships between built environment factors, including proximity to roadways, neighbourhood greenspace, and indoor environment, and breast cancer risk. However, to our knowledge, no studies have specifically examined this risk among BRCA mutation carriers. This study aims to leverage our existing database of BRCA mutation carriers from across Canada, alongside detailed environmental data available through the Canadian Urban Environmental Health Research Consortium (CANUE), to assess and quantify these risks. Findings from this study will provide novel insights into how various built environment factors may influence breast cancer risk in high-risk populations, allowing us to better understand potential risk reduction interventions and urban planning efforts.
Researcher: Joanne Kotsopoulous, Women's College Hospital
Skills required:
Primary research location:
Research description:
Animal species exhibit circadian activity patterns in response to the rotation and light cycle on Earth. However, we do not understand the evolutionary causes or consequences of this variation; for example, why are moths nocturnal, while butterflies are diurnal? Research in our lab has suggested that nocturnality may confer an evolutionary advantage during mass extinction events (Shafer, et. al., 2023), and transitions between activity patterns might drive speciation (Nichols & Shafer, et. al., 2024). However, we only have information on the activity patterns of ~12% of vertebrate species, and no systematic information is available on the activity patterns of invertebrates, which represent >97% of all animal species. Given the scale of missing information, we aim to leverage citizen science to fill in the gap. iNaturalist is a popular application that allows users to post observations of organisms along with metadata for their location/timing, spawning a new generation of digital naturalists, and generating huge databases of scientific-grade observations of Earth’s biodiversity. We propose to mine >200 million observations of ~500,000 species by >8 million users from around the world. The SUDS scholar will dereminte the activity pattern for millions of species by identifying patterns in this data using data science techniques.
Researcher: Maxwell Shafer, Department of Cell and Systems Biology, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
This project aims to develop innovative geometric deep learning methods to identify and characterize stellar streams in the Milky Way. Stellar streams are elongated groups of stars that once belonged to smaller galaxies or star clusters that were disrupted by our galaxy’s gravitational forces. These celestial structures serve as crucial forensic evidence of our galaxy’s formation history and provide unique probes of dark matter’s distribution and properties. We will apply graph neural networks and other geometric deep learning techniques to analyze stellar data from the Gaia satellite, which has mapped the positions and velocities of tens of millions of stars with unprecedented precision. These methods are particularly well-suited for this astronomical challenge as they can naturally capture the spatial and kinematic relationships between stars while handling irregular data structures. The project will also incorporate complementary data from the Dark Energy Spectroscopic Instrument (DESI) survey to enhance our understanding of stellar properties. By developing this novel approach to stellar stream detection, we aim to uncover previously unknown structures and gain deeper insights into the Milky Way’s evolutionary history and dark matter distribution.
Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
AI debate has been proposed as an adversarial scalable oversight method, with encouraging recent progress (see refs below). Debate elicits a wide range of capabilities, however, in particular a mix of knowledge and persuasion. In this pilot project, a new debate protocol focused on disentangling persuasive tendencies from knowledge elicitation will be implemented, validated and explored. Additionally supported by OpenAI funds, this research theme broadly aims to develop scalable oversight methods for super-alignment, using physics as a ground truth. The objective of super-alignment is to ensure that AI systems remain aligned with human values and intentions, even in the limit where they become more capable than humans. Reference document1
Researcher: Kristen Menou, Department of Physical and Environmental Sciences, University of Toronto Scarborough,
Skills required:
Primary research location:
Research description:
Primary research location:
Research description:
Horizontal Gene Transfer (HGT) is a process in which organisms acquire foreign genes from different species. HGT contributes to organismal evolution and has been an important source of genetic diversity. HGT was commonly identified in prokaryotes but rarely reported in eukaryotes. However, our understanding of HGT in eukaryotes is quickly expanding with the production of genomic resources and the development of Detection tools. The Kingdom Fungi represent a striking example, especially the ones known as obligate symbionts which interact with various host organisms intimately. Our research group has been dedicated to detecting fungus-related HGT elements and has discovered several such cases including the mosquito gut-dwelling fungi (doi:10.1093/molbev/msw126), herbivorous mammal rumen fungi (doi:10.1128/mSystems.00247-19), amphibian gastrointestinal fungi (doi:10.1534/g3.120.401516), and photobionts associated fungi (doi:10.1016/j.cub.2021.01.058). This project aims to identify novel HGT using lab newly assembled fungal genomes representing underexplored lineages on the Tree of Life. The student working on this project will help refine lab existing pipelines and analyze the fungal genomes as well as related host data to reconstruct the evolutionary history of identified genes by conducting comparative genomics. A high-impact research report will be accomplished and aimed for publication at the end of the project.
Researcher: Yan Wang, Department of Biological Sciences, University of Toronto Scarborough, University of Toronto
Skills required:
Primary research location:
Research description:
Why don’t more households invest in the stock market? Is it too difficult to open a brokerage account? While this may have been true in the past, advancements in FinTech have made the process simple and accessible, often requiring just a few taps on a smartphone. Instead, could the real issue be that households are simply misinformed about the risks and returns of stock market investing? Using large-scale survey data, this project aims to explore whether limited stock market participation can be attributed to misperceptions about expected returns. We will study patterns of misperceptions across household types along observable characteristics like income, age, and occupation. We also seek to study which interventions can alleviate misinformation and help increase stock market participation. This project will entail collecting, analyzing, and visualizing data. Strong and pragmatic programming experience are required to download and assemble large data sets. An understanding of financial conepts is required for analysis, and visualization entails displaying in a concise yet appealing way. This project is ideal for an undergraduate student with some research experience and who is considering graduate school in economics or finance.
Researcher: Michael Boutros, Department of Economics, University of Toronto Mississauga, University of Toronto
Skills required:
Primary research location:
Research description:
Measuring and predicting ocean currents is crucial to understanding our climate system, marine ecosystems, and societal maritime activities. Satellites are key tools to do so, but cannot provide more than surface information. In this project, we seek to infer sub-surface properties by leveraging three-dimensional realistic numerical forecasts and machine learning techniques. Of prime interest is the mixed layer, which is the uppermost layer of the ocean. It is the buffer between the atmosphere and the deep ocean, and hosts rich ecosystems. To reconstruct its depth is key to predicting the state of the upper ocean, and to do so from satellite data would provide . You will use output from a Fisheries and Oceans Canada operational numerical model as your dataset. The data is three-dimensional and therefore contains the answer to the question of how deep it is. I solves equations that are constrained by observations and finely tuned to reproduce realistic conditions. Using this data set, you will train a deep-learning algorithm (most likely a U-Net, but we are open to exploring different avenues) to predict this depth when only surface information (e.g. sea surface temperature, height, or salinity) is provided.
Researcher: Nicolas Grisouard, Department of Physics, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
The development an equity dashboard in hospitals has been proposed as a solution to facilitate the identification of variations in outcomes, encourage accountability, and support ongoing monitoring. Our research sought to develop an equity dashboard using data collected from the maternal care wards of a hospital in the US in 2019 and 2020. The data obtained were cleaned, and patient delivery data were linked to their demographic data using Microsoft Excel and Python. The data were then disaggregated by race/ethnicity and statistical analysis was performed to assess differences in the outcomes using R. Tableau Desktop was used to develop 18 visualizations of the measures. We are currently conducting usability testing. We could not complete the planned predictive modeling; however, we are working with our collaborators to obtain five years of data to incorporate predictive analytics in the next iteration. Once we validate its efficacy through user testing, we will disseminate our dashboard for implementation. 1) Develop predictive models of adverse events and outcomes based on patient characteristics and social vulnerability. Analyze feature importance for these predictions. 2) Develop an Excel Macro and content pack in Power BI that can generate comparable visualizations 3) Make dashboard publicly accessible through Tableau Public
Researcher: Myrtede Alfred, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto
Skills required:
Primary research location:
Research description:
Supervisor Fralick has developed a framework of six domains of study design that can affect the internal validity of randomized controlled trials (RCTs), encapsulated by the acronym PHOBIA: Placebo controlled? How was it funded? Outcome clinically valid? Blinded? Intention-to-Treat? A lot of centres and patients included? When evaluating an RCT, these 6 elements are crucial considerations. The current paradigm leaves reviewers to parse these details from the manuscript, which is inefficient, time-consuming, risks bias, and lacks quality control. All RCTs require registration on a publicly available clinical trial registry, meaning key aspects of their design are readily available. This project will apply supervised machine learning (ML) and two large language models (LLMs) for automating part of the peer review process. The data from the RCT will be parsed. Then, LLM 1 (Summarizer) will extract key information related to the PHOBIA framework. LLM 2 (Validator) will validate the summary by checking it against the original study content. Performance of the dual-LLM system will be evaluated according to the following metrics: hallucination detection, consistency, speed, and helpfulness. A detailed comparison of the system’s reviews with traditional human reviews will assess whether the LLMs can reliably augment the peer review process for RCTs.
Researcher: Michael Fralick, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
We have each inherited our genomes from a vast set of ancestors who were scattered across geographic space. The locations of these ancestors influence the patterns of genetic diversity we see today. Given the genetic relationships among a set of individuals we can therefore hope to reconstruct the spatial history of our shared ancestors. Our lab has recently developed a method to locate genetic ancestors by modeling movement down the many trees that relate recombining genomes (Osmond & Coop 2024) and we are applying this to a variety of species. One limit of our current approach is that the uncertainty in the location of ancestors increases as we move back in time, away from the known locations of the samples. This limit can now be relaxed with the increasing availability of ancient genomes that will effectively anchor the trees in space further back in time. The goal of this project is to extend our method to include ancient genomes and apply the method to publically available human genetic data. There are two key questions: 1) How well does our existing method locate the ancient genomes? and 2) How much do the ancient genomes change the inferred locations of other genetic ancestors?
Researcher: Matthew Osmond, Department of Ecology and Evolutionary Biology, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
This project aims to develop an integrative machine learning approach to identify causal genes underlying genome-wide association study (GWAS) risk loci. We will build on the state of the art in three major ways: 1) By dramatically increasing the diversity of input biological networks. We will incorporate curated pathway databases, co-essentiality networks, protein-protein interaction networks, genetic interaction maps, and co-expression networks. 2) By improving inference and featurization of these networks. We will use BIONIC, a deep learning approach developed by collaborators in Toronto that performs network fusion via graph convolutional neural networks, to combine the information gleaned from our biological networks into a single low-dimensional feature vector per gene. 3) By improving the machine learning modelling itself. We will predict gene-level GWAS p-values from our network-based feature vectors via leave-one-chromosome-out cross-validation. We will use gradient boosting, a popular machine learning approach that flexibly capturse non-linear relationships between features while avoiding overfitting. Naively applying gradient boosting is incorrect because it ignores that gene-level GWAS p-values may not be i.i.d. due to linkage disequilibrium. We will preprocess with Cholesky whitening to decorrelate the gene-level p-values and features. Thus, we will develop better methods for inferring both biological networks and GWAS causal genes.
Researcher: Michael Wainberg, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
The Simons Observatory (SO) is a new, multi-telescope experiment to study the origin and evolution of the cosmos by measuring the cosmic microwave background (CMB), the oldest light in the Universe. Raw data consist of TBs of timestreams of measured sky brightness recorded each day—adding up to several PB over several years—that need to be reconstructed into 2D maps. However, before this can happen, the timestreams need to be automatically processed to remove noise contaminants and foreground galaxies/stars that block the main signal. The successful candidate will join the research groups of Profs. Adam Hincks and Renée Hložek that are actively researching machine learning methods to identify and classify these objects, using existing data from the Atacama Cosmology Telescope (ACT), a precursor to SO. Possible projects include characterising and improving deep learning techniques (including combining multi-modal data and using attention mapping) for detecting and classifying events in the telescope’s raw timestreams and contributing to the data processing pipeline of SO. An exciting aspect of this project is that our classification will help enable the search for astrophysical transients, such as flaring stars and gamma ray bursts.
Researcher: Adam Hincks, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
Our Milky Way galaxy is surrounded by numerous small galaxies and star clusters, each influenced by the powerful gravitational forces of our galaxy. These forces create stellar streams—celestial ”rivers” of stars that gracefully orbit around the Milky Way. These streams are not just beautiful; they hold the keys to unraveling the mysteries of galaxy formation and the hidden nature of dark matter. (Curious? Check out this fascinating feature in The Globe & Mail: Star Streams Reveal Milky Way’s Ravenous History. Thanks to revolutionary cosmic surveys, we now have detailed data on millions of stars, including their positions and velocities in full 6D! As a SUDS Scholar, you’ll be at the cutting edge of this exciting field, developing a Bayesian framework to determine the probability that a star belongs to a particular stream and to characterize the properties of these stellar streams. You will work with massive astronomical datasets, totaling several gigabytes, from one of the most extensive spectroscopic surveys—the Dark Energy Spectroscopic Instrument (DESI). This project will give you the opportunity to develop and apply innovative statistical and computational techniques that are not only crucial for revealing the secrets of stellar streams but also for shaping the future of astronomical surveys.
Researcher: Ting Li, David A. Dunlap Department of Astronomy and Astrophysics, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
We are constantly exposed to various sensory stimuli such as sight, sound, and smell. Although sensory systems detect and interpret these stimuli, our perception is influenced by internal states such as hunger, stress, and inflammation. However, it is still unclear how the signals that signify these states, such as hormones, peptide, and cytokines, are encoded in gene expression of individual neurons and modulate the patterns of neural responses to stimuli. To address this question, our lab uses the mouse olfactory system as a model. Olfaction plays fundamental roles in many aspects of our life including learning and memory and detection of food and danger. In addition, it has been shown that odor processing is influenced by internal states even at the first step where sensory neurons in the nose detect odors. However, mechanisms through which individual neurons encode the internal states and modify responses to stimuli are still unknown. This SUDS project will primarily aim to quantitatively characterize the state-dependent changes in gene expression in the olfactory system by analyzing single-cell and bulk genomics datasets (RNA, epigenome, and protein) obtained from mice that are imposed changes in internal states such as hunger and inflammation.
Researcher: Tatsuya Tsukahara, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research description:
To develop safe nanoparticles for use during pregnancy, we first need to understand the cross-talk (communication) between cells of the placenta (barrier between the mother and the baby) and other cells from the mother at different pathological conditions, e.g. cancer. We developed an organ-on-a-chip model to mimic this environment in the lab and investigate the cross-talk between cells. We used this model to generate protemic and transcriptomic data. A data science student will work with a graduate student and help analyze this big data and enable different visualization approaches of the data. This a great opportunity for the student to work in an interdisciplinary team that works at the intersection between nanotechnology and microfluidics, and learn new wet-lab techniques, and apply their knowledge in data science to solve real-case problems.
Researcher: Hagar Labouta, Unity Health Toronto
Skills required:
Primary research location:
Research description:
This project is an expansion of `piccard`, a Python library to perform longitudinal analyses on data tabulated on unharmonized spatial units. The final library will have three modules: (1) temporal path creation, (2) visualization, and (3) classification. The first module is available on [PyPI]. This module introduces one of `piccard`’s graph-based solution to a frequent problem in spatial data science: identifying temporal trends across noncongruent spatial units of aggregation—e.g., census tracts, dissemination areas, and postal codes from different years. We conceptualize spatial units as nodes, and the edges connecting them as their overlapping geographical areas. Our method creates paths that preserve the original spatial units and their attributes. Thus, `piccard` overcomes some of the limitations of traditional harmonization methods involving labour-intensive apportioning—e.g., defining ad-hoc target units. The selected student will work with the PI and Profesor Daniel Silver (UTSC Sociology) in developing the second and third modules of the library. The visualization module will allow users to subset and inspect network paths. Meanwhile, the classification module will facilitate the classification of paths according to the distribution of the shared attributes across the original geographic units. For example, a user could classify census tracts according to patterns of variation over time.
Researcher: Fernando Calderón Figueroa, Department of Human Geography, University of Toronto Scarborough, University of Toronto
Skills required:
Primary research location:
Research description:
Magnetic Resonance Imaging (MRI) has revolutionized the study of brain aging. It provides non-invasive, detailed images of brain structure and function, allowing researchers to observe changes associated with normal aging and neurodegenerative diseases. MRI results have shown promise in predicting longitudinal brain functions in aging through the following: Volumetry - Measures changes in brain volume, particularly in regions like the hippocampus and prefrontal cortex, which are vulnerable to age-related decline; cortical thickness - assesses the thickness of the cerebral cortex, which can thin with age; white matter integrity - diffusion-tensor MRI (DTI) measures the diffusion of water molecules in white matter tracts, revealing changes in microstructure and connectivity. In this project, we will focus on the use of the MRI and cognitive data from the Baltimore Longitudinal Study of Aging (BLSA), and aim to determine a predictive modeling approach for estimating longitudinal changes in cognitive function in older adults. Methods include but are not limited to linear mixed-effects model, support-vector machines, neural networks and deep learning. The outcome of this project will enable more effective use of MRI in early diagnosis.
Researcher: Jean Chen, Baycrest
Skills required:
Primary research location:
Research description:
Researcher: Nathan Taback, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
Assisted reproductive technology (ART) refers to any fertility treatment in which oocytes (eggs) or embryos (fertilized eggs) are manipulated in a laboratory. The first step of ART is controlled ovarian stimulation (COS), which involves daily injections that stimulate the growth of multiple ovarian follicles. Eggs are then retrieved from these follicles. Predicting a patient’s response to COS is challenging. Ovarian reserve markers such as antral follicle count (AFC) and serum antimüllerian hormone (AMH) are good, but not perfect, predictors of oocyte yield. For example, a patient with a high AFC or AMH may have poor COS response, whereas a different patient with the same AFC or AMH might have a robust COS response. Follicular Output RaTe (FORT) has been proposed as a solution to this problem. The FORT score is calcuated by dividing the preovulatory follicular count (follicles measuring 16-22 mm) by AFC and multiplying by 100. Previous studies have demonstrated an association between FORT and mature oocyte yield; however, these data have been generated from young egg donors or patients undergoing in vitro fertilization. The current project aims to investigate the utility of FORT scores in women undergoing COS for urgent fertility preservation due cancer or other medical indications.
Researcher: Nigel Pereira, Lunenfeld-Tanenbaum Research Institute
Skills required:
Primary research location:
Research Description:
Research description:
Increased investment in early childhood education and care (ECEC) is being seen globally. Canada has already made significant strides by implementing universal ECEC, ensuring all children have access to early learning opportunities. However, as countries roll out such large-scale systems, it is critical that these changes are made thoughtfully and correctly from the outset, as altering a system once it is entrenched becomes difficult. A central consideration during implementation is equity and inclusion, ensuring that all children, regardless of background or ability, benefit from these services. In Canada, the rollout of the Canada-Wide Early Learning and Child Care (CWELCC) initiative prioritizes equity and inclusion as founding principles. Yet, one area of concern remains how children with disabilities are being included within early years curriculum frameworks. These frameworks, while comprehensive, are often lengthy and dense, making it difficult to evaluate their effectiveness for children with disabilities using traditional qualitative methods. To address this gap, I plan to leverage large language models (LLMs) to analyze the content of these frameworks to determine how children with disabilities are discussed and integrated, offering valuable insights into the current state of inclusivity in early childhood education.
Researcher: Elizabeth Dhuey, Department of Management, University of Toronto Scarborough, University of Toronto
Skills required:
Primary research location:
Research description:
Recent advances in foundation models and self-supervised learning have opened new possibilities for learning robust state representations and world models for robotics. While traditional approaches often rely on hand-crafted state representations or require large amounts of task-specific data, modern approaches leveraging pre-trained models and self-supervised learning promise to create more generalizable and data-efficient solutions. We are exploring novel approaches to learn and utilize state representations and world models that can effectively capture both the physical dynamics of robotic systems and the semantic understanding needed for complex tasks. This includes investigating several promising directions: Leveraging large language models (LLMs) and vision-language models (VLMs) as knowledge priors for robotics tasks Developing self-supervised learning techniques that can efficiently learn from unlabeled robot interaction data Creating hybrid architectures that combine learned world models with imitation learning for improved learning Investigating methods for abstracting and transferring learned representations across different tasks and domains The ultimate goal is to develop algorithms that can learn more efficiently from demonstrations while maintaining robustness and generalization capabilities. Success in this area could significantly reduce the task-specific data needed for robot learning while improving the ability to handle novel situations.
Researcher: Igor Gilitschenski, Department of Mathematical and Computational Sciences, University of Toronto Mississauga, University of Toronto
Skills required:
Primary research location:
Research description:
Understanding censoring, which occurs when the event of interest is not observed for some individuals within the study period, is critical for modeling time-to-event data. This is particularly important for applications such as risk prediction in cancer studies, electronic health records, and clinical trials. Ignoring censoring can lead to biased and inaccurate predictive performance. While numerous statistical approaches in survival analysis, such as Cox regression, have been developed to handle censoring, it remains an open challenge to effectively integrate these methods with modern statistical learning techniques for classification. This SUDS project aims to extend the use of Inverse Probability of Censoring Weighting (IPCW) in conjunction with statistical learning to improve risk prediction for right-censored data. Although IPCW has shown promise when integrated with statistical learning methods (e.g., Vock et al., 2016), its predictive performance can suffer when a significant proportion of subjects are censored before the time of interest due to a huge reduction in effective sample sizes. This project will explore new methodological advancements to address these limitations and validate these approaches through simulation studies and real-world applications in cancer genomics. Students working on this SUDS project will meet weekly with the supervisor to discuss progress and address challenges.
Researcher: Jun Young Park, Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto
Skills required:
Primary research location:
Research description:
Social determinants of health are increasingly acknowledged as key factors in achieving equitable, efficient, and patient-centered care. It is now well recognized that factors such as inadequate transportation, economic hardship, language barriers, employment security, and health literacy play a critical role in patient’s care experiences and health outcomes. To this effect, understanding the prevalence of these determinants and their impact on patient care is essential to shaping health services and programs that are inclusive and responsive to community needs. The following project intends to employ Natural Language Processing (NLP) approaches to uncover and help study references to social determinants of health arising from patient experience data. In collaboration with the Investigative Journalism Bureau at the University of Toronto, our lab has acquired 120,000 anonymous patient feedback comments from 45 Ontario hospitals, spanning 2015 to 2020. With support from the Institute for Pandemics (IfP), our lab has previously developed approaches to help mitigate selection biases in comments and analyze trends in patient experiences over time. This research project will build upon our previous work and support our efforts to develop analytic approaches that can help uncover barriers to health equity. Outputs will consist of a departmental presentation and research report.
Researcher: Zahra Shakeri, Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto
Skills required:
Primary research location:
Research description:
Pertrubation of genetic interactions on genomic and transcriptomics level play critical role in promoting tumorigenesis. Therefore, as systematic understandin of these perturbations will likely provide novel insight into cancer biology and open new therapeutic avenues. As part of this project, we will levergae recent advances in graph-based machine learning methods and large-scale genomics & transcriptomics data to systematically characterize genetic perturbations in various cancer types.
Researcher: Sushant Kumar, Princess Margaret Cancer Centre, University Health Network
Skills required:
Primary research location:
Research description:
This project focuses on developing new quantization methods for representing the weights and activations of large language models as numbers with lower precisions to achieve faster training and inference for large language models while minimizing the reconstruction error. In 2024, several new methods including EasyQuant and SqueezeLLM are proposed for quantizing LLMs to reduce training and inference time under acceptably low reconstruction errors. While the existing methods provide remarkable performance, it is expected that a quantization algorithm that relies on mathematical optimization can exceed the performance of existing methods. In this project, the SUDS scholar will be supervised by a faculty member from the MIE department to complete a series of weekly assignments. These tasks will encompass activities such as data analysis, computational experiments, and the implementation and testing of new algorithm enhancements in a git environment. This project leverages cutting-edge techniques in mathematical optimization to advance the quantization of LLMs by reducing reconstruction error. The results of this summer research initiative contribute to the development of a new algorithm for weight and activation quantization of large language models, thereby enhancing a widely used AI technology in using data science.
Researcher: Samin Aref, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering, University of Toronto
Skills required:
Primary research location:
Research description:
Our research aims to address the physical health challenges faced by skilled trades workers, particularly electricians, who are prone to repetitive strain injuries. This project will employ a mixed-method approach, gathering quantitative and qualitative data through a survey of 100 participants and semi-structured interviews with 30 participants. Data analysis will play a critical role in uncovering key insights. Survey responses will be analyzed statistically to assess the prevalence of physical injuries, mental health issues, and the effectiveness of workplace safety practices, while interviews will be analyzed qualitatively to identify patterns and themes regarding workplace conditions, ergonomic stressors, and the use of personal protective equipment (PPE). We plan to onboard a DSI student to support the data analysis, integrating advanced statistical tools and qualitative software to handle the complex nature of our dataset. The student will assist in synthesizing the results, contributing to a comprehensive understanding of both the physical and psychological health of apprentices, contractors, and employers in the skilled trades. This interdisciplinary approach will allow us to develop practical toolkits for injury prevention and mental health support, ultimately improving worker well-being and workplace productivity.
Researcher: Behdin Nowrouzi-Kia, Department of Occupational Science and Occupational Therapy, Temerty Faculty of Medicine, University of Toronto
Skills required:
Primary research location:
DSI Celebrates SUDS Cohort of 2024 with Annual Showcase
Students may also be interested in the Urban Data Science Corps Summer Internships offered by the School of Cities.