SUDS Student Call
May-August 2022

The Call for 2023 SUDS Scholars will be announced in mid-December 2022. The information on this page is for reference based on last year’s call.

Call for student researchers!

The Data Sciences Institute (DSI) welcomes carefully selected undergraduate students from across Canada for a rich data sciences research experience. Through the SUDS Research Program, undergraduate students, who are interested in exploring data science as a career path, have an exciting opportunity to engage in hands-on research supervised by DSI member researchers across the three UofT campuses.

The DSI is strongly committed to diversity within its community and especially welcomes applications from racialized persons/persons of colour, women, Indigenous/Aboriginal People of North America, persons with disabilities, LGBTQ2S+ persons, and others who may contribute to the further diversification of ideas.

Below are the SUDS research opportunities for May-August 2022. You can apply and rank your top 5 choices.

See here for information on eligibility, award value and duration, and SUDS programming.

Researcher Opportunities

A framework for integrating data analytics, chemical process simulation, and process engineering in the Chemical Engineering and Applied Chemistry curriculum.

Research description:

The University of Toronto is pioneering in Education 4.0, advancing digital and online education. At all levels, our university facilitates skills development and builds students' competencies for Industry 4.0. In the Chemical Engineering and Applied Chemistry Department, basic data analytics concepts are applied at the undergraduate level in some courses, including laboratory practices and core courses. At the graduate level, our Master of Engineering with an emphasis in Analytics includes four core data analytics courses. Despite these efforts, we believe that a process data analytics concentration must be methodologically introduced at the undergraduate level by offering a data analytics framework in different courses, along with a dedicated 400 course, integrating these techniques with simulation and process engineering. The first task for this project includes having a summer student (i) identify data analytics integration opportunities in different undergraduate courses, in combination with steady-state process simulations and engineering heuristics; (ii) prioritize these opportunities by selecting relevant critical courses for our bachelor and other related disciplines; (iii) provide a preliminary framework and roadmap for this integration, enriching the current learning objectives of the selected courses; and (iv) design and solve two sample problems in Python/MATLAB for process engineering applications involving simulation and plant data.

Researcher: Daniela Galatro, Chemical Engineering and Applied Chemistry, Faculty of Applied Science & Engineering, U of T

Skills required:

Python, MATLAB, process simulation tools (in ASPEN HYSYS or ASPEN PLUS).

Primary research location:

University of Toronto, St George Campus and/or Remote

Advanced causal inference analysis in R: A software Review

Research description:

Estimation of causal effects using observational data continues to grow in popularity in many fields. In recent years, advanced causal estimation methods, such as the targeted maximum likelihood estimation, have been developed to handle complex study design. However, these newer methods have been out of reach for practitioners due to their complexity and the difficulty in implementation. In this project, the student and supervisor will complete a review of three statistical software packages for implementing advanced causal inference methods in R programming language: tmle, ltmle, and gfoRmula. We will apply each package to a pediatric rheumatology dataset and review the key features of each package. We will focus specifically on i) support references and tutorial papers, ii) required input data format, iii) numeric and graphic outputs, iv) model diagnostic and fitting features, and (v) a list of accommodated statistical models. Working together with the supervisor, the student will draft a report documenting the review results and provide recommendations to guide practitioners in choosing an appropriate package based on the planned causal analysis.

Researcher: Kuan Liu, Dalla Lana School of Public Health, U of T

Skills required:

A good foundation in applied statistics, a level of comfort programming in a language such as R or Python, and familiarity with causal inference.
Previous programming experience with R is highly desirable.
Prior knowledge of causal inference is not strictly necessary provided the student is willing to learn.

Primary research location:

Remote

AI classification of cancer patients into novel YAP-dependent subtypes

Research description:

Cancer complexity makes it difficult to treat. Precision medicine targets driver mutations, but many driver combinations in different cancer clones complicate this approach. An alternate is to deduce overarching rules that cancer cells obey. Taking this route, we simplified all cancers into binary classes based on the opposite expression and function of a single coactivator, YAP. In YAPoff cancers YAP is epigenetically silenced because it induces growth arrest, contrasting YAPon cancers, where YAP is essential for cell division. Cancers can jump binary classes to evade therapy e.g. YAPon prostate or lung adenocarcinoma switch to untreatable and lethal YAPoff neuroendocrine cancer. We aim to develop a machine learning classifier that can be used clinically to distinguish YAPoff vs YAPon cancers. For this, we will mine multiple transcriptome datasets and develop a variety of supervised machine learning models to distinguish between YAPoff vs YAPon cancers. We will further evaluate the ability of these classifiers to be robust to multiple sources of noise across different cancer classes and empirically test them on clinical samples to refine the ideal binary classification scheme. This work is critical to guide optimal therapies and exploit the unique vulnerabilities of YAPoff and YAPon cancers.

Researcher: Kieran Campbell, Lunenfeld-Tanenbaum Research Institute

Skills required:

Essential:
Ability to program in R or Python
Familiarity with the command line and file manipulation
Familiarity with version control (e.g. Git & Github)
Desirable:
Experience analyzing RNA-seq data
Experience with supervised machine learning in an appropriate framework

Primary research location:

Lunenfeld Tanenbaum Research Institute and/or Remote

AI for Software Engineering & Software Engineering for AI

Research description:

[Option 1 - Vulnerability detection] Due to the limited human resource, it is common that the software vulnerabilities are fixed but not reported to security advisories. Especially, some open-source communities are intentionally postponing the release of such information to the public. Such undisclosed vulnerabilities pose security threats to users who use the vulnerable version of the software. In this project, we would like to explore this issue and conduct automated software undisclosed vulnerability analysis. Our goal is to extract knowledge to analyze the intention behind code changes and identify hidden vulnerabilities.

[Option 2 - Fairness AI] ML systems have the potential of introducing biases and social inequities. Although there is a recent surge of academic literature introducing various metrics and methods to assess the fairness of an ML system and making ML systems more fair, "fairness" remains an abstract concept yet to be fully adopted in the real world. While various open-source fairness toolkits have been developed to help make fairness more accessible to industry practitioners, several gaps exist between the toolkits' capabilities and practitioner needs. The objective of this research is to conduct an in-depth investigation of existing open-source fairness toolkits and the associated challenges in operationalizing the notion of fairness with these toolkits.

Researcher: Shurui Zhou, Electrical and Computer Engineering, Faculty of Applied Science & Engineering, U of T

Skills required:

Self-motivated
Interested in software engineering research, NLP, machine learning, mining software repositories,
software prototyping, source code analysis.

Primary research location:

University of Toronto, St George Campus and/or Remote

Algorithmic Reparation: Analyzing and Improving the Bias-Fixing Algorithms in Data Science

Research description:

To address the escalating problem of biases in data science, researchers have come up with a wide range of statistical tools for identifying, predicting, and fixing biases in a dataset. However, such tools and techniques are also often based on biased premises, and hence their impact remains limited. Building on the emerging scholarship in fair AI/ML, this project will develop a set of bias-fixing algorithms, analyze and report their performances in different contexts, and develop tools and techniques to improve their performance in real-life applications of data science.

Researcher: Syed Ishtiaque Ahmed, Computer Science, Faculty of Arts & Science, U of T

Skills required:

Basic understanding of AI/ML
Advanced skills in reading and analyzing critical social science papers
Good writing skills
Basic programming skills are a plus

Primary research location:

Remote

Algorithms and Statistical Tests for Real-World Adaptive Experiments to Enhance and Personalize Technology Interventions for Education, Mental/Physical Health, & Behaviour Change

Research description:

The student will start by replicating the work described in the abstract below and tiny.cc/iadaptiveexperiments, and then be involved in modifying/developing such algorithms for adaptive experimentation, and/or generalizing statistical hypothesis tests (or Bayesian analyses) for analyzing such data. Abstract: "Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, using data to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one. We explore algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson Sampling (TS). TS PostDiff (Posterior Probability of Difference) takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size. Recent Nobel prizes in economics were awarded for experimentation and causal inference. Adaptive experimentation is a new field where data from randomized experiments is rapidly used to help future users. Adaptive experimentation makes apps like Instagram incredibly engaging and profitable, how can we use those techniques to enhance and personalize students' education, and anyone's mental health?

Researcher: Joseph Jay Williams, Computer Science, Faculty of Arts & Science, U of T

Skills required:

We can teach most skills to students excellent at communication, documentation, proactivity.
Helpful if students have a background in hypothesis testing (e.g. t-test, testing coefficients in regression model), have analyzed real data-set, are meticulous in using checklists and triple-checking analysis, explaining the process to others, have run simulations before using R/python.

Primary research location:

Remote

Analysis of air pollution data from sites across Canada

Research description:

The student will help identify associations within the time series of air pollution data collected at sites across Canada. This will include data from SOCAAR's sites, government sites, and inexpensive air quality monitors. One goal will be to resolve changes in emissions due to initiatives to promote decarbonization as part of Canada's climate change plans.

Researcher: Greg Evans, Chemical Engineering and Applied Chemistry, Faculty of Applied Science & Engineering, UofT

Skills required:

Familiarity with analysis of time series data, querying SQL databases, correlation analysis, and application of other statistical techniques.
Familiarity with aspects of air pollution and climate change is desirable.

Primary research location:

University of Toronto, St George Campus and/or Remote

Research description:

This project involves using social network data to evaluate the performance of existing algorithms for a computationally intensive graph optimization task. Community detection is the process of inductively identifying groups within a networked system. There are many alternative methods and algorithms for approaching this computational problem which appears frequently in exploratory data analysis using networks. One family of algorithms attempts to find communities by maximizing modularity: A network-level quantity indicating the fitness of inferred communities in being internally cohesive and mutually separated. Despite the popularity of these methods which are used in no less than tens of thousands of data science projects and peer-reviewed studies, there is no systematic evaluation of the performance of different modularity maximization algorithms. This project provides a comprehensive evaluation of existing community detection algorithms using publicly available network data and accessible implementations of modularity-based community detection algorithms. The output of this summer research project contributes to the development of a reliable, open-source, and reproducible algorithm for a robust and theoretically grounded detection of communities, thereby improving upon a widely used computational tool for data-driven analysis.

Researcher: Samin Aref, Mechanical & Industrial Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Background and experience in Python programming, Python libraries for data science (pandas, matplotlib, NumPy)
Other desired skills (to have or acquire during the project) discrete optimization, graph theory, and network science
Other desired skills (to have or acquire during the project) Python libraries for large scale/network data analysis (networkX, igraph, graph-tool, dask)

Primary research location:

Remote

Applications of Machine Learning in Health Care

Research description:

The student will work as part of a larger project team to assemble data, build ML models, and validate them in a healthcare setting. Ongoing project area descriptions are available on our lab website including extension and validation of a novel image segmentation technology, development of ML models for wearable analysis, and construction and validation of novel healthcare datasets.

Researcher: Chris McIntosh, Peter Munk Cardiac Centre, UHN

Skills required:

Skills are sought in two or more of the following areas: python (PyTorch, pandas, and scikit-learn), machine learning, medical imaging, and wearable technologies.

Primary research location:

University Health Network and/or Remote

Association of genetic variants of unknown significance with clinical features of pediatric heart disease

Research description:

We have performed whole-genome sequencing (WGS) on over two thousand samples collected from children with heart disease, including cardiomyopathies and congenital heart defects. Additionally, we have access to thousands of WGS samples derived from healthy control populations. Using these data, the student will use software tools and databases to categorize genetic variants according to whether they're likely disease-causing, likely benign, or have an unknown significance (VUS).

By comparing cases against healthy controls, they will then identify genes that are enriched for VUS in individuals with heart disease, as well as how specific genes and types of genetic variants are associated with distinct phenotypes. Moreover, for those cases with heart disease that harbor multiple putatively disease-causing genetic variants, they will identify whether the same sets of genes tend to be simultaneously affected in multiple individuals. This work will be performed on the SickKids high-performance compute cluster, and they will be supervised and mentored by a senior bioinformatician who will guide them throughout their project with additional mentorship by the PI.

Researcher: Seema Mital, Cardiovascular Research, The Hospital for Sick Children

Skills required:

Experience in object-oriented programming, familiarity with Unix environments, and basic
knowledge of genetics.
Experience with genomics software and resources is an asset.
The project will require
independent work, so critical thinking and comfort in asking others for help will be key to their success.

Primary research location:

The Hospital for Sick Children and/or Remote

Association of repeats in the SLC9A3 region with lung phenotypes in cystic fibrosis

Research description:

The largest genetic association study of lung function in cystic fibrosis (CF) identified five locations in the genome that were associated with altered lung function. One of these locations is on human chromosome 5, near the gene SLC9A3. This region includes many sections of highly repetitive sequence that vary in length across individuals (variable number of tandem repeats, or VNTRs). These VNTRs are difficult to study using standard short-read sequencing since this data often cannot easily inform how many repeat units are in each VNTR.

Using newer long-read sequencing technology (58 PacBio samples), we identified the exact location and length of 49 VNTRs in this region. We have also shown that two VNTR are associated with altered gene expression in nasal epithelial cells of patients with CF. However, due to the limited statistical power afforded by the sample size, we were not able to investigate the association of the two VNTRs with respect to lung phenotypes (i.e. lung function and Pseudomonas aeruginosa lung infection status). An additional 500 subjects will be sequenced using PacBio high-fidelity (HiFi) long-read technology. The student will use this data to investigate the association of the two VNTRs with lung phenotypes in CF.

Researcher: Lisa Strug, The Hospital for Sick Children

Skills required:

Basic knowledge of human genetics
Basic knowledge of statistics (e.g. linear/logistic regression)
Familiarity with coding or statistical software (e.g. Python, R)
Familiarity with high-performance computing is an asset
Familiarity with bash scripting is an asset
Familiarity with genomic data, especially sequencing data, is an asset

Primary research location:

Peter Gilgan Centre for Research and Learning and/or Remote

Automated pain detection (computer vision)

Research description:

Applying computer vision and machine learning techniques to detect facial expressions of pain.

Researcher: Babaak Taati, KITE Research Institute | Toronto Rehab - UHN

Skills required:

Computer vision, machine learning, deep learning
Prior experience with facial landmark tracking, facial expression analysis, or human pose tracking is a plus, but not required
Python, PyTorch, scikit-learn, (TensorBoard), (W&B)

Primary research location:

Kite Research Institute, University Health Network

Bioinformatics to analyze single cell RNA seq datasets of human fetal lung tissue

Research description:

Using bioinformatics and working closely with a Ph.D. student and the Centre for Computational Medicine (SickKids) to analyze single cell RNA seq datasets of human fetal lung tissues. The student will be involved in creating the single cell atlas of the early human fetal lungs.

Researcher: Amy Wong, Developmental & Stem Cell Biology, SKH

Skills required:

Fluent in computational language and interested in using bioinformatics to identify and inform biological events.

Primary research location:

Hospital for Sick Children and/or Remote

Bioinformatics to identify unique cell surface markers of various progenitor cell types

Research description:

Using bioinformatics and working closely with a Senior Project Coordinator in the lab to identify unique cell surface markers of various progenitor cell types from scRNAseq datasets of human pluripotent stem cell-derived lung cultures at different differentiation stages. Hits will be validated by protein expression and FACS isolation of these progenitors.

Researcher: Amy Wong, Developmental & Stem Cell Biology, SKH

Skills required:

Fluent in computational language and interested in using bioinformatics to identify and inform biological events.

Primary research location:

Hospital for Sick Children and/or Remote

Building an online speech processing toolbox

Research description:

Starting from an existing prototype, the student will aid in the development of an online cloud computing service for speech researchers. The backend will include state-of-the-art neural-network speech processing models developed in the latest industrial research (for example, the revolutionary wav2vec 2.0). The web frontend will allow users to quickly and easily use these models to process audio and get output. These new models are increasingly important for understanding how human speech perception works, as they allow us to measure high-level acoustic properties of speech signals, and to get never-before-seen levels of recognition accuracy. It is critical to unlocking the power of these models by making them available to researchers in the speech sciences as an open, accessible toolbox. The student will contribute to building this toolbox, and in so doing learn the basics of speech processing and hone their skills in web development.

Researcher: Ewan Dunbar, French, Faculty of Arts & Science, UofT

Skills required:

Some existing expertise in web development with toolkits such as Django, FastAPI, Flask, or others, preferably in Python.
Familiarity with speech, linguistics, and/or neural networks would be an asset.
Serious enthusiasm to dive into one or all of these subjects is a must.

Primary research location:

University of Toronto, St George Campus and/or Remote

Can we promote empathy across ideological divides?

Research description:

Individuals often avoid contact with those who hold different moral views from their own. A growing moral divide gives rise to political polarization, racial tensions, and religious conflict, making empathizing with someone with different moral values feel impossible. As a result, moral empathy gaps, in which individuals experience a reduced empathy with those who hold different moral values, have emerged. These moral empathy gaps breed animosity and exacerbate tensions between individuals and groups. Yet despite these pernicious consequences, moral empathy gaps have yet to be empirically examined.

To close moral empathy gaps, this project will investigate the self-transcendent emotion of awe. Awe is an emotion that occurs in response to an object or person with such extraordinary qualities as to feel incomprehensible. Awe is particularly well suited for motivating empathy. First, awe promotes humility, which un-anchors individuals from their own perspective, motivating them to better understand others. Second, awe encourages openness and curiosity, which should help individuals consider new and challenging perspectives. Third, awe promotes a sense of connection and common humanity with others, which should render the moral divisions less salient, making empathy less difficult. Therefore, we will test whether feeling awe will promote empathy, closing moral empathy gaps.

Researcher: Jennifer Stellar, Psychology, University of Toronto Mississauga, UofT

Skills required:

Students must have a basic understanding of psychological experimental methods (e.g., what is a control condition?).
Any background in data science is helpful (previous experience in psychological research is preferred).
Ideal, candidates will also have proficiency in working with data, this could include using excel, SPSS, or R.

Primary research location:

University of Toronto Mississauga and/or Remote

Content analysis of over 150 years news and articles

Research description:

We have acquired textual content from news articles, periodicals, and books published over a span of 150 years that come from different geographical locations across the globe. The original data, together with its metadata, is stored on servers hosted by the University of Toronto libraries with some initial indexing performed on it using Elasticsearch. The purpose of the project is to perform data cleaning and natural language processing of the text to reveal trends. Such trends are related to historical and cultural events and how they have been perceived locally (e.g., in Canada, the US, etc.) as well as globally (e.g., across different continents). We also intend to discover issues related to changes in the coverage (and bias in the coverage) related to gender, technology, economics, and politics over time and investigate how the patterns relate to important milestones that took place in the past century. The successful candidate will acquire useful experience in analyzing large, digitized, text corpora contributing to the area often called "Culturomics" by using techniques such as topic and language modeling. Overall, the deliverables from this project will be used to advance the academic literature in the area and to inform policy.

Researcher: Periklis Andritos, Facutty of Information, UofT

Skills required:

Knowledge of python.
Knowledge of database systems, especially setting up new databases and querying existing ones.
Basic knowledge of Natural Language Processing.
Knowledge of basic statistics.
Knowledge of Elasticsearch and Kibana will be an asset.
Familiarity with visualizing results using R or python or related tools.

Primary research location:

Remote

Data Science for White Shark Conservation

Research description:

The ability to remotely collect time series of white shark movement provides an amazing opportunity to understand where these sharks go and what they do. However, it can be difficult to identify what movement patterns in the time series correspond to meaningful shark behaviours, such as foraging/hunting, without witnessing the behaviour first-hand. To that end, researchers at the Atlantic White Shark Conservancy have collected both movement data and video data from multiple sharks, with the video data coming from a camera placed on a white shark's dorsal fin that shows what the shark is doing underwater. In this way, we can now partially label the time series of movement data with specific behavioural patterns observed in the video. In some parts of the video, the shark simply swims along the coastline while in others it can be seen to hunt. Using this novel data, we can build statistical methods to accurately classify time series of white shark movement to better understand their behaviours (such as under what conditions they hunt/rest/swim in deep waters) and use this information to advance both their conservation in the Atlantic and also to keep humans safe.

Researcher: Vianey Leos Barajas, Statistical Sciences, Faculty of Arts & Science, UofT

Skills required:

Some knowledge of the software R
Familiarity with Bayesian statistics

Primary research location:

University of Toronto, St George Campus and/or Remote

Developing driver drowsiness detection algorithms using driving simulator dataset

Research description:

Driver drowsiness can lead to inattention to the roadway and contribute to fatal crashes. Although methods like drinking coffee can delay the feeling of sleepiness in the short term, the only solution to sleepiness is to sleep. To inform drivers and to guide them into taking safer actions such as taking frequent rests, driver drowsiness detection algorithms have been developed. These algorithms use a variety of measures like driver behavioral (e.g., yawning) and physiological data (e.g., heart rate), and vehicle kinematics (e.g., vehicle speed). However, recent reviews show that the measures and machine learning models that better perform are unclear and that more studies are needed in the field. We are currently conducting a driver simulator study and collecting a large variety of driver behavioral and physiological data and vehicle kinematics data. The proposed project aims to analyze this large dataset and to build and test machine learning algorithms that detect driver drowsiness. The student will assist the Ph.D. student to complete the following tasks: reviewing the literature, cleaning and analyzing data, and building driver drowsiness detection algorithms.

Researcher: Birsen Donmez, Mechanical & Industrial Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Coding skills (Python and MATLAB preferred)
Knowledge in statistics and machine learning
Good communication skills

Primary research location:

University of Toronto, St George Campus and/or Remote

Developing evidence to help reduce impactful drug shortages

Research description:

Leveraging CIHI national drug data from 2017 to 2021, the student will work to develop a risk estimator for clinically significant drug shortages: We will use descriptive statistics to explore the characteristics of all drugs with announced shortages identified in Canada. We will develop a model that predicts clinically important drug shortages. We will compare the performance of several approaches to determine which model best discriminates between drugs with high and low (i.e., < 50) shortage intensity scores. Specifically, we will compare the performance of multivariable regularized logistic regression, random forest decision trees and stochastic gradient boosting decision trees. The final goal is to develop a risk estimator based on the best performing model that calculates the risk of a clinically relevant drug shortage and classifies the level of risk as low, moderate or high. This model will be used to develop a future national at-risk medicine list.

Researcher: Mina Tadrous, Leslie Dan Faculty of Pharmacy, UofT

Skills required:

Experience with claims data
Statistical modelling
A passion for data and healthcare

Primary research location:

Remote

Disparities in climate-induced health outcomes in the Greater Toronto Area

Research description:

The effects of climate change have been and will continue to be, concentrated on those who are already disadvantaged. I am interested in understanding how fluctuations in climate and climate events have differentially impacted the health of residents in the Greater Toronto Area. This research combines ICES data on deaths and hospitalizations with data on weather and climate across the GTA to better understand the inequities faced by more disadvantaged geographic areas and demographic groups especially on the basis of indigeneity, migration status, and age. This project will combine these two datasets to measure the association between temperature and health and mortality outcomes. I will develop statistical models to 1) measure the extent of seasonality in deaths and hospitalizations; 2) estimate excess mortality and hospitalizations related to abnormal climate events (periods of extreme heat or cold); and 3) investigate how seasonality and excess events differ across indigeneity, migration status and age. Models will be developed within a Bayesian hierarchical framework, drawing on cutting-edge methods used in demography and epidemiology. This research will contribute to best-practice in data cleaning by establishing a suite of tests we expect the data to pass and making this available to others through an R package.

Researcher: Rohan Alexander, Faculty of Information, UofT

Skills required:

Demonstrated experience with: R GitHub, written communication
Experience with the following would be advantageous: Developing R packages, Developing Shiny apps, Building Bayesian hierarchical models using Stan
The student will be trained to develop skills they don't have e.g. building packages/models/apps.

Primary research location:

University of Toronto, St George Campus and/or Remote

Diversification and evolution of ubiquitin across the Kingdom Fungi

Research description:

Fungal organisms have been evolving in a separate kingdom since their divergence from animals over one billion years ago. Modern fungal descendants have established diverse relationships with many biological entities and have been well adapted to various lifestyles, such as commensals, parasites, and mutualistic symbionts. Cryptic interactions between fungi and eukaryotic hosts are ongoing at both genetic and organismal levels. Horizontal gene transfers represent an extreme example. In 2016, a comparative genomic study identified a poly-ubiquitin coding gene in insect gut-dwelling fungi that were transferred from the mosquito host. Since then, poly-ubiquitin and its roles in symbiotic systems receive large research attention. Ubiquitin is universally present in eukaryotes where it is widely known as a posttranslational tag for the hydrolytic destruction of proteins. Ubiquitin and ubiquitin-like proteins have also been found to play crucial roles in DNA transcription, autophagy, and inflammatory responses during pathogen defence by the host. The student working on this project will help investigate the diversity and evolution of fungal ubiquitin across the entire kingdom utilizing hundreds of whole-genome data, collaborating with multiple internationally renowned labs and scientists. A high-impact research report will be accomplished and aimed for publication at the end of the project.

Researcher: Yan Wang, Biological Sciences, University of Toronto Scarborough, UofT

Skills required:

Minimum requirements: Basic programming skills in Linux, Python, and/or R; effective communication skills.
Preferred qualification: strong interests in comparative genomics, host-microbe interactions, and competencies in writing and public speaking.

Primary research location:

Remote

Evaluating the contribution of dissolved inorganic carbon and alkalinity from pyrite oxidation to the marine carbon budget

Research description:

The student will work with an existing biogeochemical ocean model to evaluate how sea-level changes affect pyrite formation/dissolution in shallow marine sediments. The student is expected to build a variety of scenarios that broadly resemble the conditions during the last three ice ages. Subsequently, they will use these scenarios as input to the biogeochemical ocean model to evaluate how pyrite oxidation/formation will affect atmospheric pCO2. It is expected that the results of this study will be presented at the American Geophysical Union Fall Meeting (December 2022). If in-person travel is possible, conference travel will be funded.

Researcher: Ulrich Wortman, Earth Sciences, Faculty of Arts & Sciences, UofT

Skills required:

Some background in chemical oceanography
Python coding skills
Experience with numerical modeling

Primary research location:

University of Toronto, St George Campus and/or Remote

Exploring transition metal dichalcogenide catalysts for selective nitric oxide reduction to ammonia

Research description:

Transition metal dichalcogenide (TMD) catalysts have shown great potential as an alternative to noble metal catalysts due to their abundant reserves and unique physical/chemical properties, especially the large specific surface area. However, the inert surface of TMD impedes their application. In this project, through density functional theory (DFT) calculations, we introduce defects to activate the inert surface for catalytic nitric oxide reduction. Different defects on TMD catalysts should be considered to build the database for building the machine learning (ML) model. Based on the proposed ML model, we analyze the influence of defects for the catalytic selectivity and activity and design the optimal TMD catalyst for nitric oxide reduction. The student will be responsible for building this database by DFT calculations.

Researcher: Chandra Veer Singh, Materials Science and Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Python language
Basic understanding of materials science

Primary research location:

University of Toronto, St George Campus and/or Remote

Factors Impacting Sense of Belonging in an Undergraduate Computing Program

Research description:

Prior work at the University of Toronto has demonstrated a complex relationship between a sense of belonging in a discipline and attributes such as gender identity, ethnic ancestry, international or domestic status, and prior experience. For example, in one survey conducted at UofT, domestic students of underrepresented groups report much higher rates of impostor phenomenon experiences than international students of the same groups. Women also reported more impostor experiences, and students with identities at the intersection of these two groups experienced heightened challenges to developing a sense of belonging in the community. We propose to combine multiple datasets – surveys of impostor experiences, experiences in computing, and student perspectives on academic challenges – to build a more complete picture of the challenges to belonging experienced by students in computing programs at UofT and to determine if a framing intervention and community supports deployed in Fall 2021 have had a positive impact on a sense of belonging. The proposed student researcher would be responsible for collecting and tagging data from multiple sources to (a) provide a longitudinal, multi-perspective view of student experiences in the program and (b) to build a model of the factors that influence a sense of belonging in the program.

Researcher: Andrew Petersen, Mathematical & Computational Sciences, University of Toronto Mississauga, UofT

Skills required:

Familiarity with multidimensional statistical modeling
Experience performing qualitative coding and thematic analysis (to convert qualitative data to a form amenable for analysis)
Demonstrated experience locating and synthesizing literature in computing education and educational data mining

Primary research location:

Remote

Fighting polarization with diverse, interpretable recommendations

Research description:

Modern recommender systems can perform very well by recommending items to users that are similar to the past items a user has consumed. While this ability is extremely useful, it operates as a black box, and most recommender systems give no feedback on how exactly an item is similar to a user's past items. Is it similar in terms of the age of people who consume the item? Or the geographical location? With current recommender systems, if a user wishes to broaden their taste by, for example, considering items older people are more likely to consume, their only option is to seek new items outside the recommender system. In this project, you will work on a project to provide interpretable controls for recommender systems by taking advantage of social dimensions in behavioural embeddings to quantify several social characteristics.

Researcher: Ashton Anderson, Computer & Mathematical Sciences, University of Toronto Scarborough, UofT

Skills required:

Solid numerical Python programming skills
Data analysis
Applied machine learning

Primary research location:

University of Toronto, Scarborough and/or Remote

Generating Intercensal Estimates of Key Socioeconomic and Labour Force Markers for Small Areas to Support Health Equity Research

Research description:

The goal of this research is to apply and compare different approaches to using Statistics Canada, the 2016 and 2021 censuses as well as data from monthly Labour Force Surveys, to provide small area and small sub-population intercensal estimates. The research will involve managing and curating these data sources and then comparing three distinct techniques for producing intercensal estimates. Traditional linear extrapolation method, simulated annealing, a sophisticated algorithmic technique, and Bayesian methods. The objectives are to provide: 1. Comparisons of the different approaches to the intercensal estimation of key equity-related variables 2. A well-documented data resource for research containing intercensal estimates of key Statistics Canada measures of material and social deprivation The impact of COVID-19 pandemic and the public health containment strategies on measures of SES and employment and health inequities have highlighted the importance of having accurate intercensal data on key measures that drive health inequities. This project will advance research on health inequities in Canada and advance research on intercensal estimates methods internationally. The research involves important aspects of data management and curation as well as the application of a range of techniques for providing estimates and methods that can be used to compare different estimation techniques.

Researcher: Geoff Anderson, Dalla Lana School of Public Health, UofT

Skills required:

The student would be expected to be involved in developing and applying strategies for data management and curation and would be expected to play a key role in coding and documenting the estimation methods.
Skills in R and an understanding of biostatistics or mathematical statistics are essential.

Primary research location:

University of Toronto, St George Campus and/or Remote

Gene regulatory networks: bridging the gap between biochemical and statistical models

Research description:

Gene regulatory networks are extremely complex. As a result, mechanistic models of interactions between components have to make a large number of assumptions based on guesswork. This makes mechanistic models unreliable tools to test individual hypotheses in complex networks. Existing data science approaches focus on statistical models because they are easy to rigorously analyze for a subset of variables in a sea of unspecified interactions. However, their results are often difficult to translate into mechanistic interactions. Your work will focus on combining these two contrasting approaches by utilizing our recently established universal balance theorems to characterize stochastic fluctuations of gene expression patterns in incompletely specified regulatory networks. Specifying some features of such systems while leaving everything else unspecified has allowed us to translate individual assumptions into rigorous experimental tests despite dealing with very large and complex networks. Your work will help us to establish whether this novel approach has enough discriminatory power to reconstruct entire networks from observed cell-to-cell variability data. You will apply existing data science approaches to our invariant relations that characterize stochastic fluctuations within complex interaction networks. Your research will involve statistical approaches, numerical simulations of stochastic processes, as well as analytical work based on master equations.

Researcher: Andreas Hilfinger, Chemical and Physical Sciences, University of Toronto Mississauga, UofT

Skills required:

Strong interest in the natural sciences and mechanistic modelling.
Programming competency.
Familiar and comfortable with differential equations.
Prior experience with stochastic processes is advantageous but not required.
You will be expected to carry out independent research, keep detailed written records, and present your work within our research group.

Primary research location:

University of Toronto, St George Campus and Mississauga/Remote

Harnessing Data to Visualize and Mitigate Urban Water Inequality in India's Megacities

Research description:

Due to rapid urbanization and inadequate infrastructure, most Indian pipe networks provide water for less than four hours per day, affecting 390 million people. These intermittently operated water utilities divide scarce water between neighbourhoods according to hundreds (or thousands) of different schedules. Water supply schedules are complex and are often reported in inaccessibly dense (text-based) formats, obscuring vast inequality (unevenness) in water access. This project aims to make intermittent water supply schedules easy to understand so that regulators and residents can advocate for better and more equitable access to water. Specifically, the summer student will create bespoke visualizations of water supply schedule data from two or more of India's megacities. Visuals will be developed iteratively in close consultation with water utilities and regulators in India. Data from two megacities has already been gathered and cleaned. The student would also assist in gathering data from at least one additional megacity. The proposed project has the potential to improve the transparency and equality of a mode of water supply that affects one billion people globally.

Researcher: David Meyer, Civil Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Enthusiasm, creativity, independence, and excellence.
A passion for working with data to benefit global development.
Competency with either Python or R is preferred.
Experience working with large and messy datasets is a significant bonus.
Experience working with water utilities, in India, and/or in international development would be of great benefit.

Primary research location:

University of Toronto, St George Campus and/or Remote

Identifying Barriers to Student Academic Success through Machine Learning Applications

Research description:

The Student Academic Success (SAS) Initiative is a University of Toronto tri-campus initiative that seeks to identify and address divisional data needs related to undergraduate academic success throughout the subject program lifecycle (e.g., admission, retention and completion of Specialist and Major programs). This initiative is currently co-developing datamarts and tools to leverage these data (e.g., Tableau dashboards). Subject program admissions represent an important milestone for undergraduate students, with strong implications for their long-term academic success. Several important questions have been identified, such as: If a student is not accepted into their intended program(s) of study, how does this impact their subsequent academic pathways/success? Can we leverage existing information to inform students about pathways to other related programs? This represents the first step of our work with these types of data. To contextualize student pathways and common barriers, the SUDS student will assist with the development of an analytic framework using social media data to identify possible student success barriers. Social media platforms provide a unique opportunity to develop and implement an unsupervised learning model to identify possible barriers and/or trends based on social media posts. This information, in turn, may help contextualize some of our current exploratory analysis using the SAS datamart.

Researcher: Susan McCahan, Vice-Provost, Innovations in Undergraduate Education, UofT

Skills required:

Web-scraping
Unsupervised learning models
Natural Language Processing (NLP) techniques
Data visualization techniques

Primary research location:

University of Toronto, St George Campus and/or Remote

Implementation of serological and molecular tools to inform COVID-19 patient management: protocol for the GENCOV prospective cohort study

Research description:

The research will include analysis of genome data from the GENCOV and HostSeq datasets to look at the relationship between genotypes and phenotypes. We intend on looking at the frequency of all types of genetic variation ranging from benign to pathogenic variation to understand the frequency of genetic variations in populations across Canada to establish or refine genetic associations. This includes but is not limited to single nucleotide variants, copy number alterations, large chromosomal rearrangements such as inversions and translocations, repeat expansions, polygenic risk scores, the human leukocyte antigen regions, blood group genotypes, ancestry markers, etc. We will perform association studies to determine if there is a relationship between genotype and phenotypes, not only as they relate to COVID-19.

Researcher: L erner-Ellis Jordan, Lunenfeld-Tanenbaum Research Institute

Skills required:

Proficient in bioinformatics,
PLINK
Statistical software,
Programming in R
Scripting in Python
Experience with NGS development pipelines and data files e.g. VCF, BAM, FASTQ

Primary research location:

Lunenfeld-Tanenbaum Research Institute and/or Remote

Interactive approaches to automatic source code summarization using deep learning

Research description:

Automatic source code summarization is the task of generating a readable summary that describes the functionality of the code in natural language. In recent years, the use of deep learning-based approaches has led to significant improvement in the performance of automatic code summarization, e.g., using Transformers and Graph Neural Networks. However, the performance is still far from optimal and developers that are unsatisfied with a given summary are not able to provide feedback or additional information that can be used to refine the output. In this research project, the goal is to investigate ways in which additional input from the developer can further improve the performance of automatic code summarization. Specifically, the main tasks in the project are: 1) Investigating existing failures of state-of-the-art source code summarization solutions 2) Developing new computational approaches and interactive schemes for incorporating developer input or feedback in order to improve the performance of deep learning-based solutions for source code summarization 3) Evaluating the performance of the new approaches using existing large code summarization datasets.

Researcher: Eldan Cohen, Mechanical and Industrial Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Knowledge in deep learning (relevant topics: Transformers, RNNs, deep generative models VAE/GAN)
Experience coding in a deep learning framework (e.g., PyTorch, Tensorflow, Keras, MXNet, etc.)

Primary research location:

University of Toronto, St George Campus and/or Remote

Investigating Data Features for Reproducibility of Robust Educational Data Models

Research description:

Several metareviews in the area of educational data mining have highlighted challenges to reproducing results in the field. Differences in an educational context and student population make it difficult to determine whether or not a particular result is generalizable and transferable. I propose to develop a standard for processing and reporting data from educational discussion/Q&A boards to support the comparison of results between sites and to enable multi-institutional studies of student behaviour on Q&A boards. The proposed student will investigate literature in the area of modelling data from discussion/Q&A boards to identify features of the data that are important to interpreting the data. The goal is to create a robust data pipeline for collecting, cleaning, storing, and packaging data from a singular source. In this case, we will build tools to collect and package data from Piazza discussion/Q&A boards as it is utilized at numerous institutions internationally. In addition, the student will collect multiple datasets from UofT to produce a baseline of "standard student usage" for comparison.

Researcher: Michael Liut, Mathematical & Computational Science, University of Toronto Mississauga, UofT

Skills required:

Experience using, and preferably managing, discussion/Q&A boards (prioritizing Piazza experience)
Strong programmer
Experience using databases.
The ideal candidate will have a strong statistics and research background.

Primary research location:

Remote

Investigating how the brain encodes memory

Research description:

This research lab is focused on understanding how the brain encodes memories. To this end, we study memory formation and recall in mice while imaging the activity of individual neurons using genetically-encoded activity markers. This is a powerful approach that allows us to "see" a memory being made in a behaving mouse. We image the activity of thousands of neurons using our in-house build miniature microscopes. Once the data has been collected, we use advanced math and statistics to analyze the activity patterns of neurons to extract general principles about memory. The summer student will be involved in all aspects of this project. The student will help collect the imaging data and then, with the mentoring of senior graduate students and postdocs in the lab, help analyze this interesting dataset using a variety of different techniques.

Researcher: Josselyn Sheena, NMH Research Institute, SKH

Skills required:

A good understanding of programming in python and stats.
A background in ML algorithms is also helpful.

Primary research location:

The Hospital for Sick Children Research Institute/Remote

Learning analytics for improved student learning

Research description:

Learning analytics involves the collection and analysis of student and course data, including interactions with educational technology such as a learning management system (LMS), for the purposes of better understanding and optimizing student learning and learning environments. This research project will involve identifying, visualizing and analyzing LMS data from the University of Toronto to investigate how the data might effectively be used to identify student patterns of activity and their association with student success and how it might inform practices in learning design that can benefit all students. The particular question to be considered in this project will be determined based on available data and the interests and background of the research student. Possible questions that could be considered include: How do students interact with their instructor and each other in online discussion forums? How does student engagement with digital resources differ for courses presented in online and in-person delivery modes? Are there patterns of student activity that appear to be productive and patterns that do not? How can student LMS activity data be used as a proxy for student engagement and how might course design decisions affect the suitability of the data to effectively capture meaningful measures of engagement?

Researcher: Alison Gibbs, Statistical Sciences, Faculty of Arts & Science, UofT

Skills required:

Experience in data preparation and analysis using R or Python, particularly with methods for prediction, classification and/or network analysis.
Experience creating data visualization dashboards (for example, Shiny or Tableau) would be an asset.
Strong communication skills and interest in understanding how data can be used to support student learning.

Primary research location:

University of Toronto, St George Campus and/or Remote

Machine learning application to financial news in Korean and Japanese

Research description:

In this project, we analyze a large corpus of electronic news articles from the leading Korean and Japanese financial newspapers. Our objective is to track the private sector's assessment of Korean and Japanese foreign exchange policy – both countries have a history of actively "managing" their exchange rate to support exporting firms. We apply latent semantic scaling (a type of sentiment analysis) and classifier methods to news reports. A particular challenge of this data is that Korean and Japanese have no spaces between words so segmenting of terms in itself has to be done probabilistically. Furthermore, the Japanese use three different writing systems in newspapers, while the Korean uses Chinese characters (hanja) differently depending on the publication period. The project therefore also expands the application of specialized text processing software for these two languages.

Researcher: Mark Manger, Political Science, Faculty of Arts & Science, UofT

Skills required:

Excellent reading knowledge of Korean and/or Japanese (preferably near-native fluency)
Some experience coding in Python and/or R
An interest in natural language processing and ideally financial markets or international economics.

Primary research location:

University of Toronto, St George Campus and/or Remote

Machine learning approach to detect structural variations in cancer genome

Research description:

Current approaches for identifying cancer-associated genomic structural variations (SVs) primarily rely on short-read sequencing data. However, there are many challenges associated with accurately identifying SVs using short-read sequencing data. To address this challenge will build a machine learning model to assign an SV discovery score by comparing features of an input SV to well-characterized SVs (orthogonally validated by long- and short-read platforms). Subsequently, we will apply the quantified SV discovery score to identify a subset of high confidence SVs in a given cancer cohort.

Researcher: Sushant Kumar, Princess Margaret Cancer Center, UHN

Skills required:

Machine learning
Neural network
Proficiency in python/R/Julia programming

Primary research location:

Princess Margaret Cancer Research Tower and/or Remote

Machine learning enabled design and development of novel complex concentrated alloys for structural applications

Research description:

Complex Concentrated Alloys (CCAs) are a relatively recently identified class of metallic alloys that, in contrast to their conventional counterparts, do not rely on the concept of a base element. Instead, CCAs usually have four or more elements that are mixed in near-equimolar compositions to form a single-phase alloy. In this project, we will utilize machine learning on density functional theory (DFT) generated database to design and develop novel CCA materials for structural and energy applications.

Researcher: Chandra Veer Singh, Materials Science and Engineering, Faculty of Applied Science & Engineering, UofT

Skills required:

Basic understanding of materials science, including crystal structures, defects in solids, different classes of materials.
Some programming knowledge, e.g. python Basic understanding of statistics & data sciences.

Primary research location:

University of Toronto, St George Campus and/or Remote

Machine Learning for Automated Volumetric Reconstruction of Electron Micrographs

Research description:

We combine cutting-edge computational biology and electron microscopy (EM) to address how a nervous system develops and operates. Specifically, we are pioneering the field of comparative connectomics, where serial ultrathin EM images are used to map the wiring of whole nervous systems down to the level of individual synapses (Mulcahy B, et al. Frontiers in Neural Circuits 12 (2018): 94). Using datasets from animals at different developmental ages, and exposed to different environments, we can glean insights into the general principles of neural network assembly and plasticity (Witvliet D, et al. Nature 596:257-261). In this project, students will work on the development of automated image processing, segmentation, and volumetric reconstructing of cells, circuits, and tissues from high-resolution EM image stacks. Along with our collaborators, we are building a collection of in-house machine learning algorithms to address these challenges. Research students will work on the development of these algorithms with the goal of developing components of this pipeline and making them broadly accessible.

Researcher: Mei Zhen, Lunenfeld-Tanenbaum Research Institute

Skills required:

Proficient in either image processing, algorithm development, or programming, with
knowledge in machine learning is a plus but not necessary.
Strong drive to work with a strong team to learn and apply all the above.

Primary research location:

Lunenfeld-Tanenbaum Research Institute and/or Remote

Machine learning for chronic disease management

Research description:

Machine learning has enormous potential to automate predictive problems and uncover patterns using data that arise in healthcare. However, one of the fundamental problems that arise in healthcare is the scarcity of labelled data. This scarcity can arise due to diseases being rare in a population and because of privacy issues that limit the creation of large aggregated data stores. This project is designed to leverage auxiliary sources of information to improve the data efficiency of machine learning models. This project will leverage ideas from natural language processing and combine them with predictive modeling. The goal is to leverage language models (e.g. BERT, GPT-2) to extract relational structures among features to guide the design of predictive models. The project will be split up into two main sub-parts. The first part is the use of pre-trained and fine-tuned language models to extract relational structure among features. This structure may be in the form of (directed or undirected) pairwise relationships or graphs. The second sub-part will involve the translation of the inferred relational structure into predictive models that learn from relational information. The models and algorithms will be developed in the context of making predictions for patients undergoing liver transplants.

Researcher: Rahul Krishnan, Computer Science, Faculty of Arts & Science, UofT

Skills required:

A strong foundational knowledge of probability, statistics, and linear algebra,
Coursework in machine learning: CSC311 (preferred CSC412) or equivalents if the student is not in CS.
Proficiency in python and one of the following machine learning frameworks: Pytorch, Keras, Tensorflow or Jax.
An interest in studying machine learning for healthcare.

Primary research location:

University of Toronto, St George Campus and/or Remote

Machine learning in computational genomics

Research description:

The research project will involve the creation of new machine learning or other analytical systems to work with genomic and epigenomic data, including data from ChIP-seq, ATAC-seq, CUT&RUN, Hi-C, cfMeDIP-seq, and other genomic assays.

Researcher: M ichael Hoffman, Computational Biology and Medicine Program, Princess Margaret Cancer Centre, UHN

Skills required:

Coursework in biology, computer science, electrical engineering, statistics, or physics.
Experience in Python and Unix environments.
Not required, but preferred qualifications: Coursework in computational biology. Experience in R, C, and C++.

Primary research location:

Princess Margaret Cancer Research Towe and/or Remote

Machine Learning Tools for Quantifying Protein Interactions

Research description:

We are looking for a student interested in developing a graphical user interface (GUI) for a series of machine learning-based algorithms developed within our lab. These algorithms have been developed to assess the extent of protein interactions observed through super-resolved imaging datasets. The hope is that an accessible GUI could be released to the scientific community and utilized by other labs, such as biological or medical labs, with more limited computational abilities.

Researcher: Joshua Milstein, Chemical and Physical Sciences, University of Toronto Mississauga, UofT

Skills required:

Strong programming skills, preferably in Python, with an interest in computational biology/bioinformatics.
Experience developing a GUI is a bonus, as it would accelerate this project, but is not a requirement.

Primary research location:

University of Toronto Mississauga and/or Remote

Mapping the Milky Way a Million Light Years Away

Research description:

Blue horizontal branch (BHB) stars are great distant tracers as they are bright and can probe the Milky Way to a very far distance, and is a super useful tool for probing the Milky Way halo density profile as well as for searching for interesting structures (e.g. star clusters, dwarf galaxies) in the Milky Way. However, the color of BHB stars is similar to another type of star -- blue straggler stars (BSs), which are usually 10-50x closer at the same brightness. Therefore, it is scientifically important to separate these two different populations of stars in an efficient way. Luckily, the high-precision photometry data from Dark Energy Survey (DES, a wide-field imaging survey that contains 400 million astronomical objects) show separate sequences in multi-color space. The student will determine the probability of a star being a BHB or BS star using a statistical approach, given the brightness and color of the stars and their measurement uncertainties from DES. The student will also explore this project using various machine learning methods. This project will create one of the largest samples of BHB stars to probe the Milky Way down to a distance > 300 kpc (~ 1 million light-years!).

Researcher: Ting Li, Astronomy & Astrophysics, Faculty of Arts & Science, UofT

Skills required:

Basic computer programming skills in Python.
An interest in working on a research project involved in Bayesian statistics, modelling and machine learning methods.

Primary research location:

University of Toronto, St George Campus and/or Remote

Modeling heterogeneous covariance patterns in high-dimensional brain imaging data

Research description:

An important lesson from the first undergraduate statistics course is that the increased sample size leads to a higher power. In brain imaging studies, it is common to combine data collected from multiple study sites to recruit more subjects and increase the reproducibility of scientific discoveries. However, each study site uses MRI scanners from different manufacturers and its own processing protocols, which results in data being corrupted by *unwanted* scanner effects (also termed batch effects). For high-quality data, it is necessary to remove these unwanted scanner effects but, at the same time, preserve biological patterns. The student will develop a data science methodology that removes explicit scanner effects from high-dimensional brain imaging data. A particular application of interest is the cortical thickness data obtained from structural magnetic resonance imaging (MRI). Cortical thickness data reveals an explicit spatial autocorrelation structure, and we hypothesize that the significant source of the scanner effect is heterogeneous spatial autocorrelations by scanners. The student will first conduct exploratory data analysis to visualize and quantify these effects. We will then develop a batch correction method and compare its performance to existing methods using simulation studies. A report summarizing the work is expected by the end of the summer.

Researcher: Jun Young Park, Statistical Sciences, Faculty of Arts & Science, UofT

Skills required:

Coursework in (mathematical and applied) statistics and multivariate analysis, and programming experiences using R or Python.
Prior knowledge of random effects, spatial statistics, or brain imaging is helpful but not required.
Students interested in graduate programs in statistics or biostatistics are encouraged to apply.

Primary research location:

University of Toronto, St George Campus and/or Remote

Neural networks for classifying capnography waveforms

Research description:

A capnography waveform displays the level of expired carbon dioxide (CO2) over time to show changes in concentrations throughout the respiratory cycle. Capnography waveform abnormalities assist in the detection and diagnosis of specific conditions, such as partial airway obstruction and apnea. Deciphering which capnography waveform abnormalities deserve intervention from those that do not is an essential step towards the successful implementation of this technology into practice. In this study, capnography waveforms collected as part of an international prospective observational trial of opioid-induced respiratory depression on inpatient wards (the PRODIGY study) will be analyzed. A labeled dataset consisting of ~6000 15-second segments of capnography waveform samples has been created. The research student will assist the investigators to determine the accuracy of a neural network for classifying capnography waveforms.

Researcher: Aaron Conway, Lawrence S. Bloomberg Faculty of Nursing, UofT

Skills required:

Experience with neural networks, specifically convolutional neural networks as we are interested in classifying 1-dimensional data (co2 vs. time).

Primary research location:

University of Toronto, St George Campus and/or Remote

Novel representations of surface electromyography data through deep learning

Research description:

Surface electromyography (sEMG) measures the electrical activity of muscles. Its uses in research and clinical application include enabling amputees to control prosthetic limbs and understanding the effects of therapies after damage to the nervous system. sEMG is typically analyzed through a well-established and restricted set of signal features. There is an opportunity to use data science and machine learning to develop a richer characterization of sEMG and, in doing so, maximize the benefits of this non-invasive and very accessible physiological signal. This project will focus on using deep learning architectures to extract novel features of sEMG data that outperform engineered features on sEMG classification and clustering problems.

Researcher: Jose Zariffa, KITE Research Institute - Toronto Rehab, UHN

Skills required:

Previous experience with deep learning, including experience designing or modifying neural network architectures for a new task.
Experience with models that process non-video time series data and with dimensionality reduction methods would be ideal.

Primary research location:

KITE - Toronto Rehab - UHN

Phylo-genomics

Research description:

This project will use bioinformatic and biostatistical approaches to analyze gene family molecular evolution in recent whole-genome sequence data for over 50 species of Caenorhabditis nematode roundworms. In particular, we will focus on a family of genes important in fertility and cell-cell signalling. The student will use existing genomics software tools as well as develop customized scripts to process and manage data analysis.

Researcher: Asher Cutter, Ecology & Evolutionary Biology, Faculty of Arts & Science, UofT

Skills required:

Familiarity with R/Python/Bash
Familiarity with genetic and/or evolutionary principles
Ability to work both collaboratively and independently
Willingness to proactively seek solutions to problems

Primary research location:

University of Toronto, St George Campus and/or Remote

Positive-unlabeled learning with electronic health records data

Research description:

This project will investigate statistical machine learning methods for electronic health records phenotyping, the process of inferring patient characteristics (eg. disease status, treatment response) from the information contained in their health record. Our focus will be on the statistical challenges that arise in developing phenotyping models without gold-standard labeled data as manually annotating patient records is prohibitively expensive and labor-intensive. The student will contribute to the development of a positive-unlabeled (PU) learning method that leverages existing positive-only EHR data (ie. records of patients known to have the phenotype without reviews, such as those with a confirmatory lab or procedural finding) and a large volume of unlabeled data to yield accurate phenotyping models without gold-standard labeled data. The responsibilities for the student include (i) reviewing the PU learning literature, (ii) contributing to the development of a PU learning method, (iii) running simulation studies, (iv) applying the method to the UTOPIAN EHR repository containing primary care data on ~500,000 patients in Ontario, and (v) presenting findings to UTOPIAN clinicians and analysts. The student will gain experience in (i) R programming, GitHub, high-performance computing, (ii) methodology development, (iii) critiquing statistical literature, and (iv) communicating results to data science and clinical audiences.

Researcher: Jessica Gronsbell, Statistical Sciences, Faculty of Arts & Science, UofT

Skills required:

Coursework: Some background in statistical inference or machine learning (eg. STA 257, STA 261, STA 314)
Skills: R, GitHub Basic knowledge of natural language processing and experience working on Niagara cluster are preferred, but not required.

Primary research location:

University of Toronto, St George Campus and/or Remote

Predicting and Reducing Human Error By Analyzing Billions of Chess Moves

Research description:

In this project, you will apply state-of-the-art machine learning methods to billions of human chess moves to understand how, when, and why people make mistakes. You will apply our popular human-like chess model, Maia (the GPT-3 of chess), to develop an understanding of human decisions and errors at various skill levels, and develop interventions to help people get better.

Researcher: Ashton Anderson, Computer & Mathematical Sciences, University of Toronto Scarborough, UofT

Skills required:

Machine learning, data analysis, handling large amounts of data, command line skills, capacity to learn new skills on the go.
Experience with chess is a bonus, but not required.

Primary research location:

University of Toronto, Scarborough and/or Remote

Public health decision support tools to prevent chronic disease

Research description:

The student will support the work of the Collaborative Research Team using data science and human factors engineering to enhance and deploy decision support tools for the prevention of chronic diseases. The tool uses cross-sectional data from Statistics Canada's Canadian Community Health Survey (CCHS). In preliminary work, we conducted focus groups with the target user group of public health practitioners, identifying the need to update the tool with recent data. Since there have been significant changes to the CCHS survey methodology and specific variables in the model, we need to conduct sensitivity testing of the validated predictive model with more recent data. The student will work with a Population Health Analytics Lab to conduct sensitivity analyses of the predictive model using recent CCHS data. The student will be responsible for applying the predictive model to several cross-sectional cycles of the CCHS and modifying the analytic code for data nuances to test the model over several years. The student will also support the human factors methods developing the user interface for public health.

Researcher: Laura Rosella, Dalla Lana School of Public Health, UofT

Skills required:

Experienced statistical coder using SAS or R software.
Critical thinker, problem-solver, and communicator who can work effectively independently and, in a team, setting.

Primary research location:

University of Toronto, St George Campus and/or Remote

Real-time visualization of important variables along low-dimensional manifolds

Research description:

Nonlinear low-dimensional embeddings (such as van der Maaten and Hinton's t-SNE) are great for visualizing high-dimensional data, allowing humans to see shapes and clusters in the data. Unfortunately, interpreting those embeddings can be a bit trickier because the axes of the embedding cannot be directly related to the original features of interest. We can see patterns in the embedding, but figuring out what those patterns correspond to in the original data is much harder. We propose to solve this problem by allowing the user to interactively draw a path in a web app directly onto a 2D embedding. Then, by back projecting up to the high dimensional space where each dimension/variable is a potential feature of interest, we can quickly determine which variables are associated with that path in the browser. This project's output will be an interactive web app that allows users to do data analysis in the browser without needing to connect to a central server.

Researcher: Yun William Yu, Computer & Mathematical Sciences, University of Toronto Scarborough, UofT

Skills required:

A background in web design and Javascript.
Some knowledge of designing and implementing fast algorithms and visualizations.
Notably, unlike most web design projects, this project will run actual mathematical transformations in the browser, so efficiency is key.

Primary research location:

University of Toronto, St George Campus and/or Remote

Search for Member Stars in the Stellar Streams from Astronomical Survey Datasets

Research description:

Stellar systems such as galaxies and globular clusters can be disrupted to form stellar streams in our Milky Way, providing a snapshot of accretion that can be compared directly with theoretical models of the formation and evolution of galaxies. Thanks to various modern space-based and ground-based imaging and spectroscopic surveys, we now have both the kinematic and chemical information of over a dozen stellar streams. The student will develop a statistical model to assess the membership of each stream candidate star with full 6D phase space and metallicity information and assess how the model might affect the underlying kinematic and metallicity properties of the streams. This project will explore developing and applying new statistical and computational techniques which will be largely used in the next generation spectroscopic surveys.

Researcher: Ting Li, Astronomy & Astrophysics, Faculty of Arts & Science, UofT

Skills required:

Basic computer programming skills in Python and C++
An interest in working on a research project involved in Bayesian statistics, nested sampling algorithm, and model comparison.

Primary research location:

University of Toronto, St George Campus and/or Remote

Sensitivity Testing of Canadian Insurance Data

Research description:

This research project is based on real historical Canadian insurance claim data stemming from natural catastrophes. Natural catastrophes include different perils, such as winter storms, floods, and wildfires. For this project, we first fit an extreme value distribution to the Canadian insurance dataset, second create a representative insurance portfolio, and third assess its risk via so-called risk measures. Afterward, we conduct sensitivity analysis of the insurance portfolio using the methodology "reverse sensitivity testing" developed in Pesenti, S. M., Millossovich P., and Tsanakas A., (2019). Reverse sensitivity testing: What does it take to break the model? European Journal of Operational Research, 274(2), pp. 654-670. The project will include implementation in the programming language R as well as the usage of the R package SWIM (Scenario Weights for Importance Measurement) developed by my co-authors and me, which is available on CRAN. The student will form part of my research team and will not only work under my supervision but will also have the opportunity to work alongside the members of my research team. The student will actively engage in every aspect of the research project, from data selection and cleaning to implementation of statistical models and visualization of the results.

Researcher: Silvana Pesenti, Statistical Sciences, Faculty of Arts & Science, UofT

Skills required:

The student should have good programming skills using the programming language R.
Students with a major in actuarial science, mathematics, or statistics is preferred.
This research project draws on material from probability theory and statistics and utilises the programming language R for implementation and visualization.

Primary research location:

Remote

ServiceMiner: Automated Data-Driven Simulation Mining in Service Systems

Research description:

Our research group is developing ServiceMiner - a tool for automatically learning simulation models from enriched system event log data. The simulation models can subsequently be used for analyzing system interventions and optimizing the system. The proof of concept of ServiceMiner was tested using hospital and cloud services data; a patent application has been filed. However, software development is still in a nascent stage. Converting the prototype into functional software requires the following tasks: 1. Transforming input event log and related system data into context-enriched event logs 2. Developing an effective graphical user interface for model and intervention specification 3. Designing, integrating, and optimizing the algorithmic components. The student could get involved with any aspect of the work depending on their interests.

Researcher: Dmitry Krass, Joseph L. Rotman School of Management, UofT

Skills required:

The software is based in Python, so strong programming skills in Python are a must.
In addition, any subset of the following would be useful: (a) strong data modeling and programming (SQL or equivalent Python Libraries), (b) GUI design and development, (c) machine learning techniques, (d) stochastic processes/queuing.

Primary research location:

University of Toronto, St George Campus and/or Remote

Sociotemporal embedding of slang

Research description:

Slang is a socially constructed linguistic phenomenon that involves the creation of new words and expressions by specific groups of people. One functional purpose of slang is to enable expressive communication and in-group familiarity among people of shared background and knowledge. Existing work in natural language processing has explored how slang may be automatically detected, generated, and interpreted. However, sparse work has investigated the question regarding how slang evolved in different groups or communities and over time. In this student project, we aim to develop a novel methodology that draws on a large amount of social media data (e.g., Reddit) to quantify how meanings of slang terms evolve, both across many communities and over a stretched period of time. The student will participate in the design, implementation, and evaluation of such a methodology drawing on an interdisciplinary set of areas in data science, machine learning, and natural language processing (NLP), and will work to advance state-of-the-art word meaning representations such as word and contextual embeddings. The projected outcome of this project is a principled methodology that quantifies and visualizes slang evolution in socio-temporal settings, which has implications and applications in the science of informal language and NLP.

Researcher: Yang Xu, Computer Science, Faculty of Arts & Science, UofT

Skills required:

Proficiency in mathematics (calculus, linear algebra, probability theory), statistical inference and testing.
Strong programming skills (e.g. in Python) and prior experiences in large-data processing, analysis, and modeling.
Coursework or projects in NLP, machine learning and optimization methods, and high familiarity with hands-on experience in word embedding, contextual embedding.

Primary research location:

University of Toronto, St George Campus and/or Remote

Supporting Canadian Apprentices in the Construction and Industrial Sectors: Genetic and Epigenetic Analyses of Mental Health

Research description:

We are proposing a novel gender-inclusive approach focusing on understanding barriers faced by women, Indigenous people, youth, and other underrepresented groups to increase recruitment, improve retention, expand and stabilize the construction and industrial workforces across Canada. This proposal builds on our prior research on workplace factors associated with health professions' workplace stressors, injuries and retention, and my former collaborative professional practice with injured miners, employers, and unions on workers' return to work. Our proposed research will develop and implement strategies to increase worker participation and retention in the construction and industrial workforce, based on gender-, age-, and ethnicity-informed systematic analysis of barriers to recruitment and retention: 1) Emerging Trends and Practices 2) i) Identifying recruitment and retention factors in underrepresented groups in the construction and industry ii) Evaluating the lived-experiences of apprentices in the construction and industry 3) Workplace Organization Perspective 4) Solutions for Future Training and Education 5) Whole-genome analysis of genetic markers of the participants (apprentices) at the study entry. 6) i) Genome-wide methylation analysis (epigenetics mechanisms) and stress level of employers and apprentices at study entry (cross-sectional) ii) Analysis of genome-wide methylation changes and stress levels in the employers and employees over two years (longitudinal).

Researcher: Behdin Nowrouzi-kia, Occupational Science and Occupational Therapy, Temetry Faculty of Medicine, UofT

Skills required:

Excellent interpersonal skills
Strong computer experience including statistical analyses
Outstanding organizational skills
Demonstrated ability to maintain confidentiality
Ability to be a team player
Experience working in a mental health context
Flexible individual with initiative and capacity to handle the complexity of tasks simultaneously

Primary research location:

University of Toronto, St George Campus and/or Remote

Taxonomic classification of metagenomic sequencing reads

Research description:

Large amounts of microbial genome data are being generated where environmental samples mix together DNA from many different species of bacteria---standard examples include the microbiome of the human gut or that which arises from Ocean water samples. Physically separating the DNA from the different species is often difficult and/or expensive, so there is a need for algorithms that are able to classify DNA fragments by their species (or genus, family, etc.) of origin. Luckily, species by definition differ in their genetic content, and thus this task is theoretically feasible in silico. However, the classification task is made difficult by the fact that related species often share significant portions of their DNA. Additionally, the sizes and errors of the DNA fragments we have access to varies by the sequencing technology used. In this project, we will explore modern algorithmic methods for dealing with both short-read and long-read metagenomic sequencing data, and then build a new software tool for practitioners to use. We expect to make use of probabilistic sketching, minimizers, and de Bruijn graphs, but the exact tools will vary as the project goes on.

Researcher: Yun William Yu, Computer & Mathematical Sciences, University of Toronto Scarborough, UofT

Skills required:

Background in the analysis of algorithms and the software engineering skills to implement those algorithms.
Background in Rust or C++, or other programming experience and the willingness to learn new languages.
Familiarity with biology and genomics.

Primary research location:

University of Toronto, St George Campus and/or Remote

The Data Science of a Satisfying, Purposeful, and Engaging life

Research description:

You will have the opportunity to work with the Gallup World Poll - a global survey with over 2 million participants from over 170 countries - to systematically identify the population-level factors that can best predict improvements in population-level well-being. The project will involve looking at a detailed set of population characteristics, including but not limited to technological (e.g., the use of robotics in the workplace), labor (e.g., policy-mandated amount of work hours), environmental (e.g., pollution), social (e.g., the level of residential segregation by race and ethnicity), and political factors (e.g., freedom of the press). A machine learning approach will then be applied to identify a subset of national indicators that can best predict population well-being.

Researcher: Felix Cheung, Psychology, Faculty of Arts & Science, UofT

Skills required:

Students with a strong multidisciplinary background, experiences in quantitative methods, and familiarity with a statistical programming language will be preferred.

Primary research location:

University of Toronto, St George Campus and/or Remote

Transferring Generative Adversarial Network (GAN): Augmentation from Limited Patients' Data

Research description:

Patient-specific treatment is a strategy that optimizes clinical efficacy by tailoring the diagnosis and treatment methods based on the specific conditions of patients. For machine learning (ML)-based diagnosis and treatment, patient-specific treatment typically requires data collection for training the machine learning models. However, long-term data collection in closely monitored environments, such as epilepsy monitoring units (EMUs), often comes at high costs and is inconvenient for the patients. There is a compelling need for data augmentation techniques. Generative adversarial network (GAN) is an emerging technique that generates new datasets with saint statistics as the training set. GAN employs two ML models, a generator and a discriminator, to contest with each other until the generator can "fool" the discriminator. Combined with transfer learning from gained knowledge, GAN can generate data segments from limited recordings, such as epileptic patients' ictal segments, which occur at very low rates. The augmented dataset can then be used for training ML models for detecting epileptic onsets. In this project, the goal is to test this hypothesis using a subset of a prerecorded, expert-labeled large database. A high-performance discriminator trained on the complete database will be used for evaluating the performance of the transferring GAN.

Researcher: Xilin Liu, Electrical and Computer Engineering, Faculty of Applied Science and Engineering, UofT

Skills required:

Knowledge background in AI (i.e. have taken relevant courses)
Proficient in at least one programming language (e.g. Python or Matlab)
Familiar with machine learning tools (e.g. Pytorch, Tensorflow, etc.)
Note that biomedical background is NOT required

Primary research location:

Remote

Understanding language use and inference through text mining

Research description:

When people write a blog post, review a product on Amazon, or join the conversation on Twitter, they do so in words. Often, they do so without realizing the subtle power of the words they choose to use on their audience. This project, at the intersection of psycholinguistics, marketing, and data science, will uncover the meaning behind words and help figure out how to match the right word to the right situation. As users generate more and more content, they do so in the form of more and more words. Though the effects of words can be subtle, the modern marketplace offers the opportunity to identify even subtle effects by harnessing the power of how very much data – in the form of the countless words available online – is becoming available. This project will ask the researcher to use and develop novel methodologies for web scraping and data analytics as applied to natural language processing. We will answer questions like whether online reviews are more persuasive when written in the past or the present tense ("I was happy" or "I am happy" with a purchase) and what audiences infer when communicators talk about time ("I'll be with you momentarily").

Researcher: Sam Maglio, Management, University of Toronto Scarborough, UofT

Skills required:

This project is looking for a student well-versed in web scraping methodologies, natural language processing resources (e.g., the Linguistic Inventory and Word Count), data analytics tools (R, Python, SPSS), and a keen interest in how the words people use change what their audiences hear.

Primary research location:

University of Toronto, St George Campus and/or Remote

Unleashing the power of student data analytics

Research description:

Student researchers will apply different approaches to mining student data in order to resolve clusters. Natural language processing techniques will be used to support the processing of qualitative data collected through student surveys.

Researcher: Greg Evans, Chemical Engineering and Applied Chemistry, Faculty of Applied Science & Engineering, UofT

Skills required:

Experience with cluster analysis techniques and/or natural language processing

Primary research location:

University of Toronto, St George Campus and/or Remote

Using machine learning to identify and understand the interplay of factors associated with risk in the pre-disease stage of Crohn's disease

Research description:

Crohn´s disease (CD) is thought to be due to dysregulated interactions between environmental, dietary, microbial, immunological, and genomic factors. The human gut microbiota plays a pivotal role in health and disease and endows the host with an array of essential functions from the fermentation of complex carbohydrates, energy harvesting, intestinal barrier function, and metabolite generation. Changes in gut microbiome composition and function are increasingly recognized to be associated with CD pathogenesis resulting in significant effects on host physiology and homeostasis. Indeed, our group has identified that the pre-disease microbiota of individuals that later develop CD was different than that of individuals that remained healthy. Changes in microbiota composition likely lead to modifications of microbiota-derived metabolites, such as secondary bile acids. As such, we hypothesize that perturbations in Bile acid composition are influenced by diet and gut microbiome structure and are related to physiological factors which together may play a key role in the development of CD. We propose to investigate the role of the BA as they relate to hosting homeostasis and CD pathogenesis. We aim to (i) identify the interplay between bile acids, microbiome, and host factors, (ii) build a model to predict the risk of CD onset.

Researcher: Kenneth Croitoru, Lunenfeld-Tanenbaum Research Institute

Skills required:

Machine learning algorithms
Causal mediation analysis
Statistic (R, python)
Conditional logistic regression
Interest in computational biology
Data mining
Ability to work in a multidisciplinary team and transnational research

Primary research location:

Lunenfeld Tanenbaum Research Institute/Remote

Visualizing knowledge

Research description:

The past centuries have witnessed an explosion of advances across countries and across disciplines. The new knowledge is, for the large part, encapsulated in printed material. In this project, the research assistant will work with the principal investigator to (1) mine data from a uniquely combined set of sources [a set of libraries' Machine-readable cataloging (MARC) records and the HathiTrust corpus of digitized materials related to the Google Books project] and (2) visualize the content from the searches. The various visualizations that will emerge from the queries on text and their associated metadata will be designed to enable knowledge discovery. This will be accomplished through the application of standard tools, such as network graphs and Sankey diagrams, as well as newly designed visualizations that will be programmed to highlight and track a topic's sphere of influence across time, locations, and fields of inquiry. In addition to providing unique insights into the changes and spread of knowledge, we will explore, as a use case, how various visualizations, when used together, may be employed to identify books and materials on intersecting topics held in the library digital collections that would be of interest to library users.

Researcher: Michelle Alexopoulos, Economics, Faculty of Arts & Science, UofT

Skills required:

Knowledge of python.
Familiarity with visualizing results using R or python or related tools.
Knowledge of database systems, especially setting up new databases and querying existing ones.
Knowledge of Elasticsearch and Kibana will be an asset.
Basic knowledge of Natural Language Processing.

Primary research location:

Remote

For more information

SUDS.dsi@utoronto.ca

More opportunities

For University of Toronto Students, you may also be interested to learn about the School of Cities Urban Data Science Corps (UDSC) internship program. Click here to find out more and apply.

News

A summer of learning, fun and community for 2022 DSI SUDS Scholars.

Read the full story.

SUDS Student Call May-August 2022

Call for student researchers!

Researcher Opportunities

For more information

More opportunities

News

SUDS Student Call
May-August 2022