Data Sciences Institute (DSI)

Student teams investigate just how difficult it is to reproduce research

Reproducibility is essential for research. How can we know if a study is reliable if it is not reproducible or replicable? A researcher should be able to pick up any piece of published research and replicate it, provided they have the right materials 

But just how hard is it to reproduce research?  

Six student teams, from across U of T, including the Faculty of Applied Science and Engineering, the Temerty Faculty of Medicine, the Dalla Lana School of Public Health, and the Faculty of Arts and Science, set out to try and answer this question by attempting to reproduce published analytic research. Each team presented its discoveries during the recent Student-Led Reproducibility Challenge that aimed to raise awareness of reproducibility amongst students and emphasize that robust and reproducible processes are critical to maintaining confidence in research.  

Reproducibility is a key Thematic Program for the Data Sciences Institute, and this challenge was pioneered by the Reproducibility co-leads Rohan Alexander, Faculty of Information and Statistical Sciences, Faculty of Arts & Science; Benjamin Haibe-Kains, University Health Network and Medical Biophysics at the Temerty Faculty of Medicine; and Jason Hattrick-Simpers, Faculty of Applied Science and Engineering.  

“We were very excited to see the high level of engagement, and everyone was impressed with the level of work and commitment to the challenge. We hope this is the first of many student-led reproducibility challenges and hackathons. It is so important that future researchers value reproducibility and carry these lessons forward in their work,” says Hattrick-Simpers, team captain for one of the student teams.

“The student teams had such great presentations. The level of commitment to thoroughness and dedication to reproducing the computational methods was impressive. Transparency, reproducibility, and replication in research are more important than ever. We need to be better at communicating the solutions and challenges we face as researchers,” says Prof. Aya Mitani, team captain for one of the student teams and assistant professor at the Dalla Lana School of Public Health.

The challenge of reproducibility 

The teams reported multiple cross-cutting challenges. For example, many teams had trouble accessing the necessary data publicly. This was the case for Alyssa Schleifer and Hudson Yuen, whose research paper was about safety, health, and isolation in prisons. They reported that the original datasets were unavailable in many cases, with only pre-processed subsets accessible, this affected their ability to check for robustness or derive new findings.  

Kimlin Chin, who replicated a paper concerning falling US birth rates, said that finding the right paper was difficult. “Finding a paper that had sufficient content to make an interesting reproduction paper and also came with a complete reproduction package i.e., was not missing any code, data, or output files was a challenge. 

Sanjot Grewal, Steve Jeoung, Walid Maraqa, and Mu Yang attempted to reproduce the results of a paper studying the effect of social distancing on COVID-19 cases in the US during the early days of the pandemic. The team discovered inconsistencies in the reproduced results and realized that the authors of the original paper had incorrectly documented a critical formatting and pre-processing step. Christie Lau, Laurie Lu, Ruiyan Ni and Emily So, worked on a paper concerning predictive models of drug response, and found that the lack of proper data resources and identification set off the reproducibility process on the wrong foot. 

Having the right technology was also a challenge. Daniel Persaud, who used a paper looking at a general-purpose machine learning framework for predicting properties of inorganic materials, experienced computational constraints since he had to rely on using his personal computer. 

“It was remarkably interesting to hear from students working in different research fields. Most of the obstacles they faced were quite universal, but some were specific to their field where programming languages and ways to access data differ. We have a lot to learn from each other and more importantly, we need to work together to improve transparency and reproducibility,” says Haibe-Kains, team captain for one of the student teams.

Full group from reproducibility challenge.

The importance of reproducibility in the workplace 

Holly Xie, Senior Applied Scientist – Machine Learning Products at Xero Accounting, and Chris Henry, Senior Economist at the Bank of Canada spoke about the importance of reproducibility for organizations. Henry discussed the importance of reproducibility at the Bank of Canada, highlighting the necessity of creating reproducible content and maintaining records for when new employees are onboarded.  

“Reproducibility is important because things change over time. You need to know how things were done. People also come and go. Reproducibility is essential in helping new team members come up to speed,” he said during his presentation.

The student teams consisted of:

  • Alyssa Schleifer and Hudson Yuen, Challenge: Western, B. (2021). Inside the Box: Safety, Health, and Isolation in Prison
  • Kimlin Chin, Challenge: Kearney, M. S., Levine, P. B., & Pardue, L. (2022). The Puzzle of Falling US Birth Rates since the Great Recession 
  • Swarnadeep Chattopadhyay, Arsh Lakhanpal, and Olaedo Okpareke, Challenge: Kearney, M. S., Levine, P. B., & Pardue, L. (2022). The Puzzle of Falling US Birth Rates since the Great Recession 
  • Christie Lau, Laurie Lu, Ruiyan Ni and Emily So, Challenge: Ma, J., Fong, S. H., Luo, Y., Bakkenist, C. J., Shen, J. P., Mourragui, S., & Ideker, T. (2021). Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients 
  • Sanjot Grewal, Steve Jeoung, Walid Maraqa, and Mu Yang, Challenge: Siedner, M. J., Harling, G., Reynolds, Z., Gilbert, R. F., Haneuse, S., Venkataramani, A. S., & Tsai, A. C. (2020). Social distancing to slow the US COVID-19 epidemic: Longitudinal pretest-posttest comparison group study 
  • Daniel Persaud, Challenge: Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials

DSI welcomes Baycrest as a partner

The Data Sciences Institute (DSI) is excited to announce a new partnership with Baycrest. Baycrest is a leader in cognitive neuroscience and memory research, with the goal of transforming the journey of aging. The Baycrest Rotman Research Institute (RRI) advances the understanding of human brain structure and function in critical areas of clinical, cognitive, and computational neuroscience, including perception, memory, language, attention, and decision making. With a primary focus on aging and brain health, including Alzheimer’s and related dementias, research at the RRI and across the Baycrest campus promotes effective care and improved quality of life for older adults through research into age- and disease-related behavioural and neural changes.

Allison Sekuler from Baycrest.

“This partnership will help Baycrest expand our potential for meaningful impact, catalyzing the transformative nature of data science to make the most of our behavioural, clinical, and neuroimaging data,” says Allison Sekuler, President and Chief Scientist, Baycrest Academy for Research and Education at the Baycrest Centre for Geriatric Care. 

Canada is aging faster than ever before, and the pandemic shone a light on the vulnerability of older adults and exacerbated the public health crisis of dementia. We urgently need to address this critical societal issue, but that requires new ways of working together. Connecting with the data science community through the DSI will forge new collaborations and research opportunities, helping us create a world where all older adults can live their best possible lives.

Allison Sekuler, President and Chief Scientist, Baycrest Academy for Research and Education at the Baycrest Centre for Geriatric Care

DSI collaborates with organizations eager to support world-class researchers, educators, and trainees advancing data sciences. We facilitate inclusive research connections, supporting foundational research in data science, as well as supporting the training of a diverse group of highly qualified personnel for their success in interdisciplinary environments. As one of our external funding partners, Baycrest researchers can apply for research grants, training and networking opportunities at the DSI.

We are very excited to announce this partnership. Our goal is to create a central hub to elevate data science research, training, and partnerships. By connecting data science researchers, data and computational platforms, and external partners, the DSI will both advance research and nurture the next generation of data- and computationally focused researchers. We are thrilled to have Baycrest researchers join the DSI community.

Lisa Strug, Director of the DSI

Advancing data science discovery via software development support

Data Sciences Institute (DSI) announces its first software development support

Data science research is becoming increasingly reliant on complex computer programming, but many researchers lack the training or experience in software engineering to develop effective and reliable software. The DSI’s software development program supports faculty and scientists at the University of Toronto and external funding partners to accelerate their research by providing access to highly skilled software developers to refine or enhance existing software and improve usability and robustness, build new tools, and disseminate research software. The DSI hopes to help develop software for researchers that can be accessed across disciplines and support reproducible processes.

Coming out of the first call for this competitive program, six researchers and their teams will be able to work with a DSI software developer to build high quality and adaptable software. The research projects reflect a wide range of fields, from humanities, social sciences, and life sciences.

With over 25 applications, we had a tremendous response for this first competitive call for DSI support. It was exciting to learn about the wide range of research projects needing software support at UofT. We are working to increase capacity for this important program to better support the cutting-edge research, while supporting the collaboration, equitable and open science principles at the DSI,

says Gary Bader, DSI associate director of data management, research software and advanced research computing.

A key part of creating collaborative, reusable software is ensuring that source code is available to the broader research community. To that end, DSI-supported research software will be publicly available and documented on GitHub, and GitHub will also be used to track projects and progress towards milestones.

There is so much data in the world now. This is a transformational change. Some researchers are very savvy with it, but others are just discovering it, and we are here to support them. I see myself as more of a technician, it’s really about the researchers and their teams and what they want to achieve. It has been exciting to be part of these projects,

says Conor Klamann, DSI senior software developer.

The next call for DSI research software developer support will be announced later this year.

Developing a web interface to help speech researchers

Ewan Dunbar and his team from the Department of French in the Faculty of Arts & Science, are working with the DSI to create a web interface that allows speech researchers to upload audio files and download “speech features” useful for speech processing. This software is helpful for many experimental and clinical speech researchers. However, installing it currently not only requires Python, but also dependencies that do not work on Windows. Once completed, Speech Features Online (SFO) will let users upload large audio datasets and select among available speech features with ease.

We are very excited about this project and thrilled to work with the DSI. We want to have a tool, but we also want to make it accessible, by taking research code and bundling it, so researchers know that it’s usable and understand what it’s doing. That takes a lot of work, and it’s really a software development task,

says Ewan Dunbar.

Professor Dunbar’s research focuses on human speech perception, automatic speech processing, and understanding the cognitive processes going on in the human brain. As a speech researcher, Dunbar is also working on tackling a major problem, the fact that speech technology is currently limited to a few languages for which researchers have access to lots of transcribed audio data, such as English.

The full list of projects from the DSI’s Research Software Development Support Program

Alan Moses from the Department of Cell & Systems Biology, Faculty of Arts & Science, and Julie Forman-Kay from The Hospital of Sick Children will work with the DSI to create a software program to help the research community with intrinsically disordered regions, which are protein sequences that do not take on a stable secondary or tertiary structure.

Dorothea Kullmann from the Department of French, Faculty of Arts & Science will work with the DSI to develop a database that will consist of two interrelated parts: 1) a catalogue of the late medieval manuscripts of this type kept in Canada; and 2) a text corpus of the French texts contained in these, and other manuscripts of the same type kept anywhere in the world.

Eunice Eunhee Jang from the Department of Applied Psychology and Human Development, OISE (Ontario Institute for Studies in Education) is working on curriculum-based learning tools that assess and track children’s emergent literacy and language development. Most standardized assessments are only designed to measure exceptionalities and are often inaccessible to parents and teachers. Working with DSI developers, the BalanceAI Discovery digital assessment tool addresses this gap.

Ewan Dunbar from the Department of French, Faculty of Arts & Science is working with DSI software developers to create a web interface that allows speech researchers to upload audio files and download “speech features” useful for speech processing.

Gregory Schwartz, University Health Network, and his team identified rare cancer cells which may contribute to disease progression. He will work with DSI developers to better understand cellular heterogeneity, by developing a suite of tools for clustering and visualizing single-cell data called TooManyCells.

Laura C. Rosella, Dalla Lana School of Public Health and Birsen Donmez, Department of Mechanical and Industrial Engineering, Faculty of Applied Science and Engineering will be working with DSI developers to apply Human Factors Engineering methods to build a user-friendly decision support tool for the Chronic Disease Population Risk Tool (CDPoRT). CDPoRT was developed and validated using population-level health system data to predict the future burden of chronic diseases.

Applications open for Data Access Grants

Grants of up to $10,000 are available to cover costs associated with accessing and working with large data sources. These DSI grants aim to improve data accessibility for data science researchers and foster research by mitigating the high cost of access to data sets. We believe that equitable access to resources is crucial for creating a diverse and inclusive environment.

Deadline for applications: April 29

Applications open for Seed Funding for Methodologists

This Seed Funding is designed to encourage new collaborations between data science methodologists and theorists with applied researchers. Single applicants working in data sciences methodology or theory can apply. An applicant’s research area should focus on data sciences methodology or theory with the potential to be relevant to applied fields.

Applicants will present their research and methodology/theory at a seminar, including its potential for applied fields. Funds of up to $10,000 can be used over 8 months to support successful applicants to seed a new Collaborative Research Team with the aim of applying for a DSI Catalyst Grant.

Deadline for applications: April 14