Data Sciences Institute (DSI)

Building data science software to help the fight against cancer

Tumors, much like people, are different from one another. In fact, not only can the same type of tumor vary from person to person, but there can also be variations within the tumor itself, as a single tumor is comprised of a diverse population of cells. This tumor heterogeneity makes it difficult for researchers to create effective treatment plans. This is where Dr. Gregory Schwartz and his team at the University Health Network, and Medical Biophysics at the University of Toronto, come in with the help of the Data Sciences Institute’s (DSI) research software development support program.

Interested in applying for the DSI’s research software development support program? Apply by October 21, 2022, for our next round of applications. 

The DSI’s software development program is designed to support faculty and scientists by providing access to highly skilled software developers to refine or enhance existing software and improve usability and robustness, build new tools, and disseminate research software. The DSI has been supporting six projects since its first call. The DSI’s senior software developer, Dr. Conor Klamann worked with Schwartz and his team.

Helping understand cellular heterogeneity in cancer with TooManyCells

Genetic heterogeneity within a tumor occurs due to imperfect DNA replication. When healthy cells divide to create new cells, it can lead to mutations. When cancerous cells divide, mutations can also occur causing tumor heterogeneity. However, these diverse populations of cells can also exhibit non-genetic heterogeneity in response to treatment, changing their behaviour based on their surrounding environment independent of mutation. To measure cell behavior at the resolution of individual cells, researchers are using new single-cell technologies. This produces a massive amount of detailed data and subsequently requires sophisticated computational tools to interpret.

To better understand heterogeneity and drug resistance in cancer, Schwartz and his team developed TooManyCells, a suite of tools designed for clustering and visualizing single-cell data. The visualization component of TooManyCells’ is custom-made and presents cell relationships as a tree. By using TooManyCells, the team could identify rare cancer cells which were contributing to disease progression.

However, the software had some limitations.  

“The limitation of TooManyCells was that it took time to build a tree. These trees can be quite large, so to visualize major cell populations you would have to prune the tree several different ways and rerun the program repeatedly. You also didn't really know which way was the right way to prune the tree and colour it until you saw the output,” says Schwartz. “So that’s where this opportunity to work with the DSI’s research software development support program came in.”

“It's wonderful to have a fantastic software developer like Conor devoting his time to facilitating these kinds of projects, which are not easy to get off the ground. They are absolutely necessary and required in these fields but have surprisingly few funding opportunities. So, it's fantastic that these kinds of avenues exist,” says Schwartz about the program.

How is the project developing?

 

TooManyCells tree.

The goal of this project was to provide a graphical user interface for the analysis tools that Schwartz and his team developed. The details have evolved with time but creating an interactive tool to speed up analyses and improve user experience has always been at the heart of the project. Currently, the software development team at the DSI has a prototype in place and is working on collecting user feedback. The research team is also preparing an article describing the software, and once it has been completed, the source code will be made public on the Schwartz Lab GitHub page so that other researchers may access it.

“It's been a pleasure working on TooManyCells! It's given me the opportunity to combine various programming frameworks in ways I haven't done before while supporting some very interesting research,” says Conor Klamann, DSI senior software developer.

Conquer the world of data science with the DSI Data Science Certificate

The world runs on data — and a new certificate is set to help people develop the skills they need to become leaders in the field.

The Data Sciences Institute (DSI) at the University of Toronto has launched a Data Science Certificate to help professionals gain essential job-ready skills, which will support them to open doors to new advancements and employment opportunities.

“The University of Toronto is a global leader in data sciences,” says Lisa Strug, director of the DSI, Professor of Statistical Sciences, Computer Science and Biostatistics and senior scientist at The Hospital for Sick Children. “The demand for skilled, fluent and adaptable data science expertise is expanding. To keep pace with the scale of change, the DSI has an opportunity to lead in the shift from a knowledge-based to a learning-based model where upskilling is an ongoing opportunity for learners and no job opportunity is ever out of reach.”

Estimates suggest that 2.5 quintillion bytes of data are generated every day. It’s not surprising that professionals increasingly find that data science skills are in demand. Society is experiencing a transformative shift in the production, collection and use of data. As a result, organizations need skilled professionals capable of analyzing large amounts of data, uncovering valuable insights and defining the story hidden in the numbers.

Previous experience with data science isn’t needed to apply. The only prerequisite for the certificate is a degree in a field outside of computer science or statistics.

Why is the DSI offering this certificate?

The DSI is a central hub and incubator for data science research, training and partnerships at U of T. The DSI is accelerating the impact of data across disciplines to address pressing societal issues and drive positive social change. Training is an integral component of the DSI’s mission, aligned with the University’s aim to support life-long learning.

Learn from private-sector experts

The DSI Data Science Certificate offers the unique opportunity to learn from private-sector experts during the case studies in each course. The case study component provides learners with important insights into the professional world of data science analytics.

“The DSI Data Science Certificate is built around a series of core courses essential to establishing a strong foundation in data science. These courses are designed to take someone without data science expertise and give them the confidence to excel in any data-driven field. It also includes case studies from leading experts. We are very excited to be launching this certificate and have big plans to expand our offerings in the future,” says Rohan Alexander, assistant professor in the Faculty of Information and Department of Statistical Sciences.

In addition, the certificate offers busy professionals flexibility. The certificate is fully online, and learners can choose a single course to improve their skills in a specific area or earn a full certificate by taking six of the eight courses offered. The courses are designed to ensure learners master the core competencies in foundational data science, including SQL, R and Python, and gain hands-on experience through real-world case studies.

What pilot participants are saying

The DSI ran a successful set of course pilots with over 100 learners over the summer.

“For a beginner, I found that it provided an amazing overview! The flow was well-paced. It was a lot of information at once sometimes, but I was able to manage as I could go back and review items when off class time. The sequence of the course material makes complete sense as you move forward in the course. It all tied in together,” says one participant.

“Instructors were very knowledgeable, helpful and engaging! Good class size; also attracted collaborative and enthusiastic students with a variety of competencies. It was very helpful to be asked to ask questions in the public chat, which encouraged collegiality,” says another participant.

DSI welcomes Unity Health Toronto as a partner

The Data Sciences Institute (DSI) strives to collaborate with organizations that want to engage and support world-class researchers, educators, and trainees working to advance data science. We are excited to announce a new partnership with Unity Health Toronto.

Unity Health Toronto, comprised of Providence Healthcare, St. Joseph’s Health Centre, and St. Michael’s Hospital, works to advance the health of everyone in their urban communities and beyond. The health network serves patients, residents and clients across the full spectrum of care, spanning primary care, secondary community care, tertiary and quaternary care services to post-acute through rehabilitation, palliative care and long-term care while investing in world-class research and education.

“The pandemic has helped shine an important light on how data science can help us plan, understand and evaluate responses to global health crises and ultimately create the best care experiences for our patients and those beyond our walls. The value of health research has never been clearer than it is now. At Unity Health, we are a leader in the use of data and advanced analytics in healthcare delivery and research. Partnering with the DSI will enable Unity Health to continue to harness the power of data science to improve care. This collaboration will enhance our work with our partners to apply big data to advance the health of our communities locally, nationally and globally,” says Dr. Ori Rotstein, vice-president of Research and Innovation at Unity Health Toronto.

The DSI fuels innovation and fosters the exchange of ideas, connecting a diverse community of researchers and trainees that represent a wide array of disciplines. By connecting data science researchers, data and computational platforms, and external partners, the DSI advances research and nurtures the next generation of data science researchers. As one of our external funding partners, researchers at Unity Health can apply for research grants and support, training, as well as networking opportunities at the DSI.

“The DSI is thrilled to announce this partnership. We are very excited to be expanding our research community. We are committed to building a hub of data science researchers that can accelerate the impact of data across disciplines to address pressing societal issues and forward positive social change. We are ecstatic to have researchers from Unity Health join our data science community,” says Lisa Strug, DSI Director.

Bringing together the hammer and the nails – encouraging collaborations between methodologists and applied researchers

The Data Sciences Institute recently held a competition for Seed Funding for Methodologists. This funding is designed to catalyze new Collaborative Research Teams and encourage new partnerships between data science methodologists or theorists and applied researchers. Data science is inherently interdisciplinary and building capacity in data science has the potential to advance research frontiers across a broad spectrum of fields.

“This competition was about uniting cutting-edge methodologists with applied researchers to form new collaborations. By presenting and bringing to the fore innovative methodological and theoretical work, our goal is to ensure that new Collaborative Research Teams are forged with new and unexpected connections,” says Michael Brudno, professor at the Department of Computer Science, Faculty of Arts & Science, and chief data scientist at the University Health Network.

“Imagine that you have this amazing new hammer that you spent ages perfecting. But you are missing the nails on which to use your hammer. This seed funding is about finding those nails,” says Eyal de Lara, professor at the Department of Computer Science, Faculty of Arts & Science.

Presenting the three inaugural methodologists 

Aya Mitani, from the Dalla Lana School of Public Health, is developing a methodology that applies multilevel matrix-variate analysis to longitudinally collected dental data while accounting for correlation. The unique correlation structure of teeth provides an excellent application area, and Mitani aims to connect with researchers and oral health practitioners to prevent and manage oral diseases with greater precision, improving oral and general health outcomes across populations by applying these new methods and tools.

Linbo Wang, from the University of Toronto Scarborough, Department of Computer and Mathematical Sciences is developing innovative tools to find causal relationships with observational and/or experimental datasets. These new tools will allow researchers to better understand the underlying causal mechanisms and help decision-makers make more informed decisions. There is broad and impactful potential for the application of these methods.

Murat Erdogdu, from the Faculty of Arts and Science, Department of Computer Science and Statistical Sciences is developing theoretical tools to compute the asymptotic generalization error of certain overparameterized estimators and characterize the convergence rate of overparameterized neural networks beyond the kernel regime. This new theoretical tool will enable researchers to more carefully develop machine learning models that take their model’s limitations into account, across many application areas.

Showcasing innovative data science methodologies   

One key deliverable for this award is that recipients present their methodology or theory focusing on building new applied collaborations. 

Join us on June 16 for a discussion on potential application areas as Mitani, Wang and Erdogdu present their innovative methodological techniques. We welcome applied researchers from any discipline interested in learning more about how these methodologies might be applicable to their research.  

Register today to learn more. 

Student teams investigate just how difficult it is to reproduce research

Reproducibility is essential for research. How can we know if a study is reliable if it is not reproducible or replicable? A researcher should be able to pick up any piece of published research and replicate it, provided they have the right materials 

But just how hard is it to reproduce research?  

Six student teams, from across U of T, including the Faculty of Applied Science and Engineering, the Temerty Faculty of Medicine, the Dalla Lana School of Public Health, and the Faculty of Arts and Science, set out to try and answer this question by attempting to reproduce published analytic research. Each team presented its discoveries during the recent Student-Led Reproducibility Challenge that aimed to raise awareness of reproducibility amongst students and emphasize that robust and reproducible processes are critical to maintaining confidence in research.  

Reproducibility is a key Thematic Program for the Data Sciences Institute, and this challenge was pioneered by the Reproducibility co-leads Rohan Alexander, Faculty of Information and Statistical Sciences, Faculty of Arts & Science; Benjamin Haibe-Kains, University Health Network and Medical Biophysics at the Temerty Faculty of Medicine; and Jason Hattrick-Simpers, Faculty of Applied Science and Engineering.  

“We were very excited to see the high level of engagement, and everyone was impressed with the level of work and commitment to the challenge. We hope this is the first of many student-led reproducibility challenges and hackathons. It is so important that future researchers value reproducibility and carry these lessons forward in their work,” says Hattrick-Simpers, team captain for one of the student teams.

“The student teams had such great presentations. The level of commitment to thoroughness and dedication to reproducing the computational methods was impressive. Transparency, reproducibility, and replication in research are more important than ever. We need to be better at communicating the solutions and challenges we face as researchers,” says Prof. Aya Mitani, team captain for one of the student teams and assistant professor at the Dalla Lana School of Public Health.

The challenge of reproducibility 

The teams reported multiple cross-cutting challenges. For example, many teams had trouble accessing the necessary data publicly. This was the case for Alyssa Schleifer and Hudson Yuen, whose research paper was about safety, health, and isolation in prisons. They reported that the original datasets were unavailable in many cases, with only pre-processed subsets accessible, this affected their ability to check for robustness or derive new findings.  

Kimlin Chin, who replicated a paper concerning falling US birth rates, said that finding the right paper was difficult. “Finding a paper that had sufficient content to make an interesting reproduction paper and also came with a complete reproduction package i.e., was not missing any code, data, or output files was a challenge. 

Sanjot Grewal, Steve Jeoung, Walid Maraqa, and Mu Yang attempted to reproduce the results of a paper studying the effect of social distancing on COVID-19 cases in the US during the early days of the pandemic. The team discovered inconsistencies in the reproduced results and realized that the authors of the original paper had incorrectly documented a critical formatting and pre-processing step. Christie Lau, Laurie Lu, Ruiyan Ni and Emily So, worked on a paper concerning predictive models of drug response, and found that the lack of proper data resources and identification set off the reproducibility process on the wrong foot. 

Having the right technology was also a challenge. Daniel Persaud, who used a paper looking at a general-purpose machine learning framework for predicting properties of inorganic materials, experienced computational constraints since he had to rely on using his personal computer. 

“It was remarkably interesting to hear from students working in different research fields. Most of the obstacles they faced were quite universal, but some were specific to their field where programming languages and ways to access data differ. We have a lot to learn from each other and more importantly, we need to work together to improve transparency and reproducibility,” says Haibe-Kains, team captain for one of the student teams.

Full group from reproducibility challenge.

The importance of reproducibility in the workplace 

Holly Xie, Senior Applied Scientist – Machine Learning Products at Xero Accounting, and Chris Henry, Senior Economist at the Bank of Canada spoke about the importance of reproducibility for organizations. Henry discussed the importance of reproducibility at the Bank of Canada, highlighting the necessity of creating reproducible content and maintaining records for when new employees are onboarded.  

“Reproducibility is important because things change over time. You need to know how things were done. People also come and go. Reproducibility is essential in helping new team members come up to speed,” he said during his presentation.

The student teams consisted of:

  • Alyssa Schleifer and Hudson Yuen, Challenge: Western, B. (2021). Inside the Box: Safety, Health, and Isolation in Prison
  • Kimlin Chin, Challenge: Kearney, M. S., Levine, P. B., & Pardue, L. (2022). The Puzzle of Falling US Birth Rates since the Great Recession 
  • Swarnadeep Chattopadhyay, Arsh Lakhanpal, and Olaedo Okpareke, Challenge: Kearney, M. S., Levine, P. B., & Pardue, L. (2022). The Puzzle of Falling US Birth Rates since the Great Recession 
  • Christie Lau, Laurie Lu, Ruiyan Ni and Emily So, Challenge: Ma, J., Fong, S. H., Luo, Y., Bakkenist, C. J., Shen, J. P., Mourragui, S., & Ideker, T. (2021). Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients 
  • Sanjot Grewal, Steve Jeoung, Walid Maraqa, and Mu Yang, Challenge: Siedner, M. J., Harling, G., Reynolds, Z., Gilbert, R. F., Haneuse, S., Venkataramani, A. S., & Tsai, A. C. (2020). Social distancing to slow the US COVID-19 epidemic: Longitudinal pretest-posttest comparison group study 
  • Daniel Persaud, Challenge: Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials