Data Sciences Institute

New professional certificates will help learners upskill for careers in data analytics and applied machine learning

New professional certificates in Data Science and Machine Learning Software Foundations, launched by U of T’s Data Sciences Institute and powered by Upskill Canada, will prepare workers for success in these fast-growing fields. Photo: skynesher via Canva, Getty Images.

by Tyler Irving

A new training initiative launched by the University of Toronto’s Data Sciences Institute (DSI) is helping Canada meet its growing need for talent in data science and machine learning

Applications for the DSI Data Science and Machine Learning Software Foundations Certificates opened in October to strong demand. DSI is now gearing up for a second session, scheduled to commence on January 15.

By 2026, digital literacy is projected to be essential for 90 per cent of jobs in Canada

The certificates offer affordable, flexible and rigorous upskilling opportunities, designed for learners with a university, college degree or diploma who have three years or more of work experience. 

Prospective DSI Certificate participants can be employed or actively seeking employment and do not need experience or education in the field of data science. These certificates are accessible to individuals from all backgrounds, and do not require prior affiliation with the University.

The certificates are powered by Upskill Canada, a national initiative powered by Palette Skills and funded by Innovation, Science and Economic Development Canada (ISED). Upskill Canada is designed to meet the talent needs of high-growth sectors while building a more inclusive economy.

Supported by funding from Innovation, Science and Economic Development Canada’s (ISED) Upskilling for Industry Initiative, more than 15,000 Canadian workers will benefit from an innovative approach to skills training. Central to the Upskill Canada initiative is the role of community training providers, who work closely with local and national employers to identify precise suites of skills being sought by industry. Equipping workers with these skills will create new career pathways for Canadians and better position Canadian companies to compete both domestically and internationally.

“What we’re hearing from our partners in industry is that targeted training in key areas can greatly increase the available talent pool in this fast-moving sector,” says Lisa Strug, Academic Director of the Data Sciences Institute and Professor in the Departments of Statistical Sciences and Computer Science (Faculty of Arts & Science) and the Division of Biostatistics (Dalla Lana School of Public Health) at U of T. Strug is also a Senior Scientist at The Hospital for Sick Children.

“We’re pleased to be able to leverage U of T’s leadership in machine learning and data sciences to provide new opportunities for workers in the digital economy.”

“Through the industry advisory group, prospective employers like Thomson Reuters are actively engaging with the Data Sciences Institute as they develop learning opportunities that address the evolving data science and machine learning demands across small, medium, and large-sized enterprises,” says Carter Cousineau, Vice President, Data and Model (AI/ML) Governance and Ethics, Thomson Reuters.

“This collaborative approach helps ensure learners gain the necessary skill sets to pursue new roles, or identify opportunities for advancement, in this swiftly changing landscape.”

Both certificates offer foundational concepts in data science and machine learning knowledge and provide opportunities for practical application through employer case studies. Each certificate also includes sessions dedicated to career advancement, from support for resume writing to networking and interview skills development.

The technical and job readiness programming will be delivered as online modules with in-person and hybrid opportunities for professional networking. Certificate recipients will be well positioned for roles such as data analysts, data managers or applied machine learning analysts.

The courses and job readiness sessions are offered part-time, allowing learners time to balance existing commitments and still accomplish their career goals. Over the course of the next two years, five cohorts of learners are expected to complete the 16-week certificates.  Initially, the training will be offered to learners at a substantially reduced rate of $425 (+HST) per certificate, thanks to the support of Upskill Canada. The DSI has also committed accessibility funding for those with financial need.

“We’re so proud to formally launch Upskill Canada with our inaugural class of workers and training service providers,” says Rhonda Barnet, CEO of Palette Skills, which was chosen by ISED to run the Upskill Canada initiative.

“This is a big first step – but it’s only the beginning. We’re looking forward to working with our supporters in government and industry to upskill many more Canadians so they can transition into high-demand roles in the modern workforce – and help fast-growing companies achieve their full potential.”

Data Sciences Institute Supported Research Reveals How Automating Food Analysis Can Improve Health Policy

by Sara Elhawash

When purchasing foods, many consumers give food labels cursory scans, taking in information such as calorie levels or sodium content. Why is streamlining this process crucial from a public and policy perspective? 

Creating and maintaining the databases needed by researchers and others to establish food policies and monitor the food supply is a significant task. This involves classifying and analyzing hundreds of thousands of foods, a process that is typically done manually and infrequently. 

Guanlan Hu, Postdoctoral Fellow in the Department of Nutritional Sciences (Temerty Faculty of Medicine, U of T), is on a mission to simplify this complex process. Her research explores the use of pre-trained language models and supervised machine learning to analyze unstructured food label text, thereby streamlining food categorization and other important classification tasks. Among her primary goals is to revolutionize the understanding and categorization of ultra-processed foods (UPFs), particularly for the benefit of the public and policy makers. Her aim is to improve public health and streamline the analysis of food, underscoring the broader impact and significance of her research. 

Supervised by Professor Emerita Mary R. L’Abbé (Temerty Faculty of Medicine, U of T), and co-authored by Postdoctoral Fellow Mavra Ahmed and PhD student Nadia Flexner, Hu’s presentation at the DSI Research Day signals a shift in the landscape of food classification and health policy.  

“Using cutting-edge language models and machine learning, we’ve automated food categorization, nutrition quality scoring and food processing level classification,” says Hu. “This streamlines food analysis and holds promise for swift, scalable monitoring of the global food supply, particularly in identifying ultra-processed foods.” 

Leveraging pre-trained language models and the XGBoost multi-class classification algorithm, Hu’s methodology achieved an impressive accuracy score of 0.98 in predicting both major and sub-category classification of foods, outperforming traditional bag-of-words methods and presenting a powerful tool for efficiently determining food categories and food processing levels.  

“The research holds the potential to expedite the monitoring and regulation of ultra-processed foods in the global food supply, offering a transformative impact on public health and regulatory practices,” says Professor L’Abbé. 

This research is part of a DSI Catalyst Grant project, Using deep learning and image recognition to develop AI technology to measure child-directed marketing on food and beverage packaging and investigate the relationship between marketing, nutritional quality and price, awarded to L’Abbé and Professors David Soberman (Joseph L. Rotman School of Management), Laura Rosella (Dalla Lana School of Public Health), and Steve Mann (Edward S. Rogers Sr. Department of Electrical & Computer Engineering, Faculty of Applied Science & Engineering). The Collaborative Research Team includes trainees such as Hu. 

By refining food analysis and offering a better method for policymakers to monitor and regulate UPFs, Hu especially hopes to improve public health and dietary understanding in countries where highly processed foods contribute significantly to daily energy intake, such as Canada, the United States and Argentina, where Hu has applied her work. 

Her just-completed research, though, is simply a first step. “Much like the continual evolution of technology,” says Hu, “our work demands continuous development and evolution in this pioneering field.” 

In the meantime, Hu’s work underscores the potential of machine learning and natural language processing in nutrition sciences and the interdisciplinary nature of such breakthroughs, reflecting the importance Data Sciences Institute grants in fostering collaborative research. 

As a collaborative community, the DSI promotes innovation and facilitates the exchange of ideas, connecting diverse groups of researchers and trainees spanning various disciplines. One of the many ways that trainees can get involved is through the DSI’s Postdoctoral Fellowship, designed to support multi and interdisciplinary training and collaborative research in data sciences. 

The Interdisciplinary Work Forging a Path between Causal Inference and Policy

By Kate Baggott 

“Causal inference is hard.”  

That’s not a conclusion. It’s an observation Rahul G. Krishnan was brave enough to make at the Forging a Path: Causal Inference and Data Science for Improved Policy Workshop on November 10th to over 100 faculty, students and participants from organizations.  

The difficulty of causal inference is not a matter of methodological rigour or reporting. The difficulty comes from the interdisciplinary nature of the process. The community doing causal inference is not one community, Krishnan reminded those present. Rather, causal inference is a process that engages different communities; biostatisticians, economists, epidemiologists, computer scientists, and data scientists, among others; engage in to make decisions and form policies.  

“Among these communities, different language is used to describe the same phenomenon,” Krishnan said.”  

The workshop was created to bring together practitioners of multiple disciplines who are employing a variety of methodologies. The Data Science Institute funds the Causal Inference Emerging Data Science Program and held the workshop in collaboration with theForward Society (FOS) Lab. The program was initiated by University of Toronto‘s Linbo Wang (Department of Statistical Sciences, University of Toronto Scarborough), Gustavo J. Bobonis (Department of Economics, Faculty of Arts & Science), Ismael Mourifié (Department of Economics, Faculty of Arts & Science), and Raji Jayaraman (Department of Economics, Faculty of Arts & Science). The workshop was the first of three workshops and a seminar series over the new two-years of the emerging data science program. 

The challenge put to participants was not to create a common language, but to create a shared understanding for how to manage the reams of data collected on human activity and explain it to help policymakers improve their decision-making in all areas from public health to education, and from social security to law and justice.  

Throughout the presentations from practitioners, there was an emphasis on description, shared definitions, and clear communication when working with decision-makers. 

Econometrician and empirical microeconomist Alberto Abadie (MIT Economics) talked about estimating the value of evidence-based decision-making (EBDM) itself in his keynote presentation.  

“Despite the ubiquity of EBDM, we are unaware of empirical tools that organizations can use to assess the value of their EBDM practices,” he reminded attendees of the workshop. “Part of the challenge in evaluating the value of EBDM is that it requires a description of what organizations will do with and without various amounts of evidence that they can choose to generate at some cost.” 

Professor Elizabeth Halloran (Fred Hutchinson Cancer Center)  is a world leader in using mathematical and statistical methods to study infectious diseases and a pioneer in the design and analysis of vaccine studies.  

“Important examples of global public health policies where causal inference with interference can make a difference include vaccines and vaccination programs,” she reminded participants.  

Causal estimates demonstrating indirect effects of intervention programs, she said, can make policies in all fields more cost-effective. 

The workshop concluded with a student-led roundtable discussion where Vahid Balazadeh, Sonia Markes, Stephen Tino, Dario Toman, and Atom Vayalinkal outlined next steps in the efforts to bring together causal inference and data sciences communities. 

Data Sciences Institute Nurturing a Future-Ready Workforce 

by Sara Elhawash

What are the key skills and qualities required for successful professionals in today’s rapidly evolving data science landscape and how do they inform training? 

To address this important question and understand the needs of organizations, the Data Sciences Institute (DSI) invited industry and non-profit leaders to the Data Science for an Effective Workforce Panel at our Research Day earlier this fall. The panel featured experts from diverse sectors, including Mark Fiume (Co-Founder & CEO, DNA Stack), Ann Meyer (Director, BioInnovation Scientist Program, adMare BioInnovations), Dana Ohab, Associate Partner (Digital & Emerging Technology, EY), and Yves Jaques (Chief, Frontier Data & Tech Unit, UNICEF).  

Engaging with data science leadership is key to our understanding of the essential skills, both soft and hard, that employers are looking for in a data-driven decision-making world. The DSI’s newly launched Data Science and Machine Learning Software Certificates have been shaped by such input from employers. 

During the panel discussions and Q&A from participants, the demands of the industry came to the forefront, with panelists providing valuable insights and a roadmap for data science professionals. It was clear that an understanding of data science and continuous learning are key for a wide range of professional fields.  

“There hasn’t been a more exciting time to be in data and data science. What we are seeing is the expectations of our clients have fundamentally changed, the world we work in today has been moving faster and is more tailored than ever seen before,” stated Dana Ohab. 

Dana also emphasized that building a community of practice and forming strategic partnerships is a blueprint many use for staying relevant in the industry. Her advice underscored the need for continuous learning and networking to remain at the forefront of data science. 

In addition to technical skills, soft skills or job-ready skills are critical. “Data science is a dynamic field that requires more than just technical skills. It’s about effective communication, adaptability, and the ability to bridge the gap between complex technical expertise and real-world business understanding. The Data Science and Machine Learning Certificates at the Data Sciences Institute aim to equip learners with these essential skills, ensuring they are not only data-savvy but also capable of making a meaningful impact in a constantly evolving landscape,” says Ann Meyer. 

Yves Jaques emphasized the value of data science in driving positive change: “We are building capacity globally to identify local solutions and talent. We take a community first response and look at the ethical implications of how data is used globally. We leverage partnerships to bring real time results.” 

Marc Fiume shared the inspirational story of his best friend, Dan, who battled cystic fibrosis due to mutations in his CFTR gene. This story served as the driving force behind DNA Stack’s mission, which aims to “save and improve the lives of people like Dan, by harnessing the collective power of the world’s genomics and health data.” 

He stressed that the future of genomic medicine would be powered by data scientists, signifying the critical role of data science in addressing these healthcare challenges. 

The panel discussion, and continuing input from data science leaders, enable the DSI to serve as a unique and enriching bridge to connect researchers with organizations in order to offer cutting-edge, in-demand training. The certificates offer an exclusive opportunity to learn from industry experts through case study components, providing invaluable insights into the professional world of data science 

To watch the video recording of the panel, click here.   

The DSI Data Science Certificate and Machine Learning Software Foundations Certificate are tailor-made for professionals with no prior technical background who aspire to excel in data science careers. In addition to technical skills courses, participants engage in job-ready skills sessions and networking opportunities to successfully enter, or further their career, in the data sciences. Both continuing education certificates offer an exclusive opportunity to learn from industry experts through case studies. The cost for each certificate is $425.  For information and to apply, click here.  

Combining genetics and data science can help us understand why some people react more severely to COVID-19

Researchers from U of T and partner hospitals collaborated with others from across Canada and around the world to identify genetic variants associated with more severe COVID-19 outcomes.

by Tyler Irving

Why do some people have a more severe course of COVID-19 disease than others? A database created by an international collaboration of researchers — including many from the University of Toronto and partner hospitals — may hold the answers to this question, and many more.

In late 2019 and early 2020, reports of a novel form of coronavirus started emerging, first from China, then from many other locations across the globe. Lisa Strug, Senior Scientist at The Hospital for Sick Children (SickKids) and Academic Director of U of T’s Data Sciences Institute, remembers what happened next.

“In my research, I use data science techniques to map the genes responsible for complex traits,” says Strug, who is a Professor in the Departments of Statistical Sciences and Computer Science in the Faculty of Arts & Science at U of T and in the Biostatistics Division of the Dalla Lana School of Public Health. She is also the Associate Director of SickKids’ Centre for Applied Genomics, which is one of three sites across Canada that form CGEn, Canada’s national platform for genome sequencing infrastructure for research.

“We knew that genes were a factor in the severity of previous SARS infections, so it made sense that COVID-19, which is caused by a closely related virus, would have a genetic component too. Very early on, I started getting messages from several scientists who wanted to set up different studies that would help us find those genes.”

Over the next few months, Strug collaborated with nearly 100 researchers from across U of T and partner hospitals and institutions, as well as other researchers from across Canada to enrol individuals with COVID-19 and sequence their genomes.

Some of the key team members from the Toronto community included:

  • Stephen Scherer, Chief of Research at SickKids Research Institute and a University Professor in the Temerty Faculty of Medicine at U of T, as well as Director of the U of T McLaughlin Centre;
  • Rayjean Hung, Associate Director of Population Health, Lunenfeld-Tanenbaum Research Institute and a Professor in the Dalla Lana School of Public Health at U of T;
  • Angela Cheung, Clinician Scientist at University Health Network, Senior Scientist at Toronto General Hospital Research Institute, and a Professor at Temerty Medicine;
  • Upton Allen, Head of the Division of Infectious Diseases at SickKids and a Professor at Temerty Medicine.

Partner hospitals and institutions included:

  • The Hospital for Sick Children
  • Lunenfeld-Tanenbaum Research Institute
  • Mount Sinai Hospital
  • St Michael’s Hospital, Unity Health Toronto
  • Princess Margaret Cancer Centre
  • Ontario Institute for Cancer Research
  • University Health Network
  • Women’s College Hospital
  • Toronto General Hospital
  • Baycrest Health Sciences

Together with researchers at other universities, hospitals and research institutions across Canada, the team eventually created what came to be known as CGEn HostSeq — Canadian COVID-19 Human Host Genome Sequencing Databank.

Initiated by Dr. Scherer and CGEn’s Naveed Aziz, with Dr. Strug, a $20M grant was secured from Innovation, Science and Economic Development Canada administered through Genome Canada.

Scherer recalls, “we had to go right to the top to get this project funded fast and our labs and teams worked 7 days a week on the project right through the pandemic”.

Identifying associations between individual genes and complex traits typically requires thousands of genomes, both from those with the trait and those without. Though there was no shortage of cases to choose from, it was critical to gather, sequence DNA and organize the data in a way that would be ethical, efficient and useful to researchers now and in the future.

“One of our key mandates at the Data Sciences Institute is developing techniques and programs that ensure that data remains as open, accessible and as reproduceable as it can be,” says Strug.

“That vision was brought to bear as we assembled the data infrastructure for this project: for example, ensuring that consent forms were as broad as possible, so that this data could be linked with other sources, from electronic medical records to other health databases.”

“We wanted to be sure that even after the COVID-19 pandemic was over, this could be a national whole genome sequencing resource to ask all kinds of questions about health and our genes. The development of the database and its open nature also enabled Canada to collaborate effectively with similar projects in other countries.”

In the end, the project gathered more than 11,000 full genome sequences from across Canada, representing patients with a wide range of health outcomes. Those data were then combined with even more sequences from patients in other countries under what came to be called the COVID-19 Host Genetics Initiative.

It didn’t take long for patterns to start to emerge. A paper published in Nature in 2021 identified 13 genome-wide significant loci that are associated with SARS-CoV-2 infection or severe manifestations of COVID-19.

Since then, even more data have been added, and subsequent analysis has confirmed the significance of existing loci while also identifying new ones. The most recent update to the project, published in Nature earlier this year, brings the total number of distinct, genome-wide significant loci to 51.

“Identification of these loci can help one predict who might be more prone to a severe course of COVID-19 disease,” says Strug.

“When you identify a trait-associated locus, you can also unravel the mechanism by which this genetic region contributes to COVID-19 disease. This potentially identifies therapeutic targets and approaches that a future drug could be designed around.” 

While it will take many more years to fully untangle the effects of the different loci that have been identified, Strug says that the database is already showing its worth in other ways.

“It can be difficult to find datasets with whole genome sequence and approved for linkage with other health information that are this large, and we want people to know that it is open and available for all kinds of research, well beyond COVID, through a completely independent data access committee,” she says.

“For example, several investigators from across Canada have been approved to use these data and we’ve even provided funding to trainees to encourage them to develop new data science methodologies or ask novel health questions using the CGen HostSeq data.”

“This was a humongous effort, where researchers from across Canada came together during the COVID-19 pandemic to recruit, obtain and sequence DNA from more than 11,000 Canadians, in a systematic, cooperative, aligned way to create a made-in-Canada data resource that will hopefully be useful for years to come. I think that was really miraculous.”