Open Access COVID Datasets and Software

“Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” said Dr. Cori Bargmann, Head of Science at the Chan Zuckerberg Initiative, regarding a call to action to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.

Over the past few weeks, two notable resources have been made available, providing open access to COVID datasets and related software:

COVID-19 Open Research Dataset (CORD-19)

  • Description: Scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group. The dataset represents the most extensive machine-readable Coronavirus literature collection available to date with 44,000 articles, including over 29,000 with full text (as of March 24, 2020).
  • Use: For data and text mining. See dataset license.
  • Currency: Will be updated weekly as new research is published in peer-reviewed publications, preprints, etc.
  • Creator: The Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine at the National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.
  • Related call to action: Artificial intelligence experts are called on to participate in the COVID-19 Open Research Dataset Challenge (CORD-19). This challenge asks researchers to use the dataset to develop new text and data mining techniques to answer high-priority scientific questions (outlined in the tasks section) and submit any tools and insights they develop via the Kaggle platform.

MIDAS Online Portal for COVID-19 Modeling Research

  • Description: Clearinghouse for sharing data-driven discoveries about COVID-19. Includes public-access data collections with documented metadata, published estimates of epidemiological characteristics (both peer-reviewed and not), and software (including types: visualization, dashboard, modeling, and data processing).
  • Use: To develop computational models. As noted in a UPMC Inside Life Changing Medicine article:Scientists are using this data to calculate important features of the disease, such as how infectious the virus is and how long it takes before an infected person becomes contagious.”
  • Currency: Community members are encouraged to contribute resources to the repository. See the information for contributors section on the GitHub repository for guidance on how to contribute material to the repository.
  • Creator: Each dataset includes standard metadata outlining who collected the data, when, where, and how. Many datasets are from MIDAS member researchers (MIDAS stands for Models of Infectious Disease Agent Study). The MIDAS Coordination Center (MCC) is located at the University of Pittsburgh.
  • This resource is included in the Pitt Data Catalog.

If you have questions about locating datasets, send an e-mail to HSLS Data Services.

~Melissa Ratajeski