PubMed Central Article Datasets are Now Available in the Cloud

The National Library of Medicine (NLM) recently announced that two PubMed Central article datasets are openly available in the cloud. This news is especially of interest for those conducting research utilizing text mining methodology or other types of secondary analysis.

PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). For nearly two decades NLM has supported the retrieval and download of machine-readable open access journal articles through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol). To enhance access, these datasets are now also available on the Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). Benefits to working with the datasets in the cloud include access to uncompressed individual full-text article files in XML and plain text as well as faster download and transfer speeds.

In summary, PMC Article Datasets housed on AWS include:

  • The PMC Open Access (OA) Subset: includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse (to date more than 3.4 million).
  • The Author Manuscript Dataset: includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining (to date more than 700,000).

Full details of the datasets are available on the PMC Article Datasets page.

Of note: the datasets are updated daily and in addition to full-text articles, they contain corrections, retractions, and expressions of concern as well as file lists that include metadata for articles in each dataset.

Getting started documentation for using the datasets is available via AWS. Direct questions or concerns regarding the datasets to pubmedcentral@ncbi.nlm.nih.gov.

~Melissa Ratajeski