NCBI Datasets: Making Genomic Data Download Easy

Organisms with datasets: homo sapiens (human), mus musculus (house mouse), Arabidopsis thaliana (thale cress), and Rattus norvegicus (Norway rat)There are challenges with downloading genomic data. File sizes are large, and it can be time consuming to retrieve multiple files. Sometimes downloads fail. A custom script may be required. Fortunately, a solution to all of these frustrations is now available—NCBI Datasets.

This experimental resource allows users to easily download eukaryotic genome sequence and annotation data by assembly accession, taxonomic name (scientific and common), or taxonomy ID. The web interface allows for browsing by organism, with the most common experimental species conveniently available from the main page. For example, try selecting the house mouse (mus musculus), then select all 22 associated assemblies. Options for the type of data for the download include genomic, transcript, and protein sequences as well as annotation features. Continue reading

Build a Better Research Process with HSLS Data Services

Across the diverse fields served by the Health Sciences Library System, one thing is universal: good science depends on good data. Whether you are embarking on your first research project or have dozens of completed studies under your belt, the HSLS Data Services team is here to help you improve the efficiency and reliability of your data-handling workflows at every step in the research process. We offer consultations, classes, and customized trainings on data topics including:

  • Organizing and describing files and data—always an important practice, but especially critical at a time when many researchers are working in multiple locations, on distributed teams, or on multiple computers and file servers. These workshops are also recommended for new graduate students to set themselves up with good habits from the beginning.
  • Writing a data management plan for funders and publishers, including pre- and post-submission review using DMPTool.

Continue reading

Share with Flair with FAIR-Aware

Whether you’re new to the conversation about open science or a longtime supporter of sharing and reusing research data, the FAIR guidelines for making data findable, accessible, interoperable, and reusable establish a basic set of principles for all practitioners who wish to make their research more reproducible. The “how” of doing so varies greatly among fields, modes of research, and investigators’ goals, however, so figuring out the first actions to take to make your research products more FAIR can pose a challenge. A new online tool from the FAIRsFAIR project aims to help researchers think through each FAIR principle and demystify related jargon with FAIR-Aware, a self-guided questionnaire with extensive explanatory guidance for the concrete steps involved in making data FAIR. Continue reading

COVID-19: HSLS Portals for Data and Molecular Biology Resources

HSLS Data Services and the Molecular Biology Information Service created online portals to help researchers quickly find the information they need to address questions about SARS-CoV-2 and COVID-19.

Spikes in a corona formation on the outside of the virus

The Data Management: COVID-19 Research Data guide includes lists of general and clinical repositories. These linked resources are COVID-19-specific portals for sharing, discovering, reusing, and citing COVID-19 data and code.

The HSLS MolBio COVID-19: resources guide includes categories of linked resources: Trending Research Articles, Research Article Collections, Information Hubs, Molecular Data, and Webinars & Videos. Continue reading

Open Access COVID Datasets and Software

“Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” said Dr. Cori Bargmann, Head of Science at the Chan Zuckerberg Initiative, regarding a call to action to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.

Over the past few weeks, two notable resources have been made available, providing open access to COVID datasets and related software:

COVID-19 Open Research Dataset (CORD-19) Continue reading

Data Journals: Standalone Publication for Replication, Negative, Intermediate, or Simply Noteworthy Data

As the scholarly community continues to recognize the importance of open data sharing for increasing the reproducibility of research, researchers are faced with a growing menu of options through which to make their data available. For example, is it better to deposit data in a digital repository, which often grants depositors a Digital Object Identifier (DOI), or to formally describe a dataset in a data journal article, or to share it through a metadata registry like the Pitt Data Catalog? A recent video call for papers from the journal Data in Brief argues that data journals offer a unique opportunity for standalone publication of genres that are often critically underserved by the scholarly publishing ecosystem: datasets containing replication data, negative results, and intermediate data for research in progress. Continue reading

Better Data Sharing in Six Simple Steps

I recently attended a workshop from the Data Curation Network, a collaboration of institutions that have developed specific guidelines to help their researchers share research data. Though the workshop was aimed at librarians, the DCN’s process is useful to any researcher preparing data for sharing in a repository. If you are interested in making your research more reproducible, I encourage you to consider these simple steps.

Imagine that you have a dataset—a package of data files, documentation such as codebooks or READMEs, and perhaps analysis code—that you wish to (or are required to) deposit in a repository such as Figshare or OpenNeuro. The files you have probably require some cleanup before you share them with the world, but there may be other actions you can take that would have a big usability payoff for minimal investment. The steps below form the Data Curation Network’s “CURATE” model, paraphrased here but available in full online: Data Curation Network: A Cross-Institutional Staffing Model for Curating Research Data. Continue reading

Feedback Request: Draft NIH Policy for Data Management and Sharing

In a follow-up to last year’s request for input on updates to its 2003 Data Sharing Policy, the National Institutes of Health (NIH) is soliciting public feedback on a draft policy for data management and sharing activities related to public access and open science. Regarding the necessity of such a policy, the NIH states:

“Validation and progress in biomedical research—the cornerstone of developing new prevention strategies, treatments, and cures—is dependent on access to scientific data. Sharing scientific data helps validate research results, enables researchers to combine data types to strengthen analyses, facilitates reuse of hard to generate data or data from limited sources, and accelerates ideas for future research inquiries. Central to sharing scientific data is the recognized need to make data as available as possible while ensuring that the privacy and autonomy of research participants are respected, and that confidential/proprietary data are appropriately protected.”

The draft policy would apply to all NIH-funded or conducted research resulting in the generation of scientific data and requires: Continue reading

New Data Repository Option for NIH Researchers: NIH Figshare

In July 2019, NIH and Figshare announced the one-year pilot launch of a general data repository for all NIH-funded researchers: NIH Figshare. This repository makes datasets resulting from NIH-funded research accessible by providing a way for NIH researchers to meet data sharing requirements of grants, journals, or institutions when a subject-specific repository is not an option. Continue reading

24/7 Training for Data Analysis and Statistics Software: E-Resources from the Library

Decorative: book to e-book learning conceptIf you write scripts or use data analysis software, did you know that the Health Sciences Library System provides access to thousands of reference materials to help support research programming in the health sciences? If you want to test out software or need help interpreting a never-before-seen error message, the library’s streaming videos and e-books are available to anyone with a Pitt ID, on- or off-campus.

LinkedIn Learning (formerly known as provides video tutorials, transcripts, and exercises for popular data analysis and statistics software. Need an introduction to SPSS? Try the SPSS Statistics Essential Training course to learn the basics, or focus on quantitative tests in SPSS for Academic Research course. Dive deep into SAS with a multi-part series of SAS Essential Training: Descriptive Analysis for Healthcare Research and SAS Essential Training: Regression Analysis for Healthcare Research. Introductions to Stata and MATLAB are also available. Continue reading

Tell Us Your Story: Outcomes from Data Sharing

During Love Data Week, HSLS Data Services gathered stories from health sciences researchers to better understand the “benefits or unforeseen outcomes” experienced from data sharing.

The paraphrased stories below illustrate the importance of data security and thoughtful data management.

There is the expectation that one’s identity would remain 100% confidential when participating in a research study. A breach in data security, identified during a Google search, made one research participant hesitant about sharing any personal data in future studies. Continue reading

New Software and 3D Model Records Now in the Pitt Data Catalog

HSLS Pitt Data Catalog, a project by the Health Sciences Library SystemWhen HSLS launched the Pitt Data Catalog last spring, we wanted to provide researchers with flexible options for advertising and sharing their data. Now that the catalog has grown to describe more than 20 Pitt-created datasets, that flexibility has led our collection development in surprising and exciting directions. We have recently added our first records describing software code and 3D models, all created by Charles C. Horn, PhD.

Dr. Horn is an associate professor of medicine who studies gut-brain communication, particularly via the vagus nerve. His research makes use of several open-source software packages, which he demonstrates in his paper (with David M. Rosenberg), “Neurophysiological Analytics for All! Free Open-Source Software Tools for Documenting, Analyzing, Visualizing, and Sharing Using Electronic Notebooks.” Electrophysiological data used to demonstrate the software tools are available in the publication’s data supplements and on Github, where Dr. Horn has also uploaded scripts and a Docker image containing tools to make neurophysiological data analysis easier. Pitt Data Catalog records linking to those software/data packages include:

Dr. Horn has also designed several printable 3D models for experimental apparatuses in electrophysiology. The files shared through the NIH 3D Print Exchange include printable files in a variety of formats, photos, and assembly instructions. The 3D model records in the Pitt Data Catalog are:

We are pleased to host records describing these software packages and models, which are the first of their kind in the wider Data Catalog Collaboration Project.

If you have data, code, or models (printable or otherwise) that you would like to include in the Pitt Data Catalog, please contact us at or through the “Include your Dataset” button on the Pitt Data Catalog homepage. We are available to talk with you about publicizing your research products through the catalog. The process is quick, free, and tailored to your needs, especially regarding confidentiality and controlling access to your data.

~Helenmary Sheridan