Featured Workshops: Love Data Week

Love Data Week, February 14-18, 2022, is an international event designed to raise awareness about research data management, sharing, and preservation. To celebrate, HSLS will be hosting a variety of featured workshops*:

Continue reading

PubMed Central Article Datasets are Now Available in the Cloud

The National Library of Medicine (NLM) recently announced that two PubMed Central article datasets are openly available in the cloud. This news is especially of interest for those conducting research utilizing text mining methodology or other types of secondary analysis.

PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). For nearly two decades NLM has supported the retrieval and download of machine-readable open access journal articles through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol). To enhance access, these datasets are now also available on the Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). Benefits to working with the datasets in the cloud include access to uncompressed individual full-text article files in XML and plain text as well as faster download and transfer speeds.

In summary, PMC Article Datasets housed on AWS include:

  • The PMC Open Access (OA) Subset: includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse (to date more than 3.4 million).
  • The Author Manuscript Dataset: includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining (to date more than 700,000).

Continue reading

Two New LabArchives Products for Pitt Researchers

The University of Pittsburgh has licensed a cloud-based Electronic Research Notebook, LabArchives, since 2016. LabArchives research notebooks assist with the organization and management of laboratory data, safely and conveniently across multiple platforms and devices. Whether managing a research lab as a principal investigator or reviewing students’ lab work as an instructor, LabArchives supports effective research data management plans and helps improve student learning. Pitt researchers seeking to make the transition from paper-based to electronic lab notebooks can watch YouTube videos, read our guide, or attend one of our training sessions.

LabArchives has expanded beyond electronic research notebooks for Research and Education to include two products that we are excited to announce are now available to researchers with a Pitt email address: Inventory and Scheduler.

LabArchives Inventory streamlines the organization, tracking, and ordering of lab inventory. Whether you need to order inventory from a vendor or manage your in-lab created materials, LabArchives Inventory provides a simple and customizable solution for your physical inventory management needs. Use Inventory to customize your inventory types and storage locations, add and manage lab inventory items, and then use the ordering options to request and receive materials. Continue reading

Take Your Data Practices from Good to Best with HSLS Data Services

The HSLS Data Services team is thrilled that Pitt has declared 2021-22 to be the Year of Data and Society, because for us, every day is a day for data. Whether you are embarking on your first research project or have dozens of completed studies under your belt, we are here to help you improve the efficiency and reliability of your data-handling workflows at every step in the research process. We offer consultations, classes, and customized trainings in the following areas:

Research data management

Organizing files, writing documentation, and safely storing datasets are key practices for working with data effectively. They are also required discussion items for data management plans, which will be mandated in all NIH grant applications after January 2023. (Read the official NIH notice.) We recommend our Introduction to Research Data Management workshops especially for new graduate students to set themselves up with good habits from the start, but in-depth consultations are available for any lab, research group, or individual. Continue reading

Share Human Variant Data with New NCBI ClinVar API

Do you work with human genetic variants? Have you sought out relevant publications, clinically significant evidence, and/or publicly available data? Are you ready to contribute to the scientific and patient-care community by sharing your own research output?

You likely already know about and use ClinVar, the go-to resource for the clinical genetics community that aggregates information about genomic variation and its relationship to human health. ClinVar recently reached the significant milestone of including 1 million unique variants in its database. Over 1,800 organizations from 82 countries have submitted almost 1.5 million records in ClinVar, including more than 11,000 curated variants from 14 expert panels.

Now it is easier than ever to reciprocate and be a supportive community member by submitting your human genetic variant data using the new ClinVar Submission API. The workflow for submissions is fast and automated, thanks to a RESTful API—a particular architectural style for an application program interface (API) allowing two software programs to communicate with each other to access and use data. Continue reading

Options for Publishing Research Protocols

Typically if a researcher is asked what they think of when they hear the word “publication,” a “traditional” research journal article likely comes to mind. However, if the entire research workflow is considered, there are many research outputs that could be published including articles, preprints, protocols, datasets, and software. (We are defining “published” simply as “disseminated,” although terms such as “shared” or “posted” may be more appropriate depending on the output.)

The number of venues for publishing these outputs is growing and includes data repositories and preprint servers like DRYAD and medRxiv. New journals such as the Journal of Open Source Software (JOSS) and Scientific Data have been founded specifically to allow these research outputs to be recognized within the scholarly system. In addition, expanded publication types are now offered by established journals like PLOS ONE, which introduced Lab and Study Protocol types in early 2021.

This article will provide options for publishing research protocols, however the Where Should I Publish? Guide linked on the left of the Scholarly Communication Guide also compares options for other research outputs. Continue reading

Make File Management Simpler with Version Control

Works in progress can become unruly. As a piece of research code grows, it often spawns new files that iterate on the original: this version fixes one bug but introduces another, or that version swaps two similar functions. The same is true for manuscript drafts which pass among co-authors, accumulating new text (and usually new filenames) as they travel. It can be difficult to tell these versions apart from each other, or trace the history of how one version evolved from another. Version control systems make this work easier.

Version control is defined as “a system that records changes to a file or set of files over time so that you can recall specific versions later” in Pro Git (second edition, 2014), an excellent open textbook by Scott Chacon and Ben Straub. Version control allows a user to see all changes made to a file, who made the changes, and when they were made. It can let an author approve or reject edits made to a manuscript, or quickly determine which set of figures are the right ones to submit to a journal. Continue reading

Featured Workshop: Exploring and Cleaning Data with OpenRefine

HSLS offers classes in a wide array of subjects—molecular biology, database searching, bibliographic management, and more! You can quickly view all Upcoming Classes and Events or sign up to receive the weekly Upcoming HSLS Classes and Workshops email.

This month’s featured workshop is Exploring and Cleaning Data with OpenRefine. The workshop will take place on Friday, June 11, 2021, from 10-11:30 a.m.

Register for this virtual workshop*

Exploring and Cleaning Data with OpenRefine is a workshop that introduces participants to the basics of working with OpenRefine to clean, organize, and transform messy datasets.

OpenRefine (formerly Google Refine) is a powerful, free, open-source tool for working with unorganized tabular data. Since OpenRefine works offline in a web browser, your private data is not uploaded to the cloud and will stay on your local computer. Note that you are always working on a copy of your data, your raw data files are kept in their original form. Another benefit of OpenRefine is that while the program has a graphical interface, the system documents steps that have been completed to allow for reproducibility in data cleaning. These steps can be saved as JSON scripts and used to automate steps to clean other similar files. Continue reading

Common Data Elements: Benefits and Feedback Requested

Common Data Elements (CDEs) are definitions that allow data to be consistently captured and recorded across studies. Simply put, they allow researchers to ask the same questions in the same way across studies and receive standardized responses. For example, consider the following two questions about adolescent exercise, used on two different surveys.

Survey 1 Question:

In the past 7 days, how many days did your child exercise so much that he/she breathed hard? (Choose one)

  • No days
  • 1 day
  • 2-3 days
  • 4-5 days
  • 6-7 days

Survey 2 Question:

In the past 7 days, how often did your child exercise or participate in sports activities that made them breathe hard for at least 20 minutes. (Fill in the blank)

  •  __ day/s

The results from each could not be combined, as one question provides options while the other allows write-in responses, and their definition for exercise may vary (as one explicitly states 20 minutes). Continue reading

Explore DOIs and Beyond at PIDapalooza—the Global Festival of Persistent Identifiers for Digital Objects

Take a moment and consider your name. Do you have a name so unique that you are the only person with that name publishing in your field? I do—there are very few Helenmarys in the world to begin with. But if you cut my first name down to “Helen,” suddenly I could be one of a dozen authors working in my area. Reduce it further to “H,” and I’ve vanished among the crowd. Uniqueness is no match for the sheer ubiquity of names in the online scholarly publishing record.

What I need is a PID: a persistent identifier that refers to me and only me, and would still refer to me if I changed my name. For names, that’s easy: I have an ORCID iD, a sixteen-digit alphanumeric string that I can connect to my research output and take with me wherever I go. But what if I were not a person but a dataset, an article, or a piece of software? All of those can get PIDs too, as can far stranger objects, the breadth of which was the focus of January’s all-online, still-available, free PIDapalooza festival. Continue reading

Celebrating Love Data Week 2021

Love Data Week is February 8-12, 2021. #LoveDataPghThe week of February 8-12, 2021, is Love Data Week, an international event designed to raise awareness about research data management, sharing, preservation, and—most importantly—how we can help you. To celebrate, HSLS Data Services will be hosting a variety of workshops and giveaways, and engaging with the community via social media.


The HSLS classes offered during Love Data Week (online synchronous via Zoom) are listed below. Every class attendee will be entered into a raffle for a chance to win a gift card (mailed to winner). The more classes you attend, the more chances you have.

Note: Zoom links will be sent upon registration (also available at the above class links). Continue reading

Recently Released: Final NIH Policy for Data Management and Sharing

In October 2020, the NIH released their Final Policy for Data Management and Sharing which requires NIH-funded researchers to proactively plan for how scientific data will be preserved and shared through submission of a Data Management and Sharing Plan.

Additional supplementary information released in concert with the policy addresses:

Continue reading

Quickly Share, Gain Feedback, and Improve Your Papers with Research Square

The HSLS Update has published numerous articles about preprints over the years. Here we introduce another iteration of the preprint movementResearch Square, a multidisciplinary platform that helps researchers share their work early, gather feedback, and improve their manuscripts prior to (or in parallel with) journal submission.

So what differentiates Research Square from other preprint servers? The focus is on “added value” features such as:

Continue reading

Forecasting Data Costs for Biomedical Data Preservation

A data management plan is a formal document outlining how you will handle your data both during your research and after the project is completed. While writing this plan, and most importantly while preparing your grant application, it’s important to think through the long-term costs that might be associated with managing and preserving data throughout its life-cycle and the resources needed (both physical and personnel) to do so.

A new consensus study report from the National Academies of Sciences, Engineering, and Medicine titled “Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs” may be useful to researchers trying to accomplish this task. The report provides a framework to “help researchers identify and think through the major decisions in forecasting life-cycle costs for preserving, archiving, and promoting access to biomedical data.”

In addition to the report there are many other valuable tools/guides linked under the “resources” tab on the National Academies Press page. Of particular interest are:

Continue reading