Open access to scientific research made headlines this summer when the White House Office of Science and Technology Policy (OSTP) issued a new memo on August 25, 2022, with updated requirements for federally funded research, to make publications and results freely and immediately available. Learn more about open access for scientific publications, data, and software with classes at the Health Sciences Library System during International Open Access Week, a week of global advocacy for open access to research, happening from October 24 through 30.
Whether you’re new to open access or have specific questions, drop-in sessions are a great place to talk with HSLS specialists. Join Stephen Gabrielson, the library’s Scholarly Communication Librarian, for “Open Access Drop-In Session: How Does Open Access Publishing Work?” on Monday, October 24, from 11 a.m. to noon. This session will focus specifically on how to publish open-access articles, and it will cover sources of funding for article processing fees (APCs), how to find reputable no-APC journals, and how to self-archive your manuscript in an open-access repository. HSLS also has a guide to scholarly communication and publishing, including open-access publishing, available all the time: Scholarly Communication and Publishing Guide.
The success of a research project hinges on the quality of its data, and keeping clearly organized, well-documented data and analyses helps to ensure high data quality. The HSLS Data Services team can help you find, manage, publish, and share your data for any type of research project. We offer consultations, classes, and customized trainings in the following areas:
Research data management
Organizing files, writing documentation, and safely storing datasets are essential skills for managing data throughout the research lifecycle. We are available for personal consultations on research data management topics at any time, and offer workshops throughout the semester. In particular, we offer:
- One-hour Introduction to Research Data Management classes that are suitable for everyone, but may be especially helpful for new graduate students and project staff
- In-depth workshops on file-naming best practices, writing a data management plan for grant applications, and responsibly reusing data (or making your data available for reuse)
- One-on-one meetings on writing and implementing a data management and sharing plan (DMSP) for the NIH’s new data management and sharing policy that goes into effect in January 2023
In March 2022, the All of Us Research Program announced the release of its initial genomic dataset: nearly 100,000 whole genome sequences and 165,000 genotyping arrays, with nearly 50% coming from people who self-identify with a racial or ethnic minority group.
In an announcement about the release of the genomic data, Kelsey Mayo, Ph.D., scientific portfolio and product manager at the Vanderbilt University Medical Center Data and Research Center, states:
“What’s going to grab researchers’ attention is the diversity of the cohort. Half of our cohort is non-European. More than 90% of participants in genome-wide association studies have been of European descent. There’s just a real absence of genetic data from African, Asian, and Latino people. All of Us participants are providing this important data that’s been missing in health research. So we are going to have that new genetic information that’s been missing.”
Plans for forthcoming releases include data from participants who self-identify as American Indian or Alaska Native, with resources to provide important context for researchers.
After a long development process, the NIH’s new Data Management and Sharing (DMS) Policy will go into effect on January 25, 2023. The key feature of the new policy is that all researchers applying to the NIH for funding will be required to submit a Data Management and Sharing Plan (or DMSP) with their funding proposal; previously, many centers and funding opportunities had required a similar data management plan, but the requirement was not universal. The new policy does not require that researchers share their data (either with other researchers or with the public) but does convey “an expectation that researchers will maximize appropriate data sharing when developing plans.”
With January 2023 fast approaching, many have asked for more specific guidance from the NIH. The NIH has recently launched the Scientific Data Sharing website with helpful notices that expand the DMS Policy, such as:
Does your research involve identifying correlations between gene sequences and diseases, predicting protein structures from amino acid sequences, transcriptomics, metabolomics, or any of the many other ‘omics? If so, then you have a lot of data that requires analysis: a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making.
The “four ‘C’s” is shorthand to describe broad categories of options for analyzing bioinformatics data, including paying someone else to do it (Core labs), working with another researcher (Collaboration), and doing it yourself either by learning to program (Coding) or using out-of-the-box software (Commercially-licensed tools). The University of Pittsburgh provides numerous options in these four categories to help you with your data analysis needs.
Love Data Week, February 14-18, 2022, is an international event designed to raise awareness about research data management, sharing, and preservation. To celebrate, HSLS will be hosting a variety of featured workshops*:
- Preparing for the New NIH Data Management & Sharing Plan: Session 1 — Elements, Costs, & Tools, Monday, February 14, 2022, noon-1 p.m.
- Publicly Available Social Justice Data, Tuesday, February 15, 2022, 8:30-10 a.m.
- Data Organization in Spreadsheets, Thursday, February 17, 2022, noon-1:30 p.m.
The National Library of Medicine (NLM) recently announced that two PubMed Central article datasets are openly available in the cloud. This news is especially of interest for those conducting research utilizing text mining methodology or other types of secondary analysis.
PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). For nearly two decades NLM has supported the retrieval and download of machine-readable open access journal articles through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol). To enhance access, these datasets are now also available on the Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). Benefits to working with the datasets in the cloud include access to uncompressed individual full-text article files in XML and plain text as well as faster download and transfer speeds.
In summary, PMC Article Datasets housed on AWS include:
- The PMC Open Access (OA) Subset: includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse (to date more than 3.4 million).
- The Author Manuscript Dataset: includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining (to date more than 700,000).
The University of Pittsburgh has licensed a cloud-based Electronic Research Notebook, LabArchives, since 2016. LabArchives research notebooks assist with the organization and management of laboratory data, safely and conveniently across multiple platforms and devices. Whether managing a research lab as a principal investigator or reviewing students’ lab work as an instructor, LabArchives supports effective research data management plans and helps improve student learning. Pitt researchers seeking to make the transition from paper-based to electronic lab notebooks can watch YouTube videos, read our guide, or attend one of our training sessions.
LabArchives has expanded beyond electronic research notebooks for Research and Education to include two products that we are excited to announce are now available to researchers with a Pitt email address: Inventory and Scheduler.
LabArchives Inventory streamlines the organization, tracking, and ordering of lab inventory. Whether you need to order inventory from a vendor or manage your in-lab created materials, LabArchives Inventory provides a simple and customizable solution for your physical inventory management needs. Use Inventory to customize your inventory types and storage locations, add and manage lab inventory items, and then use the ordering options to request and receive materials. Continue reading
The HSLS Data Services team is thrilled that Pitt has declared 2021-22 to be the Year of Data and Society, because for us, every day is a day for data. Whether you are embarking on your first research project or have dozens of completed studies under your belt, we are here to help you improve the efficiency and reliability of your data-handling workflows at every step in the research process. We offer consultations, classes, and customized trainings in the following areas:
Research data management
Organizing files, writing documentation, and safely storing datasets are key practices for working with data effectively. They are also required discussion items for data management plans, which will be mandated in all NIH grant applications after January 2023. (Read the official NIH notice.) We recommend our Introduction to Research Data Management workshops especially for new graduate students to set themselves up with good habits from the start, but in-depth consultations are available for any lab, research group, or individual. Continue reading
Do you work with human genetic variants? Have you sought out relevant publications, clinically significant evidence, and/or publicly available data? Are you ready to contribute to the scientific and patient-care community by sharing your own research output?
You likely already know about and use ClinVar, the go-to resource for the clinical genetics community that aggregates information about genomic variation and its relationship to human health. ClinVar recently reached the significant milestone of including 1 million unique variants in its database. Over 1,800 organizations from 82 countries have submitted almost 1.5 million records in ClinVar, including more than 11,000 curated variants from 14 expert panels.
Now it is easier than ever to reciprocate and be a supportive community member by submitting your human genetic variant data using the new ClinVar Submission API. The workflow for submissions is fast and automated, thanks to a RESTful API—a particular architectural style for an application program interface (API) allowing two software programs to communicate with each other to access and use data. Continue reading
Typically if a researcher is asked what they think of when they hear the word “publication,” a “traditional” research journal article likely comes to mind. However, if the entire research workflow is considered, there are many research outputs that could be published including articles, preprints, protocols, datasets, and software. (We are defining “published” simply as “disseminated,” although terms such as “shared” or “posted” may be more appropriate depending on the output.)
The number of venues for publishing these outputs is growing and includes data repositories and preprint servers like DRYAD and medRxiv. New journals such as the Journal of Open Source Software (JOSS) and Scientific Data have been founded specifically to allow these research outputs to be recognized within the scholarly system. In addition, expanded publication types are now offered by established journals like PLOS ONE, which introduced Lab and Study Protocol types in early 2021.
This article will provide options for publishing research protocols, however the Where Should I Publish? Guide linked on the left of the Scholarly Communication Guide also compares options for other research outputs. Continue reading
Works in progress can become unruly. As a piece of research code grows, it often spawns new files that iterate on the original: this version fixes one bug but introduces another, or that version swaps two similar functions. The same is true for manuscript drafts which pass among co-authors, accumulating new text (and usually new filenames) as they travel. It can be difficult to tell these versions apart from each other, or trace the history of how one version evolved from another. Version control systems make this work easier.
Version control is defined as “a system that records changes to a file or set of files over time so that you can recall specific versions later” in Pro Git (second edition, 2014), an excellent open textbook by Scott Chacon and Ben Straub. Version control allows a user to see all changes made to a file, who made the changes, and when they were made. It can let an author approve or reject edits made to a manuscript, or quickly determine which set of figures are the right ones to submit to a journal. Continue reading
HSLS offers classes in a wide array of subjects—molecular biology, database searching, bibliographic management, and more! You can quickly view all Upcoming Classes and Events or sign up to receive the weekly Upcoming HSLS Classes and Workshops email.
This month’s featured workshop is Exploring and Cleaning Data with OpenRefine. The workshop will take place on Friday, June 11, 2021, from 10-11:30 a.m.
Register for this virtual workshop*
Exploring and Cleaning Data with OpenRefine is a workshop that introduces participants to the basics of working with OpenRefine to clean, organize, and transform messy datasets.
OpenRefine (formerly Google Refine) is a powerful, free, open-source tool for working with unorganized tabular data. Since OpenRefine works offline in a web browser, your private data is not uploaded to the cloud and will stay on your local computer. Note that you are always working on a copy of your data, your raw data files are kept in their original form. Another benefit of OpenRefine is that while the program has a graphical interface, the system documents steps that have been completed to allow for reproducibility in data cleaning. These steps can be saved as JSON scripts and used to automate steps to clean other similar files. Continue reading
Common Data Elements (CDEs) are definitions that allow data to be consistently captured and recorded across studies. Simply put, they allow researchers to ask the same questions in the same way across studies and receive standardized responses. For example, consider the following two questions about adolescent exercise, used on two different surveys.
Survey 1 Question:
In the past 7 days, how many days did your child exercise so much that he/she breathed hard? (Choose one)
- No days
- 1 day
- 2-3 days
- 4-5 days
- 6-7 days
Survey 2 Question:
In the past 7 days, how often did your child exercise or participate in sports activities that made them breathe hard for at least 20 minutes. (Fill in the blank)
The results from each could not be combined, as one question provides options while the other allows write-in responses, and their definition for exercise may vary (as one explicitly states 20 minutes). Continue reading