data-management – Page 2

This information is over 2 years old. Information was current at time of publication.

Whole Genome Sequences from a Diverse Human Population Now Available Through the All of Us Research Hub

In March 2022, the All of Us Research Program announced the release of its initial genomic dataset: nearly 100,000 whole genome sequences and 165,000 genotyping arrays, with nearly 50% coming from people who self-identify with a racial or ethnic minority group.

In an announcement about the release of the genomic data, Kelsey Mayo, Ph.D., scientific portfolio and product manager at the Vanderbilt University Medical Center Data and Research Center, states:

“What’s going to grab researchers’ attention is the diversity of the cohort. Half of our cohort is non-European. More than 90% of participants in genome-wide association studies have been of European descent. There’s just a real absence of genetic data from African, Asian, and Latino people. All of Us participants are providing this important data that’s been missing in health research. So we are going to have that new genetic information that’s been missing.”

Plans for forthcoming releases include data from participants who self-identify as American Indian or Alaska Native, with resources to provide important context for researchers.

Continue reading →

This information is over 2 years old. Information was current at time of publication.

New NIH Website for Scientific Data Sharing

After a long development process, the NIH’s new Data Management and Sharing (DMS) Policy will go into effect on January 25, 2023. The key feature of the new policy is that all researchers applying to the NIH for funding will be required to submit a Data Management and Sharing Plan (or DMSP) with their funding proposal; previously, many centers and funding opportunities had required a similar data management plan, but the requirement was not universal. The new policy does not require that researchers share their data (either with other researchers or with the public) but does convey “an expectation that researchers will maximize appropriate data sharing when developing plans.”

With January 2023 fast approaching, many have asked for more specific guidance from the NIH. The NIH has recently launched the Scientific Data Sharing website with helpful notices that expand the DMS Policy, such as:

Continue reading →

This information is over 2 years old. Information was current at time of publication.

Pitt Resources for Bioinformatics Data Analysis: The Four ‘C’s

Does your research involve identifying correlations between gene sequences and diseases, predicting protein structures from amino acid sequences, transcriptomics, metabolomics, or any of the many other ‘omics? If so, then you have a lot of data that requires analysis: a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making.

The “four ‘C’s” is shorthand to describe broad categories of options for analyzing bioinformatics data, including paying someone else to do it (Core labs), working with another researcher (Collaboration), and doing it yourself either by learning to program (Coding) or using out-of-the-box software (Commercially-licensed tools). The University of Pittsburgh provides numerous options in these four categories to help you with your data analysis needs.

Continue reading →

This information is over 2 years old. Information was current at time of publication.

Featured Workshops: Love Data Week

Love Data Week, February 14-18, 2022, is an international event designed to raise awareness about research data management, sharing, and preservation. To celebrate, HSLS will be hosting a variety of featured workshops*:

Preparing for the New NIH Data Management & Sharing Plan: Session 1 — Elements, Costs, & Tools, Monday, February 14, 2022, noon-1 p.m.
Publicly Available Social Justice Data, Tuesday, February 15, 2022, 8:30-10 a.m.
Data Organization in Spreadsheets, Thursday, February 17, 2022, noon-1:30 p.m.

Continue reading →

This information is over 2 years old. Information was current at time of publication.

PubMed Central Article Datasets are Now Available in the Cloud

The National Library of Medicine (NLM) recently announced that two PubMed Central article datasets are openly available in the cloud. This news is especially of interest for those conducting research utilizing text mining methodology or other types of secondary analysis.

PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). For nearly two decades NLM has supported the retrieval and download of machine-readable open access journal articles through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol). To enhance access, these datasets are now also available on the Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). Benefits to working with the datasets in the cloud include access to uncompressed individual full-text article files in XML and plain text as well as faster download and transfer speeds.

In summary, PMC Article Datasets housed on AWS include:

The PMC Open Access (OA) Subset: includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse (to date more than 3.4 million).
The Author Manuscript Dataset: includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining (to date more than 700,000).

Continue reading →

This information is over 2 years old. Information was current at time of publication.

Two New LabArchives Products for Pitt Researchers

The University of Pittsburgh has licensed a cloud-based Electronic Research Notebook, LabArchives, since 2016. LabArchives research notebooks assist with the organization and management of laboratory data, safely and conveniently across multiple platforms and devices. Whether managing a research lab as a principal investigator or reviewing students’ lab work as an instructor, LabArchives supports effective research data management plans and helps improve student learning. Pitt researchers seeking to make the transition from paper-based to electronic lab notebooks can watch YouTube videos, read our guide, or attend one of our training sessions.

LabArchives has expanded beyond electronic research notebooks for Research and Education to include two products that we are excited to announce are now available to researchers with a Pitt email address: Inventory and Scheduler.

LabArchives Inventory streamlines the organization, tracking, and ordering of lab inventory. Whether you need to order inventory from a vendor or manage your in-lab created materials, LabArchives Inventory provides a simple and customizable solution for your physical inventory management needs. Use Inventory to customize your inventory types and storage locations, add and manage lab inventory items, and then use the ordering options to request and receive materials. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Take Your Data Practices from Good to Best with HSLS Data Services

The HSLS Data Services team is thrilled that Pitt has declared 2021-22 to be the Year of Data and Society, because for us, every day is a day for data. Whether you are embarking on your first research project or have dozens of completed studies under your belt, we are here to help you improve the efficiency and reliability of your data-handling workflows at every step in the research process. We offer consultations, classes, and customized trainings in the following areas:

Research data management

Organizing files, writing documentation, and safely storing datasets are key practices for working with data effectively. They are also required discussion items for data management plans, which will be mandated in all NIH grant applications after January 2023. (Read the official NIH notice.) We recommend our Introduction to Research Data Management workshops especially for new graduate students to set themselves up with good habits from the start, but in-depth consultations are available for any lab, research group, or individual. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Share Human Variant Data with New NCBI ClinVar API

Do you work with human genetic variants? Have you sought out relevant publications, clinically significant evidence, and/or publicly available data? Are you ready to contribute to the scientific and patient-care community by sharing your own research output?

You likely already know about and use ClinVar, the go-to resource for the clinical genetics community that aggregates information about genomic variation and its relationship to human health. ClinVar recently reached the significant milestone of including 1 million unique variants in its database. Over 1,800 organizations from 82 countries have submitted almost 1.5 million records in ClinVar, including more than 11,000 curated variants from 14 expert panels.

Now it is easier than ever to reciprocate and be a supportive community member by submitting your human genetic variant data using the new ClinVar Submission API. The workflow for submissions is fast and automated, thanks to a RESTful API—a particular architectural style for an application program interface (API) allowing two software programs to communicate with each other to access and use data. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Options for Publishing Research Protocols

Typically if a researcher is asked what they think of when they hear the word “publication,” a “traditional” research journal article likely comes to mind. However, if the entire research workflow is considered, there are many research outputs that could be published including articles, preprints, protocols, datasets, and software. (We are defining “published” simply as “disseminated,” although terms such as “shared” or “posted” may be more appropriate depending on the output.)

The number of venues for publishing these outputs is growing and includes data repositories and preprint servers like DRYAD and medRxiv. New journals such as the Journal of Open Source Software (JOSS) and Scientific Data have been founded specifically to allow these research outputs to be recognized within the scholarly system. In addition, expanded publication types are now offered by established journals like PLOS ONE, which introduced Lab and Study Protocol types in early 2021.

This article will provide options for publishing research protocols, however the Where Should I Publish? Guide linked on the left of the Scholarly Communication Guide also compares options for other research outputs. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Make File Management Simpler with Version Control

Works in progress can become unruly. As a piece of research code grows, it often spawns new files that iterate on the original: this version fixes one bug but introduces another, or that version swaps two similar functions. The same is true for manuscript drafts which pass among co-authors, accumulating new text (and usually new filenames) as they travel. It can be difficult to tell these versions apart from each other, or trace the history of how one version evolved from another. Version control systems make this work easier.

Version control is defined as “a system that records changes to a file or set of files over time so that you can recall specific versions later” in Pro Git (second edition, 2014), an excellent open textbook by Scott Chacon and Ben Straub. Version control allows a user to see all changes made to a file, who made the changes, and when they were made. It can let an author approve or reject edits made to a manuscript, or quickly determine which set of figures are the right ones to submit to a journal. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Featured Workshop: Exploring and Cleaning Data with OpenRefine

HSLS offers classes in a wide array of subjects—molecular biology, database searching, bibliographic management, and more! You can quickly view all Upcoming Classes and Events or sign up to receive the weekly Upcoming HSLS Classes and Workshops email.

This month’s featured workshop is Exploring and Cleaning Data with OpenRefine. The workshop will take place on Friday, June 11, 2021, from 10-11:30 a.m.

Exploring and Cleaning Data with OpenRefine is a workshop that introduces participants to the basics of working with OpenRefine to clean, organize, and transform messy datasets.

OpenRefine (formerly Google Refine) is a powerful, free, open-source tool for working with unorganized tabular data. Since OpenRefine works offline in a web browser, your private data is not uploaded to the cloud and will stay on your local computer. Note that you are always working on a copy of your data, your raw data files are kept in their original form. Another benefit of OpenRefine is that while the program has a graphical interface, the system documents steps that have been completed to allow for reproducibility in data cleaning. These steps can be saved as JSON scripts and used to automate steps to clean other similar files. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Common Data Elements: Benefits and Feedback Requested

Common Data Elements (CDEs) are definitions that allow data to be consistently captured and recorded across studies. Simply put, they allow researchers to ask the same questions in the same way across studies and receive standardized responses. For example, consider the following two questions about adolescent exercise, used on two different surveys.

Survey 1 Question:

In the past 7 days, how many days did your child exercise so much that he/she breathed hard? (Choose one)

No days

1 day

2-3 days

4-5 days

6-7 days

Survey 2 Question:

In the past 7 days, how often did your child exercise or participate in sports activities that made them breathe hard for at least 20 minutes. (Fill in the blank)

__ day/s

The results from each could not be combined, as one question provides options while the other allows write-in responses, and their definition for exercise may vary (as one explicitly states 20 minutes). Continue reading →

This information is over 2 years old. Information was current at time of publication.

Explore DOIs and Beyond at PIDapalooza—the Global Festival of Persistent Identifiers for Digital Objects

Take a moment and consider your name. Do you have a name so unique that you are the only person with that name publishing in your field? I do—there are very few Helenmarys in the world to begin with. But if you cut my first name down to “Helen,” suddenly I could be one of a dozen authors working in my area. Reduce it further to “H,” and I’ve vanished among the crowd. Uniqueness is no match for the sheer ubiquity of names in the online scholarly publishing record.

What I need is a PID: a persistent identifier that refers to me and only me, and would still refer to me if I changed my name. For names, that’s easy: I have an ORCID iD, a sixteen-digit alphanumeric string that I can connect to my research output and take with me wherever I go. But what if I were not a person but a dataset, an article, or a piece of software? All of those can get PIDs too, as can far stranger objects, the breadth of which was the focus of January’s all-online, still-available, free PIDapalooza festival. Continue reading →

This information is over 2 years old. Information was current at time of publication.

Celebrating Love Data Week 2021

Love Data Week is February 8-12, 2021. #LoveDataPgh The week of February 8-12, 2021, is Love Data Week, an international event designed to raise awareness about research data management, sharing, preservation, and—most importantly—how we can help you. To celebrate, HSLS Data Services will be hosting a variety of workshops and giveaways, and engaging with the community via social media.

Workshops

The HSLS classes offered during Love Data Week (online synchronous via Zoom) are listed below. Every class attendee will be entered into a raffle for a chance to win a gift card (mailed to winner). The more classes you attend, the more chances you have.

Introduction to Research Data Management, February 8, 2–3 p.m.
Data Management in R, February 9, 11 a.m.–12:30 p.m.
Social Justice and Publicly Available Data, February 10, 10–11 a.m.
Increase Your Data’s Discoverability with the Pitt Data Catalog, February 10, 1:30–2:30 p.m.
Command Line Basics: Questions Hour, February 11, noon–1 p.m.
Mapping Geographic Data with Tableau, February 12, 10–11 a.m.

Note: Zoom links will be sent upon registration (also available at the above class links). Continue reading →