Data Catalog Collaboration Project Wins a Distinguished Award

HSLS Pitt Data Catalog, a project by the Health Sciences Library SystemHSLS, along with academic health sciences libraries at NYU Langone Health, Duke University, University of North Carolina at Chapel Hill, Hofstra University, University of Maryland at Baltimore, University of Virginia, and Wayne State University, participates in the Data Catalog Collaboration Project (DCCP). The DCCP recently received an award from the Clinical and Translational Science Awards (CTSA) Great Team Science Contest. “One of the goals of the CTSAs is to promote team science through establishing mechanisms by which biomedical researchers can collaborate, be trained in why team science is important, and develop evaluation measures to assess teamwork in biomedical research contexts.” “One hundred seventy applications were submitted, and the DCCP received the highest score for the Top Importance category.”

As a participant in the DCCP, HSLS developed the Pitt Data Catalog, a tool that provides Pitt researchers with an easy way to make their datasets discoverable as well as to identify other usable data.

Congratulations to the DCCP!

Inclusion of Pitt Data Catalog Datasets in Google’s New Dataset Search

Researchers across disciplines are sharing their data more and more, whether because of journal or funder mandates, or simply because they personally prefer the openness to increase discoverability and reuse of their data. This sharing has resulted in millions of datasets described or deposited in various locations across the web, including general or discipline-specific data repositories, publisher sites, data journals, authors’ home pages, or institutional data catalogs such as the Pitt Data Catalog (for more information see the catalog’s about page).

In early September 2018, Google launched a beta dataset search to enable users to find datasets, no matter their location, through a familiar interface and simple keyword search.

Because the Pitt Data Catalog uses structured data to describe the data included, records from the catalog are retrieved in Google’s search (as shown below), increasing the visibility of research and potentially the number of views and citations of associated publications.

Pitt Data Catalog record Eye movementsGoogle Dataset results showing Eye Movements

If you have datasets you would like to have described in the Pitt Data Catalog, please contact the HSLS Data Services team at HSLSDATA@pitt.edu or through our dataset inclusion form.

~Melissa Ratajeski

Your Input Needed: Proposed Provisions for a Future Draft NIH Data Management and Sharing Policy

The National Institutes of Health (NIH) is implementing measures to update its 2003 Data Sharing Policy, issuing a Request for Information (RFI) to solicit public input on proposed key provisions that could serve as the foundation for a future NIH policy for data management and sharing.

These provisions include:

  • Definitions related to data management and sharing;
  • A stated purpose to manage, preserve, and make scientific data accessible in a timely manner for appropriate use by the research community and the broader public;
  • The scope and requirements for all intramural and extramural research, funded or supported in whole or in part by NIH, that results in scientific data, regardless of NIH funding level or mechanism;
  • Proposed elements to be addressed in a data management and sharing plan: data types, related tools, software and/or code, data standards, data preservation and access (including timelines), terms for re-use and redistribution, limitations on access, and responsible personnel for data management oversight; and
  • An NIH compliance and enforcement plan that would include review at minimum annually and non-compliance taken into account for future funding or support decisions.

Comments on the proposed key provisions will be accepted electronically through December 10, 2018. Continue reading

New HSLS Program—Spotlight Series: Software Developed @ Pitt

The HSLS MolBio Information Service and Data Services have collaborated in the creation of a new HSLS program—Spotlight Series: Software Developed @ Pitt—that focuses on software developed by Pitt health sciences researchers.  Sessions will begin with a 30-minute presentation of tool development and use cases, followed by instruction on software access/installation, discussion of parameters, and hands-on practice.

The first session in this series will be:

FRED: A Versatile Framework for Modeling Infectious Diseases and Other Health Conditions

Thursday, September 20, 2018, 2:00 p.m. to 4:00 p.m.

Instructor: David Sinclair, PhD, Postdoctoral Researcher, Public Health Dynamics Lab

Location: Scaife Hall, Falk Library, Upper Floor Study Area

Please register and bring your own laptop.

If you would like to present your software or have a suggestion of a software that we should spotlight please contact: HSLSDATA@pitt.edu.

~Melissa Ratajeski

Data Sharing Statement Policy for Clinical Trials Enacted July 2018

The International Committee of Medical Journal Editors (ICMJE) is a working group of medical journal editors that makes recommendations for the conduct, reporting, editing, and publication of scholarly work in medical journals. Journals that state they follow the ICMJE recommendations include: Academic Medicine, American Journal of Epidemiology, Cancer Nursing, Chest, Circulation, Immunology & Cell Biology, Journal of Dental Hygiene, and Radiology. 

These recommendations cover a range of topics, including:

  • defining the roles of authors
  • conflicts of interest
  • corrections, retractions, republications and version control, and copyrights
  • advertising
  • clinical trials

As of July 1, 2018, manuscripts submitted to ICMJE journals reporting on the results of clinical trials must include a data sharing statement. Data sharing statements must indicate the following:

  • if individual de-identified participant data will be shared;
  • details of the data that will be shared (inclusion of data dictionaries, study protocol, statistical analysis plan, etc.);
  • when the data will become available and for how long; and
  • by what access criteria data will be shared (including with whom, for what types of analyses, and by what mechanism).

Examples of such statements are available on the ICMJE Website. As noted by Pitt’s Research Conduct and Compliance Office:

“If you have provided information in the Individual Participant Data (IPD) Sharing Statement module of your ClinicalTrials.gov study record, you should ensure that this information matches the data sharing statement submitted with the manuscript. Questions should be directed to the journal to which you are submitting.”

Clinical trials that begin enrolling participants on or after January 1, 2019, must include a data sharing plan in the trial’s registration.

Members of HSLS Data Services are available for consult when writing your data sharing statement.

~Melissa Ratajeski

Tracking down Datasets Using PubMed and PMC

PubMed and PubMed Central (PMC) now offer filters to limit a search to only those articles or citations that include related data links, supplemental material, data citations, or a data availability or data accessibility statement.

The filters, detailed below, can be combined with any search by simply adding the Boolean operator “AND” and the specific filter into the search box (see the screenshots below for example syntax; the filters are highlighted in yellow).

PubMed

data[filter] in PubMed search box

Use data[filter] to find citations with related data links in either the Secondary Source ID field or the LinkOut – more resources field (both located below the abstract). Continue reading

Introducing the Pitt Data Catalog for Dataset Sharing and Discovery

Pitt Data Catalog, a project by the Health Sciences Library SystemSharing research data can bring many benefits, including greater visibility for data creators, a more transparent research process, and opportunities to identify potential collaborators. But what about datasets that are stored on a lab server instead of in a data repository, or that should only be shared with vetted researchers? The Pitt Data Catalog is a new platform at HSLS designed to help Pitt health sciences researchers share and discover their otherwise hard-to-find datasets, while keeping ultimate control over the data in researchers’ hands.

“The Pitt data catalog has the potential to improve research collaborations and accelerate the impact of research being conducted in the schools of the health sciences. I strongly encourage each researcher to work with HSLS to make your datasets discoverable through the catalog in accordance with the FAIR Data Principles: Making Data Findable, Accessible, Interoperable and Reusable.” Dr. Arthur Levine, Senior Vice Chancellor for the Health Sciences

Unlike data repositories like Dryad or Zenodo, the Pitt Data Catalog does not host any data files. Instead, each dataset included in the catalog is described in a metadata record that includes information about the dataset’s authors, subject domain, and data creation process, as well as instructions for accessing the dataset itself and links to associated publications. Some data catalog entries describe publicly-available datasets, so their records link directly to the data in a repository. Other entries that describe privately-held datasets may direct a visitor to e-mail the corresponding author, or link to a data-access application form. Each record is created in collaboration with the researcher to ensure accurate and comprehensive information.

If you have datasets you would like to have described in the Pitt Data Catalog, please contact the HSLS Data Services team at HSLSDATA@pitt.edu or through our dataset inclusion form. We’ll schedule an in-person or phone consultation to learn more about your datasets and discuss the most appropriate terminology to describe your data. After we create a draft of your dataset’s record, we’ll send it to you for final approval. If you have updates after the record is published, just contact us to make changes; we may also contact you to make sure our information is still current.

HSLS Data Services staff are happy to give demonstrations for individual health sciences researchers, departments, or labs. If you would like to investigate whether the Pitt Data Catalog would be a good match for your datasets, please reach out and we will gladly explore its possibilities with you.

The University of Pittsburgh, Health Sciences Library System, is a member of the Data Catalog Collaboration Project and has customized this data discovery tool in part with Federal funds from the National Library of Medicine, National Institutes of Health, Department of Health and Human Services, under cooperative agreement number UG4LM012342 with the University of Pittsburgh, Health Sciences Library System. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

~Helenmary Sheridan

Expand Your Data Analysis Universe with Galaxy

Galaxy logoThe life sciences are erupting with data. Thanks to advancements in DNA sequencing technologies and the speed and capacity of computational algorithms, the generation of vast quantities of genomic and proteomic data is now commonplace and expected. However, analysis of this data is not keeping pace with its acquisition (storage space is yet another issue…). One limiting factor is that many biomedical scientists do not yet know how to access, much less use, the available analytical resources. This article describes a platform for multi-omic data analysis that is accessible, reproducible, and transparent, and recommends resources on how to use it.

Galaxy is a community-supported platform that provides access to over 5,500 tools for a multitude of analytical needs, in categories such as variant analysis, imaging, and statistics. Its components include the Galaxy Software Framework and the Public Galaxy Service. The software framework is an open-source, web-based application that functions as an intermediary between researchers without informatics expertise and the computational infrastructure that runs and stores the analyses. The public service includes the main instance, which is an installation of the Galaxy software combined with many tools and data, as well as over 80 public servers. Some of these servers are even domain-specific (ImmPort Galaxy, focusing on flow cytometry analysis) or tool-publishing (MBAC Metabiome Server, simplifying the control, usage, access, and analysis of microbiome, metabalome, and immunome data). Local institutional instances are also possible; the University of Pittsburgh has a Galaxy server hosted by the Center for Research Computing.

The scale of Galaxy is initially a bit daunting. Fortunately, there are numerous resources to help researchers navigate the analytical possibilities. Everything to get you started is at galaxyproject.org, including Galaxy 101, dataset collections, interactive tours, and a growing collection of tutorials developed and maintained by the worldwide Galaxy community and Galaxy Training Network.

The HSLS Molecular Biology Information Service can also assist you with using Galaxy for your research. During the spring 2018 semester we are introducing two hands-on workshops that will teach the basics of Galaxy including (1) interface navigation and interaction and (2) how to create, modify, and extract workflows.

To learn more, read the bioRxiv article on “Community-Driven Data Analysis Training for Biology” or contact the HSLS Molecular Biology Information Service.

~Carrie Iwema

NEW Data Class Offerings

In our continuous effort to support your research needs, HSLS is offering four new classes this spring covering: (1) introduction to mapping, (2) Python through Jupyter, (3) beginning command line for bioinformatics, and (4) options for bioinformatics analysis. Class descriptions and registration links are listed below.

(1) Data 101: Introduction to Mapping 

Thursday, February 15, 2018, 11 a.m. – 1 p.m.; Registration required

Mapping is a great way to visualize and analyze information—and to tell stories. In this introductory workshop, you’ll learn the principles of mapmaking, understand how computers are used to plot addresses on a map, conduct basic spatial analysis, and update records in a database based on location. Along with a deeper appreciation for computers, this class will provide you with a solid foundation of mapping concepts and processes, and get you prepared to take your first computer-based mapping class. No computers will be used in this class. Continue reading

Updated PubMed Central Policy Statement on Supplementary Data

PubMed Central logoPubMed Central (PMC) was established in 2000 as the National Library of Medicine’s full-text, journal article repository. Since 2005, PMC has also been the designated repository for papers submitted in accordance with the NIH Public Access Policy. Today, PMC serves as the full-text repository for papers across a variety of scientific disciplines that fall under a number of funding agencies’ public access policies.

These public access policies seek to make the published full-text papers, resulting from publicly- and privately-funded research, available for the public to find and read. As a repository, PMC ensures the permanent preservation of these research findings and makes the results of this research more readily accessible to the public, healthcare providers, educators, and the scientific community.

Recently PMC updated its policy statement on supplementary data to more clearly articulate the requirement that any supplementary data (images, tables, video, or other documents/files) that are associated with an article must be deposited in PMC with an article.

This applies to all files made available in the article record, even if the files are also available in a public repository. An exception may be made for data files that require custom software to read and use, or are very large (over 2 GB).

In cases where data cannot be reasonably included with an article, either in a figure, table, or supplementary file, NLM encourages journals and authors to make the data available in a public repository and include the relevant data citation(s) in the paper.

The NIH Manuscript Submission (NIHMS) system, developed to facilitate the submission of peer-reviewed manuscripts for inclusion in PMC, can accept submissions of datasets (2 GB or smaller) in support of any manuscript files deposited in compliance with a participating funder’s public access policy. Because these datasets will be publicly accessible, those related to human subjects research should not include any personally identifiable information and deposit should be consistent with informed consent. For more information on depositing supplementary data and dataset files via NIHMS, see the related NIHMS FAQ.

For questions regarding this revised policy or for guidance with depositing supplementary data, please refer to the HSLS Scholarly Communication: Public Access Policies page or contact HSLS Data Services.

*Parts of this article were derived from PMC documentation: Funders and PMC and PMC Policies

~Melissa Ratajeski

Open Data in Research Trending Up

“Open Data” is defined by SPARC (Scholarly Publishing and Academic Resources Coalition) as “research data that

  1. is freely available on the Internet;
  2. permits any user to download, copy, analyze, re-process, pass to software, or use for any other purpose; and
  3. is without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.”

The phrase “open data” first appeared in a PubMed article title in 2000, but it took another 13 years for an increase in publications. As we approach 2018, how do researchers now view open data? And most importantly, how does HSLS support health sciences researchers at Pitt? Continue reading

NCBI Hackathon @ Pitt

As previously reported, HSLS hosted a National Center for Biotechnology Information (NCBI) Hackathon from September 25-27, 2017, in collaboration with numerous campus partners. The event took place in the Digital Scholarship Commons of the University Library System (ULS). HSLS, the Center for Research Computing (CRC), and the Department of Biomedical Informatics (DBMI) generously provided support for breakfasts. Computing Services and Systems Development (CSSD), the School of Computing and Information (SCI), and the CRC provided expert technical support.

An NCBI-style Hackathon is a social event in which highly motivated individuals with expertise in scientific disciplines, computer programming, software development, etc., meet for an intense few days to formulate useful, efficient pipelines supporting biomedical research. All code generated by NCBI-Hackathons is made freely available on GitHub, and manuscripts describing the design/usage of software tools are posted on the F1000Research Hackathons channel.

The Pitt/NCBI-Hackathon was led by Ben Busby, the NCBI Genomics Outreach Coordinator. Participants were primarily from Pittsburgh, but they also traveled from Columbus, Oh.; Baltimore, Md.; Charlottesville, Va.; New York, N.Y.; Denver, Colo.; and San Diego, Calif. Initially, the 24 hackers were divided into five teams, but two of the groups working on virus discovery and identification of past viral exposure merged to form a super-group—an NCBI-Hackathon first!

The groups worked for three long, collaborative, and productive days, capped with irreverent awards such as “best hair” and “how I learned to relax and love the hackathon” (see picture). Final projects included:

  • HAQmap—a guide containing information and tools to help organizers create their own NCBI-style hackathon (5 member team).
  • (SC)3 Super Concise Single Cell SNP Caller—this project enables finding expressed SNPs in SRA data associated with a Bioproject record (3 member team).
  • SPeW: SeqPipeWrap—a framework for taking a NextGen Seq pipeline (such as RNA-seq, ChIP-seq or ATAC-seq) in any language, and using NextFlow as a pipeline management system to create a flexible, user-friendly pipeline that can be shared in a container platform (6 member team).
  • ViruSpy—a pipeline designed for virus discovery from metagenomics sequencing data available in NCBI’s SRA database (10 member team).

The success of the Pitt/NCBI-Hackathon bodes well for the possibility of future hackathons. If you are interested in learning more, please contact the HSLS Molecular Biology Information Service.

~ Carrie Iwema

search.DataJournals: a Tool to Discover Data Published within Data Journals

Data journals are a means to share datasets and communicate detailed information about the methods and instrumentation used to acquire the data.

However, locating datasets shared via these publications can be challenging, as PubMed includes very few data journals and does not provide full-text searching to easily locate information not found in the title or abstract of an article. To facilitate this discovery, HSLS created a federated search portal named search.DataJournals, which searches the full text of four open access data journals: Data in Brief, Genomics Data, GigaScience, and Scientific Data.

A query will search across all fields of the data article including data description, materials, methods, instrumentation, data source location, and data accessibility. Search results are aggregated and ordered by relevance and can be filtered by clustered topical categories that are created on the fly based on the textual information of the retrieved records.

Contact HSLS Data Services if you have questions about using this tool, locating datasets, or sharing data.

~Melissa Ratajeski