This information is over 2 years old. Information was current at time of publication.

Which Data Repository to Choose?

Many journals and funding agencies are requiring researchers to deposit their data in publicly accessible databases or repositories. This not only helps to ensure the long-term accessibility and preservation of the data but also increases its discoverability and reuse.

The number of sustainable online repositories available to host and archive research data may seem overwhelming. Guidance for repository selection is offered below. Also available is the HSLS Data Management Repository Web site. Note: before selecting a repository, researchers should review the deposit directions and policies for the specific repository. Continue reading

This information is over 2 years old. Information was current at time of publication.

DMPTool Adds Template for Genomic Data Sharing Policy (GDS)

DMPToolA Data Management Plan (DMP) is a formal document describing how your data will be managed during your research and after the project is completed, including sharing resulting data with other researchers, and archiving the data for future access and use. With requirements for DMPs increasing among major research funders such as the National Institutes of Health (NIH) and the National Science Foundation (NSF), where can researchers quickly find assistance writing a DMP?

The DMPTool, an online system guiding researchers in completing this key component, can be accessed on the DMP tab of the HSLS Data Management Guide. The DMPTool has been customized for the University of Pittsburgh: simply select the University of Pittsburgh from the drop-down menu on the Institution Log In page. Log in with your Pitt credentials on the Web Authentication page.

The DMPTool offers templates for NIH, NSF, and other funders, but an important new template is “NIH-GDS: Genomic Data Sharing,” written specifically for the new NIH Genomic Data Sharing Policy (GDS). The customized GDS template offers suggested responses and guidance from Pitt’s IRB and Office of Research (OOR) in one location.

To get started using the GDS template, click on Create New DMP, then Select Template. In the list of templates, click on National Institutes of Health to reveal two NIH templates. Select NIH-GDS: Genomic Data Sharing.

DMP GDS Template
GDS Template

After providing some proposal identification information you will be asked to enter the DMP details. The template outline is on the left. The workspace on the right has three tabs: “Instructions” are from the NIH-GDS; “Links” connect to Pitt and NIH-GDS resources; “Suggested Response” is a customized response, added by HSLS librarians, that researchers may modify as appropriate, and includes pertinent fields populated with selections (in brackets) to be made by the investigator. Do not submit without editing to describe your specific study and data.

While the DMPTool offers excellent guidance and sample documents throughout the DMP writing process, Pitt health sciences researchers preferring to review their DMP with a librarian are welcome to contact the HSLS Data Management Group.

~Andrea Ketchum

This information is over 2 years old. Information was current at time of publication.

DMPTool: Create and Share Data Management Plans

Data management plans (DMPs) are now a standard part of grant proposals for most funding agencies. A DMP should describe what you will do with your data during your research and once your project is completed. The plan may include details of the types of data you will collect, how you will preserve it, and how you will share the data with others.

To help researchers easily create and share DMPs, the University of Pittsburgh has become a partner institution of the DMPTool. The DMPTool offers ready-to-use templates to guide researchers through the process of generating a comprehensive plan tailored to the specific requirements of agencies, such as the National Institutes of Health, National Science Foundation, and Department of Energy. Links to general and institutional resources are available throughout the templates, offering researchers additional support.

By logging into the DMPTool with a University of Pittsburgh Computing Account username and password, researchers are able to create customized DMPs, add co-owners and editors to plans, and share created DMPs with those only from the University of Pittsburgh, or publicly. There are a number of publicly shared DMPs available within the tool which can be reviewed, copied, and/or edited. Upon completion, DMPs can be exported for inclusion in a funding proposal.

For more information on the DMPTool, see the promotional video, The DMPTool: A Brief Overview, or contact a member of the HSLS Data Management Group.

~Melissa Ratajeski

This information is over 2 years old. Information was current at time of publication.

Sharing Genomic Data: An Overview of NIH Policy

Does your research do all of the following?

  • Generate genomic data, either human or non-human.
  • Produce “large-scale” data, i.e., genome-wide association studies (GWAS), single nucleotide polymorphism (SNP) arrays, genome sequence, transcriptomic, epigenomic, and/or gene expression data.
  • Receive funding by the National Institutes of Health (NIH), either intramural, contract, or grant-based.

If so, NIH policy now requires you to share your data. To learn more, keep reading. Continue reading

This information is over 2 years old. Information was current at time of publication.

How Do You Manage Your Data? Let Us Know!

Do you have a great file-naming system that everyone in the lab uses, or have you “lost” files because lab members moved on and you’re not sure how their files were labeled? Do you have a protocol for sharing your data, or do you still have questions about credit and proper usage? Do you have a data management plan in place, or do you need help coming up with one? We would love to have a conversation with you to discuss all of these issues and more.

Image Credit: DataONE. Retrieved January 20, 2015. https://www.dataone.org/best-practices

HSLS is conducting a research study to collect information on researchers work flows and data management practices. Participation in the study will require one interview, conducted in the researcher’s lab space and taking an average of 45–60 minutes.

Our intention is that participation in this interview will benefit you and your research laboratory by bringing to light possible modifications that could be made regarding management of data in your research setting.

The data received will be used for research purposes and library educational efforts. Your responses will remain confidential and data will be saved on a password protected server.

To participate, or if you have questions, please contact the study PI: Melissa Ratajeski.

~Carrie Iwema

This information is over 2 years old. Information was current at time of publication.

Data Journals: A New Way to Share Research Data

How do you share your data? If your answer only includes publishing results in a journal article or presenting results at a conference, think again! Consider that a journal article or conference presentation is composed of two parts: 1) the interpretation of data collected, in the form of the text, and 2) the supporting evidence, i.e., the data. These two parts are increasingly recognized as independently citable. In keeping with the University of Pittsburgh’s Guidelines on Data Management and policies from funding agencies such as the National Institutes of Health (NIH), the underlying data developed with research awards should also be shared.

One response to this dichotomy has been the appearance of a new type of journal: the data journal. Data journals feature standardized descriptions and links to peer-reviewed datasets and supporting tools. Authors use a template to easily create the description during the submission process. This new publication type has been designated “Data Descriptor” by Nature Publishing. The published data descriptor is often, but not necessarily, associated with a separate journal article. Each publication type generates its own citations.

Data journals have the potential to improve dissemination and discoverability over data repositories because these journals 1) may be indexed in MEDLINE, EMBASE and other important biomedical databases, and 2) after established, could receive an impact factor from Journal Citation Reports.

While data journals promote and facilitate the reuse of datasets by publishing detailed and accurate descriptions, they do not usually host data themselves, but use links to data repositories, eliminating conflicts with funder, institutional, or publisher repository requirements.

Benefits of publishing research data separately include:

  1. Increased data citations
  2. Validation of data
  3. Data preservation services
  4. Reusable data for additional research
  5. Reusable data for teaching
  6. New collaborations

Find out more about each of these current data journals:

For more information about data sharing, see Data Management Planning: Data Sharing in the September 2013 issue of the HSLS Update.

~Andrea Ketchum

This information is over 2 years old. Information was current at time of publication.

Data Citing Guidance

Recent guidelines from federal agencies, institutions, and journal publishers encourage researchers to share their raw data. Shared data can be located in places such as repositories or on departmental Web sites, and their use requires the inclusion of a citation in a manuscript’s reference list, as would be done with a journal article or book.

Why cite data?

Citations create an important linkage between papers and supporting data, allowing for verification, replication, and re-use of the data in new studies or a meta-analysis. Similar to journal articles, the number of times that a dataset is cited could be tracked and used to support a researcher’s tenure and promotion, or to illustrate the impact of a research study.

Data citations should be included in your manuscript even when you are the producer of the data. Data can be cited without making the dataset available through open access.

How to Cite Data?

Unfortunately, most of the major style guides do not provide guidance on how to cite data and “data” is not an available reference type in some bibliographic management software tools (EndnoteX6 does have a reference type “dataset”).

The organization DataCite recommends citing data using one of these formats (fields defined below):

Minimal Citation Requirement:
Creator (Publication Year): Title. Publisher. Identifier
Citation Requirement with Optional Fields:
Creator (Publication Year): Title. Version. Publisher. Resource Type. Identifier.

 

  • Creator:  This can be an individual, group, or an organization.
  • Title: Name of the dataset or name of the study resulting in the data, not the name of the resulting journal article.
  • Version: Each iteration should have a unique number.
  • Publication Year: When the data set was published or when it was posted online; not the data creation date.
  • Publisher: Entity that makes the data available for downloading, when applicable. This might be a repository like Dryad, or an institutional repository at an academic institution.
  • Identifier: The DOI (Digital Object Identifier) or other persistent identifier. This could also be a Web site that points to a description of the data and includes a notation regarding accessibility.
  • Resource Type: A one-word description such as image, dataset, software, audiovisual, etc.

For more information on data sharing and repositories, please refer to these recent HSLS Update articles: “Data Management Planning: Data Sharing,” September 2013, and Data Repositories: Meeting Your Research Needs,” February 2014.

For questions, contact the Falk Library Main Desk at 412-648-8866 or Ask a Librarian.

~Melissa Ratajeski

This information is over 2 years old. Information was current at time of publication.

Data Repositories: Meeting Your Research Needs

What is a data repository? According to the E-Science Thesaurus, a data repository can be broadly “defined as a place that holds data, makes data available to use, and organizes data in a logical manner.”1 The National Institutes of Health (NIH) further defines repositories by level of security to accommodate sensitive data:2

  • Data archive—a place where machine-readable data are acquired, manipulated, documented, and finally distributed to the scientific community for further analysis.
  • Data enclave—a controlled, secure environment in which eligible researchers can perform analyses using restricted data resources.

In accordance with the NIH and the National Science Foundation policies requiring that research data developed with federal funds be shared with other researchers, data repositories provide the technical platform that enables the sharing, discovery, validation, and reuse of data. They also support greater efficiency throughout the scientific process.

What advantages does a data repository offer a health sciences researcher? Besides convenient storage and facilitated, professional long-term preservation for your research data, a data repository provides:

  • Updates to new data formats
  • Enhanced discoverability
  • Increased citation rates
  • Access to a variety of datasets to explore
  • Ability to reuse validated and unique datasets
  • More efficient workflow

When selecting a data repository, first check for funder, journal, or institutional requirements, and maintain compliance with your research protocols. General data repositories as well as subject-specific repositories are represented in the searchable directories listed below.

General:

Directories:

For previous articles on data management published in the HSLS Update, please see:

1. E-Science Thesaurus: Data Repository. E-Science Portal for New England Librarians. Last updated: Sep 5, 2013. Accessed Jan. 7, 2014.
2. Definitions: NIH Data Sharing Policy and Implementation Guidance. National Institutes of Health (NIH). Bethesda, MD. Last updated: March 5, 2003. Accessed Jan. 7, 2014.

~ Andrea M. Ketchum

This information is over 2 years old. Information was current at time of publication.

Data Management Planning: Privacy and Ethical Issues

If you are a biomedical researcher, then you are well aware that funding agencies and publishers have guidelines for ensuring the privacy and ethical treatment of animal and human subjects. Any research institution that accepts federal funding is legally required to have policies in place to oversee its research programs. These policies include monitoring conflicts of interest, reporting misconduct, ensuring adherence to safety regulations, and maintaining committees that review animal and human research protocols.

The Institutional Animal Care and Use Committee (IACUC) oversees the appropriate care and humane treatment of animals being used for research, testing, and education. The purpose of the Institutional Review Board (IRB) is to protect the rights and welfare of individuals participating as subjects in the research process.

In the context of data management, the IRB has three roles:

  • Reviews data management plans to examine feasibility (cost, infrastructure, staffing).
  • Reviews data collection forms to limit the amount of personal identifiable information being collected.
  • Reviews research protocols to determine how data will be safeguarded.

The rules about safeguarding include consideration of who will have access to the data technically, physically, and administratively, as well as for what purpose. These are occasionally called the privacy or confidentiality rules. However, the University of Pittsburgh IRB makes an important distinction between the two terms:

  • “Privacy” refers to the individual’s right to control access to themselves, including personal information and biological specimens.
  • “Confidentiality” refers to how an individual’s private information will be protected from release by the researcher, which is an important element of the consent process.

At the federal level, health data are protected by the Health Insurance Portability and Accountability Act (HIPAA). Information about the University of Pittsburgh’s HIPAA policies and procedures with regard to research may be found on Pitt’s Institutional Review Board’s Health Insurance Portability and Accountability Act (HIPAA) Web site, including sample protocols and consent forms.

If you are submitting a grant to either the National Institutes of Health or the National Science Foundation, be sure to review their guidelines on human subjects and privacy issues before creating your data management plan. If you have additional questions, refer to the University of Pittsburgh’s IACUC and IRB Web sites.

For previous articles on Data Management published in the HSLS Update, see:

~ Carrie Iwema

This information is over 2 years old. Information was current at time of publication.

Data Management Planning: Data Sharing

Data sharing is an important part of the scientific method. The University of Pittsburgh’s Guidelines on Data Management aligns with the National Institutes of Health (NIH) and the National Science Foundation (NSF) policies stating that data developed with federal funds should be shared on request with other researchers. With federal budgets under increasing pressure, data sharing leverages public investment by:

  • Speeding discovery
  • Making available unique and difficult to replicate data
  • Enabling the exploration of new topics
  • Eliminating redundancy
  • Facilitating validation studies
  • Discouraging fraud
  • Permitting the creation of new data sets by combining data from multiple sources
  • Facilitating meta-analysis
  • Encouraging diversity of analysis and opinion

Additionally, publishers such as Nature, Science, and PLoS require that supporting data be made available as a condition of publication, in turn making data more easily found online via data repositories. Benefits to researchers include increased publication citation1 rates, access to new research data, and convenient long-term storage.

What is research data?

When meeting the requirements of the NIH and NSF, data is not simply what appears in the published article: it is the “recorded factual material…necessary to validate researching findings,2 i.e., the raw data on which summary statistics and tables are based. The University of Pittsburgh further classifies research data3 as intangible (statistics, findings or conclusions) or tangible (notebooks, videos, forms, etc.).

Is there a timeline for sharing data?

NIH mandates that final research data be shared “no later than the acceptance for publication of the main findings from the final data set.”4 Describe plans in the NIH data management plan (DMP),5 required for projects seeking $500,000 or more in direct costs in any year. The DMP is a brief paragraph following the Research Plan of the application, and does not count towards the page limit.

NSF requires sharing final research data for all projects in a “reasonable length of time6 as long as the cost is modest.” The NSF DMP5 is two pages maximum for all full proposals, and does not count towards the 15-page Project Description.

“Data Repositories: Meeting Your Research Needs” will be covered in a future article.

1. H.A. Piwowar, R.S. Day, D.B. Fridsma, “Sharing Detailed Research Data Is Associated with Increased Citation Rate,” PLoS One 3, no. 3 (2007):e308.

2. U.S. Office of Management and Budget, Executive Office of the President, Federal Register Notice re OMB Circular A-110 (Washington, D.C., September 30, 1999), http://www.whitehouse.gov/omb/fedreg_a110-finalnotice.

3. University of Pittsburgh, Guidelines on Research Data Management (Pittsburgh, PA, November 25, 2009), http://www.provost.pitt.edu/documents/RDM_Guidelines.pdf.

4. National Institutes of Health (NIH), Final NIH Statement on Sharing Research Data (Bethesda, MD, February 26, 2003), http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html.

5. National Institutes of Health (NIH), NIH Data Sharing Policy and Implementation Guidance (Bethesda, MD, February 9, 2012), http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm .

6. National Science Foundation, Biological Sciences Directorate, Information about the Data Management Plan Required for all Proposals (2/20/13) (Arlington, VA, February 20, 2013), http://www.nsf.gov/bio/pubs/BIODMP061511.pdf.

~ Andrea Ketchum

This information is over 2 years old. Information was current at time of publication.

Data Management Planning: Data Ownership, Part 4

Who owns your research data—you, the University of Pittsburgh, or the government? Who has the legal rights to your data, and who retains the data after project completion?

Data ownership refers to the rights and control over data as well as data management and use. The rules surrounding ownership depend on who is providing the funding. Grants from philanthropic organizations (e.g., foundations) tend to advance specific causes, and policies on ownership rights will vary. Private funders (e.g., pharmaceutical companies) are interested in profits as well as benefits to society, and typically retain ownership rights for the commercial use of data. Government agencies (e.g., NIH) fund research to improve the general health and welfare of society and provide support in the form of grants and contracts.

With a federally funded grant, researchers are required to conduct the research and submit reports, but control of the data remains with the institution that received the funds, such as the University of Pittsburgh. With a contract, researchers are required to deliver a service or product, which is ultimately controlled by the government. It is important for you to know whether your government-funded research is in the form of a grant or a contract, as this will influence where you can publish and who can use your data.

Your research institution does indeed own your data, but allows you, as the Principal Investigator (PI), to be the data steward, subject to institutional review. The PI controls the research direction, publication, and copyright (unless given to a publisher) and is responsible for data collection, recording, storage, retention, and disposal. Remember that if you have a federally funded grant, your data and lab notebooks belong to the grantee institution—NOT to you, your students, or your fellows. Also, if you leave a grantee institution, you must negotiate to keep both your grants and your data.

So, before undertaking any research, ask yourself the following questions:

  • Who owns the data I’m collecting?
  • What rights do I have to publish the data?
  • Does collecting these data impose any obligations on me?

For authoritative sources of information on data ownership, please see:

For other articles in this series about data management, please see:

~ Carrie Iwema

This information is over 2 years old. Information was current at time of publication.

Data Management Planning: Storage, Backup, and Security, Part 3

Are you taking the appropriate steps to store, backup, and secure your data files? Does your research group have formal policies in place that are detailed in your data management plan? If you answered no, or maybe…read on!

The lifespan of storage media such as servers, hard drives, CD/DVDs, and flash drives varies depending on the use, location, and maintenance of the media. It is important to know the limitations, lifespan, and maintenance needs of the selected media. Don’t forget—all media, no matter how reliable, must be backed up. Creation of multiple backups and use of off-site storage provide the best protection. Consider whether to back up particular files or the entire computer system, the frequency needed, data backup location (i.e., off-site server), and persons responsible.

If you do not have the staff or expertise to implement a data management plan, consider consulting with one of these University of Pittsburgh departments: Computing Services and Systems Development (CSSD), the Center for Research on Health Care (CRHC) Data Center, Epidemiology Data Center, or the Pittsburgh Supercomputing Center (fees may apply).

Are all of the members in your lab using a consistent file naming convention to increase efficiency? The contents of data files should be described in brief but meaningful ways for quick retrieval. Other tips:

  • Avoid symbols such as “  /  \  :  *  ?  ”  <  >  #  [  ]  &  $ in file names.
  • Do not use spaces to separate words.
  • Follow the date format recommended by ISO 8601: YYYY-MM-DD.
  • When using sequential numbering, make sure to use leading zeros so files stay in order when sorting by file name (i.e., RatajeskiSurvey01).

Finally, safeguard the integrity of your data by restricting access to sensitive data. Each computer in your lab should have updated anti-virus protection, firewalls, and intrusion detection in place, especially if your system is connected to the Internet. Do not store confidential data on servers or computers connected to an external network or send personal or confidential data via e-mail. Safeguard your physical space as well. Control access to rooms and computers where data is stored and log the removal of, and access to, media or hardcopy material.

More information about sensitive data, security, and backing up your data, is available at technology.pitt.edu/security.html.

Part 1 of this series appeared in the February 2013 HSLS Update and explored various aspects of data management planning; while Part 2, in the March 2013 issue, examined metadata.

~ Melissa Ratajeski

This information is over 2 years old. Information was current at time of publication.

Data Management Planning: Metadata, Part 2

Metadata: it is so much more than data about data! When a dataset is included in an online collection or database, the standardized structure and vocabulary of metadata makes it “findable” when users query the search interface. Metadata also supports interoperability between databases, providing the semantic power necessary for sharing datasets and enabling collaboration.

At its simplest level, metadata provides a standardized description of the content of any form of data, such as a book, an image, or a dataset. The metadata elements for a book include title, author, and publication year, whereas the elements for a dataset can include contributor/creator, unique identifier, format, file size, type of data (survey, microarray), subject, abstract, version, source, and ownership. Use metadata elements as headings in spreadsheets or databases to facilitate and standardize data collection.

Pre-existing sets of elements are readily available, such as Dublin Core, with 15 repeatable core elements, and DataCite, with 17 repeatable elements. Both are designed especially for scientific datasets. DataCite also supports registration of a persistent Digital Object Identifier (DOI), which serves to increase the exposure and citation count of your dataset.

Some biomedical journals require raw data to be deposited in an approved public repository, such as UniProt or ArrayExpress, prior to peer review, and an acquisition number assigned by the repository must then be submitted along with the manuscript. The metadata standards for those datasets are determined by the repositories. An excellent resource for biomedical metadata standards is “MIBBI: Minimum Information Guidelines from Diverse Bioscience Communities.”

Metadata is a brief but required section of the National Institutes of Health (NIH) and National Science Foundation (NSF) data management plans. The free resource DMPTool provides a framework for describing the metadata used in projects funded by NIH, NSF, and other organizations.

Part 1 of this six-part series appeared in the February 2013 HSLS Update and gave an introduction to data management planning.

~ Andrea Ketchum

This information is over 2 years old. Information was current at time of publication.

Data Management Planning, Part 1

This is the first article in a six part series which will describe the various aspects of data management planning.

All data has a “lifecycle.” It’s created, processed, analyzed, preserved, shared, and potentially re-used by you or others in the research community.

Data management is the development and execution of policies Continue reading