Metadata: it is so much more than data about data! When a dataset is included in an online collection or database, the standardized structure and vocabulary of metadata makes it “findable” when users query the search interface. Metadata also supports interoperability between databases, providing the semantic power necessary for sharing datasets and enabling collaboration.
At its simplest level, metadata provides a standardized description of the content of any form of data, such as a book, an image, or a dataset. The metadata elements for a book include title, author, and publication year, whereas the elements for a dataset can include contributor/creator, unique identifier, format, file size, type of data (survey, microarray), subject, abstract, version, source, and ownership. Use metadata elements as headings in spreadsheets or databases to facilitate and standardize data collection.
Pre-existing sets of elements are readily available, such as Dublin Core, with 15 repeatable core elements, and DataCite, with 17 repeatable elements. Both are designed especially for scientific datasets. DataCite also supports registration of a persistent Digital Object Identifier (DOI), which serves to increase the exposure and citation count of your dataset.
Some biomedical journals require raw data to be deposited in an approved public repository, such as UniProt or ArrayExpress, prior to peer review, and an acquisition number assigned by the repository must then be submitted along with the manuscript. The metadata standards for those datasets are determined by the repositories. An excellent resource for biomedical metadata standards is “MIBBI: Minimum Information Guidelines from Diverse Bioscience Communities.”
Metadata is a brief but required section of the National Institutes of Health (NIH) and National Science Foundation (NSF) data management plans. The free resource DMPTool provides a framework for describing the metadata used in projects funded by NIH, NSF, and other organizations.
Part 1 of this six-part series appeared in the February 2013 HSLS Update and gave an introduction to data management planning.
~ Andrea Ketchum