This information is over 2 years old. Information was current at time of publication.{"id":13166,"date":"2020-09-23T08:39:47","date_gmt":"2020-09-23T12:39:47","guid":{"rendered":"https:\/\/info.hsls.pitt.edu\/updatereport\/?p=13166"},"modified":"2020-10-01T09:30:19","modified_gmt":"2020-10-01T13:30:19","slug":"ncbi-datasets-making-genomic-data-download-easy","status":"publish","type":"post","link":"https:\/\/info.hsls.pitt.edu\/updatereport\/ncbi-datasets-making-genomic-data-download-easy\/","title":{"rendered":"NCBI Datasets: Making Genomic Data Download Easy"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-13168 size-medium\" src=\"https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_Organisms-1-300x291.jpg\" alt=\"Organisms with datasets: homo sapiens (human), mus musculus (house mouse), Arabidopsis thaliana (thale cress), and Rattus norvegicus (Norway rat)\" width=\"300\" height=\"291\" data-emailimage=\"right\" srcset=\"https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_Organisms-1-300x291.jpg 300w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_Organisms-1-515x500.jpg 515w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_Organisms-1-768x746.jpg 768w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_Organisms-1.jpg 846w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/>There are challenges with downloading genomic data. File sizes are large, and it can be time consuming to retrieve multiple files. Sometimes downloads fail. A custom script may be required. Fortunately, a solution to all of these frustrations is now available\u2014<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/\">NCBI Datasets<\/a>.<\/p>\n<p>This <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/docs\/about-ncbi-datasets\/\">experimental resource<\/a> allows users to easily download eukaryotic genome sequence and annotation data by assembly accession, taxonomic name (scientific and common), or taxonomy ID. The web interface allows for browsing by organism, with the most common experimental species conveniently available from the main page. For example, try selecting the house mouse (mus musculus), then select all 22 associated assemblies. Options for the type of data for the download include genomic, transcript, and protein sequences as well as annotation features.<!--more--><\/p>\n<p>When a selected dataset is close to or above the limit of 15 GB, the downloaded file will be a \u201cdehydrated bag,\u201d aka a compressed\/zipped file containing only the data report and links to download the selected dataset(s) from the NCBI servers.<\/p>\n<figure id=\"attachment_13167\" aria-describedby=\"caption-attachment-13167\" style=\"width: 881px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-13167\" style=\"border: 1px solid #555\" src=\"https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_DehydratedBag-1.jpg\" alt=\"Genomic sequence (FASTA) of 22 assemblies is selected and is 14.68 GB, so it downloads as a 20 MB dehydrated bag\" width=\"881\" height=\"485\" srcset=\"https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_DehydratedBag-1.jpg 881w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_DehydratedBag-1-300x165.jpg 300w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_DehydratedBag-1-515x284.jpg 515w, https:\/\/info.hsls.pitt.edu\/updatereport\/files\/2020\/09\/NCBIdataset_DehydratedBag-1-768x423.jpg 768w\" sizes=\"auto, (max-width: 881px) 100vw, 881px\" \/><figcaption id=\"caption-attachment-13167\" class=\"wp-caption-text\">In this example, the genomic sequence (FASTA) dataset for the 22 assemblies of the house mouse is downloaded in a &lt; 20 MB dehydrated bag.<\/figcaption><\/figure>\n<p>Instructions are provided to \u201crehydrate\u201d the unzipped files and access the full dataset(s). This <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/docs\/rehydrate\/\">dehydrate\/rehydrate strategy<\/a> makes it simple to download, share, and store large genome datasets. For example, sharing this data with a colleague is as easy as e-mailing the dehydrated file, which can then be rehydrated at a convenient time.<\/p>\n<p>Plans for NCBI Datasets include adding other assemblies (bacteria and viruses) and datasets such as genomic patch sequence and alternative loci. The project aims to meet the <a href=\"https:\/\/www.go-fair.org\/fair-principles\/\">FAIR principles<\/a> of scientific data management (Findable, Accessible, Interoperable, and Reusable). An introduction to NCBI Labs initiatives is available on the <a href=\"https:\/\/ncbiinsights.ncbi.nlm.nih.gov\/2015\/07\/29\/introducing-pubmed-labs\/\">NCBI Insights blog<\/a>.<\/p>\n<p>Use NCBI Datasets to gather genomic data in order to practice using the many <a href=\"https:\/\/hsls.libguides.com\/molbio\/licensedtools\/resources\">bioinformatics tools licensed by HSLS<\/a>. Need help with using these resources? <a href=\"https:\/\/www.hsls.pitt.edu\/ask-a-molbio-specialist\">Contact HSLS MolBio<\/a> or (virtually) attend one of our many <a href=\"http:\/\/files.hsls.pitt.edu\/files\/molbio\/MolbioWorkshops.pdf\">workshops<\/a>.<\/p>\n<p>~Carrie Iwema<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are challenges with downloading genomic data. File sizes are large, and it can be time consuming to retrieve multiple files. Sometimes downloads fail. A custom script may be required. Fortunately, a solution to all of these frustrations is now available\u2014NCBI Datasets. This experimental resource allows users to easily download eukaryotic genome sequence and annotation [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"issue-archives","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[156],"tags":[80,-1],"class_list":["post-13166","post","type-post","status-publish","format-standard","hentry","category-october-2020","tag-data-management","avhec_catgroup-issue-archives"],"_links":{"self":[{"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/posts\/13166","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/comments?post=13166"}],"version-history":[{"count":5,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/posts\/13166\/revisions"}],"predecessor-version":[{"id":13192,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/posts\/13166\/revisions\/13192"}],"wp:attachment":[{"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/media?parent=13166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/categories?post=13166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/info.hsls.pitt.edu\/updatereport\/wp-json\/wp\/v2\/tags?post=13166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}