Download Genomes From NCBI¶
Author: | Brant C. Faircloth |
---|---|
Copyright: | This documentation is available under a Creative Commons (CC-BY) license. |
All of the following assume that you are using the Z shell (zsh). These may or may not work in BASH.
Modification History¶
Purpose¶
Sometimes, we want to download genomes from NCBI and sometimes we need to do that for a lot of genomes. Thing is, it’s not always easy to do this, and NCBI does not make it abundantly clear what the best way to do this is for larger genomes (e.g. not microbes). So, here’s one way to go about it.
Steps¶
You need to find identifier information for the genome(s) you want to download. NCBI indexes genomes in several ways, some of them weird. Probably the best way to find the genomes you want is to make a list of the taxa that you want to download (e.g. genus and species). Then you can feed that list of taxa into the script below to pull down the NCBI Taxonomy ID for that individual taxon. We’ll then use this Taxonomy ID to find the assemblies we’re after. You want your list of taxa to look something like:
Benthosema glaciale Percopsis transmontana Typhlichthys subterraneus Cyttopsis rosea Gadiculus argenteus Trisopterus minutus Brosme brosme Molva molva Phycis phycis Phycis blennoides
Save that list as ‘taxa.txt’. Now, create a new file (
get_tax_id.py
), edit the following to add your email address, and run the following code against this list with Python (the list name is hardcoded into the code below, but it’s easy to edit):#!/usr/bin/env python # -*- coding: utf-8 -*- """ (c) 2016 Brant Faircloth || http://faircloth-lab.org/ All rights reserved. This code is distributed under a 3-clause BSD license. Please see LICENSE.txt for more information. Created on 28 April 2016 08:42 CDT (-0500) """ import os import sys import time import argparse from Bio import Entrez # import pdb def get_tax_id(species): """to get data from ncbi taxomomy, we need to have the taxid. we can get that by passing the species name to esearch, which will return the tax id""" species = species.replace(" ", "+").strip() search = Entrez.esearch(term=species, db="taxonomy", retmode="xml") record = Entrez.read(search) return record['IdList'][0] def get_tax_data(taxid): """once we have the taxid, we can fetch the record""" search = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml") return Entrez.read(search) def main(): Entrez.email = "name@domain.edu" if not Entrez.email: print("You must add your email address") sys.exit(2) with open('taxa.txt') as infile: all_taxa = {} for line in infile: tax_name = line.strip() taxid = get_tax_id(tax_name) print("{},{}".format(tax_name, taxid)) time.sleep(1) if __name__ == '__main__': main()
This will spit out a list of taxa to
stdout
that looks like:Malacocephalus occidentalis,630739 Macrourus berglax,473319 Bathygadus melanobranchus,630650 Laemonema laureysi,1784819 Trachyrincus scabrus,562814 Muraenolepis marmoratus,487677 Melanonus zugmayeri,181410
This list now contains the taxon you searched for, and the NCBI Taxonomy ID for that species. The code will hit an error if you include a species name that does not exist in the NCBI Taxonomy database.
Save the list that’s output to a
csv
file named something likencbi_id.csv
. Once you’ve done that, we need to create a new (potentially temporary) conda environment to hold the NCBI Genome Download code.conda create -n ncbi python=3 pip conda activate ncbi pip install ncbi-genome-download
With that environment installed, either navigate to (or create) a directory to hold the genomes we want to download, and copy the list output from our automated search against NCBI taxonomy.
In this directory, we’ll use a ZSH shell script to parse the list of species and NCBI Taxonomy ID we just created, and use components of those parsed files to download the genome sequences we want. In the example below, we’re telling
ncbi-genome-download
to download only the assembly-report and the genome assembly fasta file for each taxon. There are a number of other parameters ofncbi-genome-download
you can investigate.for line in `cat ncbi_id.csv`; do elements=(${(s:,:)line}); ncbi-genome-download -s genbank -T ${elements[2]} --verbose --format "fasta,assembly-report" --output ${elements[1]} vertebrate_other; done
Because
ncbi-genome-download
will download ALL of the assemblies for a given taxon in your list, you probably want to look at what actually was downloaded and cull/trim as needed. You can easily list all of the downloads tostdout
with a command like:for i in *; do echo $i; ls $i/genbank/vertebrate_other/GCA_*/*.fna.gz; done
If you want to reformat all of these to 2bit and keep only the stuff you need:
for i in `find * -maxdepth 0 -type d`; do echo "working on $i"; cp $i/genbank/vertebrate_other/GCA_*/*_assembly_report.txt $i/; gunzip -c $i/genbank/vertebrate_other/GCA_*/*_genomic.fna.gz | faToTwoBit stdin $i/$i.2bit; twoBitInfo $i/$i.2bit $i/$i.info; done
Now you can go back and delete the intermediate fasta files (leavint the 2bit files and summary assembly reports in place):
for i in `find * -maxdepth 0 -type d`; do rm -rf $i/genbank; done