Assembly With Itero¶
Author: | Brant C. Faircloth |
---|---|
Copyright: | This documentation is available under a Creative Commons (CC-BY) license. |
Modification History¶
Purpose¶
itero is a pipeline to generate gene trees from a large set of loci, using the most appropriate site rate substitution model.
Preliminary Steps¶
Steps¶
Prior to running itero, it’s a good idea to get an idea of the count of reads that you have for a given sample. If this number is > 3M reads per sample and these are UCE data, we probably want to downsample the data to a manageable size - something like 3M total reads (1.5 M R1 and 1.5M R2 reads) per sample. You also probably want to do this after trimming. Once you’ve cleaned the reads, you can compute the count of reads by:
for i in clean-reads/*; do echo $i; gunzip -c $i/split-adapter-quality-trimmed/*-READ1.fastq.gz | wc -l | awk '{print $1/4}'; done
This will output a count of R1 reads by sample to the console, and you can use a regular expression to re-arrange those bits into a CSV file. Load that CSV file in excel (or sort in some manner) so that you can determine which samples have 3M reads (1.5M read each for
R1
andR2
).One you have the list of those samples you’d like to downsample, you need to create a text file to hold their names, call it something like
samples-to-downsample.txt
:alectura-lathami2 anas-platyrhynchos anser-erythropus anseranas-semipalmata biziura-lobata chauna-torquata colinus-cristatus coturnix-coturnix crax-alector gallus-gallus malacorhynchus-membranaceus megapodius-eremita numida-meleagris oxyura-jamaicensis rollulus-rouloul
In your working directory, imagine you have
clean-reads
containing your trimmed read data. Create a new directorydownsampled-reads
, andcd
into that. Make sure the text file from above is in this new directory. Now, assuming you haveseqtk
in your$PATH
somewhere:for sample in ``cat samples-to-downsample.txt``; do rnum=$RANDOM; reads=1500000; echo 'sampling: ' ${sample} ${reads}; echo 'using: ' ${rnum}; mkdir ${sample}; seqtk sample -s $rnum ../clean-reads/${sample}/split-adapter-quality-trimmed/${sample}-READ1.fastq.gz $reads | gzip > ./${sample}/${sample}-READ1.${reads}.fastq.gz; seqtk sample -s $rnum ../clean-reads/${sample}/split-adapter-quality-trimmed/${sample}-READ2.fastq.gz $reads | gzip > ./${sample}/${sample}-READ2.${reads}.fastq.gz; done
This will create files containing 1,500,000 reads that have been randomly sampled from your larger population of reads. If you need to, symlink in
R1
andR2
files from samples having fewer than 3M total reads. And, if you need to, upload all of those reads to wherever you are running your assembly (e.g.supermic
).On
supermic
, you generally want to split your samples up into batches of about 20 taxa, and you want to run ~2 batches of 20 taxa at the same time (itero uses a lot of IO, so we want to be nice and not suck all the available IO up). Based on your read locations, you want to create a configuration file for a batch of assemblies. That looks something like this:[reference] /home/brant/work/eb2/uce-5k-probes.loci.fasta [individuals] alectura-lathami2:/home/brant/work/eb2/batch-1/raw-reads/alectura-lathami2 anas-platyrhynchos:/home/brant/work/eb2/batch-1/raw-reads/anas-platyrhynchos anser-erythropus:/home/brant/work/eb2/batch-1/raw-reads/anser-erythropus anseranas-semipalmata:/home/brant/work/eb2/batch-1/raw-reads/anseranas-semipalmata biziura-lobata:/home/brant/work/eb2/batch-1/raw-reads/biziura-lobata chauna-torquata:/home/brant/work/eb2/batch-1/raw-reads/chauna-torquata colinus-cristatus:/home/brant/work/eb2/batch-1/raw-reads/colinus-cristatus coturnix-coturnix:/home/brant/work/eb2/batch-1/raw-reads/coturnix-coturnix crax-alector:/home/brant/work/eb2/batch-1/raw-reads/crax-alector gallus-gallus:/home/brant/work/eb2/batch-1/raw-reads/gallus-gallus malacorhynchus-membranaceus:/home/brant/work/eb2/batch-1/raw-reads/malacorhynchus-membranaceus megapodius-eremita:/home/brant/work/eb2/batch-1/raw-reads/megapodius-eremita numida-meleagris:/home/brant/work/eb2/batch-1/raw-reads/numida-meleagris oxyura-jamaicensis:/home/brant/work/eb2/batch-1/raw-reads/oxyura-jamaicensis rollulus-rouloul:/home/brant/work/eb2/batch-1/raw-reads/rollulus-rouloul
Then, you need so create a job submission script that looks something like:
#PBS -A <allocation_name> #PBS -l nodes=5:ppn=20 #PBS -l walltime=72:00:00 #PBS -q checkpt #PBS -N itero_batch1 ulimit -n 10000 # move into the directory containing this script cd $PBS_O_WORKDIR echo $PBS_NODEFILE source activate itero mpirun -hostfile $PBS_NODEFILE -n 100 itero assemble mpi --config batch-1.conf --output batch-1-assembly --local-cores 20 --clean source deactivate itero
This will run the important bits using 100 CPUs (20 CPUs for the bwa steps).
If you saved the above submission script as
itero.pbs
, then submit the job:qsub itero.pbs
You can also submit additional batches (beyond 2) and make them dependent on the earlier batches finishing - in this way you can submit lots of (batch) jobs, but only have 2 running at the same time and not hog resources. If your submission script is called
itero.pbs
, then you need to determine thejob_id
you want the new job to start after and run:qsub -W depend=afterok:<job_id> itero.pbs