Assembly With Itero

Author:Brant C. Faircloth
Copyright:This documentation is available under a Creative Commons (CC-BY) license.

Modification History

See Assembly With Itero

Purpose

itero is a pipeline to generate gene trees from a large set of loci, using the most appropriate site rate substitution model.

Preliminary Steps

  1. To install itero, please see the manual. If you plan to use a cluster/HPC, then be sure to install itero there.

Steps

  1. Prior to running itero, it’s a good idea to get an idea of the count of reads that you have for a given sample. If this number is > 3M reads per sample and these are UCE data, we probably want to downsample the data to a manageable size - something like 3M total reads (1.5 M R1 and 1.5M R2 reads) per sample. You also probably want to do this after trimming. Once you’ve cleaned the reads, you can compute the count of reads by:

    for i in clean-reads/*; do echo $i; gunzip -c $i/split-adapter-quality-trimmed/*-READ1.fastq.gz | wc -l | awk '{print $1/4}'; done
    
  2. This will output a count of R1 reads by sample to the console, and you can use a regular expression to re-arrange those bits into a CSV file. Load that CSV file in excel (or sort in some manner) so that you can determine which samples have 3M reads (1.5M read each for R1 and R2).

  3. One you have the list of those samples you’d like to downsample, you need to create a text file to hold their names, call it something like samples-to-downsample.txt:

    alectura-lathami2
    anas-platyrhynchos
    anser-erythropus
    anseranas-semipalmata
    biziura-lobata
    chauna-torquata
    colinus-cristatus
    coturnix-coturnix
    crax-alector
    gallus-gallus
    malacorhynchus-membranaceus
    megapodius-eremita
    numida-meleagris
    oxyura-jamaicensis
    rollulus-rouloul
    
  4. In your working directory, imagine you have clean-reads containing your trimmed read data. Create a new directory downsampled-reads, and cd into that. Make sure the text file from above is in this new directory. Now, assuming you have seqtk in your $PATH somewhere:

    for sample in ``cat samples-to-downsample.txt``;
        do rnum=$RANDOM;
        reads=1500000;
        echo 'sampling: ' ${sample} ${reads};
        echo 'using: ' ${rnum};
        mkdir ${sample};
        seqtk sample -s $rnum ../clean-reads/${sample}/split-adapter-quality-trimmed/${sample}-READ1.fastq.gz $reads | gzip > ./${sample}/${sample}-READ1.${reads}.fastq.gz;
        seqtk sample -s $rnum ../clean-reads/${sample}/split-adapter-quality-trimmed/${sample}-READ2.fastq.gz $reads | gzip > ./${sample}/${sample}-READ2.${reads}.fastq.gz;
    done
    
  5. This will create files containing 1,500,000 reads that have been randomly sampled from your larger population of reads. If you need to, symlink in R1 and R2 files from samples having fewer than 3M total reads. And, if you need to, upload all of those reads to wherever you are running your assembly (e.g. supermic).

  6. On supermic, you generally want to split your samples up into batches of about 20 taxa, and you want to run ~2 batches of 20 taxa at the same time (itero uses a lot of IO, so we want to be nice and not suck all the available IO up). Based on your read locations, you want to create a configuration file for a batch of assemblies. That looks something like this:

    [reference]
    /home/brant/work/eb2/uce-5k-probes.loci.fasta
    
    [individuals]
    alectura-lathami2:/home/brant/work/eb2/batch-1/raw-reads/alectura-lathami2
    anas-platyrhynchos:/home/brant/work/eb2/batch-1/raw-reads/anas-platyrhynchos
    anser-erythropus:/home/brant/work/eb2/batch-1/raw-reads/anser-erythropus
    anseranas-semipalmata:/home/brant/work/eb2/batch-1/raw-reads/anseranas-semipalmata
    biziura-lobata:/home/brant/work/eb2/batch-1/raw-reads/biziura-lobata
    chauna-torquata:/home/brant/work/eb2/batch-1/raw-reads/chauna-torquata
    colinus-cristatus:/home/brant/work/eb2/batch-1/raw-reads/colinus-cristatus
    coturnix-coturnix:/home/brant/work/eb2/batch-1/raw-reads/coturnix-coturnix
    crax-alector:/home/brant/work/eb2/batch-1/raw-reads/crax-alector
    gallus-gallus:/home/brant/work/eb2/batch-1/raw-reads/gallus-gallus
    malacorhynchus-membranaceus:/home/brant/work/eb2/batch-1/raw-reads/malacorhynchus-membranaceus
    megapodius-eremita:/home/brant/work/eb2/batch-1/raw-reads/megapodius-eremita
    numida-meleagris:/home/brant/work/eb2/batch-1/raw-reads/numida-meleagris
    oxyura-jamaicensis:/home/brant/work/eb2/batch-1/raw-reads/oxyura-jamaicensis
    rollulus-rouloul:/home/brant/work/eb2/batch-1/raw-reads/rollulus-rouloul
    
  7. Then, you need so create a job submission script that looks something like:

    #PBS -A <allocation_name>
    #PBS -l nodes=5:ppn=20
    #PBS -l walltime=72:00:00
    #PBS -q checkpt
    #PBS -N itero_batch1
    
    ulimit -n 10000
    
    # move into the directory containing this script
    cd $PBS_O_WORKDIR
    echo $PBS_NODEFILE
    
    source activate itero
    mpirun -hostfile $PBS_NODEFILE -n 100 itero assemble mpi --config batch-1.conf --output batch-1-assembly --local-cores 20 --clean
    source deactivate itero
    
  8. This will run the important bits using 100 CPUs (20 CPUs for the bwa steps).

  9. If you saved the above submission script as itero.pbs, then submit the job:

    qsub itero.pbs
    
  10. You can also submit additional batches (beyond 2) and make them dependent on the earlier batches finishing - in this way you can submit lots of (batch) jobs, but only have 2 running at the same time and not hog resources. If your submission script is called itero.pbs, then you need to determine the job_id you want the new job to start after and run:

    qsub -W depend=afterok:<job_id> itero.pbs