Running Pasta

Author:Carl Oliveros and Brant C. Faircloth
Copyright:This documentation is available under a Creative Commons (CC-BY) license.

Modification History

See Running Pasta

Purpose

Pasta is essentially an updated version of Saté, and Pasta may or may not be incorporated with Treeshrink (we strictly run Pasta here). Pasta is an iterative aligner that tends to align better than simply using mafft on its own.

Preliminary Steps

  1. Installing Pasta is a little tricky. If you have not already, create a conda environment for Pasta

    conda create -n pasta python=3
    
  2. Switch to that environment, make a tmp directory in it, and pull down the pasta code and binaries:

    source activate pasta
    cd ~/anaconda/envs/pasta/
    mkdir -p tmp/pasta-code
    cd tmp/pasta-code
    git clone https://github.com/smirarab/pasta.git
    git clone https://github.com/smirarab/sate-tools-linux.git
    cd pasta
    python setup.py develop
    
  3. To leave the environment, run:

    source deactivate pasta
    

Steps

  1. Pasta enters the equation when we’re trying to align DNA sequences. Here, I’ll discuss these steps in the context of using the phyluce pipeline, although many of the steps in the approach are similar regardless of whether you are using phyluce or not.

  2. You can implement Pasta at several stages of the phyluce pipeline, but the easiest is probably after you have identified the UCE loci and extracted those loci to what we call a “monolithic” fasta file. First thing you want to do it “explode” that monolithic FASTA by locus:

    phyluce_assembly_explode_get_fastas_file \
        --input my-monolithic.fasta \
        --output exploded-loci
    
  3. Now what you should have is a directory of fasta files, one for each locus in your data set. You likely want to filter these loci to remove really short stuff - typically something like those sequences having < 50% of the median length of all sequences for a particular locus:

    phyluce_assembly_filter_seqs_from_fastas \
        --input exploded-loci \
        --output exploded-loci-length-filtered \
        --filtered-sequences-file exploded-loci_fasta.shorts \
        --proportion 0.5 \
        --cores 12 \
        --log-path log
    
  4. And, we want to filter those FASTA files to remove loci that have fewer than 4 taxa. We can do this using some code meant for alignment data:

    phyluce_align_filter_alignments \
        --alignments exploded-loci-length-filtered \
        --output exploded-loci-length-min-4-taxa-filtered \
        --min-taxa 4
    
  5. Once that’s done, we want to package up those alignments and sync them to one of the HPC clusters.

    # package them up
    tar -czvf exploded-loci-length-min-4-taxa-filtered.tar.gz exploded-loci-length-min-4-taxa-filtered
    
    # sync to supermic
    rsync -avLP exploded-loci-length-min-4-taxa-filtered.tar.gz  [email protected]:/home/you/work/
    
  6. On the HPC machine, unarchive those files:

    tar -xzvf exploded-loci-length-min-4-taxa-filtered.tar.gz
    
  7. Let’s rename that directory, to make things simpler

    mv exploded-loci-length-min-4-taxa-filtered exploded-loci
    
  8. Because of the way that we need to parallelize Pasta, we need to create a CSV file that contains the list of files we want to align, along with the output directory information, and the locus name. You can most easily do this with a script like the following:

    # choose a name for the output directory
    OUTDIR="pasta-output"
    # remove any existing list of loci to align
    rm loci-to-align.list
    # loop over loci to build an input file
    for FULLPATH in $PWD/exploded-loci/*; do
        FILE=`basename $FULLPATH`;
        NAME="${FILE%%.*}"
        echo "$FULLPATH,$OUTDIR,$NAME" >> loci-to-align.list
    done
    
  9. This creates a file, loci-to-align.list that looks like the following:

    /home/brant/work/jarvis-align/exploded-loci/uce-1003.unaligned.fasta,pasta-output,uce-1003
    /home/brant/work/jarvis-align/exploded-loci/uce-1004.unaligned.fasta,pasta-output,uce-1004
    /home/brant/work/jarvis-align/exploded-loci/uce-1005.unaligned.fasta,pasta-output,uce-1005
    ...
    
  10. Now, create a bash script named pasta.sh that GNU Parallel will call to run the individual alignments. Note that we are setting the number of cores needed for each alignment to 2 here. You may need to adjust this value if you have very many taxa in each alignment or very many alignments (to increase the amount of RAM per alignment). We are also using the ginsi aligner from mafft, which seems to deal with abberrant sections of sequence pretty well:

    #!/bin/bash
    
    source activate pasta
    
    ## set this manually
    CORES_PER_JOB=2
    
    ## DO NOT EDIT - this comes as input from GNU parallel on STDIN, via the sample.list file
    
    FASTA=$1
    OUTDIR=$2/$3
    mkdir -p $OUTDIR
    
    ## Here are the specific commands we are running
    # run pasta
    python ~/anaconda/envs/pasta/bin/pasta/run_pasta.py --datatype=DNA --num-cpus=$CORES_PER_JOB --input=$FASTA --output-directory=$OUTDIR --aligner=ginsi
    
  11. Make sure to make this script executable: chmod 0755 pasta.sh

  12. Create a qsub script to run the job and specify the number of nodes/cores you will need. Note that we are also declaring the number of CPUs per alignment here:

    #PBS -A <allocation_name>
    #PBS -l nodes=10:ppn=20
    #PBS -l walltime=12:00:00
    #PBS -q checkpt
    #PBS -N multi_pasta
    
    module load gnuparallel/20170122
    
    # Number of Cores per job (needs to be multiple of 2)
    export CORES_PER_JOB=2
    
    # move into the directory containing this script
    cd $PBS_O_WORKDIR
    
    # set the number of Jobs per node based on $CORES_PER_JOB
    export JOBS_PER_NODE=$(($PBS_NUM_PPN / $CORES_PER_JOB))
    
    parallel --colsep '\,' \
            --progress \
            --joblog logfile.$PBS_JOBID \
            -j $JOBS_PER_NODE \
            --slf $PBS_NODEFILE \
            --workdir $PBS_O_WORKDIR \
            -a loci-to-align.list \
            ./pasta.sh {$1} {$2} {$3}
    
  13. Submit the job: qsub pasta.qsub. Be sure to monitor the job with checkjob -j <job_number> to ensure you are using resources approriately.

  14. One the jobs is finished, your alignment data should be in the pasta-output folder. However, the way that Pasta formats things, they data are sort of messy. So, you have a couple of options. The easiest is probably to create a new folder and symlink the alignments for each loci to the new folder. This is easiest to do if you are using zsh:

    mkdir alignments && cd alignments
    
    # BASH version (use if you are doing this on @hpc)
    for i in ../pasta-output/*; do
        LOCUS=`basename $i`
        ln -s $i/*.aln $LOCUS.align.fasta;
    done
    
    # ZSH version (use if you are doing this on our analysis machines and you use ZSH)
    for i in ../pasta-output/*; do
        ln -s $i/*.aln $i:t.align.fasta;
    done
    
  15. Once we have our symlinks, we can make a copy of the alignments folder and follow the symlinks using a special command. This will basically create a new directory containing correctly renamed alignments.

    # must use the -rL option to copy links as real files
    cp -rL alignments alignment-files
    
  16. Copy that back to whatever of our local machines you are using.

    tar -czvf aligment-files.tar.gz alignment-files
    rsync -avLP [email protected]:/home/you/aligment-files.tar.gz ./
    
  17. Finally, you will most likely want to trim the resulting alignments to remove aberrant sequences that ginsi has offset from the “good” parts of the alignment.

    phyluce_align_get_trimal_trimmed_alignments_from_untrimmed \
        --alignments alignment-files \
        --output alignment-files-trim \
        --input-format fasta \
        --output-format nexus \
        --cores 12
    
  18. Don’t forget to strip the locus name information from each alignment:

    mkdir alignment-files-clean && cd alignment-files-clean
    for i in ../alignment-files-trim/*; do cat $i | sed 's/uce\-[0-9]\+_'//g > $i:t; done