Running Pasta¶

Author:	Carl Oliveros and Brant C. Faircloth
Copyright:	This documentation is available under a Creative Commons (CC-BY) license.

Modification History¶

See Running Pasta

Purpose¶

Pasta is essentially an updated version of Saté, and Pasta may or may not be incorporated with Treeshrink (we strictly run Pasta here). Pasta is an iterative aligner that tends to align better than simply using mafft on its own.

Preliminary Steps¶

Installing Pasta is a little tricky. If you have not already, create a conda environment for Pasta
conda create -n pasta python=3

Switch to that environment, make a tmp directory in it, and pull down the pasta code and binaries:

source activate pasta
cd ~/anaconda/envs/pasta/
mkdir -p tmp/pasta-code
cd tmp/pasta-code
git clone https://github.com/smirarab/pasta.git
git clone https://github.com/smirarab/sate-tools-linux.git
cd pasta
python setup.py develop

To leave the environment, run:
source deactivate pasta

Steps¶

Pasta enters the equation when we’re trying to align DNA sequences. Here, I’ll discuss these steps in the context of using the phyluce pipeline, although many of the steps in the approach are similar regardless of whether you are using phyluce or not.
You can implement Pasta at several stages of the phyluce pipeline, but the easiest is probably after you have identified the UCE loci and extracted those loci to what we call a “monolithic” fasta file. First thing you want to do it “explode” that monolithic FASTA by locus:
phyluce_assembly_explode_get_fastas_file \ --input my-monolithic.fasta \ --output exploded-loci
Now what you should have is a directory of fasta files, one for each locus in your data set. You likely want to filter these loci to remove really short stuff - typically something like those sequences having < 50% of the median length of all sequences for a particular locus:
phyluce_assembly_filter_seqs_from_fastas \ --input exploded-loci \ --output exploded-loci-length-filtered \ --filtered-sequences-file exploded-loci_fasta.shorts \ --proportion 0.5 \ --cores 12 \ --log-path log

And, we want to filter those FASTA files to remove loci that have fewer than 4 taxa. We can do this using some code meant for alignment data:

phyluce_align_filter_alignments \
    --alignments exploded-loci-length-filtered \
    --output exploded-loci-length-min-4-taxa-filtered \
    --min-taxa 4

Once that’s done, we want to package up those alignments and sync them to one of the HPC clusters.

# package them up
tar -czvf exploded-loci-length-min-4-taxa-filtered.tar.gz exploded-loci-length-min-4-taxa-filtered

# sync to supermic
rsync -avLP exploded-loci-length-min-4-taxa-filtered.tar.gz  you@smic.hpc.lsu.edu:/home/you/work/

On the HPC machine, unarchive those files:

tar -xzvf exploded-loci-length-min-4-taxa-filtered.tar.gz

Let’s rename that directory, to make things simpler

mv exploded-loci-length-min-4-taxa-filtered exploded-loci

Because of the way that we need to parallelize Pasta, we need to create a CSV file that contains the list of files we want to align, along with the output directory information, and the locus name. You can most easily do this with a script like the following:

# choose a name for the output directory
OUTDIR="pasta-output"
# remove any existing list of loci to align
rm loci-to-align.list
# loop over loci to build an input file
for FULLPATH in $PWD/exploded-loci/*; do
    FILE=`basename $FULLPATH`;
    NAME="${FILE%%.*}"
    echo "$FULLPATH,$OUTDIR,$NAME" >> loci-to-align.list
done

This creates a file, loci-to-align.list that looks like the following:

/home/brant/work/jarvis-align/exploded-loci/uce-1003.unaligned.fasta,pasta-output,uce-1003
/home/brant/work/jarvis-align/exploded-loci/uce-1004.unaligned.fasta,pasta-output,uce-1004
/home/brant/work/jarvis-align/exploded-loci/uce-1005.unaligned.fasta,pasta-output,uce-1005
...

Now, create a bash script named pasta.sh that GNU Parallel will call to run the individual alignments. Note that we are setting the number of cores needed for each alignment to 2 here. You may need to adjust this value if you have very many taxa in each alignment or very many alignments (to increase the amount of RAM per alignment). We are also using the ginsi aligner from mafft, which seems to deal with abberrant sections of sequence pretty well:
#!/bin/bash source activate pasta ## set this manually CORES_PER_JOB=2 ## DO NOT EDIT - this comes as input from GNU parallel on STDIN, via the sample.list file FASTA=$1 OUTDIR=$2/$3 mkdir -p $OUTDIR ## Here are the specific commands we are running # run pasta python ~/anaconda/envs/pasta/bin/pasta/run_pasta.py --datatype=DNA --num-cpus=$CORES_PER_JOB --input=$FASTA --output-directory=$OUTDIR --aligner=ginsi
Make sure to make this script executable: chmod 0755 pasta.sh

Create a qsub script to run the job and specify the number of nodes/cores you will need. Note that we are also declaring the number of CPUs per alignment here:

#PBS -A <allocation_name>
#PBS -l nodes=10:ppn=20
#PBS -l walltime=12:00:00
#PBS -q checkpt
#PBS -N multi_pasta

module load gnuparallel/20170122

# Number of Cores per job (needs to be multiple of 2)
export CORES_PER_JOB=2

# move into the directory containing this script
cd $PBS_O_WORKDIR

# set the number of Jobs per node based on $CORES_PER_JOB
export JOBS_PER_NODE=$(($PBS_NUM_PPN / $CORES_PER_JOB))

parallel --colsep '\,' \
        --progress \
        --joblog logfile.$PBS_JOBID \
        -j $JOBS_PER_NODE \
        --slf $PBS_NODEFILE \
        --workdir $PBS_O_WORKDIR \
        -a loci-to-align.list \
        ./pasta.sh {$1} {$2} {$3}

Submit the job: qsub pasta.qsub. Be sure to monitor the job with checkjob -j <job_number> to ensure you are using resources approriately.

One the jobs is finished, your alignment data should be in the pasta-output folder. However, the way that Pasta formats things, they data are sort of messy. So, you have a couple of options. The easiest is probably to create a new folder and symlink the alignments for each loci to the new folder. This is easiest to do if you are using zsh:

mkdir alignments && cd alignments

# BASH version (use if you are doing this on @hpc)
for i in ../pasta-output/*; do
    LOCUS=`basename $i`
    ln -s $i/*.aln $LOCUS.align.fasta;
done

# ZSH version (use if you are doing this on our analysis machines and you use ZSH)
for i in ../pasta-output/*; do
    ln -s $i/*.aln $i:t.align.fasta;
done

Once we have our symlinks, we can make a copy of the alignments folder and follow the symlinks using a special command. This will basically create a new directory containing correctly renamed alignments.
# must use the -rL option to copy links as real files cp -rL alignments alignment-files

Copy that back to whatever of our local machines you are using.

tar -czvf aligment-files.tar.gz alignment-files
rsync -avLP you@smic.hpc.lsu.edu:/home/you/aligment-files.tar.gz ./

Finally, you will most likely want to trim the resulting alignments to remove aberrant sequences that ginsi has offset from the “good” parts of the alignment.

phyluce_align_get_trimal_trimmed_alignments_from_untrimmed \
    --alignments alignment-files \
    --output alignment-files-trim \
    --input-format fasta \
    --output-format nexus \
    --cores 12

Don’t forget to strip the locus name information from each alignment:

mkdir alignment-files-clean && cd alignment-files-clean
for i in ../alignment-files-trim/*; do cat $i | sed 's/uce\-[0-9]\+_'//g > $i:t; done