Running Pargenes¶

Author:	Brant C. Faircloth
Copyright:	This documentation is available under a Creative Commons (CC-BY) license.

Modification History¶

See Running Pargenes

Purpose¶

Pargenes is a pipeline to generate gene trees from a large set of loci, using the most appropriate site rate substitution model.

Preliminary Steps¶

To compile Pargenes, see Compiling Pargenes

Steps¶

Before running Pargenes, you need to prepared your data by following several steps. The easiest thing to do is to take the directory of loci that you wish to analyze (say, from a 75% matrix… or all loci [then subset]), and convert those loci to FASTA format:
phyluce_align_convert_one_align_to_another \ --alignments input-alignments \ --output input-alignments-fasta \ --cores 12 \ --log-path ./ \ --input-format nexus \ --output-format fasta
After formatting loci in FASTA format, you probably want to go ahead and reduce those loci, if needed, so that identical sequences for different taxa are removed. This requires a recent version of phyluce (which is not, yet, publicly available). Reduce the FASTA alignments by:
phyluce_align_reduce_alignments_with_raxml \ --alignments input-alignments-fasta \ --output input-alignments-fasta-reduced \ --input-format fasta \ --cores 12
After reducing your loci, you want to upload those to HPC. Before uploading, it’s probably best to package them up as .tar.gz and unpack them on Supermike/Supermic:
tar -czf input-alignments-fasta-reduced.tar.gz input-alignments-fasta-reduced

After uploading to Supermike/Supermic using something like rsync, in your working directory, create a job submission script that looks like the following (be sure to use your <allocation>). This will run a “test-run” of pargenes and estimate the number of cores that we should use for optimal run-times:

#PBS -A hpc_allbirds02
#PBS -l nodes=1:ppn=16
#PBS -l walltime=2:00:00
#PBS -q checkpt
#PBS -N pargenes

module purge
module load intel/18.0.0
module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
CORES=16

python /project/brant/shared/src/pargenes-v1.1.2/pargenes/pargenes.py \
    -a input-alignments-fasta-reduced \
    -o input-alignments-fasta-reduced-pargenes-dry-run \
    -d nt \
    -m \
    -c $CORES \
    --dry-run

This will produce an output folder (input-alignments-fasta-reduced-pargenes-dry-run). In that folder is a log file that will contain an estimate of the number of cores we need to run a job optimally. Remember that number. Based on that number, setup a new qsub file for the “real” run of Pargenes where you adjust nodes=XX:ppn=YY and CORES. That will look something like the following, which will used 512 CPU cores to: (1) estimate the best site-rate substitution model for each locus, (2) estimate the best ML gene tree for each locus based on the most appropriate model, and (3) generate 200 bootstrap replicates for everything:
#PBS -A <allocation> #PBS -l nodes=32:ppn=16 #PBS -l walltime=12:00:00 #PBS -q checkpt #PBS -N pargenes_dry_run module purge module load intel/18.0.0 module load gcc/6.4.0 module load impi/2018.0.128 cd $PBS_O_WORKDIR CORES=512 python /project/brant/shared/src/pargenes-v1.1.2/pargenes/pargenes-hpc.py \ -a input-alignments-fasta-reduced \ -o input-alignments-fasta-reduced-pargenes-bootstraps \ -d nt \ -m \ -c $CORES \ --bs-trees 200

Before downloading, you probably want to zip everything up, which you can do by creating a packaging qsub script like:

#PBS -A <allocation>
#PBS -l nodes=1:ppn=16
#PBS -l walltime=6:00:00
#PBS -q checkpt
#PBS -N pargenes_zip

cd $PBS_O_WORKDIR

tar -czf trimal-internal-odont-131-fasta-reduced-pargenes-bootstraps.tar.gz trimal-internal-odont-131-fasta-reduced-pargenes-bootstraps