Running RAxML-NG

Author:Brant C. Faircloth
Copyright:This documentation is available under a Creative Commons (CC-BY) license.

Modification History

See Running RAxML-NG

Purpose

We use RAxML-NG (formerly RAxML or ExaML) to infer maximum likelihood trees from multiple sequence alignment data. RAxML-NG is now one of the preferred options because it can run across multiple nodes and cores and also take, as input, a number of evolutionary models estimated by a program like modeltest-ng

Preliminary Steps

  1. To compile RAxML-NG, see Compiling RAxML-NG

Data Preparation

  1. RAxML accepts data in two format: PHYLIP and FASTA. Setup directory structure on @supermic to contain these data. Generally speaking, I make a project folder within my work directory (where work is symlinked to /work/brant. So, using Anna’s Diglossa as an example:

    mkdir work/anna-diglossa
    cd anna-diglossa
    mkdir alignments
    
  2. On the transfer machine (@tabasco), navgate to the directory holding the alignment files and transfer the alignments files to @supermic:

    rsync -avLP ./ user@smic.hpc.lsu.edu:/home/brant/work/anna-diglossa/alignments
    
  3. Now that that’s finished, setup a file that will (1) convert the alignment to a binary format, and (2) estimate the number of nodes/cores needed for optimal analysis. I do this in a file named something like raxml-parse.qsub:

    #!/bin/bash
    #PBS -q single
    #PBS -A <allocation>
    #PBS -l walltime=02:00:00
    #PBS -l nodes=1:ppn=1
    #PBS -V
    #PBS -N raxmlng-parse
    #PBS -o raxmlng-parse.out
    #PBS -e raxmlng-parse.err
    
    module load gcc/6.4.0
    module load impi/2018.0.128
    
    cd $PBS_O_WORKDIR
    
    /project/brant/shared/bin/raxml-ng-1.0.1 \
    --msa /path/to/alignment/alignment.phylip \
    --model GTR+G \
    --parse
    

    Note

    RAxML-NG is different from earlier versions of RAxML because you now have the ability to specify many, many different models of sequence evolution. The options for evolution models are on the RAxML-NG website and should be perused. It’s also possible to specify multiple models using a partition file (partition.txt) that looks something like:

    JC+G, p1 = 1-100, 252-400
    HKY+F, p2 = 101-180, 251
    GTR+I, p3 = 181-250
    

    And then creating the binary alignment file with a command similar to:

    /project/brant/shared/bin/raxml-ng-1.0.1 \
    --msa /path/to/alignment/alignment.phylip \
    --model partition.txt \
    --parse
    

    When creating the binary alignment file, you specify the model to use for the given data set. This model will be carried over to all subsequent analyses using this binary alignment file - which is why we don’t specify particular models in the sections below.

  4. This will produce binary alignment files within /path/to/alignment/. These files will have an .rba extension (so the file created here was drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.rba). To look at other information regarding the aligment (particularly how many nodes/cores to use), open up raxmlng-parse.out with something like less. You should see content that looks like:

    Analysis options:
    run mode: Alignment parsing and compression
    start tree(s):
    random seed: 1558381540
    tip-inner: OFF
    pattern compression: ON
    per-rate scalers: OFF
    site repeats: ON
    branch lengths: proportional (ML estimate, algorithm: NR-FAST)
    SIMD kernels: AVX
    parallelization: PTHREADS (8 threads), thread pinning: OFF
    
    [00:00:00] Reading alignment from file: alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip
    [00:00:00] Loaded alignment with 116 taxa and 2094052 sites
    
    WARNING: Fully undetermined columns found: 31736
    
    NOTE: Reduced alignment (with duplicates and gap-only sites/taxa removed)
    NOTE: was saved to: /ddnB/work/brant/anna-diglossa/alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.reduced.phy
    
    Alignment comprises 1 partitions and 720183 patterns
    
    Partition 0: noname
    Model: GTR+FO+G4m
    Alignment sites / patterns: 2062316 / 720183
    Gaps: 25.14 %
    Invariant sites: 88.39 %
    
    
    NOTE: Binary MSA file created: alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.rba
    
    * Estimated memory requirements                : 20220 MB
    
    * Recommended number of threads / MPI processes: 96
    
    Please note that numbers given above are rough estimates only.
    Actual memory consumption and parallel performance on your system may differ!
    

Inferring the Best ML Tree (with bootstrapping)

After creating the binary alignment file and getting an idea of the number of MPI processes that are needed, you need to infer the tree, ideally with some support values.

You have several ways of doing this, one of which is to use what I call “standard” MPI mode, which just gives RAxML a number of CPUs to spread the data across, and all the CPUs talk to each other over the interconnects using MPI.

Warning

Still testing.

The other way of setting up the run is to use what’s known as “hybrid” mode, which combines parallelization across HPC nodes (using MPI) with parallelization within nodes (using Pthreads).

Note

In all of the following, you can prefix the name of your output files by adding the argument --prefix <some name>. And, when generating consensus trees, you can root those on some outgroup using --outgroup taxon1,taxon2,taxon3,..., taxonQQQ.

Standard MPI Mode

Using Standard MPI Mode To Search for the Best ML Tree + Bootstrapping

Given the core count and RAM usage estimated above, on @supermic, we need to run 6 nodes each with 16 CPUS for a total of 96 CPUs. We will also set this run up to automatically search for both the best ML tree and bootstrap replicates for this best ML tree. That’s accomplished with the --all option. The other option we are passing is the --best-trees autoMRE option, which will generate bootstrap trees until those converge. If you need to set the maximum number of boostrap replicated to generate using autoMRE, specify --bs-trees autoMRE{500}, which will limit the analyses to only 500 trees (default is 1000). The --seed that we’re setting (which we pass as an environment variable $SEED whose value it taken from $RANDOM) let’s us repeat the exact analysis in the future, if needed.

With that information in hand, setup a second submission script raxml-best-tree.qsub that contains a version of the following:

#!/bin/bash
#PBS -q checkpt
#PBS -A <your_allocation>
#PBS -l walltime=72:00:00
#PBS -l nodes=6:ppn=16
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
SEED=$RANDOM
echo $SEED

mpiexec -np 96 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi-1.0.1 \
    --msa alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.rba \
    --seed $SEED \
    --all \
    --bs-trees autoMRE

Using Standard MPI Mode To Search for the Best ML Tree

Sometimes, the tree you are trying to infer is large (due to the # of tips, the amount of data, or both), and you want to separate the inference of the best ML tree from the generation of bootstrap replicates. To infer only the best ML tree, use something like the following. The --search option tells RAxML to do the tree search and --seed is described above.

#!/bin/bash
#PBS -q checkpt
#PBS -A <your_allocation>
#PBS -l walltime=72:00:00
#PBS -l nodes=6:ppn=16
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
SEED=$RANDOM
echo $SEED

mpiexec -np 96 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi-1.0.1 \
    --msa alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.rba \
    --seed $SEED \
    --search

Using Standard MPI Mode To Bootstrap

Along similar lines, if you’ve separated how RAxML runs into two parts, you would run the boostrapping for a particular set of data using the following. The --seed argument is described above, the --bootstrap argument tells RAxML to do bootstrapping, the the --bs-trees argument is described above:

#!/bin/bash
#PBS -q checkpt
#PBS -A <your_allocation>
#PBS -l walltime=72:00:00
#PBS -l nodes=6:ppn=16
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR
SEED=$RANDOM
echo $SEED

mpiexec -np 96 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi-1.0.1 \
    --msa alignments/drop2-mafft-nexus-edge-trimmed-clean-75p.phylip.raxml.rba \
    --seed $SEED \
    --bootstrap \
    --bs-trees autoMRE

Integrating the Best ML Tree with the Bootstraps

And, if you have separate files for the best ML tree and the boostrap replicates, you can integrate those using (Note that I’m using the single queue here with a very short walltime because this runs quickly).

#!/bin/bash
#PBS -q single
#PBS -A <your_allocation>
#PBS -l walltime=00:10:00
#PBS -l nodes=1:ppn=1
#PBS -V
#PBS -N raxmlng-std-mpi
#PBS -o raxmlng-std-mpi.out
#PBS -e raxmlng-std-mpi.err

module load gcc/6.4.0
module load impi/2018.0.128

cd $PBS_O_WORKDIR

/project/brant/shared/bin/raxml-ng-1.0.1 \
--tree /path/to/bestML.tree \
--bs-trees /path/to/bootstraps.tree \
--support

Post-hoc Tree Evaluation

Sometimes after the best ML tree search, you will see ML trees inferred with different likelihood values. It may be important to evaluate the differences among the best ML trees. It may also be important for you to do things like compute concordance factors when comparing results from concatenated trees to something like gene tree topologies.

Examining Likelihoods and RF Distance

You can easily compute the RF distance among the best ML trees inferred using RAxML