|Author:||Brant C. Faircloth|
|Copyright:||This documentation is available under a Creative Commons (CC-BY) license.|
Cactus is a program for aligning genomes together (i.e., genome-genome alignment). More details are available from the cactus github page. Cactus requires heterogenous nodes for different types of computations that it is running, and we’ve found that this can sometimes be hard to gin up when working with typical university HPC systems. AWS comes to the rescue in this case - you can setup and pay for the computation that you need on whatever type of nodes you need to join together to make your compute cluster. What follows are instructions on how we do this (built from the current Cactus AWS guide [see the wiki]).
- Do what you need to to create an account for AWS. We have a somewhat complicated setup, but you basically need an account, and you need to create an IAM user that has permission to run EC2 instances. For that IAM user, you also need their ACCESS_KEYS.
- For the IAM user, go to
IAM > Users (side tab) > Security Credentials. Create an access key and be sure to copy the values of
AWS_SECRET_ACCESS_KEY. You’ll need these later.
- Cactus is built on top of a CoreOS image. Before running any analyses, you’ll need to “subscribe” to use the Container Linux by CoreOS AMI. You will encounter errors if this is not done. You can do this by following this link, logging into your AWS account, and clicking “Continue to Subscribe”: https://aws.amazon.com/marketplace/pp/B01H62FDJM/.
- Finally, it’s very likely you will need to increase your service limits on AWS. In particular, you’ll probably need to request an increase to the minimum number of “Spot”
c4.8xlargeinstances you can request (default is 20), and you’ll probably also need to request an increase to the minimum number of “On Demand”
1r3.8xlargeinstances you can run (default is 1). You start this process by going to the EC2 console and clicking on “Limits” in the left column of stuff.
On whatever local machine you are using (e.g. laptop, desktop, etc.), you need to create an SSH keypair that we’ll use to connect to the machine running the show on AWS EC2. We’ll create a keypair with a specific name that lets us know we use it for AWS:
# create the key ssh-keygen -t rsa -b 4096 # enter an appropriate name/path Generating public/private rsa key pair. Enter file in which to save the key (/home/me/.ssh/id_rsa): /home/me/.ssh/id_aws Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/me/.ssh/id_aws. Your public key has been saved in /home/me/.ssh/id_aws.pub. The key fingerprint is: SHA256:XXXXXXXX [email protected]
Once that’s done, we need to make the pubkey (
*.pub) an “authorized key” on our local machine, enable
ssh-agentto automatically remember the key for us, and set some permissions on our files so everything is happy:
# create an authorized_keys file (if you don't have one) touch ~/.ssh/authorized_keys # set correct permissions on that chmod 0600 ~/.ssh/authorized_keys # put the contents of our id_aws key in authorized_keys cat ~/.ssh/id_aws.pub >> ~/.ssh/authorized_keys # set the correct permissions chmod 400 id_rsa # add the key to ssh-agent so we don't have to enter our password # all the time eval `ssh-agent -s` ssh-add /home/me/.ssh/id_aws
Return to AWS via the web interface. Go to
EC2 > Key Pairs (side panel) > Import Key Pair (top of page). Paste in the contents of your
id_aws.pubto the box and give the key a name (I also call this
Now, we need to install all the software needed to run Cactus. We’re going to do that in a conda environment, because we use conda all the time and it’s pretty easy to create new/test environments. FYI, this differs a little from the cactus website. Go ahead and setup the environment and install some needed stuff:
# make the conda environment, installing awscli and python 2 conda create -n cactus python=2 awscli # activate the environment conda activate cactus # install toil pip install --upgrade "toil[aws]" # install cactus. to do that navigate to the tmp directory in our conda install cd ~/conda/envs/cactus/tmp git clone https://github.com/comparativegenomicstoolkit/cactus.git cd cactus pip install --upgrade .
Finally, we need to place our AWS credentials in two places. Ensure you are in the
cactusenvironment just created
Run the AWS configuration utility and follow the instructions and enter the
AWS_SECRET_ACCESS_KEYwhen prompted. Also enter the relevant zone in which you want to run your EC2 instances:
Cactus uses toil which uses boto. Per the toil recommendations, add your
~/.boto.confso that its contents look like (paste in your values for
AWS_SECRET_ACCESS_KEYand not what’s below):
[Credentials] aws_access_key_id = ****************XXX aws_secret_access_key = ****************YYY
We should basically be able good to go now, go ahead and launch what’s known as the “leader” instance. Be sure to adjust your availability zone to whatever you want to use
toil launch-cluster -z us-east-1a faircloth-test --keyPairName id_aws --leaderNodeType t2.medium
You need to think about which region to use - in my case, I learned that
us-east-2will NOT work because the region needs to have SimpleDB available. Here, we’re simply using
us-east-1because it has everything.
This will spin up a
t2.mediumnode, which is relatively small, and we’ll start working on AWS through this node. It can take some time, and you might want to monitor progress using the web interface to EC2.
While the instance is starting and validating, we need to sync our data for analysis. In my opinion, it’s easiest to do this using S3. Additionally, cactus can read
s3://URLs. So, put the fastas you want to sync (easiest if unzipped) in a directory on your local machine. Then create an S3 bucket to hold those:
aws s3api create-bucket --bucket faircloth-lab-cactus-bucket --region us-east-1
You may want to put your genomes in a S3 bucket in the same region - this will make things faster. As above, we’re using
Now, sync up the files from your local machine to S3. This may take a little while, but on your local machine, run:
aws s3 sync . s3://faircloth-lab-cactus-bucket/
Once our data are uploaded and the instance is spun up, we can log into the instance on EC2
toil ssh-cluster -z us-east-1a faircloth-test
We need to install cactus and whatnot on the “leader” image:
# update the packages in the package mgr apt update apt install -y git tmux vim # create a directory to hold our analysis mkdir /opt/analysis # create a `cactus-env` virtual env in this folder virtualenv --system-site-packages cactus-env # activate that virtual env source cactus-env/bin/activate # get the cactus source from github git clone https://github.com/comparativegenomicstoolkit/cactus.git # install that in the cactus-env virtual env cd cactus pip install --upgrade . # change back to our base analysis directory cd /opt/analysis
Now, create a new file in
vim, and paste the required information into it. Be sure to adjust for your particular problem - this example uses the five genomes above and their
# Sequence data for progressive alignment of 5 genomes # all are good assemblies (((Anolis_sagrei:0.314740,Salvator_merianae:0.192470):0.122998,(Gallus_gallus:0.166480,Taeniopygia_guttata:0.116981):0.056105):0.133624,Alligator_mississippiensis:0.133624):0.0; Anolis_sagrei s3://faircloth-lab-cactus-bucket/Anolis_sagrei.fna Salvator_merianae s3://faircloth-lab-cactus-bucket/Salvator_merianae.fna Gallus_gallus s3://faircloth-lab-cactus-bucket/Gallus_gallus.fna Taeniopygia_guttata s3://faircloth-lab-cactus-bucket/Taeniopygia_guttata.fna Alligator_mississippiensis s3://faircloth-lab-cactus-bucket/Alligator_mississippiensis.fna
The cluster will automatically scale up and down, but you'll want to set a maximum number of nodes so the scaler doesn't get overly aggressive and waste money, or go over your AWS limits. We typically use c4.8xlarge on the spot market for most jobs, and r4.8xlarge on-demand for database jobs. Here are some very rough estimates of what we typically use for the maximum of each type (round up): * N mammal-size genomes (~2-4Gb): (N / 2) * 20 c4.8xlarge on the spot market, (N / 2) r3.8xlarge on-demand * N bird-size genomes (~1-2Gb): (N / 2) * 10 c4.8xlarge on the spot market, (N / 4) r3.8xlarge on-demand * N nematode-size genomes (~100-300Mb): (N / 2) c4.8xlarge on the spot market, (N / 10) r3.8xlarge on-demand * For anything less than 100Mb, the computational requirements are so small that you may be better off running it on a single machine than using an autoscaling cluster.
Once the file is created, you are ready to spin up the cactus run
# start tmux tmux # make sure we're in the right place cd /opt/analysis # start run cactus \ --nodeTypes c4.8xlarge:0.6,r3.8xlarge \ --minNodes 0,0 \ --maxNodes 20,1 \ --provisioner aws \ --batchSystem mesos \ --metrics \ aws:us-east-1:faircloth-10-25-test \ seqFile.txt output.hal