Polishing Assemblies with Pilon¶
Author: | Brant C. Faircloth |
---|---|
Copyright: | This documentation is available under a Creative Commons (CC-BY) license. |
Modification History¶
Purpose¶
PacBio assemblies are great because they produce contigs with high contiguity. However, the coverage of PacBio reads is something less than we desire, and PacBio reads are more prone error than something like Illumina reads. We can use the Illumina reads to “polish” contigs produce from the “noisier” PacBio chemistry. We can also use 10X genomics reads, if we have them, do to the same - because those are simply large insert Illumina reads.
Note
This protocol assumes you are using 10X reads.
Warning
The following assumes you are running on @qb2, which you should probably be doing because we need the bigmem
queue for vertebrate-sized genomes.
Steps¶
Note
You may have already performed some of these steps such as processing 10X reads to remove the barcode information or mapping the 10X reads to the assembly you’d like to polish. Simply skip those steps if you have already performed them.
If you are using 10X reads, you need to process those to remove the internal barcodes and adapters. That’s most easily accomplished by installing and using longranger. That’s a pretty easy process - you just need to go to the longranger website, grab the link and download that file to a reasonable location (in our case the shared directory
/home/brant/project/shared/bin/
, where it is already installed).wget -O longranger-2.2.2.tar.gz "<long link from website>" tar -xzvf longranger-2.2.2.tar.gz
Setup a working directory:
mkdir pacbio-polish && cd $_
Now, we’re going to run longranger to do some basic processing of the 10X linked read data. Go ahead and make a directory within 10x-diglossa-scaffold to hold the data:
mkdir longranger-ouput && cd $_
Prepare a
qsub
script to run longranger and process the reads. This should take <24 hours for ~40 GB zipped sequence data. The processing basically trims the reads to remove the barcode and adapter information and puts the barcode info in the fastq header:#!/bin/bash #PBS -q checkpt #PBS -A <allocation> #PBS -l walltime=24:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N longranger_basic #PBS -o longranger_basic.out #PBS -e longranger_basic.err #PBS -m abe #PBS -M brant@faircloth-lab.org export PATH=/home/brant/project/shared/bin/longranger-2.2.2/:$PATH cd $PBS_O_WORKDIR longranger basic \ --id=<my_name> \ --fastqs=/path/to/my/demuxed/fastq/files \ --localcores 20 1>longranger-basic.stdout 2>longranger-basic.stderr
Once the reads have been processed, we want to map them to our genome assembly using
bwa-mem
andsamtools
. We can get all those installed (along with Pilon) by creating a conda environment:conda create -n polishing pilon bwa samtools
Now, we need to map the reads over. The easiest thing to do is probably to create a new directory in
pacbio-polished
namedbwa-aligned
mkdir bwa-aligned && cd $_
Now, symlink in the *.fastq.gz file that we just created:
ln -s ../longranger-ouput/<my_name>/outs/barcoded.fastq.gz
Upload the assembly to this directory, using a tool like
rsync
. Here, we’ve uploadeddiglossa.contigs.fa
to the same directory (bwa-aligned
) that contains our symlink tobarcoded.fastq.gz
Once that’s uploaded, create a
qsub
script to run thebwa
mapping job:#!/bin/bash #PBS -q checkpt #PBS -A <allocation> #PBS -l walltime=36:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N bwa_mem #PBS -o bwa_mem.out #PBS -e bwa_mem.err source activate polishing cd $PBS_O_WORKDIR # index the assembly for bwa bwa index diglossa.contigs.fa # run bwa, use 20 threads for aligning and sorting and set memory for samtools at 3G per thread bwa mem -t 20 diglossa.contigs.fa barcoded.fq.gz | samtools sort -@20 -m 3G -o diglossa.contigs.barcoded.bam - samtools index diglossa.contigs.barcoded.bam
Submit that job and let it run. It will take a fair amount of time (~18 hours to map something like 40 GB data)
Once the job has run, you should have a directory
bwa-aligned
that contains the output bam file, the reads, and the assembly. Moving forward, we only need to care about the BAM file and the assembly.Running Pilon is pretty simple - it just needs to use a lot of RAM (why we need to run it @qb2). We need to setup an appropriate
qsub
script for the run:#!/bin/bash #PBS -q bigmem #PBS -A <allocation> #PBS -l walltime=72:00:00 #PBS -l nodes=1:ppn=48 #PBS -V #PBS -N pilon #PBS -o pilon.out #PBS -e pilon.err source activate polishing cd $PBS_O_WORKDIR # run pilon pilon -Xmx1400G --genome diglossa.contigs.fa \ --bam diglossa.contigs.barcoded.bam \ --changes --vcf --diploid --threads 48 \ --output diglossa.contigs.polished