Demultiplexing a Sequencing Run

Modification History

See Demultiplexing a Sequencing Run


Often, we combine MANY libraries together in a single sequencing run (moreseo even now that NovaSeqs are online). Once sequenced, the data generally need to be demultiplexed by their index sequences into something approximating the sample names that you want to associate the sequence data with. You can generally do this one of two ways: (1) directly demultiplex to named files using bcl2fastq from Illumina or (2) You can receive Undetermined files from the sequencing center, and demultiplex those based on the index calls in the header of the sequence, for example:

@A00484:41:H3G5VDRXX:1:1101:1018:1000 1:N:0:GGCGTTAT+CCTATTGG
                                            ^^^^ indexes ^^^^


  1. Download sequence data from provider. They will usually tell you how to do this - either with wget or with sftp

  2. If sftp and using a private certificate, you need to get the certificate info into a file (e.g. my.key), then:

    chmod 0600 my.key
    sftp -i /path/to/my.key [email protected]
  3. Things will take a long time to download.

  4. Once downloaded, be sure to get/check md5sums of files against what provider gives you (often, these .md5 files are part of the download). These help you make sure that the downloads were not corrupted while downloading (which can happen with big files).

    for i in *.md5; do md5sum -c $i; done
  5. You may want to count the read numbers in the file. You can check this number (times two) against the output below to make sure you’re apples-to-apples on the overall count of reads. This can also take a long time.

    gunzip -c Undetermined_S0_L001_R1_001.fastq.gz | wc -l | awk '{print $1/4}'
  1. Get list of barcodes from the users who shared your run. To make your life easy, they should provide a spreadsheet that’s setup correctly, w/ both i5 and i7 names and sequences, as well as the forward and reverse complement of the i7 sequences.

  2. You’re going to need to create combinations of barcode from the list provided by the users sharing a run. Before you do this, it’s often easiest to peek inside the R1 (or R2) file from the run (typically named Undetermined_S0_L001_R1_001.fastq.gz) to have a look at the index sequences reported and to make sure which of the forward or reverse complement of i7 you need to use for a given platform. To peek inside, use:

    gunzip -c Undetermined_S0_L001_R1_001.fastq.gz | less
  3. Then, enter “search mode” in less by typing /. I then generally search for an i5 index that will be prevalent in the run (if everything is equimolar, just pick one), then look to see what the i7 sequence of the corresponding index is. I compare that to my spreadsheet of forward and reverse indices for the i7 position, and then I know what I need to do across all the i7 indexes.

  4. Once you’re happy with that, create a file of barcodes, e.g. my_barcodes.txt where you have the indexes in the correct order, here reverse_comp(i7)+i5:

  5. You can use that file and from BBMap (here v38.22) to demultiplex paired files of Unknown reads into resulting files that will be labelled with their respective indexes:

    ~/src/BBMap_38.22/ \
        prefixmode=f \
        in=../Undetermined_S0_L001_R1_001.fastq.gz \
        in2=../Undetermined_S0_L001_R2_001.fastq.gz \
        out=%_R1_001.fastq.gz \
        out2=%_R2_001.fastq.gz \
        outu=Undetermined_R1_001.fastq.gz \
        outu2=Undetermined_R2_001.fastq.gz \
  6. For a NovaSeq S1 run, this took about 2.3 hours and produced, as output:

    Input is being processed as paired
    Time:               8250.056 seconds.
    Reads Processed:    2196226240          266.21k reads/sec
    Bases Processed:    331630162240        40.20m bases/sec
    Reads Out:    4080560900
    Bases Out:    616164695900
  7. Once you have these files, you are almost there. The easiest thing to do to get all the files renamed is to create another, tab-delimited list thas has the index combinations in column 1 and the names you want for the file in column 2. Name this file temp-names.txt. This file looks something like:

    TTACCGAG+TCGTCTGA       Molothrus_ater_LSUMZ441_NT
    TTACCGAG+CATGTGTG       Molothrus_ater_Tulane2906_NT
    TTACCGAG+TCTAGTCC       Molothrus_ater_LSUMZ2237_NT
    TTACCGAG+AAGGCTCT       Molothrus_ater_LSUMZ2241_NT
    TTACCGAG+AACCAGAG       Molothrus_ater_LSUMZ5504_NT
    TTACCGAG+ACTATCGC       Molothrus_ater_LSUMZ13890_NT
  8. Now you can rename the files in your set by running:

    while IFS=$'\t' read -r column1 column2; do
        mv ${column1}_R1_001.fastq.gz ${column2}_${column1}_R1_001.fastq.gz;
        mv ${column1}_R2_001.fastq.gz ${column2}_${column1}_R2_001.fastq.gz;
    done < "temp-names.txt"
  9. This keeps the original tag name in the name of the raw data, but prepends the name you want from the 2nd column of the tab-delimited file, above. The files names now look something like: