Contents

1 Introduction to dupRadar

RNA-Seq experiments are a common strategy nowadays to quantify the gene expression levels in cells. Due to its higher sensitivity compared to arrays and the ability to discover novel features, makes them the choice for most modern experiments.

In RNA-Seq - as in all other NGS applications - quality control is essential. Current NGS workflows usually involve PCR steps at some point, which involves the danger of PCR artefacts and over-amplification of reads. For common DNA-based assays PCR duplicates are often simply removed before further analysis and their overall fraction or the read multiplicity taken as quality metrics. In RNA-Seq however, apart from the technical PCR duplicates, there is a strong source for biological read duplicates: for a given gene length and sequencing depth there exists an expression level beyond which it is not possible to add place more reads inside a gene locus without placing a second read exactly on the position of an already existing read. For this reason the overall duplication rate is not a useful measure for RNA-Seq.

As in the NGS/RNA-Seq QC ecosystem there were no suitable tools to address this question, we set out to develop a new tool. The dupRadar package gives an insight into the duplication problem by graphically relating the gene expression level and the duplication rate present on it. Thus, failed experiments can be easily identified at a glance.

Note: by now RNA-Seq protocols have matured so that for bulk RNA protocols data rarely suffer from technical duplicates. With newer low-input or single cell RNA-Seq protocols technical duplicates possible problems are worth to be checked for by default, especially if protocols are pushed to or beyond their boundaries. Paired-end libraries make the distinction between duplicates due to highly expressed genes and PCR duplicates easier, but the problem itself is not completely solved, especially at higher sequencing depths.

2 Getting started using dupRadar

2.1 Preparing your data

Previous to running the duplication rate analysis, the BAM file with your mapped reads has to be duplicate marked with a like Picard, or the faster BamUtil dedup BioBamBam. The dupRadar package only works with duplicate marked BAM files.

If you do not have/want a duplication marking step in your default pipeline, the dupRadar package includes, for your convenience, wrappers to properly call some of these tools from your R session. Note that you still have to supply the path of the dupmarker installation though:

library(dupRadar)

Now, simply call the wrapper:

# call the duplicate marker and analyze the reads
bamDuprm <- markDuplicates(dupremover="bamutil",
               bam="test.bam",
               path="/opt/bamUtil-master/bin",
               rminput=FALSE)

Simply specify which tool to use, the path where this tool is installed, and the input bam file to be analyzed. After marking duplicates, it’s safe to remove to original BAM file in order to save space.

The dupRadar package currently comes with support for:

After the BAM file is marked for duplicates, dupRadar is ready to analyze how the duplication rate is related with the estimated gene expression levels.

2.2 A GTF file

Unless there is any specific reason, dupRadar can use the same GTF file that will be later used to count the reads falling on features.

A valid GTF file can be obtained from UCSC, Ensembl, the iGenomes or other projects.

Note that the resulting duplication rate plots depend on the GTF annotation used. GTF files from the gencode projects result in a less clear picture of duplication rates, as there are many more features and feature types annotated, which overlap heavily as well. In some cases creating the plots only using subsets of gencode annotation files (e.g. just protein coding genes) serve the QC purpose of this tool better.

2.3 AnnotationHub as a source of GTF files

The Bioconductor AnnotationHub package provides an alternative approach to obtain annotation in GTF format from entities such as Ensembl, UCSC, ENCODE, and 1000 Genomes projects.

This is partly outlined in the AnnotationHub ‘HOWTO’ vignette section “Ensembl GTF and FASTA files for TxDb gene models and sequence queries”; for the Takifugu example, the downloaded GTF file is available from the cache.

if(suppressWarnings(require(AnnotationHub))) {
    ah = AnnotationHub()
    query(ah, c("ensembl", "80", "Takifugu", "gtf"))    # discovery
    cache(ah["AH47101"])                                # retrieve / file path
}
##                                    AH47101 : 52579 
## "/tmp/RtmpSZoIKj/BiocFileCache/4bdc38049fbc_52579"

3 dupRate demo data

In the package we include two precomputed duplication matrices for two RNASeq experiments used as examples of a good and a failed (in terms of high redundancy of reads) experiments. The experiments come from the ENCODE project, as a source of a wide variety of protocols, library types and sequencing facilities.

Load the example dataset with:

attach(dupRadar_examples)

4 The duplication rate analysis

The analysis requires some info about the sequencing run and library preparation protocol:

Due to its phenomenal performance, internally we use the featureCounts() function from the Bioconductor Rsubread package, which also supports multiple threads.

# The call parameters: 
bamDuprm <- "test_duprm.bam"    # the duplicate marked bam file
gtf <- "genes.gtf"  # the gene model
stranded <- 2       # '0' (unstranded), '1' (stranded) and '2' (reversely stranded)
paired   <- FALSE   # is the library paired end?
threads  <- 4       # number of threads to be used

# Duplication rate analysis
dm <- analyzeDuprates(bamDuprm,gtf,stranded,paired,threads)

The duplication matrix contains read counts in different scenarios and RPK and RPKM values for every gene.

ID geneLength allCountsMulti filteredCountsMulti dupRateMulti dupsPerIdMulti RPKMulti RPKMMulti allCounts filteredCounts dupRate dupsPerId RPK RPKM
Xkr4 3634 2 2 0.0000000 0 0.5503577 0.0117159 2 2 0.0000000 0 0.5503577 0.0117159
Rp1 9747 0 0 NaN 0 0.0000000 0.0000000 0 0 NaN 0 0.0000000 0.0000000
Sox17 3130 0 0 NaN 0 0.0000000 0.0000000 0 0 NaN 0 0.0000000 0.0000000
Mrpl15 4203 429 261 0.3916084 168 102.0699500 2.1728436 419 258 0.3842482 161 99.6906971 2.1221946
Lypla1 2433 1106 576 0.4792043 530 454.5828196 9.6770634 1069 562 0.4742750 507 439.3752569 9.3533280
Tcea1 2847 2890 1046 0.6380623 1844 1015.1036178 21.6093121 1822 696 0.6180022 1126 639.9719002 13.6235871

5 Plotting and interpretation

The number of reads per base assigned to a gene in an ideal RNA-Seq data set is expected to be proportional to the abundance of its transcripts in the sample. For lowly expressed genes we expect read duplication to happen rarely by chance, while for highly expressed genes - depending on the total sequencing depth - we expect read duplication to happen often.

A good way to learn if a dataset is following this trend is by relating the normalized number of counts per gene (RPK, as a quantification of the gene expression) and the fraction represented by duplicated reads.

The dupRadar offers several functions to assess this relationship. The most prominent may be the 2d density scatter plot:

## make a duprate plot (blue cloud)
par(mfrow=c(1,2))
duprateExpDensPlot(DupMat=dm)       # a good looking plot
title("good experiment")
duprateExpDensPlot(DupMat=dm.bad)   # a dataset with duplication problems
title("duplication problems")