Contents

1 Introduction

The package ngsReports is designed to bolt into data processing pipelines and produce combined plots for multiple FastQC reports generated across an entire set of libraries or samples. The primary functionality of the package is parsing FastQC reports, with import methods also implemented for log files produced by tools as as STAR, Hisat2 and others. In addition to parsing files, default plotting methods are implemented. Plots applied to a single file will replicate the default plots from FastQC1 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, whilst methods applied to multiple FastQC reports summarise these and produce a series of custom plots.

Plots are produced as standard ggplot2 objects, with an interactive option available using plotly. As well as custom summary plots, tables of read counts and the like can also be easily generated.

2 Basic Usage

2.1 Using the Shiny App

In addition to the usage demonstrated below, a shiny app has been developed for interactive viewing of FastQC reports. This can be installed using:

remotes::install_github("UofABioinformaticsHub/fastqcRshiny")

A vignette for this app will be installed with the fastqcRshiny package.

2.2 The default report

In it’s simplest form, a default summary report can be generated simply by specifying a directory containing the output from FastQC and calling the function writeHtmlReport().

library(ngsReports)
fileDir <- file.path("path", "to", "your", "FastQC", "Reports")
writeHtmlReport(fileDir)

This function will transfer the default template to the provided directory and produce a single .html file containing interactive summary plots of any FastQC output found in the directory. FastQC output can be *fastqc.zip files or the same files extracted as individual directories.

The default template is provided as ngsReports_Fastqc.Rmd in the package directory . This template can be easily modified and supplied as an alternate template to the above function using your modified file as the template RMarkdown file.

altTemplate <- file.path("path", "to", "your", "new", "template.Rmd")
writeHtmlReport(fileDir, template = altTemplate)

3 Advanced Usage

3.1 Classes Defined in the Package

The package ngsReports introduces two main S4 classes:

  • FastqcData & FastqcDataList

FastqcData objects hold the parsed data from a single report as generated by the stand-alone tool FastQC. These are then extended into lists for more than one file as a FastqcDataList. For most users, the primary class of interest will be the FastqcDataList.

3.2 Loading FastQC Data Into R

To load a set of FastQC reports into R as a FastqcDataList, specify the vector of file paths, then call the function FastqcDataList(). In the rare case you’d like an individual file, this can be performed by calling FastqcData() on an individual file, or subsetting the output from FastqcDataList() using the [[]] operator as with any list object.

fileDir <- system.file("extdata", package = "ngsReports")
files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE)
fdl <- FastqcDataList(files)

From here, all FastQC modules can be obtained as a tibble (i.e. data.frame) using the function getModule() and choosing one of the following modules:

  • Summary (The PASS/WARN/FAIL status for each module)
  • Basic_Statistics
  • Per_base_sequence_quality
  • Per_sequence_quality_scores
  • Per_base_sequence_content
  • Per_sequence_GC_content
  • Per_base_N_content
  • Sequence_Length_Distribution
  • Sequence_Duplication_Levels
  • Overrepresented_sequences
  • Adapter_Content
  • Kmer_Content
  • Per_tile_sequence_quality
getModule(fdl[[1]], "Summary")
## # A tibble: 12 x 3
##    Filename      Status Category                    
##    <chr>         <chr>  <chr>                       
##  1 ATTG_R1.fastq PASS   Basic Statistics            
##  2 ATTG_R1.fastq FAIL   Per base sequence quality   
##  3 ATTG_R1.fastq WARN   Per tile sequence quality   
##  4 ATTG_R1.fastq PASS   Per sequence quality scores 
##  5 ATTG_R1.fastq FAIL   Per base sequence content   
##  6 ATTG_R1.fastq FAIL   Per sequence GC content     
##  7 ATTG_R1.fastq PASS   Per base N content          
##  8 ATTG_R1.fastq PASS   Sequence Length Distribution
##  9 ATTG_R1.fastq FAIL   Sequence Duplication Levels 
## 10 ATTG_R1.fastq FAIL   Overrepresented sequences   
## 11 ATTG_R1.fastq FAIL   Adapter Content             
## 12 ATTG_R1.fastq FAIL   Kmer Content

Capitalisation and spelling of these module names follows the default patterns from FastQC reports with spaces replaced by underscores. One additional module is available and taken directly from the text within the supplied reports

  • Total_Duplicated_Percentage

In addition, the read totals for each file in the library can be obtained using readTotals(), which can be easily used to make a table of read totals. This essentially just returns the first two columns from getModule(x, "Basic_Statistics").

reads <- readTotals(fdl)

The packages dplyr and pander can also be extremely useful for manipulating and displaying imported data. To show only the R1 read totals, you could do the following

library(dplyr)
library(pander)
reads %>%
    dplyr::filter(grepl("R1", Filename)) %>% 
    pander(
        big.mark = ",",
        caption = "Read totals from R1 libraries", 
        justify = "lr"
    )
Read totals from R1 libraries
Filename Total_Sequences
ATTG_R1.fastq 24,978
CCGC_R1.fastq 22,269
GACC_R1.fastq 10,287

3.3 Generating Plots For One or More Fastqc Files

Plots created from a single FastqcData object will resemble those generated by the FastQC tool, whilst those created from a FastqcDataList will be combined summaries across a library of files. In addition, all plots are able to be generated as interactive plots using the argument usePlotly = TRUE.

All FastQC modules have been enabled for plotting using default S4 dispatch, with the exception of Per_tile_sequence_quality.

3.3.1 Inspecting the PASS/WARN/FAIL Status of each module

The simplest of the plots is to summarise the PASS/WARN/FAIL flags as produced by FastQC for each module. This plot can be simply generated using plotSummary()

plotSummary(fdl)
Default summary of FastQC flags.

Figure 1: Default summary of FastQC flags

3.3.2 Visualising Read Totals

The next most informative plot may be to summarise the total numbers of reads in each associated Fastq file. By default, the number of duplicated sequences from the Total_Duplicated_Percentage module are shown, but this can be disabled by setting duplicated = FALSE.

plotReadTotals(fdl)

As these are ggplot2 objects, the output can be modified easily using conventional ggplot2 syntax. Here we’ll move the legend to the top right as an example.

plotReadTotals(fdl) +
    theme(
        legend.position = c(1, 1), 
        legend.justification = c(1, 1),
        legend.background = element_rect(colour = "black")
    )

3.3.3 Per Base Sequence Qualities

Turning to the Per base sequence quality scores is the next most common step for most researchers, and these can be obtained for an individual file by selecting this as an element (i.e. FastqcData object ) of the main FastqcDataList object. This plot replicates the default plots from a FastQC report.

plotBaseQuals(fdl[[1]])
Example showing the Per_base_sequence_quality plot for a single FastqcData object.

Figure 2: Example showing the Per_base_sequence_quality plot for a single FastqcData object

When working with multiple FastQC reports, these are summarised as a heatmap using the mean quality score at each position.

plotBaseQuals(fdl)
Example showing the Mean Per Base Squence Qualities for a set of FastQC reports.

Figure 3: Example showing the Mean Per Base Squence Qualities for a set of FastQC reports

Boxplots of any combinations can also be drawn from a FastqcDataList by setting the argument plotType = "boxplot". However, this may be not suitable for datasets with a large number of libraries.

plotBaseQuals(fdl[1:4], plotType = "boxplot")