1 Introduction

Users want to provide here background information about the design of their RNA-Seq project.

2 Samples and environment settings

2.1 Environment settings and input data

Typically, the user wants to record here the sources and versions of the reference genome sequence along with the corresponding annotations. In the provided sample data set all data inputs are stored in a data subdirectory and all results will be written to a separate results directory, while the systemPipeRNAseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory.

systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the workflow are provide in the systemPipeRdata vignette here.

The mini sample FASTQ files used by this report as well as the associated reference genome files can be loaded via the systemPipeRdata package. The chosen data set SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been truncated accordingly. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen for this test data set for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.

systemPipeRdata::genWorkenvir(workflow = "rnaseq", mydirname = "rnaseq")

3 Create the workflow interactively

This template provides some common steps for a RNAseq workflow. One can add, remove, modify workflow steps by operating on the sal object. For full documentation and details of SPR features and functions, please see the main vignette.

sal <- SPRproject()

If you desire to import this entire workflow in one step, please skip this tutorial and follow:

systemPipeRdata::genWorkenvir(workflow = "rnaseq")
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq_importWF.Rmd",
    verbose = FALSE)
sal <- runWF(sal)
plotWF(sal, rstudio = TRUE)
sal <- renderLogs(sal)

3.1 Required packages and resources

The systemPipeR package needs to be loaded (H Backman and Girke 2016).

appendStep(sal) <- LineWise({
}, step_name = "load_SPR")

3.2 Read preprocessing

3.2.1 Read trimming with Trimmomatic

Next, we need to populate the object created with the first step in the workflow. Here, an example of how to perform this task using parameters template files for trimming FASTQ files with Trimmomatic software (???).

targetspath <- "targetsPE.txt"
appendStep(sal) <- SYSargsList(step_name = "trimming", targets = targetspath,
    wf_file = "trimmomatic/trimmomatic-pe.cwl", input_file = "trimmomatic/trimmomatic-pe.yml",
    dir_path = "param/cwl", inputvars = c(FileName1 = "_FASTQ_PATH1_",
        FileName2 = "_FASTQ_PATH2_", SampleName = "_SampleName_"),
    dependency = "load_SPR")

3.2.2 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.

appendStep(sal) <- LineWise({
    fastq <- getColumn(sal, step = "trimming", "targetsWF", column = 1)
    fqlist <- seeFastq(fastq = fastq, batchsize = 10000, klength = 8)
    pdf("./results/fastqReport.pdf", height = 18, width = 4 *
}, step_name = "fastq_report", dependency = "trimming")