1 Introduction

Users want to provide here background information about the design of their ChIP-Seq project.

1.1 Background and objectives

This report describes the analysis of several ChIP-Seq experiments studying the DNA binding patterns of the transcriptions factors … from organism ….

1.2 Experimental design

Typically, users want to specify here all information relevant for the analysis of their NGS study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.

2 Samples and environment settings

2.1 Environment settings and input data

systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the workflow are provide in the systemPipeRdata vignette here.

systemPipeRdata::genWorkenvir(workflow = "chipseq", mydirname = "chipseq")
setwd("chipseq")

After building and loading the workflow environment generated by genWorkenvir from systemPipeRdata all data inputs are stored in a data/ directory and all analysis results will be written to a separate results/ directory, while the systemPipeChIPseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory. Additional parameter files are stored under param/.

The chosen data set used by this report SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been truncated accordingly. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen for this test data set for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.

To work with real data, users want to organize their own data similarly and substitute all test data for their own data. To rerun an established workflow on new data, the initial targets file along with the corresponding FASTQ files are usually the only inputs the user needs to provide.

For more details, please consult the documentation here. More information about the targets files from systemPipeR can be found here.

2.1.1 Experiment definition provided by targets file

The targets file defines all FASTQ files and sample comparisons of the analysis workflow.

targetspath <- system.file("extdata", "targetsPE_chip.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")
targets[1:4, -c(5, 6)]
##                     FileName1                   FileName2
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4 ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
##   SampleName Factor        Date SampleReference
## 1        M1A     M1 23-Mar-2012                
## 2        M1B     M1 23-Mar-2012                
## 3        A1A     A1 23-Mar-2012             M1A
## 4        A1B     A1 23-Mar-2012             M1B

To work with custom data, users need to generate a targets file containing the paths to their own FASTQ files.

3 Workflow environment

systemPipeR workflows can be designed and built from start to finish with a single command, importing from an R Markdown file or stepwise in interactive mode from the R console.

This tutorial will demonstrate how to build the workflow in an interactive mode, appending each step. The workflow is constructed by connecting each step via appendStep method. Each SYSargsList instance contains instructions needed for processing a set of input files with a specific command-line or R software and the paths to the corresponding outfiles generated by a particular tool/step.

To create a Workflow within systemPipeR, we can start by defining an empty container and checking the directory structure:

library(systemPipeR)
sal <- SPRproject()
sal

3.1 Required packages and resources

The systemPipeR package needs to be loaded (H Backman and Girke 2016).

appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
}, step_name = "load_SPR")

3.2 Read preprocessing

3.2.1 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files, including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length, and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.

This is the pre-trimming fastq report. Another post-trimming fastq report step is not included in the default. It is recommended to run this step first to decide whether the trimming is needed.

Please note that initial targets files are being used here. In this case, it has been added to the first step using the updateColumn function, and later, we used the getColumn function to extract a named vector.

appendStep(sal) <- LineWise(code = {
    targets <- read.delim("targetsPE_chip.txt", comment.char = "#")
    updateColumn(sal, step = "load_SPR", position = "targetsWF") <- targets
    fq_files <- getColumn(sal, "load_SPR", "targetsWF", column = 1)
    fqlist <- seeFastq(fastq = fq_files, batchsize = 10000, klength = 8)
    pdf("./results/fastqReport.pdf", height = 18, width = 4 *
        length(fqlist))
    seeFastqPlot(fqlist)
    dev.off()
}, step_name = "fastq_report", dependency = "load_SPR")