Chapter 1 Welcome

1.1 Introduction

Chromatin immunoprecipitation with sequencing (ChIP-seq) is a widely used technique for identifying the genomic binding sites of a target protein. Conventional analyses of ChIP-seq data aim to detect absolute binding (i.e., the presence or absence of a binding site) based on peaks in the read coverage. An alternative analysis strategy is to detect of changes in the binding profile between conditions (Ross-Innes et al. 2012; Pal et al. 2013). These differential binding (DB) analyses involve counting reads into genomic intervals and testing those counts for significant differences between conditions. This defines a set of putative DB regions for further examination. DB analyses are statistically easier to perform than their conventional counterparts, as the effect of genomic biases is largely mitigated when counts for different libraries are compared at the same genomic region. DB regions may also be more relevant as the change in binding can be associated with the biological difference between conditions.

This book describes the use of the csaw Bioconductor package to detect differential binding (DB) in ChIP-seq experiments with sliding windows (Lun and Smyth 2016). In these analyses, we detect and summarize DB regions between conditions in a de novo manner, i.e., without making any prior assumptions about the location or width of bound regions. We demonstrate on data from a variety of real studies focusing on changes in transcription factor binding and histone mark enrichment. Our aim is to facilitate the practical implementation of window-based DB analyses by providing detailed code and expected output. The code here can be adapted to any dataset with multiple experimental conditions and with multiple biological samples within one or more of the conditions; it is similarly straightforward to accommodate batch effects, covariates and additional experimental factors. Indeed, though the book focuses on ChIP-seq, the same software can be adapted to data from any sequencing technique where reads represent coverage of enriched genomic regions.

1.2 How to read this book

The descriptions in this book explore the theoretical and practical motivations behind each step of a csaw analysis. While all users are welcome to read it from start to finish, new users may prefer to examine the case studies presented in the later sections (Lun and Smyth 2015), which provides the important information in a more concise format. Experienced users (or those looking for some nighttime reading!) are more likely to benefit from the in-depth discussions in this document.

All of the workflows described here start from sorted and indexed BAM files in the chipseqDBData package. For application to user-specified data, the raw read sequences have to be aligned to the appropriate reference genome beforehand. Most aligners can be used for this purpose, but we have used Rsubread (Liao, Smyth, and Shi 2013) due to the convenience of its R interface. It is also recommended to mark duplicate reads using tools like Picard prior to starting the workflow.

The statistical methods described here are based upon those in the edgeR package (Robinson, McCarthy, and Smyth 2010). Knowledge of edgeR is useful but not a prerequesite for reading this guide.

1.3 How to get help

Most questions about csaw should be answered by the documentation. Every function mentioned in this guide has its own help page. For example, a detailed description of the arguments and output of the windowCounts() function can be obtained by typing ?windowCounts or help(windowCounts) at the R prompt. Further detail on the methods or the underlying theory can be found in the references at the bottom of each help page.

The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Other questions about how to use csaw are best sent to the Bioconductor support site. Please send requests for general assistance and advice to the support site, rather than to the individual authors. Users posting to the support site for the first time may find it helpful to read the posting guide.

1.4 How to cite this book

Most users of csaw should cite the following in any publications:

A. T. Lun and G. K. Smyth. csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows. Nucleic Acids Res., 44(5):e45, Mar 2016

To cite the workflows specifically, we can use:

A. T. L. Lun and G. K. Smyth. From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data. F1000Research, 4, 2015

For people interested in combined \(p\)-values, their use in DB analyses was proposed in:

A. T. Lun and G. K. Smyth. De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res., 42(11):e95, Jul 2014

The DB analyses shown here use methods from the edgeR package, which has its own citation recommendations. See the appropriate section of the edgeR user’s guide for more details.

1.5 Quick start

A typical ChIP-seq analysis in csaw would look something like that described below. This assumes that a vector of file paths to sorted and indexed BAM files is provided in and a design matrix in supplied in . The code is split across several steps:

library(chipseqDBData)
tf.data <- NFYAData()
tf.data <- head(tf.data, -1) # skip the input.
bam.files <- tf.data$Path

cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", tf.data$Description)
design <- model.matrix(~factor(cell.type))
colnames(design) <- c("intercept", "cell.type")

Loading in data from BAM files.

library(csaw)
param <- readParam(minq=20)
data <- windowCounts(bam.files, ext=110, width=10, param=param)

Filtering out uninteresting regions.

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
keep <- filterWindowsGlobal(data, binned)$filter > log2(5)
data <- data[keep,]

Calculating normalization factors.

data <- normFactors(binned, se.out=data)

Identifying DB windows.

library(edgeR)
y <- asDGEList(data)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
results <- glmQLFTest(fit)

Correcting for multiple testing.

merged <- mergeResults(data, results$table, tol=1000L)

Session information

R version 4.3.0 RC (2023-04-13 r84269)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] edgeR_3.42.0                limma_3.56.0               
 [3] csaw_1.34.0                 SummarizedExperiment_1.30.0
 [5] Biobase_2.60.0              MatrixGenerics_1.12.0      
 [7] matrixStats_0.63.0          GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.0         IRanges_2.34.0             
[11] S4Vectors_0.38.0            BiocGenerics_0.46.0        
[13] chipseqDBData_1.15.0        BiocStyle_2.28.0           

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0              dplyr_1.1.2                  
 [3] blob_1.2.4                    filelock_1.0.2               
 [5] rebook_1.10.0                 Biostrings_2.68.0            
 [7] bitops_1.0-7                  fastmap_1.1.1                
 [9] RCurl_1.98-1.12               BiocFileCache_2.8.0          
[11] promises_1.2.0.1              XML_3.99-0.14                
[13] digest_0.6.31                 mime_0.12                    
[15] lifecycle_1.0.3               ellipsis_0.3.2               
[17] statmod_1.5.0                 KEGGREST_1.40.0              
[19] interactiveDisplayBase_1.38.0 RSQLite_2.3.1                
[21] magrittr_2.0.3                compiler_4.3.0               
[23] rlang_1.1.0                   sass_0.4.5                   
[25] tools_4.3.0                   utf8_1.2.3                   
[27] yaml_2.3.7                    knitr_1.42                   
[29] bit_4.0.5                     curl_5.0.0                   
[31] DelayedArray_0.26.0           BiocParallel_1.34.0          
[33] withr_2.5.0                   purrr_1.0.1                  
[35] CodeDepends_0.6.5             grid_4.3.0                   
[37] fansi_1.0.4                   ExperimentHub_2.8.0          
[39] xtable_1.8-4                  cli_3.6.1                    
[41] rmarkdown_2.21                crayon_1.5.2                 
[43] generics_0.1.3                metapod_1.8.0                
[45] httr_1.4.5                    DBI_1.1.3                    
[47] cachem_1.0.7                  splines_4.3.0                
[49] zlibbioc_1.46.0               parallel_4.3.0               
[51] AnnotationDbi_1.62.0          BiocManager_1.30.20          
[53] XVector_0.40.0                vctrs_0.6.2                  
[55] Matrix_1.5-4                  jsonlite_1.8.4               
[57] dir.expiry_1.8.0              bookdown_0.33                
[59] bit64_4.0.5                   locfit_1.5-9.7               
[61] jquerylib_0.1.4               glue_1.6.2                   
[63] codetools_0.2-19              BiocVersion_3.17.1           
[65] later_1.3.0                   tibble_3.2.1                 
[67] pillar_1.9.0                  rappdirs_0.3.3               
[69] htmltools_0.5.5               graph_1.78.0                 
[71] GenomeInfoDbData_1.2.10       R6_2.5.1                     
[73] dbplyr_2.3.2                  lattice_0.21-8               
[75] evaluate_0.20                 shiny_1.7.4                  
[77] AnnotationHub_3.8.0           png_0.1-8                    
[79] Rsamtools_2.16.0              memoise_2.0.1                
[81] httpuv_1.6.9                  bslib_0.4.2                  
[83] Rcpp_1.0.10                   xfun_0.39                    
[85] pkgconfig_2.0.3

Bibliography

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.

Lun, A. T. L., and G. K. Smyth. 2015. “From Reads to Regions: A Bioconductor Workflow to Detect Differential Binding in ChIP-Seq Data.” F1000Research 4.

Lun, A. 2016. “csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows.” Nucleic Acids Res. 44 (5): e45.

Pal, B., T. Bouras, W. Shi, F. Vaillant, J. M. Sheridan, N. Fu, K. Breslin, et al. 2013. “Global changes in the mammary epigenome are induced by hormonal cues and coordinated by Ezh2.” Cell Rep. 3 (2): 411–26.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40.

Ross-Innes, C. S., R. Stark, A. E. Teschendorff, K. A. Holmes, H. R. Ali, M. J. Dunning, G. D. Brown, et al. 2012. “Differential oestrogen receptor binding is associated with clinical outcome in breast cancer.” Nature 481 (7381): 389–93.

The csaw Book