derfinder 1.38.0
Please read the basics of using derfinder
in the quick start guide. Thank you.
If you haven’t already, please read the quick start to using derfinder vignette. It explains the basics of using derfinder, how to ask for help, and showcases an example analysis.
The derfinder users guide goes into more depth about what you can do with derfinder. It covers the two implementations of the DER Finder approach (Frazee, Sabunciyan, Hansen et al., 2014). That is, the (A) expressed regions-level and (B) single base-level F-statistics implementations. The vignette also includes advanced material for fine tuning some options, working with non-human data, and example scripts for a high-performance computing cluster.
The expressed regions-level implementation is based on summarizing the coverage information for all your samples and applying a cutoff to that summary. For example, by calculating the mean coverage at every base and then checking if it’s greater than some cutoff value of interest. Contiguous bases passing the cutoff become a candidate Expressed Region (ER). We can then construct a matrix summarizing the base-level coverage for each sample for the set of ERs. This matrix can then be using with packages developed for feature counting (limma, DESeq2, edgeR, etc) to determine which ERs have differential expression signal. That is, to identify the Differentially Expressed Regions (DERs).
regionMatrix()
Commonly, users have aligned their raw RNA-seq data and saved the alignments in BAM files. Some might have chosen to compress this information into BigWig files. BigWig files are much faster to load than BAM files and are the type of file we prefer to work with. However, we can still identify the expressed regions from BAM files.
The function regionMatrix()
will require you to load the data (either from BAM or BigWig files) and store it all in memory. It will then calculate the mean coverage across all samples, apply the cutoff you chose, and determine the expressed regions.
This is the path you will want to follow in most scenarios.
railMatrix()
Our favorite method for identifying expressed regions is based on pre-computed summary coverage files (mean or median) as well as coverage files by sample. Rail is a cloud-enabled aligner that will generate:
Rail does a great job in creating these files for us, saving time and reducing the memory load needed for this type of analysis with R
.
If you have this type of data or can generate it from BAM files with other tools, you will be interested in using the railMatrix()
function. The output is identical to the one from regionMatrix()
. It’s just much faster and memory efficient. The only drawback is that BigWig files are not fully supported in Windows as of rtracklayer version 1.25.16.
We highly recommend this approach. Rail has also other significant features such as: scalability, reduced redundancy, integrative analysis, mode agnosticism, and inexpensive cloud implementation. For more information, visit rail.bio.
The DER Finder approach was originally implemented by calculating t-statistics between two groups and using a hidden markov model to determine the expression states: not expressed, expressed, differentially expressed (Frazee, Sabunciyan, Hansen et al., 2014). The original software works but had areas where we worked to improve it, which lead to the single base-level F-statistics implementation. Also note that the original software is no longer maintained.
This type of analysis first loads the data and preprocess it in a format that saves time and reduces memory later. It then fits two nested models (an alternative and a null model) with the coverage information for every single base-pair of the genome. Using the two fitted models, it calculates an F-statistic. Basically, it generates a vector along the genome with F-statistics.
A cutoff is then applied to the F-statistics and contiguous base-pairs of the genome passing the cutoff are considered a candidate Differentially Expressed Region (DER).
Calculating F-statistics along the genome is computationally intensive, but doable. The major resource drain comes from assigning p-values to the DERs. To do so, we permute the model matrices and re-calculate the F-statistics generating a set of null DERs. Once we have enough null DERs, we compare the observed DERs against the null DERs to calculate p-values for the observed DERs.
This type of analysis at the chromosome level is done by the analyzeChr()
function, which is a high level function using many other pieces of derfinder. Once all chromosomes have been processed, mergeResults()
combines them.
Which implementation of the DER Finder approach you will want to use depends on your specific use case and computational resources available. But in general, we recommend starting with the expressed regions-level implementation.
In this vignette we we will analyze a small subset of the samples from the BrainSpan Atlas of the Human Brain (BrainSpan, 2011) publicly available data.
We first load the required packages.
## Load libraries
library("derfinder")
library("derfinderData")
library("GenomicRanges")
For this example, we created a small table with the relevant phenotype data for 12 samples: 6 from fetal samples and 6 from adult samples. We chose at random a brain region, in this case the amygdaloid complex. For this example we will only look at data from chromosome 21. Other variables include the age in years and the gender. The data is shown below.
library("knitr")
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == "AMY")
## Display the main information
p <- pheno[, -which(colnames(pheno) %in% c(
"structure_acronym",
"structure_name", "file"
))]
rownames(p) <- NULL
kable(p, format = "html", row.names = TRUE)
gender | lab | Age | group | |
---|---|---|---|---|
1 | F | HSB97.AMY | -0.4423077 | fetal |
2 | M | HSB92.AMY | -0.3653846 | fetal |
3 | M | HSB178.AMY | -0.4615385 | fetal |
4 | M | HSB159.AMY | -0.3076923 | fetal |
5 | F | HSB153.AMY | -0.5384615 | fetal |
6 | F | HSB113.AMY | -0.5384615 | fetal |
7 | F | HSB130.AMY | 21.0000000 | adult |
8 | M | HSB136.AMY | 23.0000000 | adult |
9 | F | HSB126.AMY | 30.0000000 | adult |
10 | M | HSB145.AMY | 36.0000000 | adult |
11 | M | HSB123.AMY | 37.0000000 | adult |
12 | F | HSB135.AMY | 40.0000000 | adult |
derfinder offers three functions related to loading raw data. The first one, rawFiles()
, is a helper function for identifying the full paths to the input files. Next, loadCoverage()
loads the base-level coverage data from either BAM or BigWig files for a specific chromosome. Finally, fullCoverage()
will load the coverage for a set of chromosomes using loadCoverage()
.
We can load the data from derfinderData (Collado-Torres, Jaffe, and Leek, 2024) by first identifying the paths to the BigWig files with rawFiles()
and then loading the data with fullCoverage()
. Note that the BrainSpan data is already normalized by the total number of mapped reads in each sample. However, that won’t be the case with most data sets in which case you might want to use the totalMapped
and targetSize
arguments. The function getTotalMapped()
will be helpful to get this information.
## Determine the files to use and fix the names
files <- rawFiles(system.file("extdata", "AMY", package = "derfinderData"),
samplepatt = "bw", fileterm = NULL
)
names(files) <- gsub(".bw", "", names(files))
## Load the data from disk
system.time(fullCov <- fullCoverage(
files = files, chrs = "chr21",
totalMapped = rep(1, length(files)), targetSize = 1
))
## 2024-05-01 16:56:58.069763 fullCoverage: processing chromosome chr21
## 2024-05-01 16:56:58.074722 loadCoverage: finding chromosome lengths
## 2024-05-01 16:56:58.084482 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB113.bw
## 2024-05-01 16:56:58.241538 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB123.bw
## 2024-05-01 16:56:58.382442 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB126.bw
## 2024-05-01 16:56:58.49018 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB130.bw
## 2024-05-01 16:56:58.615929 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB135.bw
## 2024-05-01 16:56:58.730004 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB136.bw
## 2024-05-01 16:56:58.89244 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB145.bw
## 2024-05-01 16:56:59.016593 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB153.bw
## 2024-05-01 16:56:59.147914 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB159.bw
## 2024-05-01 16:56:59.269928 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB178.bw
## 2024-05-01 16:56:59.394486 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB92.bw
## 2024-05-01 16:56:59.576863 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB97.bw
## 2024-05-01 16:56:59.729725 loadCoverage: applying the cutoff to the merged data
## 2024-05-01 16:56:59.740353 filterData: normalizing coverage
## 2024-05-01 16:57:00.04865 filterData: done normalizing coverage
## 2024-05-01 16:57:00.063662 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## user system elapsed
## 2.019 0.024 2.043
Alternatively, since the BigWig files are publicly available from BrainSpan (see here), we can extract the relevant coverage data using fullCoverage()
. Note that as of rtracklayer 1.25.16 BigWig files are not supported on Windows: you can find the fullCov
object inside derfinderData to follow the examples.
## Determine the files to use and fix the names
files <- pheno$file
names(files) <- gsub(".AMY", "", pheno$lab)
## Load the data from the web
system.time(fullCov <- fullCoverage(
files = files, chrs = "chr21",
totalMapped = rep(1, length(files)), targetSize = 1
))
Note how loading the coverage for 12 samples from the web was quite fast. Although in this case we only retained the information for chromosome 21.
The result of fullCov
is a list with one element per chromosome. If no filtering was performed, each chromosome has a DataFrame
with the number of rows equaling the number of bases in the chromosome with one column per sample.
## Lets explore it
fullCov
## $chr21
## DataFrame with 48129895 rows and 12 columns
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178 HSB92 HSB97
## <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## ... ... ... ... ... ... ... ... ... ... ... ... ...
## 48129891 0 0 0 0 0 0 0 0 0 0 0 0
## 48129892 0 0 0 0 0 0 0 0 0 0 0 0
## 48129893 0 0 0 0 0 0 0 0 0 0 0 0
## 48129894 0 0 0 0 0 0 0 0 0 0 0 0
## 48129895 0 0 0 0 0 0 0 0 0 0 0 0
If filtering was performed, each chromosome also has a logical Rle
indicating which bases of the chromosome passed the filtering. This information is useful later on to map back the results to the genome coordinates.
Depending on the use case, you might want to filter the base-level coverage at the time of reading it, or you might want to keep an unfiltered version. By default both loadCoverage()
and fullCoverage()
will not filter.
If you decide to filter, set the cutoff
argument to a positive value. This will run filterData()
. Note that you might want to standardize the library sizes prior to filtering if you didn’t already do it when creating the fullCov
object. You can do so by by supplying the totalMapped
and targetSize
arguments to filterData()
. Note: don’t use these arguments twice in fullCoverage()
and filterData()
.
In this example, we prefer to keep both an unfiltered and filtered version. For the filtered version, we will retain the bases where at least one sample has coverage greater than 2.
## Filter coverage
filteredCov <- lapply(fullCov, filterData, cutoff = 2)
## 2024-05-01 16:57:00.632167 filterData: originally there were 48129895 rows, now there are 130356 rows. Meaning that 99.73 percent was filtered.
The result is similar to fullCov
but with the genomic position index as shown below.
## Similar to fullCov but with $position
filteredCov
## $chr21
## $chr21$coverage
## DataFrame with 130356 rows and 12 columns
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 2.00999999046326 0 0 0.0399999991059303 0.230000004172325 0.129999995231628
## 2 2.17000007629395 0 0 0.0399999991059303 0.230000004172325 0.129999995231628
## 3 2.21000003814697 0 0 0.0399999991059303 0.230000004172325 0.129999995231628
## 4 2.36999988555908 0 0 0.0399999991059303 0.230000004172325 0.129999995231628
## 5 2.36999988555908 0 0.0599999986588955 0.0399999991059303 0.230000004172325 0.129999995231628
## ... ... ... ... ... ... ...
## 130352 1.25 1.27999997138977 2.04999995231628 0.790000021457672 1.62999999523163 1.37000000476837
## 130353 1.21000003814697 1.24000000953674 2.04999995231628 0.790000021457672 1.62999999523163 1.37000000476837
## 130354 1.21000003814697 1.20000004768372 1.92999994754791 0.790000021457672 1.62999999523163 1.27999997138977
## 130355 1.16999995708466 1.20000004768372 1.92999994754791 0.790000021457672 1.62999999523163 1.27999997138977
## 130356 1.16999995708466 1.0900000333786 1.87000000476837 0.790000021457672 1.54999995231628 1.01999998092651
## HSB145 HSB153 HSB159 HSB178 HSB92 HSB97
## <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 0 0.589999973773956 0.150000005960464 0.0500000007450581 0.0700000002980232 1.58000004291534
## 2 0 0.629999995231628 0.150000005960464 0.0500000007450581 0.0700000002980232 1.58000004291534
## 3 0 0.670000016689301 0.150000005960464 0.0500000007450581 0.100000001490116 1.61000001430511
## 4 0 0.709999978542328 0.150000005960464 0.0500000007450581 0.129999995231628 1.64999997615814
## 5 0 0.75 0.25 0.100000001490116 0.129999995231628 1.67999994754791
## ... ... ... ... ... ... ...
## 130352 1.02999997138977 2.21000003814697 2.46000003814697 2.0699999332428 2.23000001907349 1.50999999046326
## 130353 1.02999997138977 2.21000003814697 2.46000003814697 2.0699999332428 2.23000001907349 1.50999999046326
## 130354 0.730000019073486 2.00999999046326 2.21000003814697 2.10999989509583 2.13000011444092 1.47000002861023
## 130355 0.600000023841858 2.00999999046326 2.10999989509583 2.10999989509583 2.13000011444092 1.47000002861023
## 130356 0.560000002384186 1.92999994754791 1.96000003814697 1.77999997138977 2.05999994277954 1.47000002861023
##
## $chr21$position
## logical-Rle of length 48129895 with 3235 runs
## Lengths: 9825448 149 2 9 2 2 6 ... 137 1446 172 740 598 45091
## Values : FALSE TRUE FALSE TRUE FALSE TRUE FALSE ... TRUE FALSE TRUE FALSE TRUE FALSE
In terms of memory, the filtered version requires less resources. Although this will depend on how rich the data set is and how aggressive was the filtering step.
## Compare the size in Mb
round(c(
fullCov = object.size(fullCov),
filteredCov = object.size(filteredCov)
) / 1024^2, 1)
## fullCov filteredCov
## 22.7 8.5
Note that with your own data, filtering for bases where at least one sample has coverage greater than 2 might not make sense: maybe you need a higher or lower filter. The amount of bases remaining after filtering will impact how long the analysis will take to complete. We thus recommend exploring this before proceeding.
regionMatrix()
Now that we have the data, we can identify expressed regions (ERs) by using a cutoff of 30 on the base-level mean coverage from these 12 samples. Once the regions have been identified, we can calculate a coverage matrix with one row per ER and one column per sample (12 in this case). For doing this calculation we need to know the length of the sequence reads, which in this study were 76 bp long.
Note that for this type of analysis there is no major benefit of filtering the data. Although it can be done if needed.
## Use regionMatrix()
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76))
## By using totalMapped equal to targetSize, regionMatrix() assumes that you have normalized the data already in fullCoverage(), loadCoverage() or filterData().
## 2024-05-01 16:57:01.815245 regionMatrix: processing chr21
## 2024-05-01 16:57:01.815692 filterData: normalizing coverage
## 2024-05-01 16:57:02.10297 filterData: done normalizing coverage
## 2024-05-01 16:57:03.455505 filterData: originally there were 48129895 rows, now there are 2256 rows. Meaning that 100 percent was filtered.
## 2024-05-01 16:57:03.456283 findRegions: identifying potential segments
## 2024-05-01 16:57:03.460777 findRegions: segmenting information
## 2024-05-01 16:57:03.467812 findRegions: identifying candidate regions
## 2024-05-01 16:57:03.520022 findRegions: identifying region clusters
## 2024-05-01 16:57:03.671066 getRegionCoverage: processing chr21
## 2024-05-01 16:57:03.701878 getRegionCoverage: done processing chr21
## 2024-05-01 16:57:03.704189 regionMatrix: calculating coverageMatrix
## 2024-05-01 16:57:03.709512 regionMatrix: adjusting coverageMatrix for 'L'
## user system elapsed
## 1.856 0.048 1.905
## Explore results
class(regionMat)
## [1] "list"
names(regionMat$chr21)
## [1] "regions" "coverageMatrix" "bpCoverage"
regionMatrix()
returns a list of elements of length equal to the number of chromosomes analyzed. For each chromosome, there are three pieces of output. The actual ERs are arranged in a GRanges
object named regions
.
filterData()
and then defining the regions with findRegions()
. Note that the metadata variable value
represents the mean coverage for the given region while area
is the sum of the base-level coverage (before adjusting for read length) from all samples.plotRegionCoverage()
from derfinderPlot.## regions output
regionMat$chr21$regions
## GRanges object with 45 ranges and 6 metadata columns:
## seqnames ranges strand | value area indexStart indexEnd cluster clusterL
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer> <integer> <Rle> <Rle>
## 1 chr21 9827018-9827582 * | 313.6717 177224.53 1 565 1 565
## 2 chr21 15457301-15457438 * | 215.0846 29681.68 566 703 2 138
## 3 chr21 20230140-20230192 * | 38.8325 2058.12 704 756 3 366
## 4 chr21 20230445-20230505 * | 41.3245 2520.80 757 817 3 366
## 5 chr21 27253318-27253543 * | 34.9131 7890.37 818 1043 4 765
## .. ... ... ... . ... ... ... ... ... ...
## 41 chr21 33039644-33039688 * | 34.4705 1551.1742 2180 2224 17 45
## 42 chr21 33040784-33040798 * | 32.1342 482.0133 2225 2239 18 118
## 43 chr21 33040890 * | 30.0925 30.0925 2240 2240 18 118
## 44 chr21 33040900-33040901 * | 30.1208 60.2417 2241 2242 18 118
## 45 chr21 48019401-48019414 * | 31.1489 436.0850 2243 2256 19 14
## -------
## seqinfo: 1 sequence from an unspecified genome
## Number of regions
length(regionMat$chr21$regions)
## [1] 45
bpCoverage
is the base-level coverage list which can then be used for plotting.
## Base-level coverage matrices for each of the regions
## Useful for plotting
lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## $`1`
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178 HSB92 HSB97
## 1 93.20 3.32 28.22 5.62 185.17 98.34 5.88 16.71 3.52 15.71 47.40 36.54
## 2 124.76 7.25 63.68 11.32 374.85 199.28 10.39 30.53 5.83 29.35 65.04 51.42
##
## $`2`
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178 HSB92 HSB97
## 566 45.59 7.94 15.92 34.75 141.61 104.21 19.87 38.61 4.97 23.2 13.95 22.21
## 567 45.59 7.94 15.92 35.15 141.64 104.30 19.87 38.65 4.97 23.2 13.95 22.21
## Check dimensions. First region is 565 long, second one is 138 bp long.
## The columns match the number of samples (12 in this case).
lapply(regionMat$chr21$bpCoverage[1:2], dim)
## $`1`
## [1] 565 12
##
## $`2`
## [1] 138 12
The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.
## Dimensions of the coverage matrix
dim(regionMat$chr21$coverageMatrix)
## [1] 45 12
## Coverage for each region. This matrix can then be used with limma or other pkgs
head(regionMat$chr21$coverageMatrix)
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159
## 1 3653.1093346 277.072106 1397.068687 1106.722895 8987.460401 5570.221054 1330.158818 1461.2986829 297.939342
## 2 333.3740816 99.987237 463.909476 267.354342 1198.713552 1162.313418 257.114210 313.8513139 67.940131
## 3 35.3828948 20.153553 30.725394 23.483947 16.786842 17.168947 22.895921 52.8756585 28.145395
## 4 42.3398681 29.931579 41.094474 24.724736 32.634080 19.309606 33.802632 51.6146040 31.244343
## 5 77.7402631 168.939342 115.059342 171.861974 180.638684 93.503158 90.950526 36.3046051 78.069605
## 6 0.7988158 1.770263 1.473421 2.231053 1.697368 1.007895 1.171316 0.4221053 1.000132
## HSB178 HSB92 HSB97
## 1 1407.288552 1168.519079 1325.9622371
## 2 193.695657 127.543553 200.7834228
## 3 33.127368 23.758816 20.4623685
## 4 33.576974 29.546183 28.2011836
## 5 97.151316 100.085790 35.5428946
## 6 1.139079 1.136447 0.3956579
We can then use the coverage matrix and packages such as limma, DESeq2 or edgeR to identify which ERs are differentially expressed.
Here we’ll use DESeq2 to identify the DERs. To use it we need to round the coverage data.
## Required
library("DESeq2")
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## converting counts to integer mode
## Perform DE analysis
dse <- DESeq(dse, test = "LRT", reduced = ~gender, fitType = "local")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## Extract results
deseq <- regionMat$chr21$regions
mcols(deseq) <- c(mcols(deseq), results(dse))
## Explore the results
deseq
## GRanges object with 45 ranges and 12 metadata columns:
## seqnames ranges strand | value area indexStart indexEnd cluster clusterL baseMean
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer> <integer> <Rle> <Rle> <numeric>
## 1 chr21 9827018-9827582 * | 313.6717 177224.53 1 565 1 565 2846.2872
## 2 chr21 15457301-15457438 * | 215.0846 29681.68 566 703 2 138 451.5196
## 3 chr21 20230140-20230192 * | 38.8325 2058.12 704 756 3 366 29.5781
## 4 chr21 20230445-20230505 * | 41.3245 2520.80 757 817 3 366 36.0603
## 5 chr21 27253318-27253543 * | 34.9131 7890.37 818 1043 4 765 101.6468
## .. ... ... ... . ... ... ... ... ... ... ...
## 41 chr21 33039644-33039688 * | 34.4705 1551.1742 2180 2224 17 45 20.782035
## 42 chr21 33040784-33040798 * | 32.1342 482.0133 2225 2239 18 118 6.410542
## 43 chr21 33040890 * | 30.0925 30.0925 2240 2240 18 118 0.129717
## 44 chr21 33040900-33040901 * | 30.1208 60.2417 2241 2242 18 118 0.702291
## 45 chr21 48019401-48019414 * | 31.1489 436.0850 2243 2256 19 14 5.293293
## log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 -1.6903182 0.831959 0.215262 0.6426743 0.997155
## 2 -1.1640426 0.757490 0.871126 0.3506436 0.997155
## 3 0.0461488 0.458097 3.132082 0.0767657 0.863614
## 4 -0.1866200 0.390920 2.225708 0.1357305 0.997155
## 5 -0.1387377 0.320166 3.957987 0.0466495 0.862040
## .. ... ... ... ... ...
## 41 -0.642056 0.427661 0.6047814 0.4367595 0.997155
## 42 -0.634321 0.512262 0.5454039 0.4602018 0.997155
## 43 -0.859549 3.116540 0.0206273 0.8857989 0.997155
## 44 -0.628285 2.247378 0.5825105 0.4453299 0.997155
## 45 -1.694563 1.252290 5.7895910 0.0161213 0.725460
## -------
## seqinfo: 1 sequence from an unspecified genome
You can get similar results using edgeR using these functions: DGEList()
, calcNormFactors()
, estimateGLMRobustDisp()
, glmFit()
, and glmLRT()
.
Alternatively, we can find DERs using limma. Here we’ll exemplify a type of test closer to what we’ll do later with the F-statistics approach. First of all, we need to define our models.
## Build models
mod <- model.matrix(~ pheno$group + pheno$gender)
mod0 <- model.matrix(~ pheno$gender)
Next, we’ll transform the coverage information using the same default transformation from analyzeChr()
.
## Transform coverage
transformedCov <- log2(regionMat$chr21$coverageMatrix + 32)
We can then fit the models and get the F-statistic p-values and control the FDR.
## Example using limma
library("limma")
##
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
##
## plotMA
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
## Run limma
fit <- lmFit(transformedCov, mod)
fit0 <- lmFit(transformedCov, mod0)
## Determine DE status for the regions
## Also in https://github.com/LieberInstitute/jaffelab with help and examples
getF <- function(fit, fit0, theData) {
rss1 <- rowSums((fitted(fit) - theData)^2)
df1 <- ncol(fit$coef)
rss0 <- rowSums((fitted(fit0) - theData)^2)
df0 <- ncol(fit0$coef)
fstat <- ((rss0 - rss1) / (df1 - df0)) / (rss1 / (ncol(theData) - df1))
f_pval <- pf(fstat, df1 - df0, ncol(theData) - df1, lower.tail = FALSE)
fout <- cbind(fstat, df1 - 1, ncol(theData) - df1, f_pval)
colnames(fout)[2:3] <- c("df1", "df0")
fout <- data.frame(fout)
return(fout)
}
ff <- getF(fit, fit0, transformedCov)
## Get the p-value and assign it to the regions
limma <- regionMat$chr21$regions
limma$fstat <- ff$fstat
limma$pvalue <- ff$f_pval
limma$padj <- p.adjust(ff$f_pval, "BH")
## Explore the results
limma
## GRanges object with 45 ranges and 9 metadata columns:
## seqnames ranges strand | value area indexStart indexEnd cluster clusterL fstat pvalue
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer> <integer> <Rle> <Rle> <numeric> <numeric>
## 1 chr21 9827018-9827582 * | 313.6717 177224.53 1 565 1 565 1.638455 0.2325446
## 2 chr21 15457301-15457438 * | 215.0846 29681.68 566 703 2 138 4.307443 0.0677644
## 3 chr21 20230140-20230192 * | 38.8325 2058.12 704 756 3 366 1.323342 0.2796406
## 4 chr21 20230445-20230505 * | 41.3245 2520.80 757 817 3 366 0.380332 0.5527044
## 5 chr21 27253318-27253543 * | 34.9131 7890.37 818 1043 4 765 7.249519 0.0246955
## .. ... ... ... . ... ... ... ... ... ... ... ...
## 41 chr21 33039644-33039688 * | 34.4705 1551.1742 2180 2224 17 45 3.11799 0.1112440
## 42 chr21 33040784-33040798 * | 32.1342 482.0133 2225 2239 18 118 3.66184 0.0879543
## 43 chr21 33040890 * | 30.0925 30.0925 2240 2240 18 118 3.87860 0.0804175
## 44 chr21 33040900-33040901 * | 30.1208 60.2417 2241 2242 18 118 4.39338 0.0655381
## 45 chr21 48019401-48019414 * | 31.1489 436.0850 2243 2256 19 14 6.80915 0.0282970
## padj
## <numeric>
## 1 0.581362
## 2 0.324601
## 3 0.629191
## 4 0.863074
## 5 0.309532
## .. ...
## 41 0.385075
## 42 0.329829
## 43 0.328981
## 44 0.324601
## 45 0.309532
## -------
## seqinfo: 1 sequence from an unspecified genome
In this simple example, none of the ERs have strong differential expression signal when adjusting for an FDR of 5%.
table(limma$padj < 0.05, deseq$padj < 0.05)
##
## FALSE
## FALSE 45
railMatrix()
If you have Rail output, you can get the same results faster than with regionMatrix()
. Rail will create the summarized coverage BigWig file for you, but we are not including it in this package due to its size. So, lets create it.
## Calculate the mean: this step takes a long time with many samples
meanCov <- Reduce("+", fullCov$chr21) / ncol(fullCov$chr21)
## Save it on a bigwig file called meanChr21.bw
createBw(list("chr21" = DataFrame("meanChr21" = meanCov)),
keepGR =
FALSE
)
## 2024-05-01 16:57:07.505301 coerceGR: coercing sample meanChr21
## 2024-05-01 16:57:08.633228 createBwSample: exporting bw for sample meanChr21
Now that we have the files Rail creates for us, we can use railMatrix()
.
## Identify files to use
summaryFile <- "meanChr21.bw"
## We had already found the sample BigWig files and saved it in the object 'files'
## Lets just rename it to sampleFiles for clarity.
sampleFiles <- files
## Get the regions
system.time(
regionMat.rail <- railMatrix(
chrs = "chr21", summaryFiles = summaryFile,
sampleFiles = sampleFiles, L = 76, cutoff = 30, maxClusterGap = 3000L
)
)
## 2024-05-01 16:57:10.745814 loadCoverage: finding chromosome lengths
## 2024-05-01 16:57:10.753361 loadCoverage: loading BigWig file meanChr21.bw
## 2024-05-01 16:57:11.186351 loadCoverage: applying the cutoff to the merged data
## 2024-05-01 16:57:13.3649 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## 2024-05-01 16:57:13.686366 filterData: originally there were 48129895 rows, now there are 2256 rows. Meaning that 100 percent was filtered.
## 2024-05-01 16:57:13.690015 findRegions: identifying potential segments
## 2024-05-01 16:57:13.693245 findRegions: segmenting information
## 2024-05-01 16:57:13.693604 .getSegmentsRle: segmenting with cutoff(s) 30
## 2024-05-01 16:57:13.697226 findRegions: identifying candidate regions
## 2024-05-01 16:57:13.734767 findRegions: identifying region clusters
## 2024-05-01 16:57:13.798744 railMatrix: processing regions 1 to 45
## 2024-05-01 16:57:13.804244 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB113.bw
## 2024-05-01 16:57:13.879025 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB123.bw
## 2024-05-01 16:57:13.94832 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB126.bw
## 2024-05-01 16:57:14.017126 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB130.bw
## 2024-05-01 16:57:14.087963 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB135.bw
## 2024-05-01 16:57:14.158411 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB136.bw
## 2024-05-01 16:57:14.227843 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB145.bw
## 2024-05-01 16:57:14.296553 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB153.bw
## 2024-05-01 16:57:14.365351 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB159.bw
## 2024-05-01 16:57:14.442555 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB178.bw
## 2024-05-01 16:57:14.513484 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB92.bw
## 2024-05-01 16:57:14.586151 railMatrix: processing file /home/biocbuild/bbs-3.19-bioc/R/site-library/derfinderData/extdata/AMY/HSB97.bw
## user system elapsed
## 3.903 0.024 3.928
When you take into account the time needed to load the data (fullCoverage()
) and then creating the matrix (regionMatrix()
), railMatrix()
is faster and less memory intensive.
The objects are not identical due to small rounding errors, but it’s nothing to worry about.
## Overall not identical due to small rounding errors
identical(regionMat, regionMat.rail)
## [1] FALSE
## Actual regions are the same
identical(ranges(regionMat$chr21$regions), ranges(regionMat.rail$chr21$regions))
## [1] TRUE
## When you round, the small differences go away
identical(
round(regionMat$chr21$regions$value, 4),
round(regionMat.rail$chr21$regions$value, 4)
)
## [1] TRUE
identical(
round(regionMat$chr21$regions$area, 4),
round(regionMat.rail$chr21$regions$area, 4)
)
## [1] TRUE
One form of base-level differential expression analysis implemented in derfinder is to calculate F-statistics for every base and use them to define candidate differentially expressed regions. This type of analysis is further explained in this section.
Once we have the base-level coverage data for all 12 samples, we can construct the models. In this case, we want to find differences between fetal and adult samples while adjusting for gender and a proxy of the library size.
We can use sampleDepth()
and it’s helper function collapseFullCoverage()
to get a proxy of the library size. Note that you would normally use the unfiltered data from all the chromosomes in this step and not just one.
## Get some idea of the library sizes
sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
## 2024-05-01 16:57:14.782822 sampleDepth: Calculating sample quantiles
## 2024-05-01 16:57:15.026875 sampleDepth: Calculating sample adjustments
sampleDepths
## HSB113.100% HSB123.100% HSB126.100% HSB130.100% HSB135.100% HSB136.100% HSB145.100% HSB153.100% HSB159.100% HSB178.100%
## 19.82106 19.40505 19.53045 19.52017 20.33392 19.97758 19.49827 19.41285 19.24186 19.44252
## HSB92.100% HSB97.100%
## 19.55904 19.47733
sampleDepth()
is similar to calcNormFactors()
from metagenomeSeq with some code underneath tailored for the type of data we are using. collapseFullCoverage()
is only needed to deal with the size of the data.
We can then define the nested models we want to use using makeModels()
. This is a helper function that assumes that you will always adjust for the library size. You then need to define the variable to test, in this case we are comparing fetal vs adult samples. Optionally, you can adjust for other sample covariates, such as the gender in this case.
## Define models
models <- makeModels(sampleDepths,
testvars = pheno$group,
adjustvars = pheno[, c("gender")]
)
## Explore the models
lapply(models, head)
## $mod
## (Intercept) testvarsadult sampleDepths adjustVar1M
## 1 1 0 19.82106 0
## 2 1 0 19.40505 1
## 3 1 0 19.53045 1
## 4 1 0 19.52017 1
## 5 1 0 20.33392 0
## 6 1 0 19.97758 0
##
## $mod0
## (Intercept) sampleDepths adjustVar1M
## 1 1 19.82106 0
## 2 1 19.40505 1
## 3 1 19.53045 1
## 4 1 19.52017 1
## 5 1 20.33392 0
## 6 1 19.97758 0
Note how the null model (mod0
) is nested in the alternative model (mod
). Use the same models for all your chromosomes unless you have a specific reason to use chromosome-specific models. Note that derfinder is very flexible and works with any type of nested model.
Next, we can find candidate differentially expressed regions (DERs) using as input the segments of the genome where at least one sample has coverage greater than 2. That is, the filtered coverage version we created previously.
The main function in derfinder for this type of analysis is analyzeChr()
. It works at a chromosome level and runs behinds the scenes several other derfinder functions. To use it, you have to provide the models, the grouping information, how to calculate the F-statistic cutoff and most importantly, the number of permutations.
By default analyzeChr()
will use a theoretical cutoff. In this example, we use the cutoff that would correspond to a p-value of 0.05. To assign p-values to the candidate DERs, derfinder permutes the rows of the model matrices, re-calculates the F-statistics and identifies null regions. Then it compares the area of the observed regions versus the areas from the null regions to assign an empirical p-value.
In this example we will use twenty permutations, although in a real case scenario you might consider a larger number of permutations.
In real scenario, you might consider saving the results from all the chromosomes in a given directory. Here we will use analysisResults. For each chromosome you analyze, a new directory with the chromosome-specific data will be created. So in this case, we will have analysisResults/chr21.
## Create a analysis directory
dir.create("analysisResults")
originalWd <- getwd()
setwd(file.path(originalWd, "analysisResults"))
## Perform differential expression analysis
system.time(results <- analyzeChr(
chr = "chr21", filteredCov$chr21, models,
groupInfo = pheno$group, writeOutput = TRUE, cutoffFstat = 5e-02,
nPermute = 20, seeds = 20140923 + seq_len(20), returnOutput = TRUE
))
## 2024-05-01 16:57:16.276731 analyzeChr: Pre-processing the coverage data
## 2024-05-01 16:57:18.856584 analyzeChr: Calculating statistics
## 2024-05-01 16:57:18.85989 calculateStats: calculating the F-statistics
## 2024-05-01 16:57:19.158568 analyzeChr: Calculating pvalues
## 2024-05-01 16:57:19.158968 analyzeChr: Using the following theoretical cutoff for the F-statistics 5.31765507157871
## 2024-05-01 16:57:19.160242 calculatePvalues: identifying data segments
## 2024-05-01 16:57:19.164063 findRegions: segmenting information
## 2024-05-01 16:57:19.197655 findRegions: identifying candidate regions
## 2024-05-01 16:57:19.238041 findRegions: identifying region clusters
## 2024-05-01 16:57:19.411258 calculatePvalues: calculating F-statistics for permutation 1 and seed 20140924
## 2024-05-01 16:57:19.628587 findRegions: segmenting information
## 2024-05-01 16:57:19.662213 findRegions: identifying candidate regions
## 2024-05-01 16:57:19.707188 calculatePvalues: calculating F-statistics for permutation 2 and seed 20140925
## 2024-05-01 16:57:19.927014 findRegions: segmenting information
## 2024-05-01 16:57:19.960956 findRegions: identifying candidate regions
## 2024-05-01 16:57:20.006287 calculatePvalues: calculating F-statistics for permutation 3 and seed 20140926
## 2024-05-01 16:57:20.230045 findRegions: segmenting information
## 2024-05-01 16:57:20.264309 findRegions: identifying candidate regions
## 2024-05-01 16:57:20.311365 calculatePvalues: calculating F-statistics for permutation 4 and seed 20140927
## 2024-05-01 16:57:20.536042 findRegions: segmenting information
## 2024-05-01 16:57:20.570393 findRegions: identifying candidate regions
## 2024-05-01 16:57:20.618038 calculatePvalues: calculating F-statistics for permutation 5 and seed 20140928
## 2024-05-01 16:57:20.837637 findRegions: segmenting information
## 2024-05-01 16:57:20.871413 findRegions: identifying candidate regions
## 2024-05-01 16:57:20.916892 calculatePvalues: calculating F-statistics for permutation 6 and seed 20140929
## 2024-05-01 16:57:21.130592 findRegions: segmenting information
## 2024-05-01 16:57:21.16429 findRegions: identifying candidate regions
## 2024-05-01 16:57:21.209062 calculatePvalues: calculating F-statistics for permutation 7 and seed 20140930
## 2024-05-01 16:57:21.424117 findRegions: segmenting information
## 2024-05-01 16:57:21.457564 findRegions: identifying candidate regions
## 2024-05-01 16:57:21.50309 calculatePvalues: calculating F-statistics for permutation 8 and seed 20140931
## 2024-05-01 16:57:21.71891 findRegions: segmenting information
## 2024-05-01 16:57:21.752822 findRegions: identifying candidate regions
## 2024-05-01 16:57:21.798878 calculatePvalues: calculating F-statistics for permutation 9 and seed 20140932
## 2024-05-01 16:57:22.013441 findRegions: segmenting information
## 2024-05-01 16:57:22.047205 findRegions: identifying candidate regions
## 2024-05-01 16:57:22.093126 calculatePvalues: calculating F-statistics for permutation 10 and seed 20140933
## 2024-05-01 16:57:22.308704 findRegions: segmenting information
## 2024-05-01 16:57:22.342221 findRegions: identifying candidate regions
## 2024-05-01 16:57:22.3869 calculatePvalues: calculating F-statistics for permutation 11 and seed 20140934
## 2024-05-01 16:57:23.608615 findRegions: segmenting information
## 2024-05-01 16:57:23.642143 findRegions: identifying candidate regions
## 2024-05-01 16:57:23.687532 calculatePvalues: calculating F-statistics for permutation 12 and seed 20140935
## 2024-05-01 16:57:23.901225 findRegions: segmenting information
## 2024-05-01 16:57:23.937488 findRegions: identifying candidate regions
## 2024-05-01 16:57:23.983895 calculatePvalues: calculating F-statistics for permutation 13 and seed 20140936
## 2024-05-01 16:57:24.202515 findRegions: segmenting information
## 2024-05-01 16:57:24.236324 findRegions: identifying candidate regions
## 2024-05-01 16:57:24.282606 calculatePvalues: calculating F-statistics for permutation 14 and seed 20140937
## 2024-05-01 16:57:24.500204 findRegions: segmenting information
## 2024-05-01 16:57:24.534248 findRegions: identifying candidate regions
## 2024-05-01 16:57:24.580566 calculatePvalues: calculating F-statistics for permutation 15 and seed 20140938
## 2024-05-01 16:57:24.801562 findRegions: segmenting information
## 2024-05-01 16:57:24.835259 findRegions: identifying candidate regions
## 2024-05-01 16:57:24.880691 calculatePvalues: calculating F-statistics for permutation 16 and seed 20140939
## 2024-05-01 16:57:25.134828 findRegions: segmenting information
## 2024-05-01 16:57:25.169031 findRegions: identifying candidate regions
## 2024-05-01 16:57:25.217016 calculatePvalues: calculating F-statistics for permutation 17 and seed 20140940
## 2024-05-01 16:57:25.403301 findRegions: segmenting information
## 2024-05-01 16:57:25.437435 findRegions: identifying candidate regions
## 2024-05-01 16:57:25.484983 calculatePvalues: calculating F-statistics for permutation 18 and seed 20140941
## 2024-05-01 16:57:25.70591 findRegions: segmenting information
## 2024-05-01 16:57:25.73962 findRegions: identifying candidate regions
## 2024-05-01 16:57:25.784677 calculatePvalues: calculating F-statistics for permutation 19 and seed 20140942
## 2024-05-01 16:57:26.007187 findRegions: segmenting information
## 2024-05-01 16:57:26.040927 findRegions: identifying candidate regions
## 2024-05-01 16:57:26.08738 calculatePvalues: calculating F-statistics for permutation 20 and seed 20140943
## 2024-05-01 16:57:26.338484 findRegions: segmenting information
## 2024-05-01 16:57:26.372571 findRegions: identifying candidate regions
## 2024-05-01 16:57:26.434565 calculatePvalues: calculating the p-values
## 2024-05-01 16:57:26.519233 analyzeChr: Annotating regions
## No annotationPackage supplied. Trying org.Hs.eg.db.
## Getting TSS and TSE.
## Getting CSS and CSE.
## Getting exons.
## Annotating genes.
## .....
## user system elapsed
## 73.647 0.464 74.114
To speed up analyzeChr()
, you might need to use several cores via the mc.cores
argument. If memory is limiting, you might want to use a smaller chunksize
(default is 5 million). Note that if you use too many cores, you might hit the input/output ceiling of your data network and/or hard drives speed.
Before running with a large number of permutations we recommend exploring how long each permutation cycle takes using a single permutation.
Note that analyzing each chromosome with a large number of permutations and a rich data set can take several hours, so we recommend running one job running analyzeChr()
per chromosome, and then merging the results via mergeResults()
. This process is further described in the advanced derfinder vignette.
When using returnOutput = TRUE
, analyzeChr()
will return a list with the results to explore interactively. However, by default it writes the results to disk (one .Rdata file per result).
The following code explores the results.
## Explore
names(results)
## [1] "timeinfo" "optionsStats" "coveragePrep" "fstats" "regions" "annotation"
optionStats
stores the main options used in the analyzeChr()
call including the models used, the type of cutoff, number of permutations, seeds for the permutations. All this information can be useful to reproduce the analysis.
## Explore optionsStats
names(results$optionsStats)
## [1] "models" "cutoffPre" "scalefac" "chunksize" "cutoffFstat" "cutoffType"
## [7] "nPermute" "seeds" "groupInfo" "lowMemDir" "analyzeCall" "cutoffFstatUsed"
## [13] "smooth" "smoothFunction" "weights" "returnOutput"
## Call used
results$optionsStats$analyzeCall
## analyzeChr(chr = "chr21", coverageInfo = filteredCov$chr21, models = models,
## cutoffFstat = 0.05, nPermute = 20, seeds = 20140923 + seq_len(20),
## groupInfo = pheno$group, writeOutput = TRUE, returnOutput = TRUE)
coveragePrep
has the result from the preprocessCoverage()
step. This includes the genomic position index, the mean coverage (after scaling and the log2 transformation) for all the samples, and the group mean coverages. By default, the chunks are written to disk in optionsStats$lowMemDir
(chr21/chunksDir in this example) to help reduce the required memory resources. Otherwise it is stored in coveragePrep$coverageProcessed
.
## Explore coveragePrep
names(results$coveragePrep)
## [1] "coverageProcessed" "mclapplyIndex" "position" "meanCoverage" "groupMeans"
## Group means
results$coveragePrep$groupMeans
## $fetal
## numeric-Rle of length 130356 with 116452 runs
## Lengths: 1 1 1 1 2 1 ... 2 1 1 1 1
## Values : 0.401667 0.428333 0.435000 0.461667 0.471667 0.478333 ... 1.39500 1.38167 1.34000 1.33333 1.24833
##
## $adult
## numeric-Rle of length 130356 with 119226 runs
## Lengths: 1 1 1 1 1 1 ... 1 2 1 1 1
## Values : 0.406667 0.413333 0.430000 0.448333 0.485000 0.510000 ... 1.97500 1.91833 1.77667 1.73833 1.62667
The F-statistics are then stored in fstats
. These are calculated using calculateStats()
.
## Explore optionsStats
results$fstats
## numeric-Rle of length 130356 with 126807 runs
## Lengths: 1 1 1 1 1 ... 1 1 1 1
## Values : 0.01922610 0.02996937 0.02066332 0.02249996 0.01984328 ... 3.031370 2.653428 2.507611 2.324638
## Note that the length matches the number of bases used
identical(length(results$fstats), sum(results$coveragePrep$position))
## [1] TRUE
The candidate DERs and summary results from the permutations is then stored in regions
. This is the output from calculatePvalues()
which uses several underneath other functions including calculateStats()
and findRegions()
.
## Explore regions
names(results$regions)
## [1] "regions" "nullStats" "nullWidths" "nullPermutation"
For the null regions, the summary information is composed of the mean F-statistic for the null regions (regions$nullStats
), the width of the null regions (regions$nullWidths
), and the permutation number under which they were identified (regions$nullPermutation
).
## Permutation summary information
results$regions[2:4]
## $nullStats
## numeric-Rle of length 13994 with 13994 runs
## Lengths: 1 1 1 1 1 1 ... 1 1 1 1 1
## Values : 5.43461 5.71738 6.37821 6.33171 5.48965 7.05049 ... 6.12148 5.36584 5.35554 5.36614 5.62516
##
## $nullWidths
## integer-Rle of length 13994 with 12365 runs
## Lengths: 2 1 1 1 1 1 3 1 2 1 1 1 1 ... 1 1 1 1 1 1 1 1 2 1 1 4 1
## Values : 1 24 7 1 32 2 1 11 1 7 2 14 1 ... 2 14 1 10 15 1 3 45 4 28 6 1 2
##
## $nullPermutation
## integer-Rle of length 13994 with 20 runs
## Lengths: 246 350 574 554 396 462 482 1548 1522 462 1104 2076 70 320 746 114 1460 428 802 278
## Values : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The most important part of the output is the GRanges
object with the candidate DERs shown below.
## Candidate DERs
results$regions$regions
## GRanges object with 591 ranges and 14 metadata columns:
## seqnames ranges strand | value area indexStart indexEnd cluster clusterL meanCoverage
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer> <integer> <Rle> <Rle> <numeric>
## up chr21 47610386-47610682 * | 11.10304 3297.60 122158 122454 138 933 1.597952
## up chr21 40196145-40196444 * | 10.06142 3018.43 76110 76409 71 1323 1.303508
## up chr21 27253616-27253948 * | 8.43488 2808.82 22019 22351 28 407 33.657858
## up chr21 22115534-22115894 * | 7.23645 2612.36 12274 12634 9 694 0.964464
## up chr21 22914853-22915064 * | 9.78066 2073.50 17318 17529 21 217 2.838978
## .. ... ... ... . ... ... ... ... ... ... ...
## up chr21 35889784 * | 5.31952 5.31952 60088 60088 51 742 2.75417
## up chr21 47610093 * | 5.31912 5.31912 121865 121865 138 933 1.45583
## up chr21 16333728 * | 5.31881 5.31881 5048 5048 1 9 1.19500
## up chr21 34001896 * | 5.31871 5.31871 32577 32577 38 1428 1.71250
## up chr21 34809571 * | 5.31801 5.31801 43694 43694 46 149 2.95000
## meanfetal meanadult log2FoldChangeadultvsfetal pvalues significant qvalues significantQval
## <numeric> <numeric> <numeric> <numeric> <factor> <numeric> <factor>
## up 0.82289 2.373013 1.527949 0.00278671 TRUE 0.738407 FALSE
## up 2.02532 0.581694 -1.799818 0.00378707 TRUE 0.738407 FALSE
## up 42.46704 24.848674 -0.773175 0.00464452 TRUE 0.738407 FALSE
## up 1.71906 0.209871 -3.034045 0.00535906 TRUE 0.738407 FALSE
## up 4.23593 1.442028 -1.554578 0.00793140 TRUE 0.738407 FALSE
## .. ... ... ... ... ... ... ...
## up 3.36000 2.14833 -0.6452433 0.997856 FALSE 0.974463 FALSE
## up 0.77500 2.13667 1.4630937 0.998285 FALSE 0.974463 FALSE
## up 1.23167 1.15833 -0.0885613 0.998714 FALSE 0.974463 FALSE
## up 2.33333 1.09167 -1.0958600 0.998714 FALSE 0.974463 FALSE
## up 2.87833 3.02167 0.0701108 0.999571 FALSE 0.974463 FALSE
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The metadata columns are:
Note that for this type of analysis you might want to try a few coverage cutoffs and/or F-statistic cutoffs. One quick way to evaluate the results is to compare the width of the regions.
## Width of potential DERs
summary(width(results$regions$regions))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 4.00 17.98 17.00 361.00
sum(width(results$regions$regions) > 50)
## [1] 68
## Width of candidate DERs
sig <- as.logical(results$regions$regions$significant)
summary(width(results$regions$regions[sig]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.0 81.5 97.0 127.8 122.0 361.0
sum(width(results$regions$regions[sig]) > 50)
## [1] 35
analyzeChr()
will find the nearest annotation feature using matchGenes()
from bumphunter (version >= 1.7.3). This information is useful considering that the candidate DERs were identified without relying on annotation. Yet at the end, we are interested to check if they are inside a known exon, upstream a gene, etc.
## Nearest annotation
head(results$annotation)
## name
## 1 LSS
## 2 ETS2
## 3 APP
## 4 LINC00320
## 5 NCAM2
## 6 LINC00320
## annotation
## 1 NM_001001438 NM_001145436 NM_001145437 NM_002340 NP_001001438 NP_001138908 NP_001138909 NP_002331
## 2 NM_001256295 NM_005239 NP_001243224 NP_005230 XM_005260935 XM_017028290 XM_054324370 XP_005260992 XP_016883779 XP_054180345
## 3 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_001385253 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_001372182 NP_958816 NP_958817
## 4 NR_024090 NR_109786 NR_109787 NR_109788
## 5 NM_001352591 NM_001352592 NM_001352593 NM_001352594 NM_001352595 NM_001352596 NM_001352597 NM_004540 NP_001339520 NP_001339521 NP_001339522 NP_001339523 NP_001339524 NP_001339525 NP_001339526 NP_004531 XM_011529575 XM_011529580 XM_011529581 XM_011529582 XM_011529585 XM_017028356 XM_024452081 XM_047440784 XM_047440785 XM_054324514 XM_054324515 XM_054324516 XM_054324517 XM_054324518 XM_054324519 XM_054324520 XM_054324521 XM_054324522 XM_054324523 XP_011527877 XP_011527882 XP_011527883 XP_011527884 XP_011527887 XP_016883845 XP_024307849 XP_047296740 XP_047296741 XP_054180489 XP_054180490 XP_054180491 XP_054180492 XP_054180493 XP_054180494 XP_054180495 XP_054180496 XP_054180497 XP_054180498
## 6 NR_024090 NR_109786 NR_109787 NR_109788
## description region distance subregion insideDistance exonnumber nexons UTR strand
## 1 inside exon inside 38056 inside exon 0 22 22 3'UTR -
## 2 inside exon inside 18390 inside exon 0 10 10 3'UTR +
## 3 inside exon inside 289498 inside exon 0 16 16 3'UTR -
## 4 inside exon inside 59532 inside exon 0 7 7 inside transcription region -
## 5 downstream downstream 544220 <NA> NA NA 15 <NA> +
## 6 inside exon inside 59315 inside exon 0 7 7 inside transcription region -
## geneL codingL Geneid subjectHits
## 1 39700 NA 4047 34356
## 2 19123 NA 2114 17340
## 3 290585 NA 351 31161
## 4 60513 NA 387486 32699
## 5 541884 NA 4685 37133
## 6 60513 NA 387486 32699
For more details on the output please check the bumphunter package.
Check the section about non-human data (specifically, using annotation different from hg19) on the advanced vignette.
The final piece is the wallclock time spent during each of the steps in analyzeChr()
. You can then make a plot with this information as shown in Figure 1.
## Time spent
results$timeinfo
## init setup prepData savePrep
## "2024-05-01 16:57:16 EDT" "2024-05-01 16:57:16 EDT" "2024-05-01 16:57:18 EDT" "2024-05-01 16:57:18 EDT"
## calculateStats saveStats saveStatsOpts calculatePvalues
## "2024-05-01 16:57:19 EDT" "2024-05-01 16:57:19 EDT" "2024-05-01 16:57:19 EDT" "2024-05-01 16:57:26 EDT"
## saveRegs annotate saveAnno
## "2024-05-01 16:57:26 EDT" "2024-05-01 16:58:30 EDT" "2024-05-01 16:58:30 EDT"
## Use this information to make a plot
timed <- diff(results$timeinfo)
timed.df <- data.frame(Seconds = as.numeric(timed), Step = factor(names(timed),
levels = rev(names(timed))
))
library("ggplot2")
ggplot(timed.df, aes(y = Step, x = Seconds)) +
geom_point()