This tools was designed to evaluate Copy number variations (CNVs) which have been detected from CNV detection pipelines from NGS data. CNV detection pipelines from WES and (to a lesser extent) WGS, lead to high numbers of false-positives.
The SARC package aims to flag high confidence and low confidence CNVs from such detection pipelines. This would aid in patient diagnostics and clinical work. Uniquely, the SARC package only requires a coverage
/cov
file and an cnv
file. Both files should ideally be loaded as data.frames
.
The cov
file can be created from many BAM files from WES/ WGS, and the read depth coverage of each sample should be normalized by library size of the sample prior to processing. The end of the vignette will point to how to perform read coverage reading and normalisation with examples.
The cnv
file is in many ways similar to a standard CNV BED
file - as it serves as a list of CNVs identified in the samples. Unlike a traditional BED
file, the cnv
file should contain additional columns such as CNV value, type of CNV, genes and batch of data. Additionally, the column names of the cnv
file should ideally include: SAMPLE, CHROM, START, END, TYPE, VALUE. The order of the columns is not necessary. Additional columns such as TOOL, BATCH and GENE can enhance plotting.
The cnv file should contain generic information at the minimum. Sample names (SAMPLE), Chromosome (“CHROM”), Start site (“START”), End sit (“END”), Deletion or Duplication (“TYPE”) and CNV value (“VALUE”) are required for the tool to function. Column names should be in all caps and match the names of the test_cnv example below.
START, END and VALUE should be integers!
data("test_cnv")
#For speed just use a few detected CNVs
test_cnv <- test_cnv[1:3,]
head(test_cnv) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL |
---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV |
The cov file should comprise of coverage from normalised BAM files. Additionally, it is good practice to separate cov files by the technology used to sequence the FASTQ files.
Importantly, there will be four columns before the samples - ID, Chromosome, Start and End. ID, Start and End should be integers. The rest of the column names will be the names of the samples - and these should match the samples names found in the cnv file.
data("test_cov")
head(test_cov[, 1:10]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
ID | CHROM | START | END | SampleA | SampleB | SampleC | SampleD | SampleE | SampleF |
---|---|---|---|---|---|---|---|---|---|
1 | chr1 | 13402 | 13639 | 21.96 | 23.40 | 24.82 | 20.91 | 13.83 | 20.21 |
2 | chr1 | 69088 | 69091 | 3.61 | 3.31 | 3.96 | 2.95 | 2.36 | 2.82 |
3 | chr1 | 70008 | 70010 | 1.93 | 2.69 | 2.42 | 2.58 | 2.36 | 2.31 |
4 | chr1 | 324438 | 325605 | 56.74 | 56.57 | 57.26 | 51.04 | 53.39 | 44.49 |
5 | chr1 | 721404 | 721912 | 5.62 | 10.10 | 9.99 | 6.51 | 7.41 | 5.26 |
6 | chr1 | 861393 | 861395 | 1.56 | 1.71 | 1.73 | 1.77 | 1.66 | 1.84 |
All samples in the cnv file should be found in the cov file. But you can have samples in the cov file that are not found in the cnv file.
The SARC package relies on RaggedExperiments
objects to store the cov
file as its assay and will store lists of dataframes
and Grange
objects within metadata
. The SARC
tool will generate more cnv
dataframes with additional stats for each CNV in the list of CNVs. These dataframes
will be stored within the first list stored in the metadata
.
For cohorts with multiple coverage files (from multiple sequencing platforms or meta-analyses studies), we recommend creating multiple RaggedExperiments
objects, rather than combining them all into one.
#Create a start site and end site for each CNV detected
SARC <- regionSet(cnv = test_cnv, test_cov)
#Create smaller coverage files for each CNV
SARC <- regionSplit(RE = SARC, #the raggedexpression object we made
cnv = metadata(SARC)[['CNVlist']][[1]],
startlist = metadata(SARC)[[2]], #list of start sites, per CNV
endlist = metadata(SARC)[[3]]) #list of end sites, per CNV
First we use mean scores to check if the mean reads at this region matches the CNV values from a CNV detection tool. This will make a new cnv file which will be stored in the RE object.
#Calculate the mean read coverage
SARC <- regionMean(RE = SARC,
cnv = test_cnv,
splitcov = metadata(SARC)[[4]]) #list of cnv specific coverage
We then calculate the quantile distributions. We assume a true duplication will be on the higher end of the distribution (in contrast to the other samples) and true deletions will be on the lower end. Depending on the number of samples in the COV file - the thresholds should be altered.
#Calculate the distribution of the mean reads
SARC <- regionQuantiles(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[2]], #new cnv file
meancov = metadata(SARC)[[5]], #list of cnv specific coverage + means
q1 = 0.1, #lower threshold
q2 = 0.9) #upper threshold
Anova can then be used to identify if a region with a suspected CNV has a significantly rare read depth at the region - in contrast to all other samples. The more samples, the more powerful this test is.
#Calculate rarity of each suspected CNV - in contrast to other samples
SARC <- prepAnova(RE = SARC,
start = metadata(SARC)[[2]],
end = metadata(SARC)[[3]],
cnv = metadata(SARC)[['CNVlist']][[3]]) #newest cnv dataframe
SARC <- anovaOnCNV(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[3]],
anovacov = metadata(SARC)[[8]])
head(metadata(SARC)[['CNVlist']][[4]]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL | MeanScore | Qlow | Qhigh | ANOVA |
---|---|---|---|---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth | 1.44 | 1 | 0.0927248 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth | 0.64 | 0 | 0.0842323 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV | 0.64 | 0 | 0.0842323 |
A major complaint of CNV analysis by diagnostic staff is the over-reliance on the Integrative Genome Browser (IGV). While a great tool, it can be a tedious task to search many hundreds of samples manually. The SARC package provides an alternative (but not a complete substitute) to visualize which genes and exons are being (or not being) effected by a CNV. This is also a good way of visualizing the false-positives quickly, without using IGV.
#prepare new objects for the CNV plots to work
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(Homo.sapiens) #load genome specific libraries from BioConductor
TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
txgene <- tx@unlistData
SARC <- plotCovPrep(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[4]], #newest cnv dataframe
startlist = metadata(SARC)[[2]],
endlist = metadata(SARC)[[3]],
n1 = 0, #left-padding
n2 = 0) #right-padding
SARC <- regionGrangeMake(RE = SARC,
covprepped = metadata(SARC)[[9]])
SARC <- addExonsGenes(RE = SARC,
covgranges = metadata(SARC)[[10]], #list of grange objects - one for each detected CNV
txdb = txdb, #Species specific database
txgene = txgene) #genes/ exons from the database
SARC <- setupCNVplot(RE = SARC, namedgranges = metadata(SARC)[[11]], #grange objects which have genes/ exons added
covprepped = metadata(SARC)[[9]])
## [1] "All CNVs are fine for further evaluation"
If CNVs are very small and cannot to attributed to at least one known gene in the TxDB object they will be too difficult to process any further. A message will state which CNVs were not associated with any genes. These CNVs should be removed from the cnv file also as to not cause errors.
Each detected CNV can be plotted. Automated plotting can be done quite simply with a for-loop - and is recommended. The samples with the detected CNV is highlighted with a thin purple line. The Sample with the detected CNV, the type of CNV and the value of the CNV from the detection pipeline are pasted as the subtitle.
Some clinicians would rather the confidence level flagged rather than completely removed - so SARC will not remove any CNVs, and this is left to the user.
#Use statistical analyses to flag CNVs as high or low confidence of being true
SARC <- cnvConfidence(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[4]])
head(metadata(SARC)[['CNVlist']][[5]]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL | MeanScore | Qlow | Qhigh | ANOVA | CNV.SCORE | CNV.TIER |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth | 1.44 | 1 | 0.0927248 | 2 | CONFIDENT | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT |
If the RaggedExperiment Object is confusing it can be traversed as so.
#This will show all the dataframes (cnv) files made
#Generally it is recommended to use the most recently made cnv file to keep all the additional columns
print(SARC)
## class: RaggedExperiment
## dim: 34987 1
## assays(21): ID SampleA ... SampleS SampleT
## rownames: NULL
## colnames: NULL
## colData names(1): X
#This will show all the list objects. Their names should roughly correlate with the names of the parameters in the functions.
names(metadata(SARC))
## [1] "CNVlist" "REGIONSTART" "REGIONEND" "SPLITCOV"
## [5] "SPLITMEAN" "LOWERQAUNTILE" "UPPERQAUNTILE" "ANOVACOV"
## [9] "COVPREPPED" "COVGRANGES" "NAMEDGRANGES" "CNVPLOTLIST"
Plot distributions to see why some CNVs were false positives. The sample with the suspected CNV will be highlighted in red. In cases where many samples are present, plotly=TRUE may lead to a clearer visual. sample refers to the CNV - so can easily be looped.
SARC <- setQDplot(RE = SARC, meancov = metadata(SARC)[[5]])
seeDist(meanList = metadata(SARC)[[13]],
cnv = metadata(SARC)[['CNVlist']][[5]],
sample = 1,
plotly=FALSE)
Add all genes and exons the CNVs effect. This could be useful to identify if the variant contributes to a patients symptoms.
SARC <- pasteExonsGenes(RE = SARC,
setup = metadata(SARC)[[12]], #list of dataframes from the setupCNVplot function
cnv = metadata(SARC)[['CNVlist']][[5]]) #cnv file to add an extra column to
metadata(SARC)[['CNVlist']][[6]] %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL | MeanScore | Qlow | Qhigh | ANOVA | CNV.SCORE | CNV.TIER | Gene.Exon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth | 1.44 | 1 | 0.0927248 | 2 | CONFIDENT | FCGR2A.1|FCGR2A.6|FCGR2A.7|FCGR3A.1 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT | SDHB.3|SDHB.1|PADI2.9|PADI2.8|PADI2.7|PADI2.6 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT | SDHB.3|SDHB.1|PADI2.9|PADI2.8|PADI2.7|PADI2.6 |
A more powerful test is the Dunnet test. This contrasts the read-depths between the control samples (samples with suspected CNVs) and the test samples (samples without suspected CNVs) at the same region of the DNA. However this is very slow - and it is only recommended when there are few samples (<100) to test.
DNA <- phDunnetonCNV(RE = DNA,
cnv = metadata(SARC)[['CNVlist']][[4]],
anovacov = metadata(SARC)[[8]])
SARC <- cnvConfidence(MA = SARC, cnv = experiments(SARC)[[7]], ph = TRUE)
plotCNV can also hone into a single GENE of interest, so long as it matches the genes found via TxDB, can be made specific for a BATCH of WES/ WGS data, or sites where the DNA was sequenced, and be logged. For WES data, logging is rerecorded as across a short region of DNA, the reads can change greatly, based on the chemistry of the sequencer. For deletions, log 10 is helpful, for duplication’s it can actually make them harder to see. Number of samples can also be adjusted for plotting.
The plots can be automated easily in a loop. Trick - make one plot in Rstudio, keep it in the plotting pane, and adjust the height and width. All other plots made in a loop with ggsave will be made to the same height and width. Useful when making plots for papers.
data("test_cnv2")
library(ggplot2)
for (i in seq_len(nrow(test_cnv2))) {
n <- paste0(cnv$SAMPLE[i], "_",
cnv$GENE[i], "_",
cnv$CHROM[i], "_",
cnv$START[i], "_",
cnv$END[i])
savename <- paste0(n, ".jpeg")
print(i)
plotCNV(cnv = cnv, setup = metadata(X)[[9]], FilteredCNV = i,
batch = cnv$BATCH[i], gene = cnv$GENE[i])
ggplot2::ggsave(filename = savename)
}
Though this tool is designed for clinical work, it can be further applied for research projects which focus on other species. Simply change the txdb and species database. For example, for mouse based projects, follow the following steps.
## Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
## Loading required package: Mus.musculus
## Loading required package: org.Mm.eg.db
##
#prepare new objects for the CNV plots to work
TxDb(Mus.musculus) <- TxDb.Mmusculus.UCSC.mm10.knownGene
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
tx <- transcriptsBy(Mus.musculus, columns = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
Using bioconductor package GenomicAlignments
users can read the coverage of bam files. Here we give some examples on how to read bam files into R. Further detail is available in this page.
require(GenomicAlignments)
bam <- system.file("extdata", "test.bam", package="SARC")
coverage <- coverage(bam)
coverage$chr1@values
## [1] 0 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 1 0
It is highly recommended to normalize read depth by total library size of the sequencing file for more accurate analysis. Calculating RPM (reads per million) is a straight forward method for genomic data. Here is an example using raw read depth and total library sizes.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicAlignments_1.42.0
## [2] Rsamtools_2.22.0
## [3] Biostrings_2.74.0
## [4] XVector_0.46.0
## [5] SummarizedExperiment_1.36.0
## [6] MatrixGenerics_1.18.0
## [7] matrixStats_1.4.1
## [8] Mus.musculus_1.3.1
## [9] org.Mm.eg.db_3.20.0
## [10] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [11] Homo.sapiens_1.3.1
## [12] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [13] org.Hs.eg.db_3.20.0
## [14] GO.db_3.20.0
## [15] OrganismDbi_1.48.0
## [16] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
## [17] GenomicFeatures_1.58.0
## [18] AnnotationDbi_1.68.0
## [19] Biobase_2.66.0
## [20] kableExtra_1.4.0
## [21] knitr_1.48
## [22] SARC_1.4.0
## [23] RaggedExperiment_1.30.0
## [24] GenomicRanges_1.58.0
## [25] GenomeInfoDb_1.42.0
## [26] IRanges_2.40.0
## [27] S4Vectors_0.44.0
## [28] BiocGenerics_0.52.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.1 BiocIO_1.16.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.2.1 cellranger_1.1.0
## [7] graph_1.84.0 XML_3.99-0.17 lifecycle_1.0.4
## [10] httr2_1.0.5 Rdpack_2.6.1 lattice_0.22-6
## [13] MASS_7.3-61 magrittr_2.0.3 plotly_4.10.4
## [16] sass_0.4.9 rmarkdown_2.28 jquerylib_0.1.4
## [19] yaml_2.3.10 plotrix_3.8-4 qqconf_1.3.2
## [22] sn_2.1.1 gld_2.6.6 DBI_1.2.3
## [25] RColorBrewer_1.1-3 multcomp_1.4-26 abind_1.4-8
## [28] zlibbioc_1.52.0 expm_1.0-0 purrr_1.0.2
## [31] RCurl_1.98-1.16 TH.data_1.1-2 rappdirs_0.3.3
## [34] sandwich_3.1-1 GenomeInfoDbData_1.2.13 TFisher_0.2.0
## [37] svglite_2.1.3 codetools_0.2-20 DelayedArray_0.32.0
## [40] xml2_1.3.6 tidyselect_1.2.1 farver_2.1.2
## [43] UCSC.utils_1.2.0 BiocFileCache_2.14.0 mathjaxr_1.6-0
## [46] jsonlite_1.8.9 multtest_2.62.0 e1071_1.7-16
## [49] survival_3.7-0 systemfonts_1.1.0 tools_4.4.1
## [52] progress_1.2.3 DescTools_0.99.57 Rcpp_1.0.13
## [55] glue_1.8.0 BiocBaseUtils_1.8.0 mnormt_2.1.1
## [58] gridExtra_2.3 SparseArray_1.6.0 xfun_0.48
## [61] metap_1.11 dplyr_1.1.4 withr_3.0.2
## [64] numDeriv_2016.8-1.1 BiocManager_1.30.25 fastmap_1.2.0
## [67] boot_1.3-31 fansi_1.0.6 digest_0.6.37
## [70] R6_2.5.1 colorspace_2.1-1 biomaRt_2.62.0
## [73] RSQLite_2.3.7 utf8_1.2.4 tidyr_1.3.1
## [76] generics_0.1.3 data.table_1.16.2 rtracklayer_1.66.0
## [79] class_7.3-22 prettyunits_1.2.0 httr_1.4.7
## [82] htmlwidgets_1.6.4 S4Arrays_1.6.0 pkgconfig_2.0.3
## [85] gtable_0.3.6 Exact_3.3 blob_1.2.4
## [88] htmltools_0.5.8.1 RBGL_1.82.0 plyranges_1.26.0
## [91] scales_1.3.0 tidyverse_2.0.0 lmom_3.2
## [94] png_0.1-8 rstudioapi_0.17.1 reshape2_1.4.4
## [97] rjson_0.2.23 curl_5.2.3 proxy_0.4-27
## [100] cachem_1.1.0 zoo_1.8-12 stringr_1.5.1
## [103] rootSolve_1.8.2.4 parallel_4.4.1 restfulr_0.0.15
## [106] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
## [109] dbplyr_2.5.0 evaluate_1.0.1 mvtnorm_1.3-1
## [112] cli_3.6.3 compiler_4.4.1 rlang_1.1.4
## [115] crayon_1.5.3 mutoss_0.1-13 labeling_0.4.3
## [118] plyr_1.8.9 stringi_1.8.4 viridisLite_0.4.2
## [121] BiocParallel_1.40.0 txdbmaker_1.2.0 munsell_0.5.1
## [124] lazyeval_0.2.2 Matrix_1.7-1 hms_1.1.3
## [127] bit64_4.5.2 ggplot2_3.5.1 KEGGREST_1.46.0
## [130] highr_0.11 rbibutils_2.3 memoise_2.0.1
## [133] bslib_0.8.0 bit_4.5.0 readxl_1.4.3