This tools was designed to evaluate Copy number variations (CNVs) which have been detected from CNV detection pipelines from NGS data. CNV detection pipelines from WES and (to a lesser extent) WGS, lead to high numbers of false-positives.
The SARC package aims to flag high confidence and low confidence CNVs from such detection pipelines. This would aid in patient diagnostics and clinical work. Uniquely, the SARC package only requires a coverage
/cov
file and an cnv
file. Both files should ideally be loaded as data.frames
.
The cov
file can be created from many BAM files from WES/ WGS, and the read depth coverage of each sample should be normalized by library size of the sample prior to processing. The end of the vignette will point to how to perform read coverage reading and normalisation with examples.
The cnv
file is in many ways similar to a standard CNV BED
file - as it serves as a list of CNVs identified in the samples. Unlike a traditional BED
file, the cnv
file should contain additional columns such as CNV value, type of CNV, genes and batch of data. Additionally, the column names of the cnv
file should ideally include: SAMPLE, CHROM, START, END, TYPE, VALUE. The order of the columns is not necessary. Additional columns such as TOOL, BATCH and GENE can enhance plotting.
The cnv file should contain generic information at the minimum. Sample names (SAMPLE), Chromosome (“CHROM”), Start site (“START”), End sit (“END”), Deletion or Duplication (“TYPE”) and CNV value (“VALUE”) are required for the tool to function. Column names should be in all caps and match the names of the test_cnv example below.
START, END and VALUE should be integers!
data("test_cnv")
#For speed just use a few detected CNVs
test_cnv <- test_cnv[1:3,]
head(test_cnv) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL |
---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV |
The cov file should comprise of coverage from normalised BAM files. Additionally, it is good practice to separate cov files by the technology used to sequence the FASTQ files.
Importantly, there will be four columns before the samples - ID, Chromosome, Start and End. ID, Start and End should be integers. The rest of the column names will be the names of the samples - and these should match the samples names found in the cnv file.
data("test_cov")
head(test_cov[, 1:10]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
ID | CHROM | START | END | SampleA | SampleB | SampleC | SampleD | SampleE | SampleF |
---|---|---|---|---|---|---|---|---|---|
1 | chr1 | 13402 | 13639 | 21.96 | 23.40 | 24.82 | 20.91 | 13.83 | 20.21 |
2 | chr1 | 69088 | 69091 | 3.61 | 3.31 | 3.96 | 2.95 | 2.36 | 2.82 |
3 | chr1 | 70008 | 70010 | 1.93 | 2.69 | 2.42 | 2.58 | 2.36 | 2.31 |
4 | chr1 | 324438 | 325605 | 56.74 | 56.57 | 57.26 | 51.04 | 53.39 | 44.49 |
5 | chr1 | 721404 | 721912 | 5.62 | 10.10 | 9.99 | 6.51 | 7.41 | 5.26 |
6 | chr1 | 861393 | 861395 | 1.56 | 1.71 | 1.73 | 1.77 | 1.66 | 1.84 |
All samples in the cnv file should be found in the cov file. But you can have samples in the cov file that are not found in the cnv file.
The SARC package relies on RaggedExperiments
objects to store the cov
file as its assay and will store lists of dataframes
and Grange
objects within metadata
. The SARC
tool will generate more cnv
dataframes with additional stats for each CNV in the list of CNVs. These dataframes
will be stored within the first list stored in the metadata
.
For cohorts with multiple coverage files (from multiple sequencing platforms or meta-analyses studies), we recommend creating multiple RaggedExperiments
objects, rather than combining them all into one.
#Create a start site and end site for each CNV detected
SARC <- regionSet(cnv = test_cnv, test_cov)
#Create smaller coverage files for each CNV
SARC <- regionSplit(RE = SARC, #the raggedexpression object we made
cnv = metadata(SARC)[['CNVlist']][[1]],
startlist = metadata(SARC)[[2]], #list of start sites, per CNV
endlist = metadata(SARC)[[3]]) #list of end sites, per CNV
First we use mean scores to check if the mean reads at this region matches the CNV values from a CNV detection tool. This will make a new cnv file which will be stored in the RE object.
#Calculate the mean read coverage
SARC <- regionMean(RE = SARC,
cnv = test_cnv,
splitcov = metadata(SARC)[[4]]) #list of cnv specific coverage
We then calculate the quantile distributions. We assume a true duplication will be on the higher end of the distribution (in contrast to the other samples) and true deletions will be on the lower end. Depending on the number of samples in the COV file - the thresholds should be altered.
#Calculate the distribution of the mean reads
SARC <- regionQuantiles(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[2]], #new cnv file
meancov = metadata(SARC)[[5]], #list of cnv specific coverage + means
q1 = 0.1, #lower threshold
q2 = 0.9) #upper threshold
Anova can then be used to identify if a region with a suspected CNV has a significantly rare read depth at the region - in contrast to all other samples. The more samples, the more powerful this test is.
#Calculate rarity of each suspected CNV - in contrast to other samples
SARC <- prepAnova(RE = SARC,
start = metadata(SARC)[[2]],
end = metadata(SARC)[[3]],
cnv = metadata(SARC)[['CNVlist']][[3]]) #newest cnv dataframe
SARC <- anovaOnCNV(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[3]],
anovacov = metadata(SARC)[[8]])
head(metadata(SARC)[['CNVlist']][[4]]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL | MeanScore | Qlow | Qhigh | ANOVA |
---|---|---|---|---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth | 1.44 | 1 | 0.0927248 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth | 0.64 | 0 | 0.0842323 | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV | 0.64 | 0 | 0.0842323 |
A major complaint of CNV analysis by diagnostic staff is the over-reliance on the Integrative Genome Browser (IGV). While a great tool, it can be a tedious task to search many hundreds of samples manually. The SARC package provides an alternative (but not a complete substitute) to visualize which genes and exons are being (or not being) effected by a CNV. This is also a good way of visualizing the false-positives quickly, without using IGV.
#prepare new objects for the CNV plots to work
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(Homo.sapiens) #load genome specific libraries from BioConductor
TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
txgene <- tx@unlistData
SARC <- plotCovPrep(RE = SARC,
cnv = metadata(SARC)[['CNVlist']][[4]], #newest cnv dataframe
startlist = metadata(SARC)[[2]],
endlist = metadata(SARC)[[3]],
n1 = 0, #left-padding
n2 = 0) #right-padding
SARC <- regionGrangeMake(RE = SARC,
covprepped = metadata(SARC)[[9]])
SARC <- addExonsGenes(RE = SARC,
covgranges = metadata(SARC)[[10]], #list of grange objects - one for each detected CNV
txdb = txdb, #Species specific database
txgene = txgene) #genes/ exons from the database
SARC <- setupCNVplot(RE = SARC, namedgranges = metadata(SARC)[[11]], #grange objects which have genes/ exons added
covprepped = metadata(SARC)[[9]])
## [1] "All CNVs are fine for further evaluation"
If CNVs are very small and cannot to attributed to at least one known gene in the TxDB object they will be too difficult to process any further. A message will state which CNVs were not associated with any genes. These CNVs should be removed from the cnv file also as to not cause errors.
Each detected CNV can be plotted. Automated plotting can be done quite simply with a for-loop - and is recommended. The samples with the detected CNV is highlighted with a thin purple line. The Sample with the detected CNV, the type of CNV and the value of the CNV from the detection pipeline are pasted as the subtitle.
Some clinicians would rather the confidence level flagged rather than completely removed - so SARC will not remove any CNVs, and this is left to the user.
#Use statistical analyses to flag CNVs as high or low confidence of being true
SARC <- cnvConfidence(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[4]])
head(metadata(SARC)[['CNVlist']][[5]]) %>%
kable %>%
kable_styling("striped", full_width=FALSE) %>%
scroll_box(width = "800px", height = "200px")
SAMPLE | CHROM | START | END | TYPE | VALUE | TOOL | MeanScore | Qlow | Qhigh | ANOVA | CNV.SCORE | CNV.TIER |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleA | chr1 | 161487763 | 161565521 | DUP | 3 | exomeDepth | 1.44 | 1 | 0.0927248 | 2 | CONFIDENT | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | exomeDepth | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT | |
SampleB | chr1 | 17034387 | 17090909 | DEL | 0 | clinCNV | 0.64 | 0 | 0.0842323 | 0 | VERY UNCONFIDENT |
If the RaggedExperiment Object is confusing it can be traversed as so.
#This will show all the dataframes (cnv) files made
#Generally it is recommended to use the most recently made cnv file to keep all the additional columns
print(SARC)
## class: RaggedExperiment
## dim: 34987 1
## assays(21): ID SampleA ... SampleS SampleT
## rownames: NULL
## colnames: NULL
## colData names(1): X
#This will show all the list objects. Their names should roughly correlate with the names of the parameters in the functions.
names(metadata(SARC))
## [1] "CNVlist" "REGIONSTART" "REGIONEND" "SPLITCOV"
## [5] "SPLITMEAN" "LOWERQAUNTILE" "UPPERQAUNTILE" "ANOVACOV"
## [9] "COVPREPPED" "COVGRANGES" "NAMEDGRANGES" "CNVPLOTLIST"
Plot distributions to see why some CNVs were false positives. The sample with the suspected CNV will be highlighted in red. In cases where many samples are present, plotly=TRUE may lead to a clearer visual. sample refers to the CNV - so can easily be looped.