# BioQC-benchmark: Testing Efficiency, Sensitivity and Specificity of BioQC on simulated and real-world data

#### 2022-04-26

Supplementary Information for “Detect issue heterogenity in gene expression data with BioQC” (Jitao David Zhang, Klas Hatje, Gregor Sturm, Clemens Broger, Martin Ebeling, Martine Burtin, Fabiola Terzi, Silvia Ines Pomposiello and Laura Badi)

In this vignette, we perform simulations with both model-generated and real-world data using BioQC. We show that BioQC is a sensitive method to detect tissue heterogeneity from high-throughput gene expression data. The source code used to produce this document can be found in the github repository BioQC-example.

BioQC is a R/Bioconductor package to detect tissue heterogeneity from high-throughput gene expression profiling data. It implements an efficient Wilcoxon-Mann-Whitney test, and offers tissue-specific gene signatures that are ready to use ‘out of the box’.

## Experiment setup

In this document, we perform two simulation studies with BioQC:

• Sensitivity benchmark tests the sensitivity and specificity of BioQC detecting tissue heterogeneity using model-generated, simulated data;
• Mixing benchmark tests the sensitivity and specificity of BioQC using simulated contamination with real-world data.

All source code that is needed to reproduce the results can be found in the .Rmd file generating this document.

library(BioQC)
library(hgu133plus2.db) ## to simulate an microarray expression dataset
library(lattice)
library(latticeExtra)
library(gridExtra)
library(gplots)

pdf.options(family="ArialMT", useDingbats=FALSE)

set.seed(1887)

## list human genes
humanGenes <- unique(na.omit(unlist(as.list(hgu133plus2SYMBOL))))

gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt",
package="BioQC")
tissueInds <- sapply(gmt, function(x) match(x\$genes, humanGenes))

## Sensitivity Benchmark

We next asked the question how sensitive BioQC is to expression changes of tissue signature genes. Similar to the previous simulation, we create random expression matrices. While keeping all other genes $$i.i.d$$ normally distributed following $$\mathcal{N}(0,1)$$, we dedicatedly increase the expression of genes in one randomly selected tissue signature (ovary, with 43 genes) by different amplitudes: these genes’ expression levels are randomly drawn from different normal distributions with varying expectation and constant variance between $$\mathcal{N}(0,1)$$ and $$\mathcal{N}(3,1)$$. To test the robustness of the algorithm, 10 samples are generated for each mean expression difference value.

The above figure visualizes the distribution of enrichment scores and their ranks dependent on the mean expression difference between ovary signature genes and background genes. As soon as the expression of signature genes increases by a very moderate ampltiude $$1\sigma$$, BioQC will identify the gene set as the highest-ranking signature. A even stronger difference in expression will lead to higher enrichment scores but no change in the rank.

The results suggest that BioQC is sensitive even to moderate changes in the average expression of a gene set.

## Mixing Benchmark

The sensitivity benchmark above suffers from the limitation that the distributions of gene expression are not physiological. To overcome this, we designed and performed a benchmark by in silico mixing expression profiles with weighted linear combination, thereby mimicking tissue contamination.

Given the expression profile of a sample of tissue A ($$\mathbf{Y}_A$$), and that of a sample of tissue B ($$\mathbf{Y}_B$$), the weighted linear mixing produces a new profile $$\mathbf{Y}=\omega\mathbf{Y_A}+(1-\omega)\mathbf{Y_B}$$, where $$\omega \in [0,1]$$. In essence the profiles of two tissue types are linearly mixed in different proportions, which simulates varying severities of contaminations. We asked whether BioQC could detect such mixings, and if so, how sensitive is the method.

### Dataset selection and quality control

In order to avoid over-fitting of signatures derived from human expression data, we decided to use a normal tissue expression dataset from a non-human mammal species, because it has been shown that tissue-preferential expression patterns tend to be conserved between mammal species. We identified a dataset of Canis lupus familiaris (dog), which is publicly available in Gene Expression Omnibus (GDS4164).

In this study, the authors examined 39 samples from 10 pathologically normal tissues (liver, kidney, heart, lung, brain, lymph node, spleen, jejunum, pancreas, and skeletal muscle) of four dogs (with one pancreas sample missing). We downloaded the data, and performed minimal pre-processing: for multiple probesets that map to same genes, we kept the one with the highest average expression level and removed the rest. The resulting dataset contained expression of 16797 genes. BioQC was applied to the dataset to test whether there are major contamination issues. The results, including tissues reported by the authors, and the BioQC tissue signatures with the highest and second-highest scores, are reported in the following table:

Quality control of the mixing benchmark input data with BioQC. Four columns (f.l.t.r.): sample index; tissue reported by the authors; the tissue signature with the highest enrichment score reported by BioQC; the tissue signature with the second- highest enrichment score.
Label BioQC.best BioQC.second
GSM502573 Brain Spinal_cord Nodose_nucleus
GSM502574 Brain Brain_Cortex_prefrontal Brain_Amygdala
GSM502575 Brain Brain_Cortex_prefrontal Brain_Amygdala
GSM502576 Brain Brain_Cortex_prefrontal Brain_Amygdala
GSM502577 Heart Muscle_cardiac Muscle_skeletal
GSM502578 Heart Muscle_cardiac Muscle_skeletal
GSM502579 Heart Muscle_cardiac Muscle_skeletal
GSM502580 Heart Muscle_cardiac Muscle_skeletal
GSM502581 Jejunum Intestine_small Intestine_Colon
GSM502582 Jejunum Intestine_small Intestine_Colon
GSM502583 Jejunum Intestine_small Intestine_Colon
GSM502584 Jejunum Intestine_small Intestine_Colon
GSM502585 Kidney Kidney Kidney_Renal_Cortex
GSM502586 Kidney Kidney Kidney_Renal_Cortex
GSM502587 Kidney Kidney Kidney_Renal_Cortex
GSM502588 Kidney Kidney Kidney_Renal_Cortex
GSM502589 Liver Liver Liver
GSM502590 Liver Liver Liver
GSM502591 Liver Liver Liver
GSM502592 Liver Liver Liver
GSM502593 Lung Lung Monocytes
GSM502594 Lung Monocytes Lung
GSM502595 Lung Lung Monocytes
GSM502596 Lung Monocytes Lung
GSM502597 LymphNode Lymphocyte_B_FOLL Lymphocytes_T_H
GSM502598 LymphNode Lymphocyte_B_FOLL Lymphocytes_T_H
GSM502599 LymphNode Lymphocyte_B_FOLL Lymphocytes_T_H
GSM502600 LymphNode Lymphocyte_B_FOLL Lymphocytes_T_H
GSM502601 Pancreas Pancreas_Islets Pancreas
GSM502602 Pancreas Pancreas_Islets Pancreas
GSM502603 Pancreas Pancreas_Islets Pancreas
GSM502604 SkeletalMuscle Muscle_skeletal Muscle_cardiac
GSM502605 SkeletalMuscle Muscle_skeletal Muscle_cardiac
GSM502606 SkeletalMuscle Muscle_skeletal Muscle_cardiac
GSM502607 SkeletalMuscle Muscle_skeletal Muscle_cardiac
GSM502608 Spleen Monocytes Lymphocyte_B_FOLL
GSM502609 Spleen Monocytes Lymphocyte_B_FOLL
GSM502610 Spleen Monocytes Erythroid_cells
GSM502611 Spleen Monocytes Myeloblast

By comparing the tissue labels provided by the authors and the predictions of BioQC, we notice that in most cases the two match well (despite of ontological differences). In three cases (sample ID GSM502573, GSM502594, and GSM502596) though, there seem to be intriguing differences, which might be explained by different sampling procedures or immune cell infiltration. We will however in this vignette not further explore them. These three samples are removed from the simulation procedures.

### An example of weighted mixing: heart and jejunum

As an example, we take average expression of heart and jejunum samples, and mix them by different compositions. This allows us comparing enrichment scores and their ranks when the expression profiles of heart and jejunum are mixed in silico:

We observe that with as little as 5% contamination of heart tissue in jejunum samples (rightmost in the right panel), the rank of heart signature jumps from 34 to 9; 10% and 20% contamination will further enhance the rank to 4 and 3 respectively. If we start from the other end, namely assuming jejunum contamination in heart samples, the BioQC algorithms ranks jejunum the 7th only when there are more than 25% contamination. If we set enrichment score equal or over 3 as the threshold of calling a suspected contamination event ($$p<0.001$$ in the one-sided Wilcoxon-Mann-Whitney test), it takes about 10% heart in jejunum tissue or about 30% jejunum tissue in heart to make a call. It means the sensitivity of contamination detection is not symmetric between tissues: contamination by tissues with distinct expression patterns (such as heart) are easier to be detected than contamination by tissues with less distinct expression patterns (such as small intestine).

While it is difficult to quantify the absolute sensitivity of contamination detection, it is apparent that if the enrichment score of a unforeseen tissue is very high (or ranked high), one may suspect potential contamination. Also, if there are replicates of samples from the same tissue, a higher value in one sample compared with the other samples suggests a contamination or infiltration incident.

### Pairwise Mixing

Following the heart-jejunum example, we performed all 45 pairwise mixing experiments, producing weighted linear combinations of gene expression profiles of each pair of tissues (excluding self-mixing). The results are summaried in a heatmap:

The heatmap visualization summarizes the detection limit of contamination of each pair of tissues. Take the cell in row 1 column 2 from top left: its value (0.15) means that if there are 15% or more contamination by heart in the brain sample, BioQC will be able to detect it (with the threshold enrichment score $$\geqslant3$$ or the rank $$\leqslant10$$), because the enrichment score is equal to or larger than 3, or the heart tissue signature ranks in the top 10 of all tissue signatures.

Take another cell in row 2 column 1 from top left: its value (0.5) means that if there are 50% or more contanmination by brain in a heart sample, BioQC will be able to detect it. Here we observe the asymmetry again that we observed before with the heart/jejenum example: while it is realtively easy to identify heart contamination of a brain sample, it is more difficult to identify brain contamination of a heart sample in this dataset.

The average detection limits of tissues as contamination sources are listed in the following table. The values are derived from median values of each column in the heatmap, except for diagonal and missing elements.

Median lower detection limites of tissues as contamination sources.
Tissue MedianDetectionLimit
Brain 52.22%
Heart 10.62%
Jejunum 19.44%
Kidney 23.75%
Liver 11.87%
Lung 31.67%
LymphNode 30.63%
Pancreas 17.22%
SkeletalMuscle 15.00%
Spleen 13.57%

Interestingly, brain samples are a special case: if they contaminate other tissues, it is more difficult to identify (but not other way around). It can be partially explained by the experiment design: Briggs et al. profiled the whole cerebrum, whereas in BioQC there are 22 distinct gene sets assigned to distinct brain regions. Though the prefrontal cortex signature scored highest in the canine brain samples, its score is relative low (7.45), and the genes in the signature are not too far away from the background genes:

Therefore only a strong contamination by brain in this dataset will be detected by the given threshold. We expect that if prefrontal cortex instead of cerebrum sample was profiled, the mixing profile of brain will be similar to other organs. This needs to be tested in other datasets.

Apart from that, most in silico contamination events are detectable in this dataset, with median detection limit around 0.2. This suggests that BioQC is sensitive towards tissue heterogeneity in physiological settings.

# Conclusions

Benchmark studies with similated and real-world data demonstrate that BioQC is an efficient and sensitive method to detect tissue heterogeneity from high-throughput gene expression data.

# Appendix

### Comparing BioQC with Principal Component Analysis (PCA)

In the context of the dog transcriptome dataset, we can compare the results of principal component analysis (PCA) with that of BioQC:

PCA sugggests that samples from each tissue tend to cluster together, in line with the BioQC results. In addition, PCA reveals that tissues with cells of similar origins cluster together, such as skeletal muscle and heart. As expected, one brain sample and two lung samples seem to be different from other samples of the same cluster, which are consistent with the BioQC findings; however, unlike BioQC, PCA does not provide information on what are potential contamination/infiltration casues.

Therefore, we believe BioQC complements existing unsupervised methods to inspect quality of gene expression data.

## R Session Info

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_GB              LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods
## [8] base
##
## other attached packages:
##  [1] hgu133plus2.db_3.13.0 rbenchmark_1.0.0      gplots_3.1.3
##  [4] gridExtra_2.3         latticeExtra_0.6-29   lattice_0.20-45
##  [7] org.Hs.eg.db_3.15.0   AnnotationDbi_1.58.0  IRanges_2.30.0
## [10] S4Vectors_0.34.0      testthat_3.1.4        limma_3.52.0
## [13] RColorBrewer_1.1-3    BioQC_1.24.0          Biobase_2.56.0
## [16] BiocGenerics_0.42.0   knitr_1.38
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3           locfit_1.5-9.5         png_0.1-7
##  [4] Biostrings_2.64.0      gtools_3.9.2           rprojroot_2.0.3
##  [7] digest_0.6.29          R6_2.5.1               GenomeInfoDb_1.32.0
## [10] RSQLite_2.2.12         evaluate_0.15          httr_1.4.2
## [13] highr_0.9              zlibbioc_1.42.0        rlang_1.0.2
## [16] jquerylib_0.1.4        blob_1.2.3             rmarkdown_2.14
## [19] desc_1.4.1             stringr_1.4.0          RCurl_1.98-1.6
## [22] bit_4.0.4              compiler_4.2.0         xfun_0.30
## [25] pkgconfig_2.0.3        htmltools_0.5.2        KEGGREST_1.36.0
## [28] GenomeInfoDbData_1.2.8 edgeR_3.38.0           crayon_1.5.1
## [31] withr_2.5.0            bitops_1.0-7           brio_1.1.3
## [34] grid_4.2.0             jsonlite_1.8.0         gtable_0.3.0
## [37] DBI_1.1.2              magrittr_2.0.3         KernSmooth_2.23-20
## [40] cli_3.3.0              stringi_1.7.6          cachem_1.0.6
## [43] XVector_0.36.0         bslib_0.3.1            vctrs_0.4.1
## [46] tools_4.2.0            bit64_4.0.5            jpeg_0.1-9
## [52] caTools_1.18.2         memoise_2.0.1          sass_0.4.1