1 Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("standR")

The development version of standR can be installed from GitHub:

devtools::install_github("DavisLaboratory/standR")

2 Quick start

library(standR)
library(SpatialExperiment)
library(limma)
library(ExperimentHub)

2.1 Load data for this guide

This is the background for the data:

NanoString GeoMx DSP dataset of diabetic kidney disease (DKD) vs healthy kidney tissue. Seven slides were analyzed, 4 DKD and 3 healthy. Regions of Interest (ROI) were focused two different parts of a kidney’s structure: tubules or glomeruli. Individual glomeruli were identified by a pathologist as either relatively healthy or diseased regardless if the tissue was DKD or healthy. Tubule ROIs were segmented into distal (PanCK) and proximal (neg) tubules. While both distal and proximal tubules are called tubules, they perform very different functions in the kidney.

eh <- ExperimentHub()

query(eh, "standR")
## ExperimentHub with 3 records
## # snapshotDate(): 2024-03-11
## # $dataprovider: Nanostring
## # $species: NA
## # $rdataclass: data.frame
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH7364"]]' 
## 
##            title                   
##   EH7364 | GeomxDKDdata_count      
##   EH7365 | GeomxDKDdata_sampleAnno 
##   EH7366 | GeomxDKDdata_featureAnno
countFile <- eh[["EH7364"]]
sampleAnnoFile <- eh[["EH7365"]]
featureAnnoFile <- eh[["EH7366"]]

spe <- readGeoMx(countFile, sampleAnnoFile, featureAnnoFile = featureAnnoFile, rmNegProbe = TRUE)

2.2 QC

2.2.1 metadata visualization

Based on the description of the data, we know that all glomerulus are classified as abnormal and healthy, and tubule are classified as neg and PanCK.

We therefore merge the region-related annotations to avoid collinearity, which can affect the process of batch correction.

colData(spe)$regions <- paste0(colData(spe)$region,"_",colData(spe)$SegmentLabel) |> 
  (\(.) gsub("_Geometric Segment","",.))() |>
  paste0("_",colData(spe)$pathology) |>
  (\(.) gsub("_NA","_ns",.))()

library(ggalluvial)

plotSampleInfo(spe, column2plot = c("SlideName","disease_status","regions"))

2.2.2 Gene level QC

spe <- addPerROIQC(spe, rm_genes = TRUE)
plotGeneQC(spe, ordannots = "regions", col = regions, point_size = 2)

Using the plotGeneQC function, we can have a look at which were the genes removed and the overall distribution of percentage of non-expressed genes in all ROIs. By default, top 9 genes are plotted here (arranging by mean expression), user can increase the number of plotted genes by changing the top_n parameter.

In this case we don’t see any specific biological pattern for the samples expressing this genes (figure above).

2.2.3 ROI level QC

In the ROI level QC, we first aim to identify (if any) ROI(s) that have relatively low library size and low cell count because they are considered as low quality samples due to insufficient sequencing depth or lack of RNA in the chosen region.

In this case, looking at the distribution plots of library size and nuclei count, we don’t see any particular spike in the low ends, rather the distributions are relatively smooth. Looking at the dot plot, library sizes are mostly positively correlate with the nuclei count, with some data have relatively low library size while the nuclei count is reasonable. We therefore can try to draw an filtering threshold at the low end of the library size, in this case 50000. By coloring the dot with their slide names, we find that the ROIs below the threshold are all from slide disease1B, suggesting the reason for this might be some technical issues of slide disease1B.

plotROIQC(spe, y_threshold = 50000, col = SlideName)

Since library size of 50000 seems to be a reasonable threshold, here we subset the spatial experiment object based on the library size in colData.

spe <- spe[,rownames(colData(spe))[colData(spe)$lib_size > 50000]]

2.3 Inspection of variations on ROI level

2.3.1 RLE

Here we can see obvious variation from slides to slides, and small variations are also observed within each slide.

plotRLExpr(spe, ordannots = "SlideName", assay = 2, col = SlideName)

2.3.2 PCA

Here we color the PCA with slide information, and shape by regions (tissue). We can see that PC1 is mainly spread out by regions, especially glomerulus and tubule. And grouping based on slide within each tissue are observed. The subtypes in tubule are clearly separated, but different subtypes of glomerulus is still grouping together. Moreover, diseased tissues and control tissues are mixed as well (disease slides and normal slides).

drawPCA(spe, assay = 2, col = SlideName, shape = regions)

3 Data normalization

As we observed the technical variations in the data in both RLE and PCA plots. It is necessary to perform normalization in the data.

In the standR package, we offer normalization options including TMM, RPKM, TPM, CPM, upperquartile and sizefactor. Among them, RPKM and TPM required gene length information (add genelength column to the rowData of the object). For TMM, upperquartile and sizefactor, their normalized factor will be stored their metadata.

Here we used TMM to normalize the data.

colData(spe)$biology <- paste0(colData(spe)$disease_status, "_", colData(spe)$regions)

spe_tmm <- geomxNorm(spe, method = "TMM")

4 Batch correction

In the Nanostring’s GeoMX DSP protocol, due to the fact that one slide is only big enough for a handful of tissue segments (ROIs), it is common that we see the DSP data being confounded by the batch effect introduced by different slides. In order to establish fair comparison between ROIs later on, it is necessary to remove this batch effect from the data.

To run RUV4 batch correction, we need to provide a list of “negative control genes (NCGs)”.

The function findNCGs allows identifying the NCGs from the data. In this case, since the batch effect is mostly introduced by slide, we therefore want to identify NCGs across all slides, so here we set the batch_name to “SlideName”, and select the top 500 least variable genes across different slides as NCGs.

spe <- findNCGs(spe, batch_name = "SlideName", top_n = 500)

metadata(spe) |> names()
## [1] "NegProbes"         "lcpm_threshold"    "genes_rm_rawCount"
## [4] "genes_rm_logCPM"   "NCGs"

Here we use k of 5 to perform RUV-4 normalization.

spe_ruv <- geomxBatchCorrection(spe, factors = "biology", 
                   NCGs = metadata(spe)$NCGs, k = 5)

We can then inspect the PCA of the corrected data with annotations, to inspect the removal of batch effects, and the retaining of the biological factors.

plotPairPCA(spe_ruv, assay = 2, color = disease_status, shape = regions, title = "RUV4")

Moreover, we can also have a look at the RLE plots of the normalized count.

plotRLExpr(spe_ruv, assay = 2, color = SlideName) + ggtitle("RUV4")

For more detailed analysis pipeline and usage of the standR package, please see https://github.com/DavisLaboratory/GeoMXAnalysisWorkflow

5 SessionInfo

sessionInfo()
## R Under development (unstable) (2024-03-18 r86148)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggalluvial_0.12.5           ggplot2_3.5.0              
##  [3] ExperimentHub_2.11.1        AnnotationHub_3.11.3       
##  [5] BiocFileCache_2.11.1        dbplyr_2.5.0               
##  [7] limma_3.59.6                SpatialExperiment_1.13.1   
##  [9] SingleCellExperiment_1.25.0 SummarizedExperiment_1.33.3
## [11] Biobase_2.63.0              GenomicRanges_1.55.4       
## [13] GenomeInfoDb_1.39.9         IRanges_2.37.1             
## [15] S4Vectors_0.41.5            BiocGenerics_0.49.1        
## [17] MatrixGenerics_1.15.0       matrixStats_1.2.0          
## [19] standR_1.7.4               
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.8            magrittr_2.0.3           
##   [3] ggbeeswarm_0.7.2          magick_2.8.3             
##   [5] farver_2.1.1              rmarkdown_2.26           
##   [7] zlibbioc_1.49.3           vctrs_0.6.5              
##   [9] memoise_2.0.1             DelayedMatrixStats_1.25.1
##  [11] rstatix_0.7.2             htmltools_0.5.8          
##  [13] S4Arrays_1.3.6            curl_5.2.1               
##  [15] BiocNeighbors_1.21.2      broom_1.0.5              
##  [17] SparseArray_1.3.4         sass_0.4.9               
##  [19] bslib_0.6.2               cachem_1.0.8             
##  [21] mime_0.12                 lifecycle_1.0.4          
##  [23] pkgconfig_2.0.3           rsvd_1.0.5               
##  [25] Matrix_1.7-0              R6_2.5.1                 
##  [27] fastmap_1.1.1             GenomeInfoDbData_1.2.11  
##  [29] digest_0.6.35             colorspace_2.1-0         
##  [31] patchwork_1.2.0           AnnotationDbi_1.65.2     
##  [33] scater_1.31.2             irlba_2.3.5.1            
##  [35] RSQLite_2.3.5             ggpubr_0.6.0             
##  [37] beachmat_2.19.2           filelock_1.0.3           
##  [39] labeling_0.4.3            fansi_1.0.6              
##  [41] httr_1.4.7                abind_1.4-5              
##  [43] mgcv_1.9-1                compiler_4.4.0           
##  [45] bit64_4.0.5               withr_3.0.0              
##  [47] backports_1.4.1           BiocParallel_1.37.1      
##  [49] carData_3.0-5             viridis_0.6.5            
##  [51] DBI_1.2.2                 highr_0.10               
##  [53] ggsignif_0.6.4            rappdirs_0.3.3           
##  [55] DelayedArray_0.29.9       rjson_0.2.21             
##  [57] tools_4.4.0               vipor_0.4.7              
##  [59] beeswarm_0.4.0            glue_1.7.0               
##  [61] nlme_3.1-164              grid_4.4.0               
##  [63] generics_0.1.3            gtable_0.3.4             
##  [65] tzdb_0.4.0                tidyr_1.3.1              
##  [67] hms_1.1.3                 car_3.1-2                
##  [69] BiocSingular_1.19.0       ScaledMatrix_1.11.1      
##  [71] utf8_1.2.4                XVector_0.43.1           
##  [73] ggrepel_0.9.5             BiocVersion_3.19.1       
##  [75] pillar_1.9.0              vroom_1.6.5              
##  [77] splines_4.4.0             dplyr_1.1.4              
##  [79] lattice_0.22-6            bit_4.0.5                
##  [81] ruv_0.9.7.1               tidyselect_1.2.1         
##  [83] locfit_1.5-9.9            Biostrings_2.71.5        
##  [85] scuttle_1.13.1            knitr_1.45               
##  [87] gridExtra_2.3             edgeR_4.1.19             
##  [89] xfun_0.43                 statmod_1.5.0            
##  [91] yaml_2.3.8                evaluate_0.23            
##  [93] codetools_0.2-19          archive_1.1.7            
##  [95] tibble_3.2.1              BiocManager_1.30.22      
##  [97] cli_3.6.2                 munsell_0.5.0            
##  [99] jquerylib_0.1.4           Rcpp_1.0.12              
## [101] png_0.1-8                 parallel_4.4.0           
## [103] readr_2.1.5               blob_1.2.4               
## [105] sparseMatrixStats_1.15.0  viridisLite_0.4.2        
## [107] scales_1.3.0              purrr_1.0.2              
## [109] crayon_1.5.2              rlang_1.1.3              
## [111] cowplot_1.1.3             KEGGREST_1.43.0