Abstract

This package provides access to region annotated high-resolution spatial transcriptomics datasets from the 10x Xenium, NanoString CosMx, and BGI STOmics platforms. Regions were annotated independently of the transcriptomic measurements, and therefore form a valuable resource for benchmarking novel computational methods being developed in the field of spatial bioinformatics.

1 SubcellularSpatialData

The data in this package is published in the publication Bhuva et al., Genome Biology, 2024. Annotations for each dataset are obtained independently of the molecular measurements. The lowest level of measurement is annotated for each dataset. For BGI STOmics, this is each DNA nanoball, and for 10x Xenium and NanoString CosMx, these are the individual transcript detections. The package provides functions to convert these unit measurements into higher level summaries such as cells, regions, square bins, or hex bins for analysis purposes. These summaries are stored in SpatialExperiment objects.

The SubcellularSpatialData package can be downloaded as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SubcellularSpatialData")

2 Download data from the SubcellularSpatialData R package

To download the data, we first need to get a list of the data available in the SubcellularSpatialData package and determine the unique identifiers for each data. The query() function assists in getting this list.

library(SubcellularSpatialData)
library(ExperimentHub)

eh = ExperimentHub()
query(eh, "SubcellularSpatialData")
#> ExperimentHub with 4 records
#> # snapshotDate(): 2024-04-29
#> # $dataprovider: 10x, NanoString, BGI
#> # $species: Mus musculus, Homo sapiens
#> # $rdataclass: data.frame
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH8230"]]' 
#> 
#>            title                 
#>   EH8230 | xenium_mm_brain       
#>   EH8231 | stomics_mm_brain      
#>   EH8232 | cosmx_hs_nsclc        
#>   EH8567 | xenium_hs_breast_addon

Data can then be downloaded using the unique identifier.

eh[["EH8230"]]

As the datasets are large in size, the code above is not executed. Instead, we will explore the package using a subsampled dataset that is available within the package.

# load internal sample data
data(tx_small)
# view data
head(tx_small)

The data is in the form of a transcript table where the following columns are present across all datasets:

sample_id - Unique identifier for the sample the transcript belongs to.
cell - Unique identifier for the cell the transcript belongs to (NA if it is not allocated to any cell).
gene - Gene name or identity of the transcript.
genetype - Type of target (gene or control probe).
x - x-coordinate of the transcript.
y - y-coordinate of the transcript.
counts - Number of transcripts detected at this location (always 1 for NanoString CosMx and 10x Xenium as each row is an individual detection).
region - Annotated histological region for the transcript.
technology - Name of the platform the data was sequenced on.

Additional columns may be present for different technologies.

We now visualise the data on a spatial plot

library(ggplot2)
tx_small |>
  ggplot(aes(x, y, colour = region)) +
  geom_point(pch = ".") +
  scale_colour_brewer(palette = "Set2", guide = guide_legend(override.aes = list(shape = 15, size = 5))) +
  facet_wrap(~sample_id, ncol = 2, scales = "free") +
  theme_minimal() +
  theme(legend.position = "bottom")

3 Summarising transcripts into cells, bins, and regions

The transcript table is highly detailed containing a low-level form of measurement. Users may be more interestered in working on binned counts, cellular counts, or region-specific pseudo-bulked samples. These can be obtained using the helper function tx2spe(). For cellular counts, the original cell binning provided by the vendor is used. Transcript aggregation is performed separately per sample therefore users do not need to separate data per sample.

Please note that BGI only performs nuclear binning of counts therefore cellular counts obtained will represent nuclear counts only. We recommended that for fair evaluations during benchmarking, alternative binning algorithms that are fair are explored, or analysis be performed on square or hex binned counts.

library(SpatialExperiment)

# summarising counts per cell
tx2spe(tx_small, bin = "cell")
#> class: SpatialExperiment 
#> dim: 489 92394 
#> metadata(0):
#> assays(1): counts
#> rownames(489): MLPH PGR ... NegControlCodeword_0524
#>   NegControlProbe_00009
#> rowData names(2): gene genetype
#> colnames(92394): IDC_10 IDC_100005 ... ILC_99993 ILC_99994
#> colData names(8): sample_id cell_id ... region technology
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x y
#> imgData names(0):

# summarising counts per square bin
tx2spe(tx_small, bin = "square", nbins = 30)
#> class: SpatialExperiment 
#> dim: 489 1458 
#> metadata(0):
#> assays(1): counts
#> rownames(489): MLPH PGR ... BLANK_0406 NegControlCodeword_0524
#> rowData names(2): gene genetype
#> colnames(1458): IDC_4 IDC_5 ... ILC_822 ILC_828
#> colData names(12): sample_id bin_id ... region technology
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x y
#> imgData names(0):

# summarising counts per hex bin
tx2spe(tx_small, bin = "hex", nbins = 30)
#> class: SpatialExperiment 
#> dim: 489 1654 
#> metadata(0):
#> assays(1): counts
#> rownames(489): MLPH PGR ... BLANK_0406 NegControlCodeword_0524
#> rowData names(2): gene genetype
#> colnames(1654): IDC_6 IDC_7 ... ILC_1077 ILC_1103
#> colData names(14): sample_id bin_id ... region technology
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x y
#> imgData names(0):

# summarising counts per region
tx2spe(tx_small, bin = "region")
#> class: SpatialExperiment 
#> dim: 489 14 
#> metadata(0):
#> assays(1): counts
#> rownames(489): MLPH PGR ... BLANK_0406 NegControlCodeword_0524
#> rowData names(2): gene genetype
#> colnames(14): IDC_Adipose tissue (fat) IDC_Blood vessels ... ILC_Normal
#>   ducts ILC_Stroma
#> colData names(10): sample_id bin_id ... region technology
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x y
#> imgData names(0):

When transcript counts are aggregated (using any approach), aggregation of numeric columns is performed using the mean(..., na.rm = TRUE) function and aggregation of character/factor columns is performed such that the most frequent class becomes the representative class. As such, for a hex bin, the highest frequency region for the transcripts allocated to the bin becomes the bin’s region annotation. New coordinates for cells, bins, or regions are computed using the mean function as well, therefore represent the center of mass for the object. For hex and square bins, the average coordinate is computed by default, however, x and y indices for the bin are stored in the colData under the bin_x and bin_y columns.

When aggregation is perfomed using bins or regions, an additional column, ncells, is computed that indicates how many unique cells are present within the bin/region. Do note that if a cell overlaps multiple bins/regions, it will be counted in each bin/region.

4 Session Info

sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SpatialExperiment_1.15.0     SingleCellExperiment_1.27.0 
#>  [3] SummarizedExperiment_1.35.0  Biobase_2.65.0              
#>  [5] GenomicRanges_1.57.0         GenomeInfoDb_1.41.0         
#>  [7] IRanges_2.39.0               S4Vectors_0.43.0            
#>  [9] MatrixGenerics_1.17.0        matrixStats_1.3.0           
#> [11] ggplot2_3.5.1                ExperimentHub_2.13.0        
#> [13] AnnotationHub_3.13.0         BiocFileCache_2.13.0        
#> [15] dbplyr_2.5.0                 BiocGenerics_0.51.0         
#> [17] SubcellularSpatialData_1.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.1           
#>  [4] blob_1.2.4              filelock_1.0.3          Biostrings_2.73.0      
#>  [7] fastmap_1.1.1           digest_0.6.35           mime_0.12              
#> [10] lifecycle_1.0.4         KEGGREST_1.45.0         RSQLite_2.3.6          
#> [13] magrittr_2.0.3          compiler_4.4.0          rlang_1.1.3            
#> [16] sass_0.4.9              tools_4.4.0             utf8_1.2.4             
#> [19] yaml_2.3.8              knitr_1.46              S4Arrays_1.5.0         
#> [22] labeling_0.4.3          bit_4.0.5               curl_5.2.1             
#> [25] DelayedArray_0.31.0     RColorBrewer_1.1-3      abind_1.4-5            
#> [28] withr_3.0.0             purrr_1.0.2             grid_4.4.0             
#> [31] fansi_1.0.6             colorspace_2.1-0        scales_1.3.0           
#> [34] cli_3.6.2               rmarkdown_2.26          crayon_1.5.2           
#> [37] generics_0.1.3          rjson_0.2.21            httr_1.4.7             
#> [40] DBI_1.2.2               cachem_1.0.8            zlibbioc_1.51.0        
#> [43] AnnotationDbi_1.67.0    BiocManager_1.30.22     XVector_0.45.0         
#> [46] vctrs_0.6.5             Matrix_1.7-0            jsonlite_1.8.8         
#> [49] bit64_4.0.5             magick_2.8.3            hexbin_1.28.3          
#> [52] jquerylib_0.1.4         glue_1.7.0              gtable_0.3.5           
#> [55] BiocVersion_3.20.0      UCSC.utils_1.1.0        munsell_0.5.1          
#> [58] tibble_3.2.1            pillar_1.9.0            rappdirs_0.3.3         
#> [61] htmltools_0.5.8.1       GenomeInfoDbData_1.2.12 R6_2.5.1               
#> [64] evaluate_0.23           lattice_0.22-6          highr_0.10             
#> [67] png_0.1-8               memoise_2.0.1           BiocStyle_2.33.0       
#> [70] bslib_0.7.0             Rcpp_1.0.12             SparseArray_1.5.0      
#> [73] xfun_0.43               prettydoc_0.4.1         pkgconfig_2.0.3

SubcellularSpatialData: Annotated spatial transcriptomics datasets from 10x Xenium, NanoString CosMx and BGI STOmics.

Dharmesh D. Bhuva

2 May 2024

1 SubcellularSpatialData

2 Download data from the SubcellularSpatialData R package

3 Summarising transcripts into cells, bins, and regions

4 Session Info