The R package MoleculeExperiment contains functions to create and work with objects from the new MoleculeExperiment class. We introduce this class for analysing molecule-based spatial transcriptomics data (e.g., Xenium by 10X, CosMx SMI by Nanostring, and Merscope by Vizgen, among others).
The goal of the MoleculeExperiment class is to:
1. Enable analysis of spatial transcriptomics (ST) data at the molecule level,
independent of aggregating to the cell or tissue level.
2. Standardise molecule-based ST data across vendors, to hopefully facilitate
comparison of different data sources and common analytical and visualisation
workflows.
3. Enable aggregation to a SpatialExperiment
object given combinations of
molecules and segmentation boundaries.
The latest release of MoleculeExperiment can be installed using:
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("MoleculeExperiment")
library(MoleculeExperiment)
library(ggplot2)
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")
me <- readXenium(repoDir,
keepCols = "essential"
)
me
#> class: MoleculeExperiment
#> 2 samples: sample1 sample2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [4900,4919.98] x [6400.02,6420]
#>
#> @boundaries contents:
#> -cell:
#> 7 unique segment IDs: 67500 67512 67515 67521 67527 ...
ggplot_me() +
geom_polygon_me(me, assayName = "cell", fill = "grey") +
geom_point_me(me) +
# zoom in to selected patch area
coord_cartesian(
xlim = c(4900, 4919.98),
ylim = c(6400.02, 6420)
)
# transform ME to SPE object
spe <- countMolecules(me)
spe
#> class: SpatialExperiment
#> dim: 178 14
#> metadata(0):
#> assays(1): counts
#> rownames(178): 2010300C02Rik Acsbg1 ... Zfp536 Zfpm2
#> rowData names(0):
#> colnames(14): sample1.67500 sample1.67512 ... sample2.65070
#> sample2.65071
#> colData names(4): sample_id cell_id x_location y_location
#> reducedDimNames(1): spatial
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x_location y_location
#> imgData names(0):
Here we demonstrate how to work with an ME object from toy data, representing a scenario where both the detected transcripts information and the boundary information have already been read into R. This requires the standardisation of the data with the dataframeToMEList() function.
The flexibility of the arguments in dataframeToMEList() enable the creation of a standard ME object across dataframes comming from different vendors of molecule-based spatial transcriptomics technologies.
moleculesDf <- data.frame(
sample_id = rep(c("sample1", "sample2"), times = c(30, 20)),
features = rep(c("gene1", "gene2"), times = c(20, 30)),
x_coords = runif(50),
y_coords = runif(50)
)
head(moleculesDf)
#> sample_id features x_coords y_coords
#> 1 sample1 gene1 0.24763833 0.1654545
#> 2 sample1 gene1 0.75681544 0.3688864
#> 3 sample1 gene1 0.05414825 0.2576267
#> 4 sample1 gene1 0.87848945 0.7229406
#> 5 sample1 gene1 0.50648818 0.3971102
#> 6 sample1 gene1 0.63143636 0.8520749
boundariesDf <- data.frame(
sample_id = rep(c("sample1", "sample2"), times = c(16, 6)),
cell_id = rep(
c(
"cell1", "cell2", "cell3", "cell4",
"cell1", "cell2"
),
times = c(4, 4, 4, 4, 3, 3)
),
vertex_x = c(
0, 0.5, 0.5, 0,
0.5, 1, 1, 0.5,
0, 0.5, 0.5, 0,
0.5, 1, 1, 0.5,
0, 1, 0, 0, 1, 1
),
vertex_y = c(
0, 0, 0.5, 0.5,
0, 0, 0.5, 0.5,
0.5, 0.5, 1, 1,
0.5, 0.5, 1, 1,
0, 1, 1, 0, 0, 1
)
)
head(boundariesDf)
#> sample_id cell_id vertex_x vertex_y
#> 1 sample1 cell1 0.0 0.0
#> 2 sample1 cell1 0.5 0.0
#> 3 sample1 cell1 0.5 0.5
#> 4 sample1 cell1 0.0 0.5
#> 5 sample1 cell2 0.5 0.0
#> 6 sample1 cell2 1.0 0.0
moleculesMEList <- dataframeToMEList(moleculesDf,
dfType = "molecules",
assayName = "detected",
sampleCol = "sample_id",
factorCol = "features",
xCol = "x_coords",
yCol = "y_coords"
)
str(moleculesMEList, max.level = 3)
#> List of 1
#> $ detected:List of 2
#> ..$ sample1:List of 2
#> .. ..$ gene1: tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ gene2: tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ sample2:List of 1
#> .. ..$ gene2: tibble [20 × 2] (S3: tbl_df/tbl/data.frame)
boundariesMEList <- dataframeToMEList(boundariesDf,
dfType = "boundaries",
assayName = "cell",
sampleCol = "sample_id",
factorCol = "cell_id",
xCol = "vertex_x",
yCol = "vertex_y"
)
str(boundariesMEList, 3)
#> List of 1
#> $ cell:List of 2
#> ..$ sample1:List of 4
#> .. ..$ cell1: tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ cell2: tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ cell3: tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ cell4: tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ sample2:List of 2
#> .. ..$ cell1: tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ cell2: tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
toyME <- MoleculeExperiment(
molecules = moleculesMEList,
boundaries = boundariesMEList
)
toyME
#> class: MoleculeExperiment
#> 2 samples: sample1 sample2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [0.02,0.99] x [0.01,0.99]
#>
#> @boundaries contents:
#> -cell:
#> 3 unique segment IDs: cell1 cell2 cell3 cell4 ...
In this example, we use the extent of the molecules of generated for toyME
to
allign the boundaries with the molecules. In general, the extent of the
segmentation is required for this alignment.
repoDir <- system.file("extdata", package = "MoleculeExperiment")
segMask <- paste0(repoDir, "/BIDcell_segmask.tif")
boundaries(toyME, "BIDcell_segmentation") <- readSegMask(
extent(toyME), # use the molecule extent to define the boundary extent
path = segMask, assayName = "BIDcell_segmentation",
sample_id = "sample1", background_value = 0
)
toyME
#> class: MoleculeExperiment
#> 2 samples: sample1 sample2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [0.02,0.99] x [0.01,0.99]
#>
#> @boundaries contents:
#> -cell:
#> 3 unique segment IDs: cell1 cell2 cell3 cell4 ...
#> -BIDcell_segmentation:
#> 63 unique segment IDs: 1 2 3 4 5 6 ...
The MoleculeExperiment package also provides functions to directly work with the directories containing output files of commonly used technologies. This is especially useful to work with data from multiple samples. Here we provide convenience functions to read in data from Xenium (10X Genomics), CosMx (Nanostring) and Merscope (Vizgen).
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")
me <- readXenium(repoDir,
keepCols = "essential",
addBoundaries = "cell"
)
me
#> class: MoleculeExperiment
#> 2 samples: sample1 sample2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [4900,4919.98] x [6400.02,6420]
#>
#> @boundaries contents:
#> -cell:
#> 7 unique segment IDs: 67500 67512 67515 67521 67527 ...
readXenium() standardises the transcript and boundary information such that the column names are consistent across technologies when handling ME objects.
In addition, readXenium() enables the user to decide if they want to keep all data that is vendor-specific (e.g., column with qv score), some columns of interest, or only the essential columns. The latter refers to feature names and locations of the detected transcripts, and segment ids and boundary locations of the segmentation results.
For CosMx and Merscope data we provide convenience functions that standardise the raw transcripts data into a MoleculeExperiment object and additionally read the boundaries included in the standard data releases.
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/nanostring_Lung9_Rep1")
meCosmx <- readCosmx(repoDir,
keepCols = "essential",
addBoundaries = "cell"
)
meCosmx
#> class: MoleculeExperiment
#> 2 samples: sample_1 sample_2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [924.01,3002] x [26290,26398]
#>
#> @boundaries contents:
#> -cell:
#> 98 unique segment IDs: 1 2 3 4 5 6 ...
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/vizgen_HumanOvarianCancerPatient2Slice2")
meMerscope <- readMerscope(repoDir,
keepCols = "essential",
addBoundaries = "cell"
)
meMerscope
#> class: MoleculeExperiment
#> 1 samples: vizgen_HumanOvarianCancerPatient2Slice2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [10219.02,10386.83] x [8350.93,8395.3]
#>
#> @boundaries contents:
#> -cell:
#> 24 unique segment IDs: 45862 45865 45872 45873 45882 45887 ...
A MoleculeExperiment object contains a @molecules slot and an optional @boundaries slot.
Both slots have a hierarchical list structure that consists of a nested list, ultimately ending in a data.frame/tibble. Traditional rectangular data structures, like dataframes, redundantly store gene names and sample IDs for the millions of transcripts. In contrast, data in a list enables us to avoid this redundancy and work with objects of smaller size.
The @molecules slot contains molecule-level information. The essential data it contains is the feature name (e.g., gene names) and x and y locations of the detected molecules (e.g., transcripts), in each sample. Nevertheless, the user can also decide to keep all molecule metadata (e.g., subcellular location: nucleus/cytoplasm).
The nested list in the molecules slot has the following hierarchical structure: “assay name” > “sample ID” > “feature name” > dataframe/tibble with X and Y locations (and other additional columns of interest).
showMolecules(me)
#> List of 1
#> $ detected:List of 2
#> ..$ sample1:List of 137
#> .. ..$ 2010300C02Rik : tibble [11 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ Acsbg1 : tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
#> .. .. [list output truncated]
#> ..$ sample2:List of 143
#> .. ..$ 2010300C02Rik: tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ Acsbg1 : tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
#> .. .. [list output truncated]
The @boundaries slot contains information from segmentation analyses (e.g., cell boundaries, or nucleus boundaries).
The nested list in the boundaries slot has the following hierarchical structure: “assay name” > “sample ID” > “segment ID” > dataframe/tibble with the vertex coordinates defining the boundaries for each segment. For example, if the boundary information is for cells, the assay name can be set to “cell”; or “nucleus” if one is using nucleus boundaries.
showBoundaries(me)
#> List of 1
#> $ cell:List of 2
#> ..$ sample1:List of 5
#> .. ..$ 67500: tibble [13 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ 67512: tibble [13 × 2] (S3: tbl_df/tbl/data.frame)
#> .. .. [list output truncated]
#> ..$ sample2:List of 9
#> .. ..$ 65043: tibble [13 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ 65044: tibble [13 × 2] (S3: tbl_df/tbl/data.frame)
#> .. .. [list output truncated]
Here we introduce basic methods to access and manipulate data in an ME object, i.e., getters and setters, respectively.
The main getters are molecules() and boundaries(). NOTE: the output of these methods is the ME nested list, which can be very large on screen. Thus, these getters should be used when wanting to work with the data. To quickly view the slot contents, use showMolecules() and showBoundaries() instead.
# NOTE: output not shown as it is too large
# access molecules slot
molecules(me)
# access cell boundary information in boundaries slot
boundaries(me, "cell")
For ease of use, these getters have arguments that enable the transformation of the data from a nested ME list format to a data.frame format.
molecules(me, assayName = "detected", flatten = TRUE)
#> # A tibble: 1,739 × 4
#> x_location y_location feature_name sample_id
#> * <dbl> <dbl> <chr> <chr>
#> 1 4918. 6411. 2010300C02Rik sample1
#> 2 4901. 6417. 2010300C02Rik sample1
#> 3 4901. 6417. 2010300C02Rik sample1
#> 4 4910. 6417. 2010300C02Rik sample1
#> 5 4908. 6413. 2010300C02Rik sample1
#> 6 4911. 6407. 2010300C02Rik sample1
#> 7 4915. 6411. 2010300C02Rik sample1
#> 8 4916. 6412. 2010300C02Rik sample1
#> 9 4901. 6415. 2010300C02Rik sample1
#> 10 4906. 6417. 2010300C02Rik sample1
#> # ℹ 1,729 more rows
boundaries(me, assayName = "cell", flatten = TRUE)
#> # A tibble: 182 × 4
#> x_location y_location segment_id sample_id
#> * <dbl> <dbl> <chr> <chr>
#> 1 4905. 6400. 67500 sample1
#> 2 4899. 6401. 67500 sample1
#> 3 4894. 6408. 67500 sample1
#> 4 4890. 6418. 67500 sample1
#> 5 4887. 6423. 67500 sample1
#> 6 4887. 6425. 67500 sample1
#> 7 4890. 6427. 67500 sample1
#> 8 4891. 6427. 67500 sample1
#> 9 4894. 6426. 67500 sample1
#> 10 4908. 6421. 67500 sample1
#> # ℹ 172 more rows
Other getters include: features() and segmentIDs().
# get initial features in sample 1
head(features(me)[[1]])
#> [1] "2010300C02Rik" "Acsbg1" "Adamts2" "Adamtsl1"
#> [5] "Angpt1" "Aqp4"
segmentIDs(me, "cell")
#> $sample1
#> [1] "67500" "67512" "67515" "67521" "67527"
#>
#> $sample2
#> [1] "65043" "65044" "65051" "65055" "65063" "65064" "65067" "65070" "65071"
Main setters include molecules<-
and boundaries<-
.
For example, with boundaries<-
one can add new segmentation assay information
to the boundaries slot. Here we demonstrate this with the nucleus boundaries.
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")
nucleiMEList <- readBoundaries(
dataDir = repoDir,
pattern = "nucleus_boundaries.csv",
segmentIDCol = "cell_id",
xCol = "vertex_x",
yCol = "vertex_y",
keepCols = "essential",
boundariesAssay = "nucleus",
scaleFactorVector = 1
)
boundaries(me, "nucleus") <- nucleiMEList
me # note the addition of the nucleus boundaries to the boundaries slot
#> class: MoleculeExperiment
#> 2 samples: sample1 sample2
#>
#> @molecules contents:
#> -detected assay:
#> Location range across all samples in assay "detected": [4900,4919.98] x [6400.02,6420]
#>
#> @boundaries contents:
#> -cell:
#> 7 unique segment IDs: 67500 67512 67515 67521 67527 ...
#> -nucleus:
#> 5 unique segment IDs: 67500 67512 67515 67521 67527 ...
The additional boundaries can be accessed, e.g. for visualisation.
ggplot_me() +
# add cell segments and colour by cell id
geom_polygon_me(me, byFill = "segment_id", colour = "black", alpha = 0.1) +
# add molecule points and colour by feature name
geom_point_me(me, byColour = "feature_name", size = 0.1) +
# add nuclei segments and colour the border with red
geom_polygon_me(me, assayName = "nucleus", fill = NA, colour = "red") +
# zoom in to selected patch area
coord_cartesian(xlim = c(4900, 4919.98), ylim = c(6400.02, 6420))