R version: R Under development (unstable) (2021-10-19 r81077)
Bioconductor version: 3.15
Package: 2.1.0 <>
In this vignette, we will introduce a data analysis workflow for GeoMx-NGS mRNA expression data.
The GeoMx Digital Spatial Profiler (DSP) is a platform for capturing spatially resolved high-plex gene (or protein) expression data from tissue Merritt et al., 2020. In particular, formalin-fixed paraffin-embedded (FFPE) or fresh-frozen (FF) tissue sections are stained with barcoded in-situ hybridization probes that bind to endogenous mRNA transcripts. The user then selects regions of the interest (ROI) to profile; if desired, each ROI segment can be further sub-divided into areas of illumination (AOI) based on tissue morphology. The GeoMx then photo-cleaves and collects expression barcodes for each AOI segment separately for downstream sequencing and data processing.
The final results are spatially resolved unique expression datasets for every protein-coding gene (>18,000 genes) from every individual segment profiled from tissue.
The motivation for this vignette is to enable scientists to work with GeoMx-NGS gene expression data and understand a standard data analysis workflow.
Our specific objectives:
Let’s install and load the GeoMx packages we need:
install.packages("devtools") devtools::install_github("Nanostring-Biostats/NanoStringNCTools") devtools::install_github("Nanostring-Biostats/GeomxTools", ref = "dev") devtools::install_github("Nanostring-Biostats/GeoMxWorkflows", ref = "main")
library(NanoStringNCTools) library(GeomxTools) library(GeoMxWorkflows)
In this vignette, we will analyze a GeoMx kidney dataset created with the human whole transcriptome atlas (WTA) assay. The dataset includes 4 diabetic kidney disease (DKD) and 3 healthy kidney tissue samples. Regions of interest (ROI) were spatially profiled to focus on two different kidney structures: tubules or glomeruli. One glomerular ROI contains the entirety of a single glomerulus. Each tubular ROI contains multiple tubules that were segmented into distal (PanCK+) and proximal (PanCK-) tubule areas of illumination (AOI).
Download and the unzip the kidney data set found on the NanoString Website
The key data files are:
We first locate the downloaded files:
# Reference the main folder 'file.path' containing the sub-folders with each # data file type: datadir <- system.file("extdata", "WTA_NGS_Example", package="GeoMxWorkflows") # to locate a specific file path replace the above line with # datadir <- file.path("~/Folder/SubFolder/DataLocation") # replace the Folder, SubFolder, DataLocation as needed # the DataLocation folder should contain a dccs, pkcs, and annotation folder # with each set of files present as needed
# automatically list files in each directory for use DCCFiles <- dir(file.path(datadir, "dccs"), pattern = ".dcc$", full.names = TRUE, recursive = TRUE) PKCFiles <- unzip(zipfile = dir(file.path(datadir, "pkcs"), pattern = ".zip$", full.names = TRUE, recursive = TRUE)) SampleAnnotationFile <- dir(file.path(datadir, "annotation"), pattern = ".xlsx$", full.names = TRUE, recursive = TRUE)
We then load the data to create a data object using the
# load data demoData <- readNanoStringGeoMxSet(dccFiles = DCCFiles, pkcFiles = PKCFiles, phenoDataFile = SampleAnnotationFile, phenoDataSheet = "Template", phenoDataDccColName = "Sample_ID", protocolDataColNames = c("aoi", "roi"), experimentDataColNames = c("panel"))
All of the expression, annotation, and probe information are now linked and stored together into a single data object.
For more details on this object’s structure and accessors, please refer to the “GeoMxSet Object Overview” section at the end of this vignette.
First let’s access the PKC files, to ensure that the expected PKCs have been
loaded for this study. For the demo data we are using the file
library(knitr) pkcs <- annotation(demoData) modules <- gsub(".pkc", "", pkcs) kable(data.frame(PKCs = pkcs, modules = modules))
Now that we have loaded the data, we can visually summarize the experimental design for our dataset to look at the different types of samples and ROI/AOI segments that have been profiled. We present this information in a Sankey diagram.
library(dplyr) library(ggforce) # select the annotations we want to show, use `` to surround column names with # spaces or special symbols count_mat <- count(pData(demoData), `slide name`, class, region, segment) # simplify the slide names count_mat$`slide name` <- gsub("disease", "d", gsub("normal", "n", count_mat$`slide name`)) # gather the data and plot in order: class, slide name, region, segment test_gr <- gather_set_data(count_mat, 1:4) test_gr$x <- factor(test_gr$x, levels = c("class", "slide name", "region", "segment")) # plot Sankey ggplot(test_gr, aes(x, id = id, split = y, value = n)) + geom_parallel_sets(aes(fill = region), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.2) + geom_parallel_sets_labels(color = "white", size = 5) + theme_classic(base_size = 17) + theme(legend.position = "bottom", axis.ticks.y = element_blank(), axis.line = element_blank(), axis.text.y = element_blank()) + scale_y_continuous(expand = expansion(0)) + scale_x_discrete(expand = expansion(0)) + labs(x = "", y = "") + annotate(geom = "segment", x = 4.25, xend = 4.25, y = 20, yend = 120, lwd = 2) + annotate(geom = "text", x = 4.19, y = 70, angle = 90, size = 5, hjust = 0.5, label = "100 segments")