This vignette demonstrates the scGraphVerse workflow on a single-cell RNA-seq dataset. We show how to:
In this real-data analysis, we’ll work with two public PBMC (Peripheral Blood Mononuclear Cell) datasets from 10X Genomics. Our strategy is to focus specifically on B cells and identify a common set of highly expressed genes across both datasets. This approach allows us to compare regulatory networks between different experimental batches while controlling for cell type and gene selection effects.
Data Processing Workflow: 1. Load two PBMC datasets (3k and 4k cells) from TENxPBMCData 2. Use SingleR for automated cell type annotation 3. Select top 500 most expressed genes in B cells from each dataset 4. Find intersection of expressed genes to ensure comparable gene sets 5. Subset datasets to B cells only with common gene set
This preprocessing ensures we have clean, comparable data for network inference.
# Note: This vignette requires external packages from Suggests
# Install if needed:
# BiocManager::install(c("TENxPBMCData", "scater", "SingleR", "celldex"))
# Helper function to preprocess PBMC data
preprocess_pbmc <- function(pbmc_obj) {
sce <- scater::logNormCounts(pbmc_obj)
symbols_tenx <- SummarizedExperiment::rowData(sce)$Symbol_TENx
valid <- !is.na(symbols_tenx) & symbols_tenx != ""
sce <- sce[valid, ]
rownames(sce) <- make.unique(symbols_tenx[valid])
SummarizedExperiment::assay(sce, "logcounts") <-
as.matrix(SummarizedExperiment::assay(sce, "logcounts"))
colnames(sce) <- paste0("cell_", seq_len(ncol(sce)))
return(sce)
}
# 1. Load and preprocess PBMC data
if (requireNamespace("TENxPBMCData", quietly = TRUE)) {
pbmc_obj <- TENxPBMCData::TENxPBMCData("pbmc3k")
pbmc_obj1 <- TENxPBMCData::TENxPBMCData("pbmc4k")
sce <- preprocess_pbmc(pbmc_obj)
sce1 <- preprocess_pbmc(pbmc_obj1)
# 2. Cell type annotation
if (requireNamespace("celldex", quietly = TRUE) &&
requireNamespace("SingleR", quietly = TRUE)) {
ref <- celldex::HumanPrimaryCellAtlasData()
pred1 <- SingleR::SingleR(test = sce1, ref = ref, labels=ref$label.main)
SummarizedExperiment::colData(sce1)$pcell <- pred1$labels
pred <- SingleR::SingleR(test = sce, ref = ref, labels=ref$label.main)
SummarizedExperiment::colData(sce)$pcell <- pred$labels
# 3. Select top 100 B-cell genes
genes <- selgene(
object = sce,
top_n = 100,
cell_type = "B_cell",
cell_type_col = "pcell",
remove_rib = TRUE,
remove_mt = TRUE
)
genes1 <- selgene(
object = sce1,
top_n = 100,
cell_type = "B_cell",
cell_type_col = "pcell",
remove_rib = TRUE,
remove_mt = TRUE
)
# 4. Find intersection
commong <- intersect(genes, genes1)
# 5. Subset data
pbmc1_sub <- sce[
commong,
SummarizedExperiment::colData(sce)$pcell == "B_cell"
]
pbmc2_sub <- sce1[
commong,
SummarizedExperiment::colData(sce1)$pcell == "B_cell"
]
pbmc1_sub <- pbmc1_sub[, 1:85]
pbmc2_sub <- pbmc2_sub[, 1:85]
# Create MultiAssayExperiment for multi-sample analysis
bcell_mae <- create_mae(list(pbmc1 = pbmc1_sub, pbmc2 = pbmc2_sub))
}
}
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache
#> Using SCE assay 'logcounts' (log-normalized).
#> Subsetted to 344 cells where pcell = 'B_cell'.
#> Removed mitochondrial genes matching '^MT-'.
#> Removed ribosomal genes matching '^RP[SL]'.
#> Top 100 genes selected based on mean expression.
#> Using SCE assay 'logcounts' (log-normalized).
#> Subsetted to 607 cells where pcell = 'B_cell'.
#> Removed mitochondrial genes matching '^MT-'.
#> Removed ribosomal genes matching '^RP[SL]'.
#> Top 100 genes selected based on mean expression.
Now we’ll infer gene regulatory networks from our preprocessed B cell data using GENIE3. Unlike the simulation study where we had control over all parameters, here we’re working with real biological data that presents unique challenges: B cells have distinct expression patterns, lower total gene counts compared to the full transcriptome, and batch effects between the two PBMC datasets.
The function now accepts a MultiAssayExperiment
object as input, which
provides a structured way to handle multiple experimental conditions.
# Choose method: "GENIE3", "GRNBoost2", "ZILGM", "PCzinb" or "JRF"
method <- "GENIE3"
if (exists("bcell_mae")) {
networks <- infer_networks(
count_matrices_list = bcell_mae,
method = method,
nCores = 1
)
}
Here we apply a more stringent threshold (99th percentile) compared to the simulation study (95th percentile). This is appropriate for real data analysis where we expect higher noise levels and want to focus on the most confident regulatory relationships. With B cell data, we’re particularly interested in high-confidence edges controlling B cell function.
The functions now return and work with SummarizedExperiment
objects for
better data organization and metadata tracking.
if (exists("networks")) {
# Weighted adjacency (returns SummarizedExperiment)
wadj_se <- generate_adjacency(networks)
# Symmetrize (returns SummarizedExperiment)
swadj_se <- symmetrize(wadj_se, weight_function = "mean")
# Binary cutoff (top 1%, returns SummarizedExperiment)
binary_se <- cutoff_adjacency(
count_matrices = bcell_mae,
weighted_adjm_list = swadj_se,
n = 1,
method = method,
quantile_threshold = 0.99,
nCores = 1
)
# Plot
if (requireNamespace("ggraph", quietly = TRUE)) {
plotg(binary_se)
}
}
With only two PBMC datasets, our consensus network will include edges that appear in both batches, providing high confidence in the regulatory relationships.
if (exists("binary_se")) {
# Consensus by vote
consensus <- create_consensus(binary_se, method = "vote")
# Plot consensus network
if (requireNamespace("ggraph", quietly = TRUE)) {
plotg(consensus)
}
}
Here we demonstrate the use of classify_edges()
and
plot_network_comparison()
for detailed network comparison analysis.
The false_plot
parameter controls whether to include
false positive/negative plots.
if (exists("consensus")) {
# Note: For comparison with external databases like STRINGdb,
# use stringdb_adjacency() to fetch reference networks
# Community detection
if (requireNamespace("igraph", quietly = TRUE)) {
communities <- community_path(consensus)
}
}
#>
#>
#> Detecting communities...
#> Running pathway enrichment...
#> 'select()' returned 1:1 mapping between keys and columns
#> Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
#> Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
Unlike the simulation study where we used STRINGdb to create ground truth, here
we use it for validation of our inferred B cell regulatory network.
The keep_all_genes = TRUE
parameter ensures we retain information for all
genes in our dataset, even those not found in STRING, which is important for
comprehensive edge mining analysis.
# Note: This requires internet connection for STRINGdb downloads
# This section requires the STRINGdb package from Suggests
if (exists("consensus") && requireNamespace("STRINGdb", quietly = TRUE)) {
str_result <- stringdb_adjacency(
genes = rownames(consensus),
species = 9606,
required_score = 900,
keep_all_genes = TRUE
)
# Extract binary adjacency and symmetrize
str_binary <- str_result$binary
ground_truth_se <- symmetrize(
build_network_se(list(string_network = str_binary)),
weight_function = "mean"
)
ground_truth <- SummarizedExperiment::assay(ground_truth_se, 1)
# Edge mining: TP rates (returns list of dataframes)
em <- edge_mining(
consensus,
ground_truth = ground_truth,
query_edge_types = "TP"
)
# Display results
if (length(em) > 0) {
print(head(em[[1]]))
}
}
#> Registered S3 method overwritten by 'gplots':
#> method from
#> reorder.factor DescTools
#> Initializing STRINGdb...
#> Mapping genes to STRING IDs...
#> Mapped 85 genes to STRING IDs.
#> Retrieving **physical** interactions from STRING API...
#> Found 68 STRING physical interactions.
#> Adjacency matrices constructed successfully.
#> gene1 gene2 edge_type pubmed_hits
#> 2 ACTG1 ARPC2 TP 1
#> 41 HLA-A HLA-B TP 2922
#> 43 HLA-A HLA-C TP 1190
#> 44 HLA-B HLA-C TP 1194
#> 67 CD74 HLA-DRA TP 38
#> 70 HLA-B HLA-DRA TP 21
#> PMIDs
#> 2 34938284
#> 41 41069639,41050417,41046214,41035653,41034087,41011272,41003814,41001029,40962674,40944463,40941618,40910546,40896631,40868188,40873549,40866583,40835749,40820324,40791414,40770645,40739281,40721532,40668377,40637775,40617889,40577315,40575193,40574847,40565849,40535491,40517559,40510810,40508101,40508057,40504263,28520367,40475155,40469303,40451247,40405507,40390041,40387827,40366340,40360437,40342296,40314039,40303409,40284905,40239304,40230599,40200142,40171627,40170583,40149438,40127639,40087529,40074802,40071333,40052889,40020215,40010968,39979805,39937552,39933824,39928675,39914254,39901240,39898909,39892763,39890193,39836838,39833815,39824805,39804685,39768617,39738923,39728756,39727970,39726387,39723380,39716692,39715961,39713404,39694865,39684614,39672954,39642776,39605664,39586371,39558667,39530176,39469985,39459542,39456988,39447013,39435845,39374941,39337366,39336927,39306605
#> 43 41097894,41035653,41011272,41003814,40941618,40912174,40896631,40873549,40842414,40820324,40804474,40791414,40770645,40739281,40721532,40668377,40642077,40637873,40577315,40574847,40535491,40508057,40469303,40429368,40405507,40402323,40390041,40366340,40361234,40284905,40239304,40223959,40223381,40200142,40074802,40020215,39979805,39933824,39928675,39901240,39836838,39835139,39804685,39768617,39731480,39723380,39715961,39694865,39672954,39642776,39621955,39605664,39558667,39469985,39459542,39447013,39374941,39337964,39306605,39294236,39234850,39230170,39222660,39201523,39201460,39185409,39173558,39167735,39141684,39111186,39107575,39085621,39044005,39015089,38978376,38940584,38936012,38924641,38907894,38900146,38840925,38825445,38785545,38758119,38757301,38680423,38674456,38667507,38657099,38645783,38581695,38575661,38568509,38547417,38535155,38524775,38485627,38478091,38469033,38459565
#> 44 41104340,41035653,41032234,41011272,41003814,40941618,40896631,40873549,40862422,40820673,40820324,40806754,40791414,40784610,40770645,40744770,40739281,40721532,40706259,40693714,40668377,40663089,40577315,40574847,40557158,40535491,40508057,40492078,40469303,40405507,40390041,40366340,40345160,40284905,40239304,40200142,40193244,40074802,40065352,40020215,40004750,39979805,39933824,39928675,39901240,39836838,39807693,39804685,39790007,39768617,39723380,39715961,39694865,39672954,39663186,39651149,39642776,39605664,39558667,39533856,39483265,39470005,39469985,39459542,39457610,39447013,39436105,39374941,39364548,39329950,39307952,39306605,39294236,39234850,39230170,39222660,39201523,39201460,39185409,39173558,39167735,39158076,39125781,39111186,39107575,39101335,39085621,39060355,39041326,39019884,39015482,39015089,38988514,38978376,38978245,38940584,38936012,38907894,38900146,38887196
#> 67 40993240,40806540,40800081,40717170,40517735,40338916,40226614,40223959,39937308,39865657,39694280,39483471,39280905,38367852,38129488,37150240,36975409,36211422,35845067,35804895,35626874,35378753,35208514,34589742,34137173,33235138,33042148,32305991,32259312,32218455,30365966,29867922,28862321,26500140,26246143,24321376,23127183,15467430
#> 70 40087529,37780578,37317627,35991636,35759748,35739510,35592524,35356004,34275915,32547543,32218410,31009447,29867922,25901754,24278156,23028341,22952805,21543121,19143813,14564482,12445311
#> query_status
#> 2 hits_found
#> 41 hits_found
#> 43 hits_found
#> 44 hits_found
#> 67 hits_found
#> 70 hits_found
This case study illustrates how scGraphVerse enables end-to-end GRN reconstruction and validation in single-cell data. Users can swap inference algorithms, tune thresholds, and incorporate external prior networks.
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] TENxPBMCData_1.27.0 HDF5Array_1.37.0
#> [3] h5mread_1.1.1 rhdf5_2.53.6
#> [5] DelayedArray_0.35.3 SparseArray_1.9.1
#> [7] S4Arrays_1.9.1 abind_1.4-8
#> [9] Matrix_1.7-4 SingleCellExperiment_1.31.1
#> [11] SummarizedExperiment_1.39.2 Biobase_2.69.1
#> [13] GenomicRanges_1.61.5 Seqinfo_0.99.2
#> [15] IRanges_2.43.5 S4Vectors_0.47.4
#> [17] BiocGenerics_0.55.4 generics_0.1.4
#> [19] MatrixGenerics_1.21.0 matrixStats_1.5.0
#> [21] scGraphVerse_0.99.20 BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] R.methodsS3_1.8.2 dichromat_2.0-0.1
#> [3] gld_2.6.8 Biostrings_2.77.2
#> [5] vctrs_0.6.5 ggtangle_0.0.7
#> [7] perturbR_0.1.3 digest_0.6.37
#> [9] png_0.1-8 shape_1.4.6.1
#> [11] proxy_0.4-27 Exact_3.3
#> [13] pcaPP_2.0-5 BiocBaseUtils_1.11.2
#> [15] gypsum_1.5.0 ggrepel_0.9.6
#> [17] bst_0.3-24 magick_2.9.0
#> [19] hdrcde_3.4 MASS_7.3-65
#> [21] fontLiberation_0.1.0 reshape2_1.4.4
#> [23] foreach_1.5.2 qvalue_2.41.0
#> [25] withr_3.0.2 xfun_0.53
#> [27] ggfun_0.2.0 survival_3.8-3
#> [29] doRNG_1.8.6.2 memoise_2.0.1
#> [31] ggbeeswarm_0.7.2 clusterProfiler_4.17.0
#> [33] gson_0.1.0 systemfonts_1.3.1
#> [35] tidytree_0.4.6 networkD3_0.4.1
#> [37] gtools_3.9.5 R.oo_1.27.1
#> [39] WeightSVM_1.7-16 KEGGREST_1.49.2
#> [41] httr_1.4.7 GENIE3_1.31.0
#> [43] hash_2.2.6.3 rhdf5filters_1.21.4
#> [45] rstudioapi_0.17.1 DOSE_4.3.0
#> [47] curl_7.0.0 ScaledMatrix_1.17.0
#> [49] ggraph_2.2.2 polyclip_1.10-7
#> [51] ExperimentHub_2.99.5 stringr_1.5.2
#> [53] pracma_2.4.4 doParallel_1.0.17
#> [55] evaluate_1.0.5 BiocFileCache_2.99.6
#> [57] hms_1.1.4 glmnet_4.1-10
#> [59] bookdown_0.45 irlba_2.3.5.1
#> [61] colorspace_2.1-2 filelock_1.0.3
#> [63] reticulate_1.43.0 readxl_1.4.5
#> [65] magrittr_2.0.4 readr_2.1.5
#> [67] viridis_0.6.5 ggtree_3.99.1
#> [69] lattice_0.22-7 XML_3.99-0.19
#> [71] scuttle_1.19.0 cowplot_1.2.0
#> [73] class_7.3-23 pillar_1.11.1
#> [75] nlme_3.1-168 iterators_1.0.14
#> [77] caTools_1.18.3 compiler_4.5.1
#> [79] beachmat_2.25.5 stringi_1.8.7
#> [81] DescTools_0.99.60 plyr_1.8.9
#> [83] mpath_0.4-2.26 fda_6.3.0
#> [85] crayon_1.5.3 scater_1.37.0
#> [87] gbm_2.2.2 gridGraphics_0.5-1
#> [89] chron_2.3-62 haven_2.5.5
#> [91] graphlayouts_1.2.2 org.Hs.eg.db_3.22.0
#> [93] bit_4.6.0 rootSolve_1.8.2.4
#> [95] dplyr_1.1.4 fastmatch_1.1-6
#> [97] codetools_0.2-20 BiocSingular_1.25.0
#> [99] bslib_0.9.0 e1071_1.7-16
#> [101] lmom_3.2 alabaster.ranges_1.9.1
#> [103] fds_1.8 MultiAssayExperiment_1.35.9
#> [105] splines_4.5.1 Rcpp_1.1.0
#> [107] dbplyr_2.5.1 sparseMatrixStats_1.21.0
#> [109] cellranger_1.1.0 knitr_1.50
#> [111] blob_1.2.4 BiocVersion_3.22.0
#> [113] robin_2.0.0 fs_1.6.6
#> [115] DelayedMatrixStats_1.31.0 pscl_1.5.9
#> [117] expm_1.0-0 ggplotify_0.1.3
#> [119] sqldf_0.4-11 tibble_3.3.0
#> [121] tzdb_0.5.0 tweenr_2.0.3
#> [123] pkgconfig_2.0.3 tools_4.5.1
#> [125] cachem_1.1.0 RSQLite_2.4.3
#> [127] viridisLite_0.4.2 DBI_1.2.3
#> [129] numDeriv_2016.8-1.1 distributions3_0.2.3
#> [131] celldex_1.19.0 fastmap_1.2.0
#> [133] rmarkdown_2.30 scales_1.4.0
#> [135] grid_4.5.1 AnnotationHub_3.99.6
#> [137] sass_0.4.10 patchwork_1.3.2
#> [139] BiocManager_1.30.26 graph_1.87.0
#> [141] alabaster.schemas_1.9.0 SingleR_2.11.4
#> [143] rpart_4.1.24 farver_2.1.2
#> [145] tidygraph_1.3.1 gsubfn_0.7
#> [147] yaml_2.3.10 deSolve_1.40
#> [149] cli_3.6.5 purrr_1.1.0
#> [151] lifecycle_1.0.4 askpass_1.2.1
#> [153] rainbow_3.8 mvtnorm_1.3-3
#> [155] BiocParallel_1.43.4 gtable_0.3.6
#> [157] parallel_4.5.1 ape_5.8-1
#> [159] jsonlite_2.0.0 bitops_1.0-9
#> [161] ggplot2_4.0.0 bit64_4.6.0-1
#> [163] yulab.utils_0.2.1 alabaster.matrix_1.9.0
#> [165] BiocNeighbors_2.3.1 proto_1.0.0
#> [167] jquerylib_0.1.4 alabaster.se_1.9.0
#> [169] GOSemSim_2.35.2 R.utils_2.13.0
#> [171] lazyeval_0.2.2 alabaster.base_1.9.5
#> [173] htmltools_0.5.8.1 enrichplot_1.29.2
#> [175] GO.db_3.22.0 rappdirs_0.3.3
#> [177] data.tree_1.2.0 tinytex_0.57
#> [179] glue_1.8.0 STRINGdb_2.21.0
#> [181] httr2_1.2.1 XVector_0.49.1
#> [183] gdtools_0.4.4 RCurl_1.98-1.17
#> [185] qpdf_1.4.1 treeio_1.33.0
#> [187] mclust_6.1.1 ks_1.15.1
#> [189] gridExtra_2.3 boot_1.3-32
#> [191] igraph_2.2.0 R6_2.6.1
#> [193] tidyr_1.3.1 fdatest_2.1.1
#> [195] ggiraph_0.9.2 gplots_3.2.0
#> [197] forcats_1.0.1 labeling_0.4.3
#> [199] cluster_2.1.8.1 rngtools_1.5.2
#> [201] Rhdf5lib_1.31.1 aplot_0.2.9
#> [203] plotrix_3.8-4 tidyselect_1.2.1
#> [205] vipor_0.4.7 ggforce_0.5.0
#> [207] fontBitstreamVera_0.1.1 AnnotationDbi_1.71.2
#> [209] rsvd_1.0.5 KernSmooth_2.23-26
#> [211] S7_0.2.0 fontquiver_0.2.1
#> [213] data.table_1.17.8 htmlwidgets_1.6.4
#> [215] fgsea_1.35.8 RColorBrewer_1.1-3
#> [217] rlang_1.1.6 rentrez_1.2.4
#> [219] beeswarm_0.4.0