Contents


1 Introduction

This package is a pipeline to identify the key gene regulators in a biological process, for example in cell differentiation and in cell development after stimulation. Given gene expression data and sample information, there are four major steps in this pipeline: (1) differential expression analysis; (2) regulator-target network inference; (3) enrichment analysis; and (4) regulators scoring and ranking.

In this tutorial, we are showing you how to perform RegEnrich analysis by starting with a quick example, followed by detailed explanation in each step and three case studies.


2 A quick example

To illustrate how to use RegEnrich pipline, here we simply show the basic steps, assuming you have had all input data and the parameters ready.

2.1 Including RegEnrich library

3 RegenrichSet object initialization

3.1 The RegenrichSet object

RegEnrich package is programmed in a style of S4 object system. The most fundamental class in RegEnrich is RegenrichSet, and majority functions are working with this class in the following analysis steps. So the RegenrichSet object must be initialized prior to RegEnrich analysis. The RegenrichSet object can be easily initialized by RegenrichSet function, which require several input data along with other parameters.

3.2 Input data

There are three fundamental input data for RegEnrich pipline.

3.2.1 Expression data

The first one is expression data (expr), which is a table (matrix) with m rows (genes or proteins) and n columns (samples). Here the expression data can be of genes, measured by microarray or RNA sequencing, and can be of proteins, measured by mass spectrometry, etc.

Here we downloaded the gene expression data (RNA-sequencing data in FPKM format) that is from (Bouquet et al. 2016) in GEO database (GSE63085). And only 52 samples were included in Lyme_GSE63085 dataset and can be loaded in the following way.

Althouth here RNA-sequencing data is reprensented in FPKM (Fragments Per Kilobase of transcript per Million) format, we do recomment raw read count format. To further work on this FPKM data, we convert the FPKM data (plus 1) into logarithm to the base 2.

3.2.3 Regulators

The third input is the regulators in the studied organisms. If the organism is human, RegEnrich by default provided transcrpiton factors and co-factors in humans as the regulators which are obtained from (Han et al. 2015; Marbach et al. 2016; and Liu et al. 2015).

You can define your own regulators in RegEnrich. For example, for the studies in other organisms such as mouse, yeast, drosophila and arabidopsis thaliana, you can use the transcription (co-)factors in the following links, and cite corresponding paper:

Mouse, Yeast, Drosophila, and Arabidopsis thaliana

But one thing to keep in mind, the type of names or IDs of regulators should be the same as those of genes in the expression data. For example, if ENSEMBL gene ID is used in the expression data, then the regulators should be represented by ENSEMBL gene ID as well.

3.3 Other parameters

In addition to previous 3 most fundamental input data, other parameters for RegEnrich analysis can be initialized here as well. These parameters include 3 groups.

First, parameters to perform differential expression analysis, such as method (differential test method), minMeanExpr (the threshold to remove the lowly expressed gene), design (design model matrix or formula), reduced (reduced model matrix or formula), contrast (to extract the coefficients in the model), etc. Here we consider the effect of different patients and time (week in sample information table) on gene expression. To identify the differential genes related to time, we can simply use LRT_LM method (likelihood retio test on two linear model) to perfrom differential expression analysis. So the corresponding parameters are defined by:

Second, parameters to perform regulator-target network inference, such as networkConstruction (network inference method), topNetPercent (what percentage of the top edges to retain), etc. Here we use COEN method (weighted gene coexpression network) and other default parameters to inference regulator-target network.

Third, parameters to perform enrichment analysis, such as enrichTest (enrichment method), minSize (minimum number of targets of regulator), etc. Here we use FET method (Fisher’s exact test) and other default parameters to perform enrichment analysis.

The detailed explaination of these parameters can be found in the help page of RegenrichSet function, which can be viewed by ?RegenrichSet. Unlike expression data and sample information data, these parameters can be re-specified in the later steps of RegEnrich analysis.

4 Differential expression analysis

In this step, the major goal is to obtain differential p-values and log2 fold changes of gene expression between different conditions. There are couple of packages being developed to perform differential expression analysis, such as DESeq2, edgeR, limma, DSS, EBSeq, and baySeq. The full tutorials of these packages have been already provided as vignettes or other documentation in these packages. Here, we provide a wraper function, regenrich_diffExpr, which allows you to choose either “Wald_DESeq2”, “LRT_DESeq2”, “limma”, or “LRT_LM” to perform the differential expression analysis on the RegenrichSet obejct.

4.1 Use the parameters initialized in the RegenrichSet object

Since RegenrichSet object is initialized with method = "LRT_LM", regenrich_diffExpr function performs differential expression analysis using likelihood ratio test on two linear models that are specified by design formula and reduced formula.

LRT_LM method is implemented for data with complicated experiment designs, in which it is less meaningful to calculate log2 fold changes. In the current version of RegEnrich, calculating the log2 fold change between conditions is not implemented in LRT_LM method. And the log2 fold changes of all genes are set to 0 by default.

5 Regulator-target network inference

RegEnrich provides two computational methods, COEN and GRN, to infer regulator-target network based on expression data. For COEN method, the weighted co-expression network is constructed using WGCNA package, and then the regulator-target network is the robust subnetwork of weighted co-expression network, and the nodes and edges are removed if they are not connected to any regulators. With respect to GRN, it infers regulator-target network using random forest algorithm. This method was initially described in GENIE3 package, and it is modified in RegEnrich to support parallel computing and to control the model accuracy. Regulator-target network inferences using COEN and GRN methods are shown bellow.

5.1 COEN (based on WGCNA)

regenrich_network is the function to perform regulator-target network inference. Since the networkConstruction = "COEN" parameter has been set during the RegenrichSet initialization, so by default RegEnrich constructs a COEN network.

What happens under the hood in above codes is updating object@topNetP slot, which is a Topnetwork class. RegEnrich provides a function to access this slot.

Please note that since the oganism of GSE63085 dataset is Homo sapien, the regulators used in RegEnrich by default are obtained from (Han et al. 2015; Marbach et al. 2016; and Liu et al. 2015). And we are using gene names in the expression data, so the regulators here are also represented by gene names.

Since network inference is generally very time-consuming, we suggest saving the RegenrichSet object with the regulator-target network in case of using it next time.

A more detailed explaination can be found in the help pages of regenrich_network (see ?regenrich_network) and TopNetwork-class (see ?"TopNetwork-class").

5.2 GRN (based on random forest)

Alternatively, you can build a gene regulatory network (GRN) by setting networkConstruction = "GRN" parameter in regenrich_network function. To accelarate computing, you can set number of CPU cores and random seeds using BPPARAM parameter. Here you can control the accuracy of network inferance by minR which are computed based on out-of-bag estimation in random forest. Please note that the lower minR is set, the less edges and potential less regulators are retained.

5.3 User defined network

It is also possible to provide a regulator-target network, which is obtained somewhere else. For example, this network can be constructed based on the relation network of transcription factors and their binding targets. Here we assigned the constructed COEN regulator-target network (a Topnetwork object) in object variable to object2 variable.

It is also fine to provide a 3-column table (data.frame object) of network edges, in which the first column is regulators, the second column is targets, and the third column is edge weight (reliability).


6 Enrichment analysis

RegEnrich provides two methods to perform enrichment analysis, i.e. Fisher’s exact test (FET) and gene set enrichment analysis (GSEA). Both methods are implemented in regenrich_enrich function. regenrich_enrich function updates object@resEnrich slot, which is a Enrich class. RegEnrich provides results_enrich function to access this slot.

6.1 Fisher’s exact test (FET)

Since the enrichTest = "FET" parameter has been set during the RegenrichSet initialization, so by default RegEnrich performs enrichment analysis using FET method.

6.2 Gene set enrichment analysis (GSEA)

Since the enrichTest = "FET" parameter has been set during the RegenrichSet initialization, but enrichTest = GSEA parameter can be re-specified in regenrich_enrich function to perform enrichment analysis using GSEA method. Typically, GSEA is slower than FET method, especially when the number of reg is large and the regulator-target network is also large. Reducing the number of permutation (nperm, default is 10,000) can be a good trial to have a look at preliminary results.

You can compare the order of enriched regulators obtained by FET and GSEA methods using plotOrders function.


7 Regulator scoring and ranking

The RegEnrich score is a summarized information from both differential expression analysis and regulator enrichment analysis for regulators. This step of RegEnrich analysis is done by regenrich_rankScore function.

Above all, the differential expression analysis is perormed by LRT_LM method, regulator-target network is infered by COEN method, and enrichment analysis is performed by FET method, so the scores and ranking summarize the importance of regulators by considering regulatory interactions in the studied biological process.

The expression of regulator and its targets can be viewed using following code.

Note that the previous analysis is a tutorial showing you how to perform basic RegEnrich analysis. As you known this analysis is based on only first 2000 genes, the real key regulators should be different from the previous results.

RegEnrich can work with different types of dataset, such as microarray data, RNAseq raw read count data, and mass spectrometriy proteomic data. The following section shows you 2 case studies of using RegEnrich to work with these 2 types of datasets.


8 Case studies

8.1 Case 1: Microarray (single-channel) data

8.1.1 Background

This case study analyzes gene expression changes of primary human hepatocytes after 6 and 24 h treatment with interferon alpha (IFN-α). The gene expression was examined using single-channel Affymetrix Human Genome U133 Plus 2.0 Arrays. And the raw data and normalized data is available in GEO database under GSE31193 accession ID.

8.1.2 Reading the data

There are several ways to read the data. You can download the normalized data file, decompress it, and then read it using read.csv function.

Reading the raw data (.cel files) and then normalize it using other normalization method is also possible. As there is a simpler way to read data from GEO database, this method is not included in this vignette.

The simplist way to read the data is using GEOquery library. To use this library, you must have this package installed.

This dataset contains samples treated with IL28B, but here we are focusing on only control samples and samples after 6 and 24 h IFN-α treatment.

Here to simplify the analysis, if there are multiple probes matching a gene, we use only one probe with higher average expression value to represent this gene.

Because the speed of network infernece is highly influenced by the number of genes, to quickly illustrate how to use RegEnrich, here we randomly take only 5,000 genes for analysis. If you would like to see the real result in the analysis, then the following data subsetting step should be discarded.

8.1.3 RegEnrich analysis

Here we would like to know which regulators play key roles in primary human hepatocytes after 24 h treatment with IFN-α.

8.2 Case 2: RNAseq read count data

8.2.1 Background

Here we show how to apply RegEnrich on the RNAseq data by analyzing Kang et al’s monocyte-macrophage-IFN stimulation dataset ( GSE130567). There are multiple experiment conditions in this study. But here we would like to focus on partial samples in which monocytes were cultured with 10 ng/ml human macrophage colonystimulating factor (M-CSF) in the presence (IFN-γ-primed macrophages) or absence (resting macrophages) of 100 U/ml human IFN-γ for 48 h. RNA were extracted and reverse transcripted followed by sequencing (50 bp, paired-end) using Illumina HiSeq 4000. Sequenced reads were mapped to reference human genome (hg19 assembly) using STAR aligner with default parameters. We will use the raw HT-seq counts for the RegEnrich analysis.

8.2.2 Reading the data

Since the sample information and raw read counts data are archived seperately in GEO database. First, we can read the sample information using GEOquery package.

Second, download read count file and decompose into a temporary folder.

Then read the raw read counts in these files.

Similar to the case 1, here we randomly take only 5,000 genes to quickly illustrate how to use RegEnrich, but to see the real result from the analysis, you should neglect the following step.

8.2.3 RegEnrich analysis


9 Session info

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#>  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] GEOquery_2.58.0             RegEnrich_1.0.0             SummarizedExperiment_1.20.0
#>  [4] Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
#>  [7] IRanges_2.24.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
#> [10] BiocSet_1.4.0               tibble_3.0.4                dplyr_1.0.2                
#> [13] S4Vectors_0.28.0            BiocGenerics_0.36.0         BiocStyle_2.18.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] fgsea_1.16.0           colorspace_1.4-1       ellipsis_0.3.1         dynamicTreeCut_1.63-1 
#>   [5] qvalue_2.22.0          htmlTable_2.1.0        XVector_0.30.0         base64enc_0.1-3       
#>   [9] rstudioapi_0.11        bit64_4.0.5            fansi_0.4.1            AnnotationDbi_1.52.0  
#>  [13] xml2_1.3.2             codetools_0.2-16       splines_4.0.3          doParallel_1.0.16     
#>  [17] GOSemSim_2.16.0        impute_1.64.0          geneplotter_1.68.0     knitr_1.30            
#>  [21] Formula_1.2-4          annotate_1.68.0        WGCNA_1.69             cluster_2.1.0         
#>  [25] GO.db_3.12.0           png_0.1-7              readr_1.4.0            BiocManager_1.30.10   
#>  [29] compiler_4.0.3         httr_1.4.2             backports_1.1.10       assertthat_0.2.1      
#>  [33] Matrix_1.2-18          ontologyIndex_2.5      cli_2.1.0              limma_3.46.0          
#>  [37] htmltools_0.5.0        tools_4.0.3            gtable_0.3.0           glue_1.4.2            
#>  [41] GenomeInfoDbData_1.2.4 reshape2_1.4.4         DO.db_2.9              fastmatch_1.1-0       
#>  [45] Rcpp_1.0.5             vctrs_0.3.4            Biostrings_2.58.0      preprocessCore_1.52.0 
#>  [49] iterators_1.0.13       xfun_0.18              fastcluster_1.1.25     stringr_1.4.0         
#>  [53] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5           DOSE_3.16.0           
#>  [57] zlibbioc_1.36.0        scales_1.1.1           hms_0.5.3              RColorBrewer_1.1-2    
#>  [61] curl_4.3               yaml_2.2.1             memoise_1.1.0          gridExtra_2.3         
#>  [65] ggplot2_3.3.2          rpart_4.1-15           latticeExtra_0.6-29    stringi_1.5.3         
#>  [69] RSQLite_2.2.1          genefilter_1.72.0      BiocIO_1.0.0           foreach_1.5.1         
#>  [73] randomForest_4.6-14    checkmate_2.0.0        BiocParallel_1.24.0    rlang_0.4.8           
#>  [77] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14          lattice_0.20-41       
#>  [81] purrr_0.3.4            htmlwidgets_1.5.2      bit_4.0.4              tidyselect_1.1.0      
#>  [85] plyr_1.8.6             magrittr_1.5           bookdown_0.21          DESeq2_1.30.0         
#>  [89] R6_2.4.1               generics_0.0.2         Hmisc_4.4-1            DelayedArray_0.16.0   
#>  [93] DBI_1.1.0              pillar_1.4.6           foreign_0.8-80         survival_3.2-7        
#>  [97] KEGGREST_1.30.0        RCurl_1.98-1.2         nnet_7.3-14            crayon_1.3.4          
#> [101] utf8_1.1.4             rmarkdown_2.5          jpeg_0.1-8.1           locfit_1.5-9.4        
#> [105] grid_4.0.3             data.table_1.13.2      blob_1.2.1             digest_0.6.27         
#> [109] xtable_1.8-4           tidyr_1.1.2            munsell_0.5.0

References

Bouquet, Jerome, Mark J Soloski, Andrea Swei, Chris Cheadle, Scot Federman, Jean-Noel Billaud, Alison W Rebman, et al. 2016. “Longitudinal Transcriptome Analysis Reveals a Sustained Differential Gene Expression Signature in Patients Treated for Acute Lyme Disease.” MBio 7 (1). Am Soc Microbiol:e00100–16.

Han, Heonjong, Hongseok Shim, Donghyun Shin, Jung Eun Shim, Yunhee Ko, Junha Shin, Hanhae Kim, et al. 2015. “TRRUST: A Reference Database of Human Transcriptional Regulatory Interactions.” Scientific Reports 5. Nature Publishing Group:11432.

Liu, Zhi-Ping, Canglin Wu, Hongyu Miao, and Hulin Wu. 2015. “RegNetwork: An Integrated Database of Transcriptional and Post-Transcriptional Regulatory Networks in Human and Mouse.” Database 2015. Oxford University Press.

Marbach, Daniel, David Lamparter, Gerald Quon, Manolis Kellis, Zoltán Kutalik, and Sven Bergmann. 2016. “Tissue-Specific Regulatory Circuits Reveal Variable Modular Perturbations Across Complex Diseases.” Nature Methods 13 (4). Nature Publishing Group:366.