Using the ReactomeGSA package

Johannes Griss

2020-05-27

Introduction

The ReactomeGSA package is a client to the web-based Reactome Analysis System. Essentially, it performs a gene set analysis using the latest version of the Reactome pathway database as a backend.

The main advantages of using the Reactome Analysis System are:

Citation

To cite this package, use

Griss J. ReactomeGSA, https://github.com/reactome/ReactomeGSA (2019)

Installation

The ReactomeGSA package can be directly installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!require(ReactomeGSA))
  BiocManager::install("ReactomeGSA")

For more information, see https://bioconductor.org/install/.

Getting available methods

The Reactome Analysis System will be continuously updated. Before starting your analysis it is therefore a good approach to check which methods are available.

This can simply be done by using:

library(ReactomeGSA)

available_methods <- get_reactome_methods(print_methods = FALSE, return_result = TRUE)

# only show the names of the available methods
available_methods$name
#> [1] "PADOG"  "Camera" "ssGSEA"

To get more information about a specific method, set print_details to TRUE and specify the method:

# Use this command to print the description of the specific method to the console
# get_reactome_methods(print_methods = TRUE, print_details = TRUE, method = "PADOG", return_result = FALSE)

# show the parameter names for the method
padog_params <- available_methods$parameters[available_methods$name == "PADOG"][[1]]

paste0(padog_params$name, " (", padog_params$type, ", ", padog_params$default, ")")
#>  [1] "use_interactors (bool, False)"             
#>  [2] "include_disease_pathways (bool, True)"     
#>  [3] "max_missing_values (float, 0.5)"           
#>  [4] "create_reactome_visualization (bool, True)"
#>  [5] "create_reports (bool, False)"              
#>  [6] "email (string, )"                          
#>  [7] "reactome_server (string, dev)"             
#>  [8] "sample_groups (string, )"                  
#>  [9] "discrete_norm_function (string, TMM)"      
#> [10] "continuous_norm_function (string, none)"

Creating an analysis request

To start a gene set analysis, you first have to create an analysis request. This is a simple S4 class that takes care of submitting multiple datasets simultaneously to the analysis system.

When creating the request object, you already have to specify the analysis method you want to use:

# Create a new request object using 'Camera' for the gene set analysis
my_request <-ReactomeAnalysisRequest(method = "Camera")

my_request
#> ReactomeAnalysisRequestObject
#>   Method = Camera
#>   No request data stored
#> ReactomeAnalysisRequest

Setting parameters

To get a list of supported parameters for each method, use the get_reactome_methods function (see above).

Parameters are simply set using the set_parameters function:

# set the maximum number of allowed missing values to 50%
my_request <- set_parameters(request = my_request, max_missing_values = 0.5)

my_request
#> ReactomeAnalysisRequestObject
#>   Method = Camera
#>   Parameters:
#>   - max_missing_values: 0.5
#>   Datasets: none
#> ReactomeAnalysisRequest

Multiple parameters can by set simulataneously by simply adding more name-value pairs to the function call.

Adding datasets

One analysis request can contain multiple datasets. This can be used to, for example, visualize the results of an RNA-seq and Proteomics experiment (of the same / similar samples) side by side:

library(ReactomeGSA.data)
data("griss_melanoma_proteomics")

This is a limma EList object with the sample data already added

class(griss_melanoma_proteomics)
#> [1] "EList"
#> attr(,"package")
#> [1] "limma"
head(griss_melanoma_proteomics$samples)
#>                patient.id condition cell.type
#> M-D MOCK PBMCB         P3      MOCK     PBMCB
#> M-D MCM PBMCB          P3       MCM     PBMCB
#> M-K MOCK PBMCB         P4      MOCK     PBMCB
#> M-K MCM PBMCB          P4       MCM     PBMCB
#> P-A MOCK PBMCB         P1      MOCK     PBMCB
#> P-A MCM PBMCB          P1       MCM     PBMCB

The dataset can now simply be added to the request using the add_dataset function:

my_request <- add_dataset(request = my_request, 
                          expression_values = griss_melanoma_proteomics, 
                          name = "Proteomics", 
                          type = "proteomics_int",
                          comparison_factor = "condition", 
                          comparison_group_1 = "MOCK", 
                          comparison_group_2 = "MCM",
                          additional_factors = c("cell.type", "patient.id"))
my_request
#> ReactomeAnalysisRequestObject
#>   Method = Camera
#>   Parameters:
#>   - max_missing_values: 0.5
#>   Datasets:
#>   - Proteomics (proteomics_int)
#>     No parameters set.
#> ReactomeAnalysisRequest

Several datasets (of the same experiment) can be added to one request. This RNA-seq data is stored as an edgeR DGEList object:

data("griss_melanoma_rnaseq")

# only keep genes with >= 100 reads in total
total_reads <- rowSums(griss_melanoma_rnaseq$counts)
griss_melanoma_rnaseq <- griss_melanoma_rnaseq[total_reads >= 100, ]

# this is a edgeR DGEList object
class(griss_melanoma_rnaseq)
#> [1] "DGEList"
#> attr(,"package")
#> [1] "edgeR"
head(griss_melanoma_rnaseq$samples)
#>        group lib.size norm.factors patient cell_type treatment
#> 195-13  MOCK 29907534    1.0629977      P1      TIBC      MOCK
#> 195-14   MCM 26397322    0.9927768      P1      TIBC       MCM
#> 195-19  MOCK 18194834    1.0077827      P2     PBMCB      MOCK
#> 195-20   MCM 24282215    1.0041410      P2     PBMCB       MCM
#> 197-11  MOCK 22628117    0.9522869      P1     PBMCB      MOCK
#> 197-12   MCM 23319849    1.0115732      P1     PBMCB       MCM

Again, the dataset can simply be added using add_dataset. Here, we added an additional parameter to the add_dataset call. Such additional parameters are treated as additional dataset-level parameters.

# add the dataset
my_request <- add_dataset(request = my_request, 
                          expression_values = griss_melanoma_rnaseq, 
                          name = "RNA-seq", 
                          type = "rnaseq_counts",
                          comparison_factor = "treatment", 
                          comparison_group_1 = "MOCK", 
                          comparison_group_2 = "MCM",
                          additional_factors = c("cell_type", "patient"),
                          # This adds the dataset-level parameter 'discrete_norm_function' to the request
                          discrete_norm_function = "TMM")
#> Converting expression data to string... (This may take a moment)
#> Conversion complete
my_request
#> ReactomeAnalysisRequestObject
#>   Method = Camera
#>   Parameters:
#>   - max_missing_values: 0.5
#>   Datasets:
#>   - Proteomics (proteomics_int)
#>     No parameters set.
#>   - RNA-seq (rnaseq_counts)
#>     discrete_norm_function: TMM
#> ReactomeAnalysisRequest

Sample annotations

Datasets can be passed as limma EList, edgeR DGEList, any implementation of the Bioconductor ExpressionSet, or simply a data.frame.

For the first three, sample annotations are simply read from the respective slot. When supplying the expression values as a data.frame, the sample_data parameter has to be set using a data.frame where each row represents one sample and each column one proptery. If the the sample_data option is set while providing the expression data as an EList, DGEList, or ExpressionSet, the data in sample_data will be used instead of the sample annotations in the expression data object.

Name

Each dataset has to have a name. This can be anything but has to be unique within one analysis request.

Type

The ReactomeAnalysisSystem supports different types of ’omics data. To get a list of supported types, use the get_reactome_data_types function:

Defining the experimental design

Defining the experimental design for a ReactomeAnalysisRequest is very simple. Basically, it only takes three parameters:

The value set in comparison_factor must match a column name in the sample data (either the slot in an Elist, DGEList, or ExpressionSet object or in the sample_data parameter).

Additionally, it is possible to define blocking factors. These are supported by all methods that rely on linear models in the backend. Some methods though might simply ignore this parameter. For more information on whether a method supports blocking factors, please use get_reactome_methods.

Blocking factors can simply be set additional_factors to a vector of names. These should again reference properties (or columns) in the sample data.

Submitting the request

Once the ReactomeAnalysisRequest is created, the complete analysis can be run using perform_reactome_analysis:

result <- perform_reactome_analysis(request = my_request, compress = F)
#> Submitting request to Reactome API...
#> Reactome Analysis submitted succesfully
#> Converting dataset Proteomics...
#> Mapping identifiers...
#> Performing gene set analysis using Camera
#> Analysing dataset 'RNA-seq' using Camera
#> Creating REACTOME visualization
#> Retrieving result...

Investigating the result

The result object is a ReactomeAnalysisResult S4 class with several helper functions to access the data.

To retrieve the names of all available results (generally one per dataset), use the names function:

names(result)
#> [1] "Proteomics" "RNA-seq"

For every dataset, different result types may be available. These can be shown using the result_types function:

result_types(result)
#> [1] "pathways"     "fold_changes"

The Camera analysis method returns two types of results, pathway-level data and gene- / protein-level fold changes.

A specific result can be retrieved using the get_result method:

# retrieve the fold-change data for the proteomics dataset
proteomics_fc <- get_result(result, type = "fold_changes", name = "Proteomics")
head(proteomics_fc)
#>   Identifier      logFC   AveExpr         t      P.Value    adj.P.Val         B
#> 1     Q14526  0.4937650 -3.346909 14.518996 1.516133e-10 8.061281e-07 13.987362
#> 2     Q6VY07  0.2981411 -3.330347 13.560202 4.131240e-10 1.098290e-06 13.129372
#> 3     P07093  1.7950301 -3.648968 12.284771 1.728229e-09 3.062997e-06 11.871270
#> 4     P10124  1.0758634 -3.436961 10.323745 2.015541e-08 2.679158e-05  9.633911
#> 5     P55210  0.5018522 -3.347932  9.511286 6.202629e-08 6.595875e-05  8.582641
#> 6     O43683 -0.4754083 -3.345551 -9.362828 7.674505e-08 6.800891e-05  8.381844

Additionally, it is possible to directly merge the pathway level data for all result sets using the pathways function:

combined_pathways <- pathways(result)

head(combined_pathways)
#>                                                                                                                              Name
#> R-HSA-163200  Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins.
#> R-HSA-1428517                                                      The citric acid (TCA) cycle and respiratory electron transport
#> R-HSA-611105                                                                                       Respiratory electron transport
#> R-HSA-6799198                                                                                                Complex I biogenesis
#> R-HSA-72649                                                                              Translation initiation complex formation
#> R-HSA-72662                Activation of the mRNA upon binding of the cap-binding complex and eIFs, and subsequent binding to 43S
#>               Direction.Proteomics FDR.Proteomics PValue.Proteomics
#> R-HSA-163200                    Up   2.440247e-14      1.229963e-17
#> R-HSA-1428517                   Up   2.906918e-14      2.930361e-17
#> R-HSA-611105                    Up   7.331115e-14      1.108536e-16
#> R-HSA-6799198                   Up   1.481297e-11      2.986486e-14
#> R-HSA-72649                   Down   2.787931e-08      7.026035e-11
#> R-HSA-72662                   Down   5.583066e-08      1.688427e-10
#>               NGenes.Proteomics av_foldchange.Proteomics sig.Proteomics
#> R-HSA-163200                104               0.13854724           TRUE
#> R-HSA-1428517               144               0.12644683           TRUE
#> R-HSA-611105                 90               0.13912641           TRUE
#> R-HSA-6799198                53               0.14768317           TRUE
#> R-HSA-72649                  57              -0.09502561           TRUE
#> R-HSA-72662                  58              -0.09089517           TRUE
#>               Direction.RNA-seq  FDR.RNA-seq PValue.RNA-seq NGenes.RNA-seq
#> R-HSA-163200               Down 9.220252e-06   2.670118e-07            120
#> R-HSA-1428517              Down 2.273229e-05   7.405995e-07            164
#> R-HSA-611105               Down 1.185242e-04   5.764767e-06             99
#> R-HSA-6799198              Down 2.034228e-03   1.527972e-04             55
#> R-HSA-72649                Down 1.148816e-01   2.406795e-02             58
#> R-HSA-72662                Down 1.453653e-01   3.440094e-02             59
#>               av_foldchange.RNA-seq sig.RNA-seq
#> R-HSA-163200            -0.19744938        TRUE
#> R-HSA-1428517           -0.17597954        TRUE
#> R-HSA-611105            -0.18821852        TRUE
#> R-HSA-6799198           -0.18212003        TRUE
#> R-HSA-72649             -0.10430641       FALSE
#> R-HSA-72662             -0.07826746       FALSE

Visualising results

The ReactomeGSA package includes several basic plotting functions to visualise the pathway results. For comparative gene set analysis like the one presented here, two functions are available: plot_correlations and plot_volcano.

plot_correlations can be used to quickly assess how similar two datasets are on the pathway level:

plot_correlations(result)
#> Comparing 1 vs 2
#> [[1]]
#> Warning: Removed 232 rows containing missing values (geom_point).

Individual datasets can further be visualised using volcano plots of the pathway data:

plot_volcano(result, 2)

Session Info

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ReactomeGSA.data_1.2.0 Seurat_3.1.5           edgeR_3.30.0          
#> [4] limma_3.44.1           ReactomeGSA_1.2.2     
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-148        tsne_0.1-3          bitops_1.0-6       
#>  [4] progress_1.2.2      RcppAnnoy_0.0.16    RColorBrewer_1.1-2 
#>  [7] httr_1.4.1          sctransform_0.2.1   tools_4.0.0        
#> [10] R6_2.4.1            irlba_2.3.3         KernSmooth_2.23-17 
#> [13] uwot_0.1.8          lazyeval_0.2.2      colorspace_1.4-1   
#> [16] prettyunits_1.1.1   tidyselect_1.1.0    gridExtra_2.3      
#> [19] curl_4.3            compiler_4.0.0      plotly_4.9.2.1     
#> [22] labeling_0.3        caTools_1.18.0      scales_1.1.1       
#> [25] lmtest_0.9-37       ggridges_0.5.2      pbapply_1.4-2      
#> [28] rappdirs_0.3.1      stringr_1.4.0       digest_0.6.25      
#> [31] rmarkdown_2.1       pkgconfig_2.0.3     htmltools_0.4.0    
#> [34] htmlwidgets_1.5.1   rlang_0.4.6         farver_2.0.3       
#> [37] zoo_1.8-8           jsonlite_1.6.1      ica_1.0-2          
#> [40] gtools_3.8.2        dplyr_0.8.5         magrittr_1.5       
#> [43] patchwork_1.0.0     Matrix_1.2-18       Rcpp_1.0.4.6       
#> [46] munsell_0.5.0       ape_5.3             reticulate_1.15    
#> [49] lifecycle_0.2.0     stringi_1.4.6       yaml_2.2.1         
#> [52] MASS_7.3-51.6       gplots_3.0.3        Rtsne_0.15         
#> [55] plyr_1.8.6          grid_4.0.0          parallel_4.0.0     
#> [58] gdata_2.18.0        listenv_0.8.0       ggrepel_0.8.2      
#> [61] crayon_1.3.4        lattice_0.20-41     cowplot_1.0.0      
#> [64] splines_4.0.0       hms_0.5.3           locfit_1.5-9.4     
#> [67] knitr_1.28          pillar_1.4.4        igraph_1.2.5       
#> [70] future.apply_1.5.0  reshape2_1.4.4      codetools_0.2-16   
#> [73] leiden_0.3.3        glue_1.4.1          evaluate_0.14      
#> [76] data.table_1.12.8   BiocManager_1.30.10 vctrs_0.3.0        
#> [79] png_0.1-7           gtable_0.3.0        RANN_2.6.1         
#> [82] purrr_0.3.4         tidyr_1.1.0         future_1.17.0      
#> [85] assertthat_0.2.1    ggplot2_3.3.0       xfun_0.14          
#> [88] rsvd_1.0.3          survival_3.1-12     viridisLite_0.3.0  
#> [91] tibble_3.0.1        cluster_2.1.0       globals_0.12.5     
#> [94] fitdistrplus_1.1-1  ellipsis_0.3.1      ROCR_1.0-11