Contents

library(gemma.R)
library(data.table)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(SummarizedExperiment)
library(pheatmap)
library(viridis)
library(listviewer)

1 About Gemma

Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles. Gemma contains data from thousands of public studies, referencing thousands of published papers. Every dataset in Gemma has passed a rigorous curation process that re-annotates the expression platform at the sequence level, which allows for more consistent cross-platform comparisons and meta-analyses.

For detailed information on the curation process, read this page or the latest publication.

2 Package cheat sheet

3 Installation instructions

3.1 Bioconductor

You can install gemma.R through Bioconductor with the following code:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("gemma.R")

4 Searching for datasets of interest in Gemma

Using the get_datasets function, datasets fitting various criteria can be accessed.

# accessing all mouse and human datasets
get_datasets(taxa = c('mouse','human')) %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE2018 Human Lung Transplant - BAL Bronchoalveolar lavage samp… human
GSE4523 Expression Studies of Melan… Melanotransferrin (MTf) or … mouse
GSE4036 perro-affy-human-186940 Our laboratory has develope… human
GSE4034 palme-affy-mouse-198967 Fear conditioning (FC) is a… mouse
GSE2866 Donarum-3R01NS040270-03S1 Succinate semialdehyde dehy… mouse
GSE3253 Exaggerated neuroinflammati… Acute cognitive impairment … mouse
# accessing human datasets with the word "bipolar"
get_datasets(query = 'bipolar',taxa = 'human') %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE67645 Transcriptome dynamics of d… To define molecular mechani… human
McLean Hippocampus McLean Hippocampus Hippocampus of schizophreni… human
byne-cc Corpus Callosum data from S… human
GSE45484 Gene-expression differences… Analysis of gene-expression… human
GSE134497 Total RNA sequecing for hum… Total RNA sequecing for hu… human
GSE160761 RNA sequencing in human i… Valproate(VPA) has been us… human
# access human datasets that were annotated with the ontology term for the
# bipolar disorder

get_datasets(taxa ='human', 
             uris = 'http://purl.obolibrary.org/obo/MONDO_0004985') %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE5389 Adult postmortem brain tiss… Bipolar affective disorder … human
GSE5388 Adult postmortem brain tiss… Bipolar affective disorder … human
GSE7036 Expression profiling in mon… To identify genes dysregula… human
McLean Hippocampus McLean Hippocampus Hippocampus of schizophreni… human
McLean_PFC McLean_PFC Prefrontal cortex of schizo… human
stanley_feinberg Stanley consortium collecti… 50 samples of individuals f… human

Note that a single call of these functions will only return 20 results by default and a 100 results maximum, controlled by the limit argument. In order to get all available results, use get_all_pages function on the output of the function

get_datasets(taxa = 'human') %>% 
    get_all_pages() %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE2018 Human Lung Transplant - BAL Bronchoalveolar lavage samp… human
GSE4036 perro-affy-human-186940 Our laboratory has develope… human
GSE3489 Patterns of gene dysregulat… The neurodegenerative proce… human
GSE1923 Identification of PDGF-depe… Overall study: Identificati… human
GSE361 Mammary epithelial cell tra… Analysis of gene expression… human
GSE492 Effect of prostaglandin ana… The purpose of this study i… human

See larger queries section for more details. To keep this vignette simpler we will keep using the first 20 results returned by default in examples below.

The filter parameter of get_datasets allows filtering for datasets with specific properties. A list of such properties can be accessed using filter_properties.

filter_properties()$dataset %>% head %>% gemma_kable()
properties type description
accession.accession string NA
accession.accessionVersion string NA
accession.externalDatabase…. string NA
accession.externalDatabase.id integer NA
accession.externalDatabase…. string NA
accession.externalDatabase…. string NA

These properties can be used together to fine tune your results

# all datasets with more than 4 samples annotated for any disease
get_datasets(filter = 'bioAssayCount > 4 and allCharacteristics.category = disease') %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE2018 Human Lung Transplant - BAL Bronchoalveolar lavage samp… human
GSE4036 perro-affy-human-186940 Our laboratory has develope… human
GSE2866 Donarum-3R01NS040270-03S1 Succinate semialdehyde dehy… mouse
GSE2426 Pre-Neoplastic Stage of Med… SUMMARY Medulloblastoma is… mouse
GSE2867 Zoghbi-5R01NS027699-14 A number of human neurodege… mouse
GSE3489 Patterns of gene dysregulat… The neurodegenerative proce… human
# all datasets with ontology terms for Alzheimer's disease and Parkinson's disease
# this is equivalent to using the uris parameter
get_datasets(filter = 'allCharacteristics.valueUri in (http://purl.obolibrary.org/obo/MONDO_0004975,http://purl.obolibrary.org/obo/MONDO_0005180
)')  %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE1555 Vehicle and sAPPalpha-treat… Gene expression changes ind… mouse
GSE4757 Rogers-3U24NS043571-01S1 Alzheimer’s Disease (AD) is… human
GSE6613 Parkinson’s disease vs. con… Parkinson?s disease (PD) pr… human
GSE12685 Expression of mRNAs Regulat… In Alzheimer’s disease (AD)… human
GSE14499 Effect of BDNF on the APP t… We examined transgenic (TG)… mouse
GSE10908 Differential gene expressio… In a transgenic mouse model… mouse

Dataset information provided by get_datasets also includes some quality information that can be used to determine the suitability of any given experiment. For instance geeq.batchEffect column will be set to -1 if Gemma’s preprocessing has detected batch effects that were unable to be resolved by batch correction. More information about these and other fields can be found at the function documentation.

get_datasets(taxa = 'human', filter = 'bioAssayCount > 4') %>% 
     filter(geeq.batchEffect !=-1) %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE2018 Human Lung Transplant - BAL Bronchoalveolar lavage samp… human
GSE4036 perro-affy-human-186940 Our laboratory has develope… human
GSE3489 Patterns of gene dysregulat… The neurodegenerative proce… human
GSE1923 Identification of PDGF-depe… Overall study: Identificati… human
GSE361 Mammary epithelial cell tra… Analysis of gene expression… human
GSE492 Effect of prostaglandin ana… The purpose of this study i… human

Gemma uses multiple ontologies when annotating datasets and using the term URIs instead of free text to search can lead to more specific results. search_annotations function allows searching for annotation terms that might be relevant to your query.

search_annotations('bipolar') %>% 
    head %>% gemma_kable()
category.name category.URI value.name value.URI
disease http://www.ebi…/EFO_0000408 bipolar II disorder http://purl.obolibrary…/MONDO_0000693
disease http://www.ebi…/EFO_0000408 bipolar I disorder http://purl.obolibrary…/MONDO_0001866
disease http://www.ebi…/EFO_0000408 bipolar disorder http://purl.obolibrary…/MONDO_0004985
disease http://www.ebi…/EFO_0000408 bipolar II disoder http://www.ebi…/EFO_0009964
disease http://www.ebi…/EFO_0000408 Bipolar depressed NA
disease status NA bipolar disorder (BD) NA

5 Downloading expression data

Upon identifying datasets of interest, more information about specific ones can be requested. In this example we will be using GSE46416 which includes samples taken from healthy donors along with manic/euthymic phase bipolar disorder patients.

The data associated with specific experiments can be accessed by using get_datasets_by_ids

get_datasets_by_ids("GSE46416") %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable
experiment.shortName experiment.name experiment.description taxon.name
GSE46416 State- and trait-specific g… Gene expression profiles of… human

To access the expression data in a convenient form, you can use get_dataset_object. It is a high-level wrapper that combines various endpoint calls to return lists of annotated SummarizedExperiment or ExpressionSet objects that are compatible with other Bioconductor packages or a tidyverse-friendly long form tibble for downstream analyses. These include the expression matrix along with the experimental design, and ensure the sample names match between both when transforming/subsetting data.

dat <- get_dataset_object("GSE46416",
                          type = 'se') # SummarizedExperiment is the default output type

Note that the tidy format is less memory efficient but allows easy visualization and exploration with ggplot2 and the rest of the tidyverse.

To show how subsetting works, we’ll keep the “manic phase” data and the reference_subject_roles, which refers to the control samples in Gemma datasets.

# Check the levels of the disease factor
dat[[1]]$disease %>% unique()
[1] "bipolar disorder has_modifier euthymic phase"
[2] "reference subject role"                      
[3] "bipolar disorder has_modifier manic phase"   
# Subset patients during manic phase and controls
manic <- dat[[1]][, dat[[1]]$disease == "bipolar disorder,manic phase" | 
        dat[[1]]$disease == "reference subject role"]
manic
class: SummarizedExperiment 
dim: 21407 10 
metadata(8): title abstract ... gemmaSuitabilityScore taxon
assays(1): counts
rownames(21407): 2315430 2315554 ... 7385683 7385696
rowData names(4): Probe GeneSymbol GeneName NCBIid
colnames(10): Control, 15 Control, 8 ... Control, 1_DE50 Control,
  2_DE40
colData names(3): factorValues disease block

Let’s take a look at sample to sample correlation in our subset.

# Get Expression matrix
manicExpr <- assay(manic, "counts")


manicExpr %>% 
    cor %>% 
    pheatmap(col =viridis(10),border_color = NA,angle_col = 45,fontsize = 7)
Sample to sample correlations of bipolar patients during manic phase and controls.

Figure 1: Sample to sample correlations of bipolar patients during manic phase and controls

You can also use get_dataset_expression to only get the expression matrix, get_dataset_samples to get the metadata information. The output of this function can be simplified to match the version included in the output of get_dataset_object by using make_design on the output.

get_dataset_samples('GSE46416') %>% make_design('text') %>% head %>%
    gemma_kable()
factorValues disease block
Bipolar disorder patient euthymic phase, 17 c("disea…. bipolar disorder has_modifi… Batch_02_26/11/09
Bipolar disorder patient euthymic phase, 34 c("disea…. bipolar disorder has_modifi… Batch_04_02/12/09
Control, 15 c("disea…. reference subject role Batch_02_26/11/09
Bipolar disorder patient euthymic phase, 32 c("disea…. bipolar disorder has_modifi… Batch_04_02/12/09
Control, 8 c("disea…. reference subject role Batch_01_25/11/09
Bipolar disorder patient manic phase, 21 c("block…. bipolar disorder has_modifi… Batch_03_27/11/09

6 Platform Annotations

Expression data in Gemma comes with annotations for the gene each expression profile corresponds to. Using the get_platform_annotations function, these annotations can be retrieved independently of the expression data, along with additional annotations such as Gene Ontology terms.

Examples:

head(get_platform_annotations('GPL96') %>% select(-GOTerms))
     ProbeName   GeneSymbols                                     GeneNames
        <char>        <char>                                        <char>
1: 211750_x_at TUBA1C|TUBA1A             tubulin alpha 1c|tubulin alpha 1a
2:   216678_at                                                            
3:   216345_at        ZSWIM8            zinc finger SWIM-type containing 8
4:   207273_at                                                            
5: 216025_x_at        CYP2C9 cytochrome P450 family 2 subfamily C member 9
6: 218191_s_at        LMBRD1                     LMBR1 domain containing 1
        GemmaIDs    NCBIids
          <char>     <char>
1: 360802|172797 84790|7846
2:                         
3:        235733      23053
4:                         
5:         32964       1559
6:        303717      55788
head(get_platform_annotations('Generic_human_ncbiIds') %>% select(-GOTerms))
   ElementName  GeneSymbols
         <int>       <char>
1:       55236         UBA6
2:       79664         ICE2
3:   100126270     FMR1-AS1
4:   105373684    LINC01818
5:   124900245 LOC124900245
6:   102725051 LOC102725051
                                               GeneNames GemmaIDs   NCBIids
                                                  <char>    <int>     <int>
1:           ubiquitin like modifier activating enzyme 6   295849     55236
2: interactor of little elongation complex ELL subunit 2   336840     79664
3:                                  FMR1 antisense RNA 1  3157248 100126270
4:           long intergenic non-protein coding RNA 1818  9235895 105373684
5:                           small nucleolar RNA SNORA40 10578422 124900245
6:                          uncharacterized LOC102725051  9008981 102725051

If you are interested in a particular gene, you can see which platforms include it using get_gene_probes. Note that functions to search gene work best with unambigious identifiers rather than symbols.

# lists genes in gemma matching the symbol or identifier
get_genes('Eno2') %>% gemma_kable()
gene.symbol gene.ensembl gene.NCBI gene.name gene.aliases gene.MFX.rank taxon.name taxon.scientific taxon.ID taxon.NCBI taxon.database.name taxon.database.ID
ENO2 ENSG00000111674 2026 enolase 2 HEL-S-27…. 0.9033096 human Homo sapiens 1 9606 hg38 87
Eno2 ENSMUSG00000004267 13807 enolase 2, gamma neuronal D6Ertd37…. 0.8980022 mouse Mus musculus 2 10090 mm10 81
Eno2 ENSRNOG00000013141 24334 enolase 2 NSE, RNEN3 0.8822730 rat Rattus norvegicus 3 10116 rn6 86
# ncbi id for human ENO2
probes <- get_gene_probes(2026)

# remove the description for brevity of output
head(probes[,.SD, .SDcols = !colnames(probes) %in% c('mapping.Description','platform.Description')]) %>%
    gemma_kable()
mapping.name mapping.description platform.shortName platform.name platform.ID platform.type platform.description platform.troubled taxon.name taxon.scientific taxon.ID taxon.NCBI taxon.database.name taxon.database.ID
201313_at enolase 2 (gamma, neuronal) NA Affymetrix GeneChip Human G… 1 ONECOLOR The U133 set includes 2 ar… FALSE human Homo sapiens 1 9606 hg38 87
201313_at enolase 2 (gamma, neuronal) NA Affymetrix GeneChip Human G… 4 ONECOLOR Complete coverage of the H… FALSE human Homo sapiens 1 9606 hg38 87
40193_at enolase 2 (gamma, neuronal) NA Affymetrix GeneChip Human G… 8 ONECOLOR The Human Genome U95 (HG-U… FALSE human Homo sapiens 1 9606 hg38 87
1639 NA NA CHUGAI 41K 36 TWOCOLOR Patients and Sample Prepar… FALSE human Homo sapiens 1 9606 hg38 87
6621 NA NA UP_5Ka 38 TWOCOLOR A cDNA microarray with ~50… FALSE human Homo sapiens 1 9606 hg38 87
861 NA NA UP_5Ka 38 TWOCOLOR A cDNA microarray with ~50… FALSE human Homo sapiens 1 9606 hg38 87

7 Differential expression analyses

Gemma contains precomputed differential expression analyses for most of its datasets. Analyses can involve more than one factor, such as “sex” as well as “disease”. Some datasets contain more than one analysis to account for different factors and their interactions. The results are stored as resultSets, each corresponding to one factor (or their interaction). You can access them using get_differential_expression_values. From here on, we can explore and visualize the data to find the most differentially-expressed genes

Note that get_differential_expression_values can return multiple differentials per study if a study has multiple factors to contrast. Since GSE46416 only has one extracting the first element of the returned list is all we need.

dif_exp <- get_differential_expression_values('GSE46416')
dif_exp[[1]] %>% head %>% gemma_kable()
Probe NCBIid GeneSymbol GeneName pvalue corrected_pvalue rank contrast_113004_log2fc contrast_113004_tstat contrast_113004_pvalue contrast_113005_log2fc contrast_113005_tstat contrast_113005_pvalue
2982730 4018 LPA lipoprotein(a) 0.9185 0.9521 0.9647 -0.0247 -0.3587 0.7221 -0.0249 -0.3622 0.7196
2787851 166752 FREM3 FRAS1 related extracellular… 0.5850 0.7348 0.7960 0.1895 1.0119 0.3191 0.1403 0.7495 0.4590
2477558 0.3965 0.5931 0.6685 0.0918 1.1386 0.2633 0.1016 1.2604 0.2166
2910917 0.6506 0.7815 0.8325 0.1564 0.6739 0.5052 0.2096 0.9032 0.3731
3983537 140886 PABPC5 poly(A) binding protein cyt… 0.1064 0.3227 0.3299 0.1845 2.0320 0.0505 0.1603 1.7656 0.0869
3538213 23002 DAAM1 dishevelled associated acti… 0.2271 0.4477 0.5073 0.3398 1.7279 0.0936 0.1200 0.6104 0.5459

By default the columns names of the output correspond to contrast IDs. To see what conditions these IDs correspond to we can either use get_dataset_differential_expression_analyses to get the metadata about differentials of a given dataset, or set readableContrasts argument of get_differential_expression_values to TRUE. The former approach is usually better for a large scale systematic analysis while the latter is easier to read in an interactive session.

get_dataset_differential_expression_analyses function returns structured metadata about the differentials.

contrasts <- get_dataset_differential_expression_analyses('GSE46416')
contrasts %>% gemma_kable()
result.ID contrast.ID experiment.ID factor.category factor.category.URI factor.ID baseline.factors experimental.factors subsetFactor.subset subsetFactor probes.analyzed genes.analyzed
550248 113004 8997 disease http://www.ebi…/EFO_0000408 19134 disease,…. disease,…. FALSE logical(…. 21961 18959
550248 113005 8997 disease http://www.ebi…/EFO_0000408 19134 disease,…. disease,…. FALSE logical(…. 21961 18959

contrast.ID column corresponds to the column names in the output of get_differential_expression_values while result.ID corresponds to the name of the differential in the output object. Using them together will let one to access differentially expressed gene counts for each condition contrast

# using result.ID and contrast.ID of the output above, we can access specific
# results. Note that one study may have multiple contrast objects
seq_len(nrow(contrasts)) %>% sapply(function(i){
    result_set = dif_exp[[as.character(contrasts[i,]$result.ID)]]
    p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.ID}_pvalue")]]
    
    # multiple testing correction
    sum(p.adjust(p_values,method = 'BH') < 0.05)
}) -> dif_exp_genes

contrasts <- data.table(result.ID = contrasts$result.ID,
                        contrast.id = contrasts$contrast.ID,
                        baseline.factorValue = contrasts$baseline.factors,
                        experimental.factorValue = contrasts$experimental.factors,
                        n_diff = dif_exp_genes)

contrasts %>% gemma_kable()
result.ID contrast.id baseline.factorValue experimental.factorValue n_diff
550248 113004 disease,…. disease,…. 3
550248 113005 disease,…. disease,…. 1410
contrasts$baseline.factorValue
[[1]]
   category                         category.URI                  value
     <char>                               <char>                 <char>
1:  disease http://www.ebi.ac.uk/efo/EFO_0000408 reference subject role
                                    value.URI predicate predicate.URI object
                                       <char>    <lgcl>        <lgcl> <lgcl>
1: http://purl.obolibrary.org/obo/OBI_0000220        NA            NA     NA
   object.URI                summary     ID factor.ID factor.category
       <lgcl>                 <char>  <int>     <int>          <char>
1:         NA reference subject role 113006     19134         disease
                    factor.category.URI
                                 <char>
1: http://www.ebi.ac.uk/efo/EFO_0000408

[[2]]
   category                         category.URI                  value
     <char>                               <char>                 <char>
1:  disease http://www.ebi.ac.uk/efo/EFO_0000408 reference subject role
                                    value.URI predicate predicate.URI object
                                       <char>    <lgcl>        <lgcl> <lgcl>
1: http://purl.obolibrary.org/obo/OBI_0000220        NA            NA     NA
   object.URI                summary     ID factor.ID factor.category
       <lgcl>                 <char>  <int>     <int>          <char>
1:         NA reference subject role 113006     19134         disease
                    factor.category.URI
                                 <char>
1: http://www.ebi.ac.uk/efo/EFO_0000408
contrasts$experimental.factorValue
[[1]]
   category                         category.URI            value
     <char>                               <char>           <char>
1:  disease http://www.ebi.ac.uk/efo/EFO_0000408 bipolar disorder
                                      value.URI    predicate
                                         <char>       <char>
1: http://purl.obolibrary.org/obo/MONDO_0004985 has_modifier
                               predicate.URI      object object.URI
                                      <char>      <char>     <char>
1: http://purl.obolibrary.org/obo/RO_0002573 manic phase       <NA>
                                     summary     ID factor.ID factor.category
                                      <char>  <int>     <int>          <char>
1: bipolar disorder has_modifier manic phase 113004     19134         disease
                    factor.category.URI
                                 <char>
1: http://www.ebi.ac.uk/efo/EFO_0000408

[[2]]
   category                         category.URI            value
     <char>                               <char>           <char>
1:  disease http://www.ebi.ac.uk/efo/EFO_0000408 bipolar disorder
                                      value.URI    predicate
                                         <char>       <char>
1: http://purl.obolibrary.org/obo/MONDO_0004985 has_modifier
                               predicate.URI         object object.URI
                                      <char>         <char>     <char>
1: http://purl.obolibrary.org/obo/RO_0002573 euthymic phase       <NA>
                                        summary     ID factor.ID
                                         <char>  <int>     <int>
1: bipolar disorder has_modifier euthymic phase 113005     19134
   factor.category                  factor.category.URI
            <char>                               <char>
1:         disease http://www.ebi.ac.uk/efo/EFO_0000408

Alternatively we, since we are only looking at one dataset and one contrast manually, we can simply use readableContrasts.

de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]]
de %>% head %>% gemma_kable
Probe NCBIid GeneSymbol GeneName pvalue corrected_pvalue rank contrast_bipolar disorder has_modifier manic phase_logFoldChange contrast_bipolar disorder has_modifier manic phase_tstat contrast_bipolar disorder has_modifier manic phase_pvalue contrast_bipolar disorder has_modifier euthymic phase_logFoldChange contrast_bipolar disorder has_modifier euthymic phase_tstat contrast_bipolar disorder has_modifier euthymic phase_pvalue
2982730 4018 LPA lipoprotein(a) 0.9185 0.9521 0.9647 -0.0247 -0.3587 0.7221 -0.0249 -0.3622 0.7196
2787851 166752 FREM3 FRAS1 related extracellular… 0.5850 0.7348 0.7960 0.1895 1.0119 0.3191 0.1403 0.7495 0.4590
2477558 0.3965 0.5931 0.6685 0.0918 1.1386 0.2633 0.1016 1.2604 0.2166
2910917 0.6506 0.7815 0.8325 0.1564 0.6739 0.5052 0.2096 0.9032 0.3731
3983537 140886 PABPC5 poly(A) binding protein cyt… 0.1064 0.3227 0.3299 0.1845 2.0320 0.0505 0.1603 1.7656 0.0869
3538213 23002 DAAM1 dishevelled associated acti… 0.2271 0.4477 0.5073 0.3398 1.7279 0.0936 0.1200 0.6104 0.5459
# Classify probes for plotting
de$diffexpr <- "No"
de$diffexpr[de$`contrast_bipolar disorder has_modifier manic phase_logFoldChange` > 1.0 & 
        de$`contrast_bipolar disorder has_modifier manic phase_pvalue` < 0.05] <- "Up"
de$diffexpr[de$`contrast_bipolar disorder has_modifier manic phase_logFoldChange` < -1.0 & 
        de$`contrast_bipolar disorder has_modifier manic phase_pvalue` < 0.05] <- "Down"

# Upregulated probes
filter(de, diffexpr == "Up") %>%
    arrange(`contrast_bipolar disorder has_modifier manic phase_pvalue`) %>%
    select(Probe, GeneSymbol, `contrast_bipolar disorder has_modifier manic phase_pvalue`, 
        `contrast_bipolar disorder has_modifier manic phase_logFoldChange`) %>%
    head(10) %>% gemma_kable()
Probe GeneSymbol contrast_bipolar disorder has_modifier manic phase_pvalue contrast_bipolar disorder has_modifier manic phase_logFoldChange
2319550 RBP7 8.61e-05 1.0740
2548699 CYP1B1 1.00e-04 1.3225
3907190 SLPI 3.00e-04 1.0558
3629103 PCLAF 5.00e-04 1.2783
3545525 SLIRP 6.00e-04 1.3490
3146433 COX6C 9.00e-04 1.4670
2899102 H3C3 1.30e-03 1.0260
2538349 1.30e-03 1.0731
3635198 BCL2A1 1.80e-03 1.0798
2633191 GPR15 2.40e-03 1.2046
# Downregulated probes
filter(de, diffexpr == "Down") %>%
    arrange(`contrast_bipolar disorder has_modifier manic phase_pvalue`) %>%
    select(Probe, GeneSymbol, `contrast_bipolar disorder has_modifier manic phase_pvalue`, 
        `contrast_bipolar disorder has_modifier manic phase_logFoldChange`) %>%
    head(10) %>% gemma_kable()
Probe GeneSymbol contrast_bipolar disorder has_modifier manic phase_pvalue contrast_bipolar disorder has_modifier manic phase_logFoldChange
2775390 2.10e-06 -1.5558
3760268 1.15e-05 -1.8506
3124344 1.00e-04 -1.0370
3673179 2.00e-04 -1.0340
3245871 WDFY4 2.00e-04 -1.1569
3022689 SND1-IT1 2.00e-04 -1.2199
2679014 3.00e-04 -1.1752
4019758 4.00e-04 -1.4049
3336402 RBM14 4.00e-04 -1.0711
3930525 RUNX1-IT1 4.00e-04 -1.5169
# Add gene symbols as labels to DE genes
de$delabel <- ""
de$delabel[de$diffexpr != "No"] <- de$GeneSymbol[de$diffexpr != "No"]

# Volcano plot for bipolar patients vs controls
ggplot(
    data = de,
    aes(
        x = `contrast_bipolar disorder has_modifier manic phase_logFoldChange`,
        y = -log10(`contrast_bipolar disorder has_modifier manic phase_pvalue`),
        color = diffexpr,
        label = delabel
    )
) +
    geom_point() +
    geom_hline(yintercept = -log10(0.05), col = "gray45", linetype = "dashed") +
    geom_vline(xintercept = c(-1.0, 1.0), col = "gray45", linetype = "dashed") +
    labs(x = "log2(FoldChange)", y = "-log10(p-value)") +
    scale_color_manual(values = c("blue", "black", "red")) +
    geom_text_repel(show.legend = FALSE) +
    theme_minimal()
Differentially-expressed genes in bipolar patients during manic phase versus controls.

Figure 2: Differentially-expressed genes in bipolar patients during manic phase versus controls

8 Larger queries

To query large amounts of data, the API has a pagination system which uses the limit and offset parameters. To avoid overloading the server, calls are limited to a maximum of 100 entries, so the offset allows you to get the next batch of entries in the next call(s).

To simplify the process of accessing all available data, gemma.R includes the get_all_pages function which can use the output from one page to make all the follow up requests

get_platforms_by_ids() %>% 
    get_all_pages() %>% head %>% gemma_kable()
platform.ID platform.shortName platform.name platform.description platform.troubled platform.experimentCount platform.type taxon.name taxon.scientific taxon.ID taxon.NCBI taxon.database.name taxon.database.ID
1 GPL96 Affymetrix GeneChip Human G… The U133 set includes 2 ar… FALSE 397 ONECOLOR human Homo sapiens 1 9606 hg38 87
2 GPL1355 Affymetrix GeneChip Rat Gen… The GeneChip Rat Genome 23… FALSE 295 ONECOLOR rat Rattus norvegicus 3 10116 rn6 86
3 GPL1261 Affymetrix GeneChip Mouse G… All probe sets represented… FALSE 1295 ONECOLOR mouse Mus musculus 2 10090 mm10 81
4 GPL570 Affymetrix GeneChip Human G… Complete coverage of the H… FALSE 1481 ONECOLOR human Homo sapiens 1 9606 hg38 87
5 GPL81 Affymetrix GeneChip Murine … The MG-U74 set includes 3 … FALSE 198 ONECOLOR mouse Mus musculus 2 10090 mm10 81
6 GPL85 Affymetrix GeneChip Rat Gen… The RG-U34 set includes 3 … FALSE 88 ONECOLOR rat Rattus norvegicus 3 10116 rn6 86

Alternative way to access all pages is to do so manually. To see how many available results are there, you can look at the attributes of the output objects where additional information from the API response is appended.

platform_count = attributes(get_platforms_by_ids(limit = 1))$totalElements
print(platform_count)
[1] 634

After which you can use offset to access all available platforms.

lapply(seq(0,platform_count,100), function(offset){
    get_platforms_by_ids(limit = 100, offset = offset) %>%
        select(platform.ID, platform.shortName, taxon.name)
}) %>% do.call(rbind,.) %>% 
    head %>% gemma_kable()
platform.ID platform.shortName taxon.name
1 GPL96 human
2 GPL1355 rat
3 GPL1261 mouse
4 GPL570 human
5 GPL81 mouse
6 GPL85 rat

Many endpoints only support a single identifier:

get_dataset_annotations(c("GSE35974", "GSE46416"))
Error: Please specify one valid identifier for dataset.

In these cases, you will have to loop over all the identifiers you wish to query and send separate requests.

lapply(c("GSE35974", "GSE12649"), function(dataset) {
    get_dataset_annotations(dataset) %>% 
        mutate(experiment.shortName = dataset) %>%
        select(experiment.shortName, class.name, term.name)
}) %>% do.call(rbind,.) %>% gemma_kable()
experiment.shortName class.name term.name
GSE35974 biological sex female
GSE35974 disease depressive disorder
GSE35974 disease bipolar disorder
GSE35974 biological sex male
GSE35974 labelling biotin
GSE35974 disease schizophrenia
GSE35974 organism part cerebellum
GSE12649 disease schizophrenia
GSE12649 labelling biotin
GSE12649 disease bipolar disorder
GSE12649 organism part prefrontal cortex

9 Output options

9.1 Raw data

By default, Gemma API does some parsing on the raw API results to make it easier to work with inside of R. In the process, it drops some typically unused values. If you wish to fetch everything, use raw = TRUE. Instead of a data table, you’ll usually be served a list that represents the underlying JSON response.

get_gene_locations("DYRK1A") %>% gemma_kable()
chromosome strand nucleotide length taxon.name taxon.scientific taxon.ID taxon.NCBI taxon.database.name taxon.database.ID
11
33890705 118714 rat Rattus norvegicus 3 10116 rn6 86
21
37365572 160785 human Homo sapiens 1 9606 hg38 87
16
94370769 125608 mouse Mus musculus 2 10090 mm10 81
get_gene_locations("DYRK1A", raw = TRUE) %>% jsonedit()

9.2 File outputs

Sometimes, you may wish to save results to a file for future inspection. You can do this simply by providing a filename to the file parameter. The extension for this file will be one of three options:

  1. .json, if you requested results with raw=TRUE
  2. .csv if the results have no nested data tables
  3. .rds otherwise

You can also specify whether or not the new fetched results are allowed to overwrite an existing file by specifying the overwrite = TRUE parameter.

9.3 Memoise data

To speed up results, you can remember past results so future queries can proceed virtually instantly. This is enabled through the memoise package. To enable memoisation, simply set memoised = TRUE in the function call whenever you want to refer to the cache, both to save data for future calls and use the saved data for repeated calls.

By default this will create a cache in your local filesystem.

You can see the path to your cache by using rappdirs::user_cache_dir(appname = "gemmaR"). If you want to use a different directory, use options(gemma.cache = 'path').

If you’re done with your fetching and want to ensure no space is being used for cached results, or if you just want to ensure you’re getting up-to-date data from Gemma, you can clear the cache using forget_gemma_memoised.

9.4 Changing defaults

We’ve seen how to change raw = TRUE, overwrite = TRUE and memoised = TRUE in individual function calls. It’s possible that you want to always use the functions these ways without specifying the option every time. You can do this by simply changing the default, which is visible in the function definition. See below for examples.

options(gemma.memoised = TRUE) # always refer to cache
options(gemma.overwrite = TRUE) # always overwrite when saving files
options(gemma.raw = TRUE) # always receive results as-is from Gemma

10 Session info

sessionInfo()
R Under development (unstable) (2024-01-16 r85808)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] listviewer_4.0.0            viridis_0.6.5              
 [3] viridisLite_0.4.2           pheatmap_1.0.12            
 [5] SummarizedExperiment_1.33.3 Biobase_2.63.0             
 [7] GenomicRanges_1.55.3        GenomeInfoDb_1.39.7        
 [9] IRanges_2.37.1              S4Vectors_0.41.3           
[11] BiocGenerics_0.49.1         MatrixGenerics_1.15.0      
[13] matrixStats_1.2.0           ggrepel_0.9.5              
[15] ggplot2_3.5.0               dplyr_1.1.4                
[17] data.table_1.15.2           gemma.R_2.99.1             
[19] BiocStyle_2.31.0           

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0        farver_2.1.1            fastmap_1.1.1          
 [4] digest_0.6.34           timechange_0.3.0        lifecycle_1.0.4        
 [7] ellipsis_0.3.2          magrittr_2.0.3          compiler_4.4.0         
[10] rlang_1.1.3             sass_0.4.8              tools_4.4.0            
[13] utf8_1.2.4              yaml_2.3.8              knitr_1.45             
[16] labeling_0.4.3          S4Arrays_1.3.5          htmlwidgets_1.6.4      
[19] bit_4.0.5               curl_5.2.0              DelayedArray_0.29.9    
[22] xml2_1.3.6              RColorBrewer_1.1-3      abind_1.4-5            
[25] withr_3.0.0             purrr_1.0.2             grid_4.4.0             
[28] fansi_1.0.6             colorspace_2.1-0        scales_1.3.0           
[31] cli_3.6.2               rmarkdown_2.25          crayon_1.5.2           
[34] generics_0.1.3          rstudioapi_0.15.0       httr_1.4.7             
[37] cachem_1.0.8            stringr_1.5.1           zlibbioc_1.49.0        
[40] assertthat_0.2.1        BiocManager_1.30.22     XVector_0.43.1         
[43] vctrs_0.6.5             Matrix_1.6-5            jsonlite_1.8.8         
[46] bookdown_0.37           bit64_4.0.5             magick_2.8.3           
[49] systemfonts_1.0.5       jquerylib_0.1.4         glue_1.7.0             
[52] lubridate_1.9.3         stringi_1.8.3           gtable_0.3.4           
[55] munsell_0.5.0           tibble_3.2.1            pillar_1.9.0           
[58] rappdirs_0.3.3          htmltools_0.5.7         GenomeInfoDbData_1.2.11
[61] R6_2.5.1                evaluate_0.23           kableExtra_1.4.0       
[64] lattice_0.22-5          highr_0.10              memoise_2.0.1          
[67] bslib_0.6.1             Rcpp_1.0.12             svglite_2.1.3          
[70] gridExtra_2.3           SparseArray_1.3.4       xfun_0.42              
[73] pkgconfig_2.0.3