Chapter 6 Exploiting the cell ontology

6.1 Motivation

As previously discussed in Section 5.4, SingleR maps the labels in its references to the Cell Ontology. The most obvious advantage of doing this is to provide a standardized vocabulary with which to describe cell types, thus facilitating integrated analyses with multiple references. However, another useful feature of the Cell Ontology is its hierarchical organization of terms, allowing us to adjust cell type annotations to the desired resolution. This represents a more dynamic alternative to the static label.main and label.fine options in each reference.

6.2 Basic manipulation

We use the ontoProc package to load in the Cell Ontology. This produces an ontology_index object (from the ontologyIndex package) that we can query for various pieces of information.

## Ontology with 2249 terms
## 
## format-version: 1.2
## data-version: releases/2020-10-02
## ontology: cl
## 
## Properties:
##  id: character
##  name: character
##  parents: list
##  children: list
##  ancestors: list
##  obsolete: logical
##  RO:0002202: list
##  alt_id: list
##  comment: character
##  consider: list
##  created_by: character
##  creation_date: character
##  data-version: list
##  def: character
##  format-version: list
##  holds_over_chain: list
##  is_a: list
##  is_transitive: list
##  namespace: list
##  ontology: list
##  property_value: list
##  remark: list
##  replaced_by: list
##  subset: list
##  subsetdef: list
##  synonym: list
##  synonymtypedef: list
##  transitive_over: list
##  xref: list
## Roots:
##  CL:0000000 - cell
##  RO:0002202 - develops_from

The most immediate use of this object lies in mapping ontology terms to their plain-English descriptions. We can use this to translate annotations produced by SingleR() from the label.ont labels into a more interpretable form. We demonstrate this approach using celldex’s collection of mouse RNA-seq references (Aran et al. 2019).

##                         CL:0000000                         CL:0000001 
##                             "cell"            "primary cultured cell" 
##                         CL:0000002                         CL:0000003 
## "obsolete immortal cell line cell"                      "native cell" 
##                         CL:0000004                         CL:0000005 
##        "obsolete cell by organism"  "fibroblast neural crest derived"
##                                                                                                                                                                                                                     CL:0000000 
##                                   "\"A material entity of anatomical origin (part of or deriving from an organism) that has as its parts a maximally connected cell compartment surrounded by a plasma membrane.\" [CARO:mah]" 
##                                                                                                                                                                                                                     CL:0000001 
##                                                                 "\"A cultured cell that is freshly isolated from a organismal source, or derives in culture from such a cell prior to the culture being passaged.\" [ReO:mhb]" 
##                                                                                                                                                                                                                     CL:0000002 
##             "\"OBSOLETE: A cell line cell that is expected to be capable of an unlimited number of divisions, and is thus able to support indefinite growth/propagation in vitro as part of a immortal cell line.\" [ReO:mhb]" 
##                                                                                                                                                                                                                     CL:0000003 
## "\"A cell that is found in a natural setting, which includes multicellular organism cells 'in vivo' (i.e. part of an organism), and unicellular organisms 'in environment' (i.e. part of a natural environment).\" [CARO:mah]" 
##                                                                                                                                                                                                                     CL:0000004 
##                                                                                                                            "\"OBSOLETE: A classification of cells by the organisms within which they are contained.\" [FB:ma]" 
##                                                                                                                                                                                                                     CL:0000005 
##                                                                                                                           "\"Any fibroblast that is deriived from the neural crest.\" [https://orcid.org/0000-0001-5208-3432]"
## CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 
## "fat cell" "fat cell" "fat cell" "fat cell" "fat cell" "fat cell"

Another interesting application involves examining the relationship between different terms. The ontology itself is a directed acyclic graph, so we can can convert it into graph object for advanced queries using the igraph package. Each edge represents an “is a” relationship where each vertex represents a specialized case of the concept of the parent node.

## IGRAPH 5c4c8d2 DN-- 2248 3185 -- 
## + attr: name (v/c)
## + edges from 5c4c8d2 (vertex names):
##  [1] CL:0000010->CL:0000001 CL:0000000->CL:0000003 CL:0000057->CL:0000005
##  [4] CL:0000101->CL:0000006 CL:0000197->CL:0000006 CL:0002321->CL:0000007
##  [7] CL:0000333->CL:0000008 CL:0000578->CL:0000010 CL:0000333->CL:0000011
## [10] CL:0000034->CL:0000014 CL:0000039->CL:0000014 CL:0000586->CL:0000015
## [13] CL:0000014->CL:0000016 CL:0000015->CL:0000016 CL:0000015->CL:0000017
## [16] CL:0000015->CL:0000018 CL:0000413->CL:0000018 CL:0000408->CL:0000019
## [19] CL:0000015->CL:0000020 CL:0000586->CL:0000021 CL:0000014->CL:0000022
## [22] CL:0000021->CL:0000022 CL:0000021->CL:0000023 CL:0000021->CL:0000024
## + ... omitted several edges

One query involves identifying all descendents of a particular term of interest. This can be useful when searching for a cell type in the presence of variable annotation resolution; for example, a search for “epithelial cell” can be configured to pick up all child terms such as “endothelial cell” and “ependymal cell”.

##                        CL:0000624 
## "CD4-positive, alpha-beta T cell"
##                                                       CL:0000624 
##                                "CD4-positive, alpha-beta T cell" 
##                                                       CL:0000492 
##                                     "CD4-positive helper T cell" 
##                                                       CL:0001051 
## "CD4-positive, CXCR3-negative, CCR6-negative, alpha-beta T cell" 
##                                                       CL:0000791 
##                                       "mature alpha-beta T cell" 
##                                                       CL:0000792 
##      "CD4-positive, CD25-positive, alpha-beta regulatory T cell" 
##                                                       CL:0000793 
##                "CD4-positive, alpha-beta intraepithelial T cell"

Alternatively, we might be interested in the last common ancestor (LCA) for a set of terms. This is the furthest term - or, in some cases, multiple terms - from the root of the ontology that is also an ancestor of all of the terms of interest. We will use this LCA concept in the next section to adjust resolution across multiple references.

##                        CL:0000624                        CL:0000785 
## "CD4-positive, alpha-beta T cell"                   "mature B cell" 
##                        CL:0000623 
##             "natural killer cell"
##   CL:0000542 
## "lymphocyte"

6.3 Adjusting resolution

We can use the ontology graph to adjust the resolution of the reference labels by rolling up overly-specific terms to their LCA. The findCommonAncestors() utility takes a set of terms and returns a list of potential LCAs for various subsets of those terms. Users can inspect this list to identify LCAs at the desired resolution and then map their descendent terms to those LCAs.

## $`CL:0000081`
## $`CL:0000081`$name
## [1] "blood cell"
## 
## $`CL:0000081`$descendents
## DataFrame with 2 rows and 2 columns
##                   name      set1
##            <character> <logical>
## CL:0000232 erythrocyte      TRUE
## CL:0000094 granulocyte      TRUE
## 
## 
## $`CL:0000126`
## $`CL:0000126`$name
## [1] "macroglial cell"
## 
## $`CL:0000126`$descendents
## DataFrame with 2 rows and 2 columns
##                       name      set1
##                <character> <logical>
## CL:0000127       astrocyte      TRUE
## CL:0000128 oligodendrocyte      TRUE
## 
## 
## $`CL:0000393`
## $`CL:0000393`$name
## [1] "electrically responsive cell"
## 
## $`CL:0000393`$descendents
## DataFrame with 2 rows and 2 columns
##                           name      set1
##                    <character> <logical>
## CL:0000540              neuron      TRUE
## CL:0000746 cardiac muscle cell      TRUE
## 
## 
## $`CL:0002320`
## $`CL:0002320`$name
## [1] "connective tissue cell"
## 
## $`CL:0002320`$descendents
## DataFrame with 2 rows and 2 columns
##                   name      set1
##            <character> <logical>
## CL:0000136    fat cell      TRUE
## CL:0000057  fibroblast      TRUE
## 
## 
## $`CL:0011115`
## $`CL:0011115`$name
## [1] "precursor cell"
## 
## $`CL:0011115`$descendents
## DataFrame with 2 rows and 2 columns
##                          name      set1
##                   <character> <logical>
## CL:0000047 neuronal stem cell      TRUE
## CL:0000576           monocyte      TRUE
## 
## 
## $`CL:0000066`
## $`CL:0000066`$name
## [1] "epithelial cell"
## 
## $`CL:0000066`$descendents
## DataFrame with 3 rows and 2 columns
##                        name      set1
##                 <character> <logical>
## CL:0000115 endothelial cell      TRUE
## CL:0000182       hepatocyte      TRUE
## CL:0000065   ependymal cell      TRUE

We can also use this function to synchronize multiple sets of terms to the same resolution. Here, we consider the ImmGen dataset (Heng et al. 2008), which provides highly resolved annotation of immune cell types. The findCommonAncestors() function specifies the origins of the descendents for each LCA, allowing us to focus on LCAs that have representatives in both sets of terms.

## $`CL:0000126`
## $`CL:0000126`$name
## [1] "macroglial cell"
## 
## $`CL:0000126`$descendents
## DataFrame with 2 rows and 3 columns
##                       name  MouseRNA    ImmGen
##                <character> <logical> <logical>
## CL:0000127       astrocyte      TRUE     FALSE
## CL:0000128 oligodendrocyte      TRUE     FALSE
## 
## 
## $`CL:0000393`
## $`CL:0000393`$name
## [1] "electrically responsive cell"
## 
## $`CL:0000393`$descendents
## DataFrame with 2 rows and 3 columns
##                           name  MouseRNA    ImmGen
##                    <character> <logical> <logical>
## CL:0000540              neuron      TRUE     FALSE
## CL:0000746 cardiac muscle cell      TRUE     FALSE
## 
## 
## $`CL:0000623`
## $`CL:0000623`$name
## [1] "natural killer cell"
## 
## $`CL:0000623`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000623    natural killer cell      TRUE      TRUE
## CL:0002438 NK1.1-positive natur..     FALSE      TRUE
## 
## 
## $`CL:0000813`
## $`CL:0000813`$name
## [1] "memory T cell"
## 
## $`CL:0000813`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000897 CD4-positive, alpha-..     FALSE      TRUE
## CL:0000909 CD8-positive, alpha-..     FALSE      TRUE
## 
## 
## $`CL:0000815`
## $`CL:0000815`$name
## [1] "regulatory T cell"
## 
## $`CL:0000815`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000792 CD4-positive, CD25-p..     FALSE      TRUE
## CL:0000815      regulatory T cell     FALSE      TRUE
## 
## 
## $`CL:0000819`
## $`CL:0000819`$name
## [1] "B-1 B cell"
## 
## $`CL:0000819`$descendents
## DataFrame with 2 rows and 3 columns
##                   name  MouseRNA    ImmGen
##            <character> <logical> <logical>
## CL:0000820 B-1a B cell     FALSE      TRUE
## CL:0000821 B-1b B cell     FALSE      TRUE

For example, we might notice that the mouse RNA-seq reference only has a single “T cell” term. To synchronize resolution across references, we would need to roll up all of the ImmGen’s finely resolved subsets into that LCA as shown below. Of course, this results in some loss of precision and information; whether this is an acceptable price for simpler interpretation is a decision that is left to the user.

## DataFrame with 35 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000084                 T cell      TRUE      TRUE
## CL:0002427 resting double-posit..     FALSE      TRUE
## CL:0000809 double-positive, alp..     FALSE      TRUE
## CL:0002429 CD69-positive double..     FALSE      TRUE
## CL:0000624 CD4-positive, alpha-..     FALSE      TRUE
## ...                           ...       ...       ...
## CL:0002415 immature Vgamma1.1-p..     FALSE      TRUE
## CL:0002411 Vgamma1.1-positive, ..     FALSE      TRUE
## CL:0002416 mature Vgamma1.1-pos..     FALSE      TRUE
## CL:0002407 mature Vgamma2-posit..     FALSE      TRUE
## CL:0000815      regulatory T cell     FALSE      TRUE

Session information

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] igraph_1.2.6                celldex_1.0.0              
 [3] SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [5] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
 [7] IRanges_2.24.0              S4Vectors_0.28.0           
 [9] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
[11] matrixStats_0.57.0          ontoProc_1.12.0            
[13] ontologyIndex_2.5           BiocStyle_2.18.0           
[15] rebook_1.0.0               

loaded via a namespace (and not attached):
 [1] httr_1.4.2                    AnnotationHub_2.22.0         
 [3] bit64_4.0.5                   DelayedMatrixStats_1.12.0    
 [5] ontologyPlot_1.4              paintmap_1.0                 
 [7] shiny_1.5.0                   assertthat_0.2.1             
 [9] interactiveDisplayBase_1.28.0 BiocManager_1.30.10          
[11] BiocFileCache_1.14.0          blob_1.2.1                   
[13] GenomeInfoDbData_1.2.4        yaml_2.2.1                   
[15] BiocVersion_3.12.0            lattice_0.20-41              
[17] pillar_1.4.6                  RSQLite_2.2.1                
[19] glue_1.4.2                    digest_0.6.27                
[21] promises_1.1.1                XVector_0.30.0               
[23] Matrix_1.2-18                 htmltools_0.5.0              
[25] httpuv_1.5.4                  XML_3.99-0.5                 
[27] pkgconfig_2.0.3               bookdown_0.21                
[29] zlibbioc_1.36.0               purrr_0.3.4                  
[31] xtable_1.8-4                  processx_3.4.4               
[33] later_1.1.0.1                 tibble_3.0.4                 
[35] generics_0.1.0                ellipsis_0.3.1               
[37] DT_0.16                       magrittr_1.5                 
[39] crayon_1.3.4                  CodeDepends_0.6.5            
[41] mime_0.9                      memoise_1.1.0                
[43] evaluate_0.14                 ps_1.4.0                     
[45] graph_1.68.0                  tools_4.0.3                  
[47] lifecycle_0.2.0               stringr_1.4.0                
[49] DelayedArray_0.16.0           AnnotationDbi_1.52.0         
[51] callr_3.5.1                   compiler_4.0.3               
[53] rlang_0.4.8                   grid_4.0.3                   
[55] RCurl_1.98-1.2                rappdirs_0.3.1               
[57] htmlwidgets_1.5.2             bitops_1.0-6                 
[59] rmarkdown_2.5                 ExperimentHub_1.16.0         
[61] codetools_0.2-16              DBI_1.1.0                    
[63] curl_4.3                      R6_2.5.0                     
[65] knitr_1.30                    dplyr_1.0.2                  
[67] fastmap_1.0.1                 bit_4.0.4                    
[69] Rgraphviz_2.34.0              stringi_1.5.3                
[71] Rcpp_1.0.5                    vctrs_0.3.4                  
[73] sparseMatrixStats_1.2.0       dbplyr_1.4.4                 
[75] tidyselect_1.1.0              xfun_0.19                    

Bibliography

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2): 163–72.

Heng, Tracy S.P., Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, et al. 2008. “The immunological genome project: Networks of gene expression in immune cells.” Nature Immunology 9 (10): 1091–4. https://doi.org/10.1038/ni1008-1091.