1 Introduction

The rsemmed package provides a way for users to explore connections between the biological concepts present in the Semantic MEDLINE database (Kilicoglu et al. 2011) in a programmatic way.

1.1 Overview of Semantic MEDLINE

The Semantic MEDLINE database (SemMedDB) is a collection of annotations of sentences from the abstracts of articles indexed in PubMed. These annotations take the form of subject-predicate-object triples of information. These triples are also called predications.

An example predication is “Interleukin-12 INTERACTS_WITH IFNA1”. Here, the subject is “Interleukin-12”, the object is “IFNA1” (interferon alpha-1), and the predicate linking the subject and object is “INTERACTS_WITH”. The Semantic MEDLINE database consists of tens of millions of these predications.

Semantic MEDLINE also provides information on the broad categories into which biological concepts (predication subjects and objects) fall. This information is called the semantic type of a concept. The databases assigns 4-letter codes to semantic types. For example, “gngm” represents “Gene or Genome”. Every concept in the database has one or more semantic types (abbreviated as “semtypes”).

Note: The information in Semantic MEDLINE is primarily computationally-derived. Thus, some information will seem nonsensical. For example, the reported semantic types of concepts might not quite match. The Semantic MEDLINE resource and this package are meant to facilitate an initial window of exploration into the literature. The hope is that this package helps guide more streamlined manual investigations of the literature.

1.2 Graph representation of SemMedDB

The predications in SemMedDB can be represented in graph form. Nodes represent concepts, and directed edges represent predicates (concept linkers). In particular, the Semantic MEDLINE graph is a directed multigraph because multiple predicates are often present between pairs of nodes (e.g., “A ASSOCIATED_WITH B” and “A INTERACTS_WITH B”). rsemmed relies on the igraph package for efficient graph operations.

1.2.1 Full data availability

The full data underlying the complete Semantic MEDLINE database is available from from this National Library of Medicine site as SQL dump files. In particular, the PREDICATION table is the primary file that is needed to construct the database. More information about the Semantic MEDLINE database is available here.

See the inst/script folder for scripts to perform the following processing of these raw files:

  • Conversion of the original SQL dump files to a CSV file
  • Generation of the graph representation from the CSV file

The next section describes details about the processing that occurs in these scripts to generate the graph representation.

In this vignette, we will explore a much smaller subset of the full graph that suffices to show the full functionality of rsemmed.

1.2.2 Note about processed data

The graph representation of SemMedDB contains a processed and summarized form of the raw database. The toy example below illustrates the summarization performed.

Subject Subject semtype Predicate Object Object semtype
A aapp INHIBITS B gngm
A gngm INHIBITS B aapp

The two rows show two predications that are treated as different predications because the semantic types (“semtypes”) of the subject and object vary. In the processed data, such instances have been collapsed as shown below.

Subject Subject semtype Predicate Object Object semtype # instances
A aapp,gngm INHIBITS B aapp,gngm 2

The different semantic types for a particular concept are collapsed into a single comma-separated string that is available via igraph::vertex_attr(g, "semtype").

The “# instances” column indicates that the “A INHIBITS B” predication was observed twice in the database. This piece of information is available as an edge attribute via igraph::edge_attr(g, "num_instances"). Similarly, predicate information is also an edge attribute accessible via igraph::edge_attr(g, "predicate").

A note of caution: Be careful when working with edge attributes in the Semantic MEDLINE graph manually. These operations can be very slow because there are over 18 million edges. Working with node/vertex attributes is much faster, but there are still a very large number of nodes (roughly 290,000).

The rest of this vignette will showcase how to use rsemmed functions to explore this graph.

2 Installation

To install rsemmed, start R and enter the following:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rsemmed")

3 Example workflow

3.1 Loading packages and data

Load the rsemmed package and the g_small object which contains a smaller version of the Semantic MEDLINE database.

library(rsemmed)
## Loading required package: igraph
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
data(g_small)

This loads an object of class igraph named g_small into the workspace. The SemMedDB graph object is a necessary input for most of rsemmed’s functions.

(The full processed graph representation linked above contains an object of class igraph named g.)

3.2 Finding nodes

The starting point for an rsemmed exploration is to find nodes related to the initial ideas of interest. For example, we may wish to find connections between the ideas “sickle cell trait” and “malaria”.

The rsemmed::find_nodes() function allows you to search for nodes by name. We supply the graph and a regular expression to use in searching through the name attribute of the nodes. Finding the most relevant nodes will generally involve iteration.

To find nodes related to the sickle cell trait, we can start by searching for nodes containing the word “sickle”. (Note: searches ignore capitalization.)

nodes_sickle <- find_nodes(g_small, pattern = "sickle")
nodes_sickle
## + 5/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia      Sickle Hemoglobin       Sickle Cell Trait      
## [4] Sickle cell retinopathy sickle trait

We may decide that only sickle cell anemia and the sickle trait are important. Conventional R subsetting allows us to keep the 3 related nodes:

nodes_sickle <- nodes_sickle[c(1,3,5)]
nodes_sickle
## + 3/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia Sickle Cell Trait  sickle trait

We can also search for nodes related to “malaria”:

nodes_malaria <- find_nodes(g_small, pattern = "malaria")
nodes_malaria
## + 32/1038 vertices, named, from f0c97f6:
##  [1] Malaria                                   
##  [2] Malaria, Falciparum                       
##  [3] Malaria, Cerebral                         
##  [4] Malaria Vaccines                          
##  [5] Antimalarials                             
##  [6] Malaria, Vivax                            
##  [7] Simian malaria                            
##  [8] Malaria, Avian                            
##  [9] Malarial parasites                        
## [10] Mixed malaria                             
## + ... omitted several vertices

There are 32 results, not all of which are printed, so we can display all results by accessing the name attribute of the returned nodes:

nodes_malaria$name
##  [1] "Malaria"                                   
##  [2] "Malaria, Falciparum"                       
##  [3] "Malaria, Cerebral"                         
##  [4] "Malaria Vaccines"                          
##  [5] "Antimalarials"                             
##  [6] "Malaria, Vivax"                            
##  [7] "Simian malaria"                            
##  [8] "Malaria, Avian"                            
##  [9] "Malarial parasites"                        
## [10] "Mixed malaria"                             
## [11] "Plasmodium malariae infection"             
## [12] "Malaria antigen"                           
## [13] "Prescription of prophylactic anti-malarial"
## [14] "Induced malaria"                           
## [15] "Malaria serology"                          
## [16] "Algid malaria"                             
## [17] "Aminoquinoline antimalarial"               
## [18] "Plasmodium malariae"                       
## [19] "Malaria antibody"                          
## [20] "Malaria screening"                         
## [21] "Congenital malaria"                        
## [22] "Ovale malaria"                             
## [23] "Quartan malaria"                           
## [24] "Malaria Antibodies Test"                   
## [25] "Biguanide antimalarial"                    
## [26] "MALARIA RELAPSE"                           
## [27] "KIT, TEST, MALARIA"                        
## [28] "Malarial hepatitis"                        
## [29] "Malarial pigmentation"                     
## [30] "Malaria antigen test"                      
## [31] "Malarial nephrosis"                        
## [32] "Malaria smear"

Perhaps we only want to keep the nodes that relate to disease. We could use direct subsetting, but another option is to use find_nodes() again with nodes_malaria as the input. Using the match argument set to FALSE allows us to prune unwanted matches from our results.

Below we iteratively prune matches to only keep disease-related results. Though this is not as condense as direct subsetting, it is more transparent about what was removed.

nodes_malaria <- nodes_malaria %>%
    find_nodes(pattern = "anti", match = FALSE) %>%
    find_nodes(pattern = "test", match = FALSE) %>%
    find_nodes(pattern = "screening", match = FALSE) %>%
    find_nodes(pattern = "pigment", match = FALSE) %>%
    find_nodes(pattern = "smear", match = FALSE) %>%
    find_nodes(pattern = "parasite", match = FALSE) %>%
    find_nodes(pattern = "serology", match = FALSE) %>%
    find_nodes(pattern = "vaccine", match = FALSE)
nodes_malaria
## + 17/1038 vertices, named, from f0c97f6:
##  [1] Malaria                       Malaria, Falciparum          
##  [3] Malaria, Cerebral             Malaria, Vivax               
##  [5] Simian malaria                Malaria, Avian               
##  [7] Mixed malaria                 Plasmodium malariae infection
##  [9] Induced malaria               Algid malaria                
## [11] Plasmodium malariae           Congenital malaria           
## [13] Ovale malaria                 Quartan malaria              
## [15] MALARIA RELAPSE               Malarial hepatitis           
## [17] Malarial nephrosis

The find_nodes() function can also be used with the semtypes argument which allows you to specify a character vector of semantic types to search for. If both pattern and semtypes are provided, they are combined with an OR operation. If you would like them to be combined with an AND operation, nest the calls in sequence.

## malaria OR disease (dsyn)
find_nodes(g_small, pattern = "malaria", semtypes = "dsyn")
## + 317/1038 vertices, named, from f0c97f6:
##   [1] Obstruction                                  
##   [2] Depressed mood                               
##   [3] Carcinoma                                    
##   [4] HIV-1                                        
##   [5] Infection                                    
##   [6] leukemia                                     
##   [7] Neoplasm                                     
##   [8] Renal tubular disorder                       
##   [9] Toxic effect                                 
##  [10] Vesicle                                      
## + ... omitted several vertices
## malaria AND disease (dsyn)
find_nodes(g_small, pattern = "malaria") %>%
    find_nodes(semtypes = "dsyn")
## + 16/1038 vertices, named, from f0c97f6:
##  [1] Malaria                       Malaria, Falciparum          
##  [3] Malaria, Cerebral             Malaria, Vivax               
##  [5] Simian malaria                Malaria, Avian               
##  [7] Mixed malaria                 Plasmodium malariae infection
##  [9] Induced malaria               Algid malaria                
## [11] Congenital malaria            Ovale malaria                
## [13] Quartan malaria               MALARIA RELAPSE              
## [15] Malarial hepatitis            Malarial nephrosis

Finally, you can also select nodes by exact name with the names argument. (Capitalization is ignored.)

find_nodes(g_small, names = "sickle trait")
## + 1/1038 vertex, named, from f0c97f6:
## [1] sickle trait
find_nodes(g_small, names = "SICKLE trait")
## + 1/1038 vertex, named, from f0c97f6:
## [1] sickle trait

3.3 Growing understanding by connecting nodes

Now that we have nodes related to the ideas of interest, we can develop further understanding by asking the following questions:

  • How are these node sets connected to each other? (Aim 1)
  • What ideas (nodes) are connected to these nodes? (Aim 2)

3.3.1 Aim 1: Connecting different node sets

To further Aim 1, we can use the rsemmed::find_paths() function. This function takes two sets of nodes from and to (corresponding to the two different ideas of interest) and returns all shortest paths between nodes in from (“source” nodes) and nodes in to (“target” nodes). That is, for every possible combination of a single node in from and a single node in to, all shortest undirected paths between those nodes are found.

paths <- find_paths(graph = g_small, from = nodes_sickle, to = nodes_malaria)

3.3.1.1 Information from find_paths()

The result of find_paths() is a list with one element for each of the nodes in from. Each element is itself a list of paths between from and to. In igraph, paths are represented as vertex sequences (class igraph.vs).

Recall that nodes_sickle contains the nodes below:

nodes_sickle
## + 3/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia Sickle Cell Trait  sickle trait

Thus, paths is structured as follows:

  • paths[[1]] is a list of paths originating from Sickle Cell Anemia.
  • paths[[2]] is a list of paths originating from Sickle Cell Trait.
  • paths[[3]] is a list of paths originating from sickle trait.

With lengths() we can show the number of shortest paths starting at each of the three source (“from”) nodes:

lengths(paths)
## [1]  956  268 1601

3.3.1.2 Displaying paths

There are two ways to display the information contained in these paths: rsemmed::text_path() and rsemmed::plot_path().

  • text_path() displays a text version of a path
  • plot_path() displays a graphical version of the path

For example, to show the 100th of the shortest paths originating from the first of the sickle trait nodes (paths[[1]][[100]]), we can use text_path() and plot_path() as below:

this_path <- paths[[1]][[100]]
tp <- text_path(g_small, this_path)
## Sickle Cell Anemia --- pulmonary complications :
## # A tibble: 4 x 5
##   from_semtype from                 via           to                  to_semtype
##   <chr>        <chr>                <chr>         <chr>               <chr>     
## 1 dsyn         Sickle Cell Anemia   CAUSES        pulmonary complica… patf      
## 2 patf         pulmonary complicat… CAUSES        Sickle Cell Anemia  dsyn      
## 3 patf         pulmonary complicat… COEXISTS_WITH Sickle Cell Anemia  dsyn      
## 4 patf         pulmonary complicat… MANIFESTATIO… Sickle Cell Anemia  dsyn      
## 
## pulmonary complications --- Malaria, Falciparum :
## # A tibble: 1 x 5
##   from_semtype from                   via           to                to_semtype
##   <chr>        <chr>                  <chr>         <chr>             <chr>     
## 1 patf         pulmonary complicatio… COEXISTS_WITH Malaria, Falcipa… dsyn
tp
## [[1]]
## # A tibble: 4 x 5
##   from_semtype from                 via           to                  to_semtype
##   <chr>        <chr>                <chr>         <chr>               <chr>     
## 1 dsyn         Sickle Cell Anemia   CAUSES        pulmonary complica… patf      
## 2 patf         pulmonary complicat… CAUSES        Sickle Cell Anemia  dsyn      
## 3 patf         pulmonary complicat… COEXISTS_WITH Sickle Cell Anemia  dsyn      
## 4 patf         pulmonary complicat… MANIFESTATIO… Sickle Cell Anemia  dsyn      
## 
## [[2]]
## # A tibble: 1 x 5
##   from_semtype from                   via           to                to_semtype
##   <chr>        <chr>                  <chr>         <chr>             <chr>     
## 1 patf         pulmonary complicatio… COEXISTS_WITH Malaria, Falcipa… dsyn
plot_path(g_small, this_path)

plot_path() plots the subgraph defined by the nodes on the path.

text_path() sequentially shows detailed information about semantic types and predicates for the pairs of nodes on the path. It also invisibly returns a list of tibble’s containing the displayed information, where each list element corresponds to a pair of nodes on the path.

3.3.1.3 Refining paths with weights

Finding paths between node sets necessarily uses shortest path algorithms for computational tractability. However, when these algorithms are run without modification, the shortest paths tend to be less useful than desired.

For example, one of the shortest paths from “sickle trait” to “Malaria, Cerebral” goes through the node “Infant”:

this_path <- paths[[3]][[32]]
plot_path(g_small, this_path)

This likely isn’t the type of path we were hoping for. Why does such a path arise? For some insight, we can use the degree() function within the igraph package to look at the degree distribution for all nodes in the Semantic MEDLINE graph. We also show the degree of the “Infant” node in red.

plot(density(degree(g_small), from = 0),
    xlab = "Degree", main = "Degree distribution")
## The second node in the path is "Infant" --> this_path[2]
abline(v = degree(g_small, v = this_path[2]), col = "red", lwd = 2)