Contents

1 Introduction

SynExtend is a package of tools for working with objects of class Synteny built from the package DECIPHER’s FindSynteny() function.

Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a ‘syntenic block’. That designation of synteny can then used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction.

FindSynteny takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximate shared k-mers are significant. Combining the information generated by FindSynteny with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the k-mers that they share, or alignment.

2 Package Structure

Currently SynExtend contains one set of functions for performig orthology predictions, as well as a rearrangement estimation function that is currently under construction.

2.1 Installation

  1. Install the latest version of R using CRAN.
  2. Install SynExtend in R by running the following commands:
if (!requireNamespace("BiocManager",
                      quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("SynExtend")

3 Usage

Using the FindSynteny function in DECIPHER builds an object of class Synteny. In this tutorial, a prebuilt DECIPHER database is used. For database construction see ?Seqs2DB in DECIPHER. This example starts with a database containing three endosymbiont genomes that were chosen to keep package data to a minimum.

library(SynExtend)
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
## 
##     dendrapply
DBPATH <- system.file("extdata",
                      "Endosymbionts_v02.sqlite",
                      package = "SynExtend")

Syn <- FindSynteny(dbFile = DBPATH)
## ================================================================================
## 
## Time difference of 11.75 secs

Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class Synteny can also be plotted to provide clear visual representations of the data inside. The genomes used in this example are distantly related and fairly dissimilar.

Syn
##           1         2         3        4
## 1     1 seq  95% hits  95% hits 91% hits
## 2 52 blocks     1 seq  95% hits 91% hits
## 3 51 blocks 42 blocks     1 seq 91% hits
## 4 75 blocks 75 blocks 75 blocks    1 seq
pairs(Syn)

Data present inside objects of class Synteny can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see ?FindSynteny in the package DECIPHER.

print(head(Syn[[1, 2]]))
##      index1 index2 strand width start1 start2 frame1 frame2
## [1,]      1      1      1  2019 507817 656143      1      2
## [2,]      1      1      1  5532 509839 654121      1      2
## [3,]      1      1      1  2217 515517 648454      3      2
## [4,]      1      1      1   963 517737 646234      3      2
## [5,]      1      1      1  1953 518703 645268      3      2
## [6,]      1      1      1  3420 520659 643312      3      2
print(head(Syn[[2, 1]]))
##      index1 index2 strand score start1 start2   end1   end2 first_hit last_hit
## [1,]      1      1      1 72356 507817 593612 570377 656143         1       19
## [2,]      1      1      1 69241 682222 420739 743196 481721        20       48
## [3,]      1      1      1 60712 570396 541175 622767 593540        49       77
## [4,]      1      1      1 51362 155419 149200 223981 217755        78      101
## [5,]      1      1      1 49773      1 328415  44823 373239       102      130
## [6,]      1      1      1 27630 753628 385286 778610 410259       131      138

The above printed objects show the data for the comparison between the first and second genome in our database.

To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.

Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. Parsed GFFs can be constructed with gffToDataFrame, for full functionality, or GFFs can be imported via rtracklater::import() for limited functionality. GeneCalls for both the PairSummaries and NucleotideOverlap functions must be named list, and those names must match dimnames(Syn)[[1]].

# generating genecalls with local data:
GC <- gffToDataFrame(GFF = system.file("extdata",
                                       "GCF_021065005.1_ASM2106500v1_genomic.gff.gz",
                                       package = "SynExtend"),
                     Verbose = TRUE)
## ================================================================================
## Time difference of 5.522598 secs
# in an effort to be space conscious, not all original gffs are kept within this package
GeneCalls <- get(data("Endosymbionts_GeneCalls", package = "SynExtend"))

SynExtend’s gffToDataFrame function will directly import gff files into a usable format, and includes other extracted information.

print(head(GeneCalls[[1]]))
## DataFrame with 6 rows and 11 columns
##       Index    Strand     Start      Stop        Type                 ID
##   <integer> <integer> <integer> <integer> <character>        <character>
## 1         1         0       170       493        gene gene-M9394_RS00005
## 2         1         0       574      1935        gene gene-M9394_RS00010
## 3         1         0      1995      2522        gene gene-M9394_RS00015
## 4         1         0      2581      3321        gene gene-M9394_RS00020
## 5         1         1      4383      5396        gene gene-M9394_RS00025
## 6         1         0      5682      6377        gene gene-M9394_RS00030
##           Range                Product    Coding Translation_Table
##   <IRangesList>            <character> <logical>       <character>
## 1       170-493 cell envelope integr..      TRUE                11
## 2      574-1935 Tol-Pal system beta ..      TRUE                11
## 3     1995-2522 peptidoglycan-associ..      TRUE                11
## 4     2581-3321 tol-pal system prote..      TRUE                11
## 5     4383-5396 6-phosphogluconolact..      TRUE                11
## 6     5682-6377 2%2C3-diphosphoglyce..      TRUE                11
##          Contig
##     <character>
## 1 NZ_CP097751.1
## 2 NZ_CP097751.1
## 3 NZ_CP097751.1
## 4 NZ_CP097751.1
## 5 NZ_CP097751.1
## 6 NZ_CP097751.1

Raw GFF imports are also acceptable, but prevent alignments in amino acid space with PairSummaries().

X01 <- rtracklayer::import(system.file("extdata",
                                       "GCA_000875775.1_ASM87577v1_genomic.gff.gz",
                                       package = "SynExtend"))
class(X01)
print(X01)

SynExtend’s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links.

Links <- NucleotideOverlap(SyntenyObject = Syn,
                           GeneCalls = GeneCalls,
                           Verbose = TRUE)
## 
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.678359 secs

The Links object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class Synteny. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should a user desire to have access to it.

class(Links)
## [1] "LinkedPairs"
print(Links)
##            1          2          3         4
## 1   1 Contig  690 Pairs  681 Pairs 722 Pairs
## 2  750 Kmers   1 Contig  683 Pairs 721 Pairs
## 3  734 Kmers  710 Kmers   1 Contig 718 Pairs
## 4 2681 Kmers 2692 Kmers 2658 Kmers  1 Contig

This raw data can be processed to provide a straightforward summary of predicted pairs.

LinkedPairs1 <- PairSummaries(SyntenyLinks = Links,
                              DBPATH = DBPATH,
                              PIDs = FALSE,
                              Verbose = TRUE)
## 
## Preparing overhead data.
## Overhead complete.
## Collecting pairs.
## ================================================================================
## Time difference of 2.564016 secs

The object LinkedPairs1 is a data.frame where each row is populated by information about a predicted orthologous pair. By default PairSummaries uses a simple model to determine whether the k-mers that link a pair of genes are likely to provide an erroneous link. When set to Model = "Global", is is simply a prediction of whether the involved nucleotides are likely to describe a pair of genomic features whose alignment would result in a PID that falls within a random distribution. This model is effective if somewhat permissive, but is significantly faster than performing many pairwise alignments.

print(head(LinkedPairs1))
##      p1      p2 ExactMatch Consensus TotalKmers MaxKmer p1FeatureLength
## 1 1_1_1 2_1_310        324 0.7727273          1     324             324
## 2 1_1_2 2_1_309       1362 1.0000000          1    1362            1362
## 3 1_1_3 2_1_308        528 1.0000000          1     528             528
## 4 1_1_4 2_1_307        741 1.0000000          1     741             741
## 5 1_1_5 2_1_306       1014 1.0000000          1    1014            1014
## 6 1_1_6 2_1_305        696 1.0000000          1     696             696
##   p2FeatureLength Adjacent     TetDist PIDType PredictedPID
## 1             594        1 0.043957492      AA    0.4548953
## 2            1362        2 0.002943341      AA    0.9999764
## 3             528        2 0.005387480      AA    0.9982457
## 4             741        2 0.007901019      AA    0.9995543
## 5            1014        2 0.005767509      AA    0.9999043
## 6             696        2 0.000000000      AA    0.9995661

PairSummaries includes arguments that allow for aligning all pairs that are predicted, via PIDs = TRUE, while IgnoreDefaultStringSet = FALSE indicates that alignments should be performed in nucleotide or amino acid space as is appropriate for the linked sequences. Setting IgnoreDefaultStringSet = TRUE will force all alignments into nucleotide space.

As of SynExtend v 1.3.13, the functions ExtractBy and DisjointSet have been added to provide users with direct tools to work with PairSummaries objects.

SingleLinkageClusters <- DisjointSet(Pairs = LinkedPairs1,
                                     Verbose = TRUE)
## 
## Assigning initial root:
## ================================================================================
## 
## Time difference of 0.02947187 secs
## 
## Assigning final root:
## ================================================================================
## 
## Time difference of 0.01788187 secs
## 
## Assigning single linkage clusters.
## Assignments complete.
## 
## Time difference of 0.0534656 secs
# extract the first 10 clusters
Sets <- ExtractBy(x = LinkedPairs1,
                  y = DBPATH,
                  z = SingleLinkageClusters[1:10],
                  Verbose = TRUE)
## 
## Extracting Sequences:
## ================================================================================
## 
## Arranging Sequences:
## ================================================================================
## 
## Time difference of 0.2577453 secs
head(Sets)
## [[1]]
## DNAStringSet object of length 6:
##     width seq                                               names               
## [1]   324 TTGACAACAAAAAAAAATGATAA...TATTAAATTTTTCTCCTCAATAA 1_1_1
## [2]   411 GTGATCGTTTCTTTGTTGCTATA...AAAAGAAAAATCAAGATCAATAA 1_1_663
## [3]   135 TTGCCCAACACCCCTAACAAGAA...ATCAGGTTGCCCAACACCCCTAA 1_1_664
## [4]   594 TTGGCAGGACCAACGTTGACCAA...TATTAAATTTTTCTCCTCAATAA 2_1_310
## [5]   324 TTGACAACAAAAAAAAATGATAA...TATTAAATTTTTCTCCTCAATAA 3_1_272
## [6]  1059 ATGATATCAATCATTGTACATGT...TATTAAATTTTTCTCCTCAATAA 4_1_442
## 
## [[2]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]  1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 1_1_2
## [2]  1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 2_1_309
## [3]  1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 3_1_273
## [4]  1296 ATGTTAATTATATTTTTATGGAA...GGTCTCCATTACATTTAGAATAA 4_1_443
## 
## [[3]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]   528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 1_1_3
## [2]   528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 2_1_308
## [3]   528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 3_1_274
## [4]   528 ATGAAATCAAACAATTTTTTTAA...GCGTTGTATTAATATATAAATAA 4_1_444
## 
## [[4]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]   741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 1_1_4
## [2]   741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 2_1_307
## [3]   741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 3_1_275
## [4]   738 ATGATATATTTTAATATTACAAA...AACGTCTAATACACCTATCATAA 4_1_445
## 
## [[5]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]  1014 ATGATACAAATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 1_1_5
## [2]  1014 ATGATACAGATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 2_1_306
## [3]  1014 ATGATACAAATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 3_1_276
## [4]  1014 ATGATACAAATTATTTATGTAGC...TTATTACGTTGAATTGCAAATGA 4_1_446
## 
## [[6]]
## DNAStringSet object of length 4:
##     width seq                                               names               
## [1]   696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 1_1_6
## [2]   696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 2_1_305
## [3]   696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 3_1_277
## [4]   699 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 4_1_447

Session Info:

sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] SynExtend_1.17.0    DECIPHER_3.1.0      Biostrings_2.73.0  
## [4] GenomeInfoDb_1.41.0 XVector_0.45.0      IRanges_2.39.0     
## [7] S4Vectors_0.43.0    BiocGenerics_0.51.0 BiocStyle_2.33.0   
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.0.5               jsonlite_1.8.8          highr_0.10             
##  [4] compiler_4.4.0          BiocManager_1.30.22     crayon_1.5.2           
##  [7] Rcpp_1.0.12             tinytex_0.50            blob_1.2.4             
## [10] magick_2.8.3            parallel_4.4.0          jquerylib_0.1.4        
## [13] yaml_2.3.8              fastmap_1.1.1           R6_2.5.1               
## [16] knitr_1.46              bookdown_0.39           GenomeInfoDbData_1.2.12
## [19] DBI_1.2.2               bslib_0.7.0             rlang_1.1.3            
## [22] cachem_1.0.8            xfun_0.43               sass_0.4.9             
## [25] bit64_4.0.5             RSQLite_2.3.6           memoise_2.0.1          
## [28] cli_3.6.2               magrittr_2.0.3          zlibbioc_1.51.0        
## [31] digest_0.6.35           lifecycle_1.0.4         vctrs_0.6.5            
## [34] evaluate_0.23           rmarkdown_2.26          httr_1.4.7             
## [37] pkgconfig_2.0.3         tools_4.4.0             htmltools_0.5.8.1      
## [40] UCSC.utils_1.1.0