Contents

1 Background and Rational

Protein domains is one of the most import types of annotation we have for proteins. For such annotation the Pfam database/tool is (by far) the most used tool and the de-facto way protein domains are currently defined. We have recently shown most human protein domains exist as multiple distinct variants termed domain isotypes - a until now overlooked aspect of protein domains. Domain isotypes are used in a cell, tissue, and disease-specific manner. Accordingly, we find that domain isotypes, compared to each other, modulate, or abolish the functionality of a protein domain. For more information check our preprint

This R package enables the user to read the Pfam prediction from both webserver and stand-alone runs into R and afterwards identify and classification domain isotypes from Pfam results.

These 5 isotypes are the reference isotype and four isotypes that, compared to the reference isotype, are best described as a truncation, an insertion, a deletion, or combinations thereof (“complex”) and are visualized in the figure below:

2 Installation

pfamAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of pfamAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor (or don’t know if you have used it), simply copy-paste the following into an R session to install the basic Bioconductor packages (will only done if you don’t already have them):

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install()
}

If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.

After you have installed the basic Bioconductor packages you can install pfamAnalyzeR by copy-pasting the following into an R session:

BiocManager::install("pfamAnalyzeR")

This will install the pfamAnalyzeR package as well as other R packages that are needed for pfamAnalyzeR to work.

3 Workflow

Lets take a look at how an analysis with pfamAnalyzeR looks

We start by loading the pfamAnalyzeR library.

library(pfamAnalyzeR)
#> Loading required package: readr
#> Loading required package: stringr
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

To showcase pfamAnalyzeR we have included the output of running a small toy dataset througth the pfam webserver in the R pacakge. This can be accessed as follows:

### Create sting pointing to result file
# note that you do not need to use the "system.file".
# That is only needed when accessing files in an R package
pfamResultFile <- system.file(
    "extdata/pfam_results.txt",
    package = "pfamAnalyzeR"
)

file.exists(pfamResultFile)
#> [1] TRUE

Now we can read in the Pfam result file and classify each domain into one of the 5 domain isotypes.

### Run entire pfam analysis
pfamRes <- pfamAnalyzeR(pfamResultFile)

With this done we can explore the results a little:


### Look at a few entries
pfamRes %>% 
    select(seq_id, hmm_name, type, domain_isotype, domain_isotype_simple) %>% 
    head()
#> # A tibble: 6 × 5
#>   seq_id         hmm_name type   domain_isotype domain_isotype_simple
#>   <chr>          <chr>    <chr>  <chr>          <chr>                
#> 1 TCONS_00000045 ASC      Family Insertion      Non-reference        
#> 2 TCONS_00000046 ASC      Family Insertion      Non-reference        
#> 3 TCONS_00000047 ASC      Family Insertion      Non-reference        
#> 4 TCONS_00000048 ASC      Family Insertion      Non-reference        
#> 5 TCONS_00000049 ASC      Family Truncation     Non-reference        
#> 6 TCONS_00000066 DUF3523  Family Complex        Non-reference

### Summarize domain isotype
table(pfamRes$domain_isotype)
#> 
#>    Complex   Deletion  Insertion  Reference Truncation 
#>         80          3         30        196         31

### Summarize domain isotype
table(pfamRes$domain_isotype_simple)
#> 
#> Non-reference     Reference 
#>           144           196

From which it can be seen that a large fraction (!) of protein domains found in regular data are non-reference isotypes.


Please note that pfamAnalyzeR performs the isotype analysis regardless of Pfam result type:

table(
    pfamRes$type,
    pfamRes$domain_isotype_simple
)
#>         
#>          Non-reference Reference
#>   Domain            78       126
#>   Family            40        62
#>   Repeat            26         8

Meaning if you are only interested in a specific annotation type you will have to subset it yourself.

4 SessionInfo

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pfamAnalyzeR_1.2.0 dplyr_1.1.3        stringr_1.5.0      readr_2.1.4       
#> [5] BiocStyle_2.30.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.4         cli_3.6.1           knitr_1.44         
#>  [4] rlang_1.1.1         xfun_0.40           stringi_1.7.12     
#>  [7] generics_0.1.3      jsonlite_1.8.7      glue_1.6.2         
#> [10] htmltools_0.5.6.1   sass_0.4.7          hms_1.1.3          
#> [13] fansi_1.0.5         rmarkdown_2.25      evaluate_0.22      
#> [16] jquerylib_0.1.4     tibble_3.2.1        tzdb_0.4.0         
#> [19] fastmap_1.1.1       yaml_2.3.7          lifecycle_1.0.3    
#> [22] bookdown_0.36       BiocManager_1.30.22 compiler_4.3.1     
#> [25] pkgconfig_2.0.3     digest_0.6.33       R6_2.5.1           
#> [28] tidyselect_1.2.0    utf8_1.2.4          pillar_1.9.0       
#> [31] magrittr_2.0.3      bslib_0.5.1         withr_2.5.1        
#> [34] tools_4.3.1         cachem_1.0.8