xina_user_code.knit

title: “Introduction to the XINA pagkage”

author: “Lang Ho Lee, Sasha A. Singh”

date: “February 6, 2019”

vignette: >

output:

knitr:::html_vignette:

df_print: kable

toc: true

number_sections: true

1. Introduction

Quantitative proteomics experiments, using for instance isobaric tandem mass tagging approaches, are conducive to measuring changes in protein abundance over multiple time points in response to one or more conditions or stimulations. The aim of XINA is to determine which proteins exhibit similar patterns within and across experimental conditions, since proteins with co-abundance patterns may have common molecular functions. XINA imports multiple datasets, tags dataset in silico, and combines the data for subsequent subgrouping into multiple clusters. The result is a single output depicting the variation across all conditions. XINA, not only extracts co-abundance profiles within and across experiments, but also incorporates protein-protein interaction databases and integrative resources such as KEGG to infer interactors and molecular functions, respectively, and produces intuitive graphical outputs.

1-1. Main contribution

An easy-to-use software for non-expert users of clustering and network analyses.

1-2. Data inputs

Any type of quantitative proteomics data, labeled or label-free

2. XINA websites

https://cics.bwh.harvard.edu/software http://bioconductor.org/packages/XINA/ https://github.com/langholee/XINA/

3. XINA installation

XINA requires R>=3.5.0.

# Install from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("XINA")

# Install from Github
install.packages("devtools")
library(devtools)
install_github("langholee/XINA")

The first step is to call XINA

library(XINA)

To follow this vignette, you may need the following packages

install.packages("igraph")
install.packages("ggplot2")
BiocManager::install("STRINGdb")

4. Example theoretical dataset

We generated an example dataset to show how XINA can be used for your research. To demonstrate XINA functions and allow users to perform similar exercises, we included a module that can generate multiplexed time-series datasets using theoretical data. This data consists of three treatment conditions, ‘Control’, ‘Stimulus1’ and ‘Stimulus2’. Each condition has time series data from 0 hour to 72 hours. As an example, we chose the mTOR pathway to be differentially regulated across the three conditions.

# Generate random multiplexed time-series data
random_data_info <- make_random_xina_data()

# The number of proteins
random_data_info$size

## [1] 500

# Time points
random_data_info$time_points

## [1] "0hr"  "2hr"  "6hr"  "12hr" "24hr" "48hr" "72hr"

# Three conditions
random_data_info$conditions

## [1] "Control"   "Stimulus1" "Stimulus2"

Read and check the randomly generated data

Control <- read.csv("Control.csv", check.names=FALSE, stringsAsFactors = FALSE)
Stimulus1 <- read.csv("Stimulus1.csv", check.names=FALSE, stringsAsFactors = FALSE)
Stimulus2 <- read.csv("Stimulus2.csv", check.names=FALSE, stringsAsFactors = FALSE)

head(Control)

##   Accession                                                        Description 0hr    2hr    6hr   12hr   24hr   48hr
## 1     OR5H6 olfactory receptor family 5 subfamily H member 6 (gene/pseudogene) 0.5 0.7818 0.6189 0.1313 0.4946 0.9573
## 2     TCF21                                            transcription factor 21 0.5 0.4862 0.3966 0.3724 0.6120 0.4225
## 3      CIR1                               corepressor interacting with RBPJ, 1 0.5 0.6490 0.0912 0.2087 0.1535 0.6079
## 4      RRM1                      ribonucleotide reductase catalytic subunit M1 0.5 0.7568 0.8834 0.0226 0.5365 0.4793
## 5      VWA1                        von Willebrand factor A domain containing 1 0.5 0.2210 0.5243 0.0833 0.4742 0.9378
## 6      DAG1                                                     dystroglycan 1 0.5 0.8142 0.2755 0.0765 0.0660 0.7316
##     72hr
## 1 0.8320
## 2 0.3347
## 3 0.1953
## 4 0.3467
## 5 0.7766
## 6 0.9274

head(Stimulus1)

##   Accession                                                        Description 0hr    2hr    6hr   12hr   24hr   48hr
## 1     OR5H6 olfactory receptor family 5 subfamily H member 6 (gene/pseudogene) 0.5 0.7615 0.2942 0.8499 0.8538 0.3254
## 2     TCF21                                            transcription factor 21 0.5 0.7653 0.3204 0.1156 0.0216 0.7268
## 3      CIR1                               corepressor interacting with RBPJ, 1 0.5 0.9132 0.1886 0.9620 0.6913 0.8396
## 4      RRM1                      ribonucleotide reductase catalytic subunit M1 0.5 0.0504 0.7828 0.0196 0.0988 0.8253
## 5      VWA1                        von Willebrand factor A domain containing 1 0.5 0.0040 0.8108 0.3270 0.6577 0.2858
## 6      DAG1                                                     dystroglycan 1 0.5 0.3896 0.7706 0.4683 0.0130 0.2409
##     72hr
## 1 0.9968
## 2 0.0414
## 3 0.1757
## 4 0.0006
## 5 0.5280
## 6 0.5827

head(Stimulus2)

##   Accession                                                        Description 0hr    2hr    6hr   12hr   24hr   48hr
## 1     OR5H6 olfactory receptor family 5 subfamily H member 6 (gene/pseudogene) 0.5 0.9823 0.7190 0.6689 0.2337 0.1543
## 2     TCF21                                            transcription factor 21 0.5 0.7813 0.6489 0.1966 0.8922 0.3971
## 3      CIR1                               corepressor interacting with RBPJ, 1 0.5 0.8360 0.7728 0.1406 0.2874 0.0327
## 4      RRM1                      ribonucleotide reductase catalytic subunit M1 0.5 0.4515 0.3766 0.2886 0.9011 0.4001
## 5      VWA1                        von Willebrand factor A domain containing 1 0.5 0.4990 0.2913 0.7866 0.5184 0.0581
## 6      DAG1                                                     dystroglycan 1 0.5 0.3425 0.9147 0.5509 0.5427 0.2889
##     72hr
## 1 0.5299
## 2 0.6778
## 3 0.8593
## 4 0.3066
## 5 0.1563
## 6 0.2099

Since XINA needs to know which columns have the kinetics data matrix, the user should give a vector containing column names of the kinetics data matrix. These column names have to be the same in all input datasets (Control input, Stimulus1 input and Stimulus2 input).

head(Control[random_data_info$time_points])

##   0hr    2hr    6hr   12hr   24hr   48hr   72hr
## 1 0.5 0.7818 0.6189 0.1313 0.4946 0.9573 0.8320
## 2 0.5 0.4862 0.3966 0.3724 0.6120 0.4225 0.3347
## 3 0.5 0.6490 0.0912 0.2087 0.1535 0.6079 0.1953
## 4 0.5 0.7568 0.8834 0.0226 0.5365 0.4793 0.3467
## 5 0.5 0.2210 0.5243 0.0833 0.4742 0.9378 0.7766
## 6 0.5 0.8142 0.2755 0.0765 0.0660 0.7316 0.9274

5. Package features

XINA is an R package and can examine, but is not limited to, time-series omics data from multiple experiment conditions. It has three modules: 1. Model-based clustering analysis, 2. coregulation analysis, and 3. Protein-protein interaction network analysis (we used STRING DB for this practice).

5.1 Clustering analysis using model-based clustering or k-means clustering algorithm

XINA implements model-based clustering to classify features (genes or proteins) depending on their expression profiles. The model-based clustering optimizes the number of clusters at minimum Bayesian information criteria (BIC). Model-based clustering is fulfilled by the ‘mclust’ R package [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/], which was used by our previously developed tool mIMT-visHTS [https://www.ncbi.nlm.nih.gov/pubmed/26232111]. By default, XINA performs sum-normalization for each gene/protein time-series profile [https://www.ncbi.nlm.nih.gov/pubmed/19861354]. This step is done to standardize all datasets. Most importantly, XINA assigns an electronic tag to each dataset’s proteins (similar to TMT) in order to combine the multiple datasets (Super dataset) for subsequent clustering.

XINA uses the ‘mclust’ package for the model-based clustering. ‘mclust’ requires the fixed random seed to get reproducible clustering results.

set.seed(0)

‘nClusters’ is the number of desired clusters. ‘mclust’ will choose the most optimized number of clusters by considering the Bayesian information criteria (BIC). BIC of ‘mclust’ is the negative of normal BIC, thus the higher BIC, the more optimized clustering scheme in ‘mclust’, while lowest BIC is better in statistics.

# Data files
data_files <- paste(random_data_info$conditions, ".csv", sep='')
data_files

## [1] "Control.csv"   "Stimulus1.csv" "Stimulus2.csv"

# time points of the data matrix
data_column <- random_data_info$time_points
data_column

## [1] "0hr"  "2hr"  "6hr"  "12hr" "24hr" "48hr" "72hr"

Run the model-based clustering

# Run the model-based clusteirng
clustering_result <- xina_clustering(data_files, data_column=data_column, nClusters=20)

XINA also supports k-means clustering as well as the model-based clustering

clustering_result_km <- xina_clustering(data_files, data_column=data_column, nClusters=20, chosen_model='kmeans')

For visualizing clustering results, XINA draws line graphs of the clustering results using ‘plot_clusters’.

library(ggplot2)
theme1 <- theme(title=element_text(size=8, face='bold'),
                axis.text.x = element_text(size=7),
                axis.text.y = element_blank(),
                axis.ticks.x = element_blank(),
                axis.ticks.y = element_blank(),
                axis.title.x = element_blank(),
                axis.title.y = element_blank())
plot_clusters(clustering_result, ggplot_theme=theme1)

XINA calculates sample condition composition, for example the sample composition in the cluster 28 is higher than 95% for Stimulus2. ‘plot_condition_composition’ plots these compositions as pie-charts. Sample composition information is insightful because we can find which specific patterns are closely related with each stimulus.

theme2 <- theme(legend.key.size = unit(0.3, "cm"),
                legend.text=element_text(size=5),
                title=element_text(size=7, face='bold'))
condition_composition <- plot_condition_compositions(clustering_result, ggplot_theme=theme2)

tail(condition_composition)

##    Cluster Condition  N Percent_ratio
## 54      18 Stimulus2 25         34.72
## 55      19   Control 57         98.28
## 56      19 Stimulus2  1          1.72
## 57      20   Control  1          1.67
## 58      20 Stimulus1 58         96.67
## 59      20 Stimulus2  1          1.67

5.2 coregulation analysis

XINA supposes that proteins that comigrate between clusters in response to a given condition are more likely to be coregulated at the biological level than other proteins within the same clusters. For this module, at least two datasets to be compared are needed. XINA supposes features assigned to the same cluster in an experiment condition as a coregulated group. XINA traces the comigrated proteins in different experiment conditions and finds signficant trends by 1) the number of member features (proteins) and 2) the enrichment test using the Fishers exact test. The comigrations are displayed via an alluvial plot. In XINA the comigration is defined as a condition of proteins that show the same expression pattern, classified and evaluated by XINA clustering, in at least two dataset conditions. If there are proteins that are assigned to the same cluster in more than two datasets, XINA considers those proteins to be comigrated. XINA’s ‘alluvial_enriched’ is designed to find these comigrations and draws alluvial plots for visualizing the found comigrations.

classes <- as.vector(clustering_result$condition)
classes

## [1] "Control"   "Stimulus1" "Stimulus2"

all_cor <- alluvial_enriched(clustering_result, classes)

## [1] "length(selected_conditions) > 2, so XINA can't apply the enrichment filter\n            Can't apply the enrichment filter, so pval_threshold is ignored"

head(all_cor)

##   Control Stimulus1 Stimulus2 Comigration_size RowNum PValue Pvalue.adjusted TP FP FN TN Alluvia_color
## 1       0         0         1               10      1     NA              NA NA NA NA NA       #BEBEBE
## 2       0         0         2                1      2     NA              NA NA NA NA NA       #BEBEBE
## 3       0         0         3               13      3     NA              NA NA NA NA NA       #BEBEBE
## 4       0         0         4                2      4     NA              NA NA NA NA NA       #BEBEBE
## 5       0         0         5                3      5     NA              NA NA NA NA NA       #BEBEBE
## 6       0         0         6                3      6     NA              NA NA NA NA NA       #BEBEBE

You can narrow down comigrations by using the size (the number of comigrated proteins) filter.

cor_bigger_than_5 <- alluvial_enriched(clustering_result, classes, comigration_size=5)

## [1] "length(selected_conditions) > 2, so XINA can't apply the enrichment filter\n            Can't apply the enrichment filter, so pval_threshold is ignored"

head(cor_bigger_than_5)

##    Control Stimulus1 Stimulus2 Comigration_size RowNum PValue Pvalue.adjusted TP FP FN TN Alluvia_color
## 1        0         0         1               10      1     NA              NA NA NA NA NA       #BEBEBE
## 3        0         0         3               13      3     NA              NA NA NA NA NA       #BEBEBE
## 7        0         0         7                5      7     NA              NA NA NA NA NA       #BEBEBE
## 10       0         0        10                8     10     NA              NA NA NA NA NA       #BEBEBE
## 11       0         0        11                7     11     NA              NA NA NA NA NA       #BEBEBE
## 13       0         0        13                5     13     NA              NA NA NA NA NA       #BEBEBE

5.3 Network analysis

XINA conducts protein-protein interaction (PPI) network analysis through implementing ‘igraph’ and ‘STRINGdb’ R packages. XINA constructs PPI networks for comigrated protein groups as well as individual clusters of a specific experiment (dataset) condition. In the constructed networks, XINA finds influential players by calculating various network centrality calculations including betweenness, closeness and eigenvector scores. For the selected comigrated groups, XINA can calculate an enrichment test based on gene ontology and KEGG pathway terms to help understanding comigrated groups.

XINA’s example dataset is from human gene names, so download human PPI database from STRING DB and run XINA PPI network analysis.

library(STRINGdb)
string_db <- STRINGdb$new( version="10", species=9606, score_threshold=0, input_directory="" )
string_db

xina_result <- xina_analysis(clustering_result, string_db)

You can draw PPI networks of all the XINA clusters using ‘xina_plots’ function easily. PPI network plots will be stored in the working directory

# XINA network plots labeled gene names
xina_plot_all(xina_result, clustering_result)

If you want to see more, please check “README.md” of our Github XINA repository.