A probabilistic modeling framework that jointly analyzes personal genome and transcriptome data to estimate the probability that a variant has regulatory impact in that individual.

RIVER 1.3.0

`RIVER`

is an `R`

package of a probabilistic modeling framework, called *RIVER (RNA-Informed Variant Effect on Regulation)*, that jointly analyzes personal genome (WGS) and transcriptome data (RNA-seq) to estimate the probability that a variant has regulatory impact in that individual. It is based on a generative model that assumes that genomic annotations, such as the location of a variant with respect to regulatory elements, determine the prior probability that variant is a functional regulatory variant, which is an unobserved variable. The functional regulatory variant status then influences whether nearby genes are likely to display outlier levels of gene expression in that person.

*RIVER* is a hierarchical Bayesian model that predicts the regulatory effects of rare variants by integrating gene expression with genomic annotations. The *RIVER* consists of three layers: a set of nodes \(G = G_{1}, ..., G_{P}\) in the topmost layer representing \(P\) observed genomic annotations over all rare variants near a particular gene, a latent binary variable \(FR\) in the middle layer representing the unobserved funcitonal regulatory status of the rare variants, and one binary node \(E\) in the final layer representing expression outlier status of the nearby gene. We model each conditional probability distribution as follows: \[ FR | G \sim Bernoulli(\psi), \psi = logit^{-1}(\beta^T, G) \] \[E | FR \sim Categorical(\theta_{FR}) \] \[ \beta_i \sim N(0, \frac{1}{\lambda})\] \[\theta_{FR} \sim Beta(C,C)\] with parameters \(\beta\) and \(\theta\) and hyper-parameters \(\lambda\) and \(C\).

Because \(FR\) is unobserved, log-likelihood objective of *RIVER* over instances \(n = 1, ..., N\), \[
\sum_{n=1}^{N} log\sum_{FR_n= 0}^{1} P(E_n, G_n, FR_n | \beta, \theta),
\] is non-convex. We therefore optimize model parameters via Expectation-Maximization (EM) as follows:

In the E-step, we compute the posterior probabilities (\(\omega_n^{(i)}\)) of the latent variables \(FR_n\) given current parameters and observed data. For example, at the \(i\)-th iteration, the posterior probability of \(FR_n = 1\) for the \(n\)-th instance is \[ \omega_{1n}^{(i)} = P(FR_n = 1 | G_n, \beta^{(i)}, E_n, \theta^{(i)}) =\frac{P(FR_n = 1 | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n = 1, \theta^{(i)})}{\sum_{FR_n = 0}^1 P(FR_n | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n, \theta^{(i)})}, \] \[\omega_{0n}^{(i)} = 1 - \omega_{1n}^{(i)}.\]

In the M-step, at the \(i\)-th iteration, given the current estimates \(\omega^{(i)}\), the parameters (\(\beta^{(i+1)*}\)) are estimated as \[ \max_{\beta^{(i+1)}} \sum_{n = 1}^N \sum_{FR_n = 0}^1 log(P(FR_n | G_n, \beta^{(i+1)})) \cdotp \omega_{FR, n}^{(i)} - \frac{\lambda}{2}||\beta^{(i+1)}||_2, \] where \(\lambda\) is an L2 penalty hyper-parameter derived from the Gaussian prior on \(\beta\). The parameters \(\theta\) get updated as: \[ \theta_{s,t}^{(i+1)} = \sum_{n = 1}^{N} I(E_n = t) \cdotp \omega_{s,n}^{(i)} + C. \] where \(I\) is an indicator operator, \(t\) is the binary value of expression \(E_n\), \(s\) is the possible binary values of \(FR_n\) and \(C\) is a pseudo count derived from the Beta prior on . The E- and M-steps are applied iteratively until convergence.

The purpose of this section is to provide users a general sense of our package, `RIVER`

, including components and their behaviors and applications. We will briefly go over main functions, observe basic operations and corresponding outcomes. Throughout this section, users may have better ideas about which functions are available, which values to be chosen for necessary parameters, and where to seek help. More detailed descriptions are given in later sections.

First, we load `RIVER`

:

`library("RIVER")`

`RIVER`

consists of several functions mainly supporting two main functions including `evaRIVER`

and `appRIVER`

, which we are about to show how to use them here. We first load simulated data created beforehand for illustration.

```
filename <- system.file("extdata", "simulation_RIVER.gz",
package="RIVER")
dataInput <- getData(filename) # import experimental data
```

`getData`

combines different resources including genomic features, outlier status of gene expression, and N2 pairs having same rare variants into standardized data structures, called `ExpressionSet`

class.

`print(dataInput)`

```
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 18 features, 6122 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: indiv58:gene1614 indiv5:gene1331 ... indiv18:gene1111
## (6122 total)
## varLabels: Outlier N2pair
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
```

```
Feat <- t(Biobase::exprs(dataInput)) # genomic features (G)
Out <- as.numeric(unlist(dataInput$Outlier))-1 # outlier status (E)
```

In the simulated data, an input object `dataInput`

consists of 18 genomic features and expression outlier status of 6122 samples and which samples belong to N2 pairs.

`head(Feat)`

```
## Feature1 Feature2 Feature3 Feature4 Feature5
## indiv58:gene1614 0.4890338 -1.2143183 -0.5015241 -0.8093881 -0.1916957
## indiv5:gene1331 -4.1665487 -0.6355490 -1.0914989 -2.8069442 -0.3921658
## indiv76:gene447 -0.4469724 -0.9440314 0.7386942 0.6786997 0.6043647
## indiv16:gene126 1.6252083 -2.2686506 1.4156013 -0.5662072 -0.1405224
## indiv45:gene1044 0.3086791 1.0044798 0.8493856 0.3515569 -0.3763434
## indiv6:gene258 0.9046627 1.7618399 -1.6258895 0.3961724 -0.2203089
## Feature6 Feature7 Feature8 Feature9 Feature10
## indiv58:gene1614 -0.06614698 0.1993153 -0.7544253 -1.3505036 -0.03477715
## indiv5:gene1331 -0.43831006 1.8175888 -0.8411557 2.1495237 -0.48691686
## indiv76:gene447 1.47627930 0.6521233 -0.6416004 1.0309900 -0.38262446
## indiv16:gene126 2.11676989 1.0670951 0.3404799 -1.5970916 -0.49910751
## indiv45:gene1044 1.45269690 -0.8368466 -1.0016646 0.1908291 -0.26598159
## indiv6:gene258 -0.76821018 -1.3436283 -0.7201516 0.5440035 0.31006097
## Feature11 Feature12 Feature13 Feature14 Feature15
## indiv58:gene1614 2.08675352 0.7283183 0.15074710 1.5183579 0.2226134
## indiv5:gene1331 2.57339374 0.4840111 0.58897093 -0.2069596 0.1000502
## indiv76:gene447 0.30621172 0.6690835 -1.39701213 1.4853201 1.6552545
## indiv16:gene126 0.58599041 -0.7052013 -0.07715282 -1.3326831 0.1719152
## indiv45:gene1044 1.45029299 -0.4530850 -0.09610983 1.4731617 -0.9372256
## indiv6:gene258 0.05174989 0.4026417 -0.50911462 -0.3623563 -2.5279770
## Feature16 Feature17 Feature18
## indiv58:gene1614 1.88846321 -0.4408236 -0.4558111
## indiv5:gene1331 0.22640429 1.2535793 -1.0163254
## indiv76:gene447 0.71720775 0.2758493 1.1378452
## indiv16:gene126 1.12919669 0.6166061 -1.6069772
## indiv45:gene1044 0.14794114 1.2241711 1.4916670
## indiv6:gene258 -0.01574568 -1.5835940 -1.2366762
```

`head(Out)`

`## [1] 1 1 1 0 1 1`

`Feat`

contains continuous values of genomic features (defined as \(G\) in the objective function) while `Out`

contains binary values representing outlier status of same samples as `Feat`

(defined as \(E\) in the objective function).

For evaluation, we hold out pairs of individuals at genes where only those two individuals shared the same rare variants. Except for the list of instances, we train *RIVER* with the rest of instances, compute the *RIVER* score (the posterior probability of having a functional regulatory variant given both WGS and RNA-seq data) from one individual, and assess the accuracy with respect to the second individualâ€™s held-out expression levels. Since there is currently quite few gold standard set of functional rare variants, using this labeled test data allow us to evaluate predictive accuracy of *RIVER* compared with genomic annotation model, *GAM*, that uses genomic annotations alone. We can observe a significant improvement by incorporating expression data.

To do so, we simply use `evaRIVER`

:

`evaROC <- evaRIVER(dataInput)`

```
## ::: EM iteration is terminated since it converges within a
## predefined tolerance (0.001) :::
```

`evaROC`

is an S4 object of class `evaRIVER`

which contains two AUC values from *RIVER* and *GAM*, specificity and sensitivity measures from the two models, and p-value of comparing the two AUC values.

`cat('AUC (GAM-genomic annotation model) = ', round(evaROC$GAM_auc,3), '\n')`

`## AUC (GAM-genomic annotation model) = 0.58`

`cat('AUC (RIVER) = ', round(evaROC$RIVER_auc,3), '\n')`

`## AUC (RIVER) = 0.8`

`cat('P-value ', format.pval(evaROC$pvalue, digits=2, eps=0.001), '***\n')`

`## P-value <0.001 ***`

We can visualize the ROC curves with AUC values:

```
par(mar=c(6.1, 6.1, 4.1, 4.1))
plot(NULL, xlim=c(0,1), ylim=c(0,1),
xlab="False positive rate", ylab="True positive rate",
cex.axis=1.3, cex.lab=1.6)
abline(0, 1, col="gray")
lines(1-evaROC$RIVER_spec, evaROC$RIVER_sens,
type="s", col='dodgerblue', lwd=2)
lines(1-evaROC$GAM_spec, evaROC$GAM_sens,
type="s", col='mediumpurple', lwd=2)
legend(0.7,0.2,c("RIVER","GAM"), lty=c(1,1), lwd=c(2,2),
col=c("dodgerblue","mediumpurple"), cex=1.2,
pt.cex=1.2, bty="n")
title(main=paste("AUC: RIVER = ", round(evaROC$RIVER_auc,3),
", GAM = ", round(evaROC$GAM_auc,3),
", P = ", format.pval(evaROC$pvalue, digits=2,
eps=0.001),sep=""))
```

Each ROC curve from either *RIVER* or *GAM* is computed by comparing the posterior probability given available data for the 1st individual with the outlier status of the 2nd individual in the list of held-out pairs and vice versa.

To extract posterior probabilities for prioritizing functional rare variants in any downstream analysis such as finding pathogenic rare variants in disease, you simply run `appRIVER`

to obtain the posterior probabilities:

`postprobs <- appRIVER(dataInput)`

```
## ::: EM iteration is terminated since it converges within a
## predefined tolerance (0.001) :::
```

`postprobs`

is an S4 object of class `appRIVER`

which contains subject IDs, gene names, \(P(FR = 1 | G)\), \(P(FR = 1 | G, E)\), and `fitRIVER`

, all the relevant information of the fitted *RIVER* including hyperparamters for further use.

Probabilities of rare variants being functional from *RIVER* and *GAM* for a few samples are shown below:

```
example_probs <- data.frame(Subject=postprobs$indiv_name,
Gene=postprobs$gene_name,
RIVERpp=postprobs$RIVER_posterior,
GAMpp=postprobs$GAM_posterior)
head(example_probs)
```

```
## Subject Gene RIVERpp GAMpp
## 1 indiv58 gene1614 0.4303673 0.2319994
## 2 indiv5 gene1331 0.3597397 0.1939262
## 3 indiv76 gene447 0.5249447 0.3061179
## 4 indiv16 gene126 0.1316062 0.2447880
## 5 indiv45 gene1044 0.4488103 0.2370526
## 6 indiv6 gene258 0.3598831 0.1714101
```

From left to right, it shows subject ID, gene name, posterior probabilities from *RIVER*, posterior probabilities from *GAM*.

To observe how much we can obtain additional information on functional rare variants by integrating the outlier status of gene expression into *RIVER* in the following figure.

`plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)`