# 1 Introduction

To date, several packages have been developed to infer gene coexpression networks from expression data, such as WGCNA (Langfelder and Horvath 2008), CEMiTool (Russo et al. 2018) and petal (Petereit et al. 2016). However, network inference and analysis is a non-trivial task that requires solid statistical background, especially for data preprocessing and proper interpretation of results. Because of that, inexperienced researchers often struggle to choose the most suitable algorithms for their projects. Besides, different packages are required for each step of a standard network analysis, and their distinct syntaxes can hinder interoperability between packages, particularly for non-advanced R users. Here, we have developed an all-in-one R package that uses state-of-the-art algorithms to facilitate the workflow of biological network analysis, from data acquisition to analysis and interpretation. This will likely accelerate network analysis pipelines and advance systems biology research.

# 2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')

BiocManager::install("BioNERO")
# Load package after installation
library(BioNERO)
##
set.seed(123) # for reproducibility

For this tutorial, we will use maize (Zea mays) gene expression data normalized in TPM. The data were obtained from Shin et al. (2020) and were filtered for package size issues. For more information on the data set, see ?zma.se. The data set is stored as a SummarizedExperiment object.1 NOTE: In case you have many tab-separated expression tables in a directory, BioNERO has a helper function named dfs2one() to load all these files and combine them into a single data frame.

The input expression data in BioNERO can be both a SummarizedExperiment object or a gene expression matrix or data frame with genes in rows and samples in columns. However, we strongly recommend using SummarizedExperiment objects for easier interoperability with other Bioconductor packages.

data(zma.se)

# Take a quick look at the data
zma.se
## class: SummarizedExperiment
## dim: 10802 28
## assays(1): ''
## rownames(10802): ZeamMp030 ZeamMp044 ... Zm00001d054106 Zm00001d054107
## rowData names(0):
## colnames(28): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue
SummarizedExperiment::colData(zma.se)
## DataFrame with 28 rows and 1 column
##                    Tissue
##               <character>
## SRX339756       endosperm
## SRX339757       endosperm
## SRX339758       endosperm
## SRX339762       endosperm
## SRX339763       endosperm
## ...                   ...
## SRX2792107 whole_seedling
## SRX2792108 whole_seedling
## SRX2792102 whole_seedling
## SRX2792103 whole_seedling
## SRX2792104 whole_seedling

## 3.1 Step-by-step data preprocessing

This section is suitable for users who want to have more control of their data analysis, as they can inspect the data set after each preprocessing step and analyze how different options to the arguments would affect the expression data. If you want a quick start, you can skip to the next section (Automatic, one-step data preprocessing).

Step 1: Replacing missing values. By default, replace_na() will replace NAs with 0. Users can also replace NAs with the mean of each row (generally not advisable, but it can be useful in very specific cases).

exp_filt <- replace_na(zma.se)
sum(is.na(zma.se))
## [1] 0

Step 2: Removing non-expressed genes. Here, for faster network reconstruction, we will remove every gene whose median value is below 10. The function’s default for min_exp is 1. For other options, see ?remove_nonexp.

exp_filt <- remove_nonexp(exp_filt, method="median", min_exp = 10)
dim(exp_filt)
## [1] 8529   28

Step 3 (optional): Filtering genes by variance. It is reasonable to remove genes whose expression values do not vary much across samples, since we often want to find genes that are more or less expressed in particular conditions. Here, we will keep only the top 2000 most variable genes. Users can also filter by percentile (e.g., the top 10% most variable genes).

exp_filt <- filter_by_variance(exp_filt, n=2000)
dim(exp_filt)
## [1] 2000   28

Step 4: Removing outlying samples. There are several methods to remove outliers. We have implemented the Z.K (standardized connectivity) method (Oldham, Langfelder, and Horvath 2012) in ZKfiltering(), which is a network-based approach to remove outliers. This method has proven to be more suitable for network analysis, since it can remove outliers that other methods (such as hierarchical clustering) cannot identify. By default, BioNERO considers all samples with ZK < 2 as outliers, but this parameter is flexible if users want to change it.

exp_filt <- ZKfiltering(exp_filt, cor_method = "pearson")
## Number of removed samples: 1
dim(exp_filt)
## [1] 2000   27

Step 5: Adjusting for confounding artifacts. This is an important step to avoid spurious correlations resulting from confounders. The method was described by Parsana et al. (2019), who developed a principal component (PC)-based correction for confounders. After correction, the expression data are quantile normalized, so every gene follows an approximate normal distribution.

exp_filt <- PC_correction(exp_filt)

## 3.2 Automatic, one-step data preprocessing

Alternatively, users can preprocess their data with a single function. The function exp_preprocess() is a wrapper for the functions replace_na(), remove_nonexp(), filter_by_variance(), ZKfiltering() and PC_correction(). The arguments passed to exp_preprocess() will be used by each of these functions to generate a filtered expression data frame in a single step.2 NOTE: Here, we are using TPM-normalized data. If you have expression data as raw read counts, set the argument vstransform = TRUE in exp_preprocess(). This will apply DESeq2’s variance stabilizing transformation (Love, Huber, and Anders 2014) to your count data.

final_exp <- exp_preprocess(zma.se, min_exp = 10, variance_filter = TRUE, n=2000)
## Number of removed samples: 1
dim(exp_filt) == dim(final_exp)
## [1] TRUE TRUE

# Take a look at the final data
final_exp
## class: SummarizedExperiment
## dim: 2000 27
## assays(1): ''
## rownames(2000): ZeamMp030 ZeamMp092 ... Zm00001d054093 Zm00001d054107
## rowData names(0):
## colnames(27): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue

# 4 Exploratory data analysis

BioNERO includes some functions for easy data exploration. These functions were created to avoid having to type code chunks that, although small, will be used many times. The idea here is to make the user experience with biological network analysis as easy and simple as possible.

Plotting heatmaps: the function plot_heatmap() plots heatmaps of correlations between samples or gene expression in a single line. Users can use their preferred RColorBrewer’s palette, hide/show gene names, and activate/deactivate clustering for rows and/or columns.

# Heatmap of sample correlations
p <- plot_heatmap(final_exp, type = "samplecor")

# Heatmap of gene expression
p <- plot_heatmap(final_exp, type="expr")

Principal component analysis (PCA): the function plot_PCA() performs a PCA and plots PC1 vs PC2 (by default), as well the percentage of variance explained by each PC. Users can also choose to plot PC1 vs PC3 or PC2 vs PC3.

plot_PCA(final_exp)

# 5 Gene coexpression network inference

Now that we have our filtered and normalized expression data, we can reconstruct a gene coexpression network (GCN) with the WGCNA algorithm (Langfelder and Horvath 2008). First of all, we need to identify the most suitable $$\beta$$ power that makes the network satisfy the scale-free topology. We do that with the function SFT_fit(). Correlation values are raised to a power $$\beta$$ to amplify their distances and, hence, to make the module detection algorithm more powerful. The higher the value of $$\beta$$, the closer to the scale-free topology the network is. However, a very high $$\beta$$ power reduces mean connectivity, which is not desired. To solve this trade-off, we pick the lowest $$\beta$$ power above a certain threshold (by default in SFT_fit(), 0.8). This makes the network close to the scale-free topology without dramatically reducing the mean connectivity.

sft <- SFT_fit(final_exp, net_type="signed hybrid", cor_method="pearson")
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      3    0.220 -0.218          0.178   278.0    303.00  598.0
## 2      4    0.416 -0.382          0.272   196.0    199.00  472.0
## 3      5    0.573 -0.468          0.462   145.0    136.00  381.0
## 4      6    0.675 -0.536          0.584   110.0     95.70  312.0
## 5      7    0.748 -0.584          0.676    86.3     70.00  259.0
## 6      8    0.791 -0.653          0.735    68.8     51.90  221.0
## 7      9    0.803 -0.717          0.761    55.8     38.60  191.0
## 8     10    0.815 -0.775          0.790    45.8     29.90  167.0
## 9     11    0.821 -0.828          0.815    38.1     22.90  147.0
## 10    12    0.838 -0.874          0.850    32.0     17.90  130.0
## 11    13    0.847 -0.913          0.876    27.2     14.30  116.0
## 12    14    0.856 -0.943          0.893    23.2     11.80  104.0
## 13    15    0.875 -0.973          0.913    20.0      9.79   93.0
## 14    16    0.892 -0.997          0.937    17.3      8.00   83.9
## 15    17    0.897 -1.020          0.941    15.1      6.75   76.0
## 16    18    0.891 -1.070          0.948    13.3      5.79   69.7
## 17    19    0.888 -1.100          0.950    11.7      4.96   64.2
## 18    20    0.888 -1.130          0.957    10.4      4.27   59.4
sft$power ## [1] 9 power <- sft$power

As we can see, the optimal power is 9. However, we strongly recommend a visual inspection of the simulation of different $$\beta$$ powers, as WGCNA can fail to return the most appropriate $$\beta$$ power in some cases.3 PRO TIP: If your $$\beta$$ power is too low (say below 6), look at the plot as a sanity check. The function SFT_fit() automatically saves a ggplot object in the second element of the resulting list. To visualize it, you simply have to access the plot.

sft\$plot