BatchQC 2.5.0
Sequencing and microarray samples are often collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. BatchQC is a software tool that streamlines batch preprocessing and evaluation by providing shiny app interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. This is contained in a full R package which allows reproducibility of all shiny evaluations and analyses. BatchQC diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. Moreover, BatchQC interactively applies multiple common batch effect approaches to the data, and the user can quickly see the benefits of each method. BatchQC is developed as a Shiny App, but is reproducible at the command line through the functions contained int he R package. The output of the shiny app is organized into multiple tabs, and each tab features an important part of the batch effect analysis and visualization of the data. The BatchQC interface has the following features:
BatchQC()
is the function that launches the Shiny App in interactive mode.
To begin, install Bioconductor and then install BatchQC:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BatchQC")
To install the most up-to-date version of BatchQC, please install directly from github. You will need the devtools package. You can install both of these with the following commands:
if (!require("devtools", quietly = TRUE)) {
install.packages("devtools")
}
library(devtools)
install_github("wejlab/BatchQC")
You should now be able to load BatchQC and launch the shiny app.
library(BatchQC)
BatchQC()
Below is an example of how you may use the shiny app to analyze your data:
Upon launching the shiny app, you will be on the “Upload Data” screen. From here, you may upload the following data:
A preview of the selected data will show up under the “Input Summary” and “Full Metadata” tab. Input summary shows the counts file/assay on the Input summary tab and the metadata on the “Full Metadata” tab. If a preview does not load properly, please ensure your dataset is structured properly and a valid file type.
You MUST hit “Upload” for your data to be available to interact with the shiny app.
All shiny app functions can also be run from the command line; all command line execution require that your data is in a Summarized Experiment object. You can create a summarized experiment object from a counts and metadata matrix. The remainder of this documentation will utilize the signature data example data set included in the package. This can be loaded from the command line as a summarized experiment object with the following:
data(signature_data)
data(batch_indicator)
se_object <- BatchQC::summarized_experiment(signature_data, batch_indicator)
SummarizedExperiment::assayNames(se_object) <- "log_intensity"
se_object$batch <- as.factor(se_object$batch)
se_object$condition <- as.factor(se_object$condition)
Likewise, you can upload your own data into a summarized experiment object for use with command line functions by providing a data matrix and sample info matrix to the summarized_experiment function.
After you have successfully uploaded your data set of interest, you can apply one of the following Normalization methods if appropriate:
To use one of these two methods, select it from the normalization method drop down, select an assay to perform the normalization on from the drop down, and name the new assay (the prepopulated name can be changed to your preference). You may also choose to log-transform the results of either of these normalization methods or your raw data by selecting the “Log transform” box. Hit “Normalize” to complete the analysis. Your new assay will now be available for further analysis.
Alternatively, this can be called directly from the command line on an SE object and will return the same SE object with an additional assay with the name that you provide as the output_assay_name.
Our signature data set already has a log adjusted assay, so normalization is not needed. Examples are provided below on how you would complete this if needed via the command line:
# CPM Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
log_bool = FALSE, assay_to_normalize = "log_intensity",
output_assay_name = "CPM_normalized_counts")
# CPM Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
log_bool = TRUE, assay_to_normalize = "log_intensity",
output_assay_name = "CPM_log_normalized_counts")
# DESeq Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
log_bool = FALSE, assay_to_normalize = "log_intensity",
output_assay_name = "DESeq_normalized_counts")
# DESeq Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
log_bool = TRUE, assay_to_normalize = "log_intensity",
output_assay_name = "DESeq_log_normalized_counts")
# log adjust
se_object <- BatchQC::normalize_SE(se = se_object, method = "none",
log_bool = TRUE, assay_to_normalize = "log_intensity",
output_assay_name = "log_normalized_counts")
Select the appropriate Batch Effect correction model to create a batch-corrected assay for your data. You may select one of the following:
To apply your desired batch effect correction model, select the method from the drop down menu, then select the appropriate assay in the second drop down menu, select your batch variable (labeled batch in all built-in examples), and the covariates that you would like to preserve (not including the batch variable). If desired, revise the name for the corrected assay and then hit “Correct.”
A progress bar will pop up in the bottom right hand corner and a message will state “Correction Complete” in the bottom right hand corner when complete.
You are now ready to analyze your raw and batch effect corrected data sets!
The signature data is log normalized, so ComBat is one appropriate batch correction tool. Therefore, to run from the command line use the following:
# ComBat correction
se_object <- BatchQC::batch_correct(se = se_object, method = "ComBat",
assay_to_normalize = "log_intensity", batch = "batch",
covar = c("condition"), output_assay_name = "ComBat_corrected")
## Found 53 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found3batches
## Adjusting for1covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
ComBat-Seq (or limma or sva) can be applied to a data set by providing “ComBat-Seq” (or “limma” or “sva”) to the method parameter in the function call.
This tab allows you to view the batch status of each condition. Select the variable that represents your batch variable and then select the variable that represents the condition. A table will then appear listing each condition in the row and how many samples in each condition are in each batch (displayed in the columns).
To recreate this table, run the following code, which will produce a tibble as output:
batch_design_tibble <- BatchQC::batch_design(se = se_object, batch = "batch",
covariate = "condition")
batch_design_tibble
## # A tibble: 10 × 4
## # Groups: condition [10]
## condition `1` `2` `3`
## <fct> <int> <int> <int>
## 1 1 6 12 9
## 2 2 6 0 0
## 3 3 0 6 0
## 4 4 0 6 0
## 5 5 0 5 0
## 6 6 0 6 0
## 7 7 0 6 0
## 8 8 0 0 9
## 9 9 0 0 9
## 10 10 0 0 9
This tab displays the Pearson correlation coefficient and Cramer’s V estimation based on the selected batch and condition.
This table can be recreated with the following code:
confound_table <- BatchQC::confound_metrics(se = se_object, batch = "batch")
confound_table
## Pearson Correlation Coefficient Cramer's V
## condition 0.9179524 0.8005757
Pearson correlation coefficient indicates the strength of the linear relationship between your batch and condition variables. It can range from -1 to 1. The closer the value is to 0, the less likely there is a batch effect. The closer the value is to -1 or 1, the more likely there is a batch effect.
To calculate only the pearson correlation coefficient, run the following:
pearson_cor_result <- BatchQC::std_pearson_corr_coef(batch_design_tibble)
pearson_cor_result
## [1] 0.9179524
This is an additional metric for batch effect and will be between 0 and 1. The closer the value is to 0, the lower the batch effect. The closer the value is to 1, the greater the batch effect
To calculate only Cramer’s V, run the following:
cramers_v_result <- BatchQC::cramers_v(batch_design_tibble)
cramers_v_result
## [1] 0.8005757
If both the Pearson Correlation Coefficient and the Cramer’s V test indicate low batch effect, you likely do not need to adjust your results for batch. However, if one or both metrics indicate some or high batch effect, you should use additional analysis tools (listed below) to explore the need for a batch correction for your results.
First, select your unadjusted assay, your batch variable and any covariates of interest. Click “Here we go!” to see the variation analysis results. The “Individual Variable” tab shows the individual, or raw variation, explained by each variable. The “Residual” tab displays the type 2, or residual, variation for each variable. A box plot is displayed along with a table with the variation specific to each gene listed. This table is searchable and can be displayed in ascending or descending order. Notice the amount of variation attributed to batch. In our signature data set, most of the variation is attributed to batch rather than the condition. This indicates a need for a batch correction.
The “Individual Variation Variable/Batch Ratio” tab displays the individual, or raw variation, divided by the batch. A ratio greater that 1 indicates that batch has a stronger affect than the variable of interest. And the “Residual Variation Variable/Batch Ratio” displays the residual, or type 2 variation, divided by the batch. A ratio greater that 1 indicates that batch has a stronger affect than the variable of interest.
Next, select the batch corrected assay, with the same batch and condition variable and observe the changes in variation. When batch effect is being properly corrected, you should notice that the variation explained by the batch variable has decreased in the batch corrected assay, indicating that the variation in the data is now more likely to be associated with your condition of interest rather than the batch. The ratio statistics should also be less than 1. This is visible in our example data set and confirms the need for batch adjustment.
To recreate the variation analysis plot for the log_intensity assay, use the following code:
ex_variation <- batchqc_explained_variation(se = se_object,
batch = "batch", condition = "condition", assay_name = "log_intensity")
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
To recreate the variation analysis table for the log_intensity assay, use the following code:
ex_variation <- batchqc_explained_variation(se = se_object,
batch = "batch", condition = "condition", assay_name = "log_intensity")
EV_table <- BatchQC::EV_table(batchqc_ev = ex_variation[[1]])
EV_table
## Gene Name Explained batch condition Unexplained
## <char> <num> <num> <num> <num>
## 1: Gene1 36.61 26.52 14.12 63.39
## 2: Gene2 60.24 57.70 44.63 39.76
## 3: Gene3 49.31 34.32 36.55 50.69
## 4: Gene4 17.31 3.01 16.32 82.69
## 5: Gene5 85.54 81.74 69.30 14.46
## ---
## 1596: Gene1596 96.29 90.68 62.02 3.71
## 1597: Gene1597 92.51 87.16 54.09 7.49
## 1598: Gene1598 74.53 28.47 56.57 25.47
## 1599: Gene1599 16.14 5.56 12.94 83.86
## 1600: Gene1600 27.77 15.21 19.30 72.23
EV_table_type_2 <- BatchQC::EV_table(batchqc_ev = ex_variation[[2]])
EV_table_type_2
## Gene Name Explained batch condition Unexplained
## <char> <num> <num> <num> <num>
## 1: Gene1 36.61 22.49 10.09 63.39
## 2: Gene2 60.24 15.62 2.54 39.76
## 3: Gene3 49.31 12.76 14.99 50.69
## 4: Gene4 17.31 1.00 14.31 82.69
## 5: Gene5 85.54 16.24 3.80 14.46
## ---
## 1596: Gene1596 96.29 34.27 5.61 3.71
## 1597: Gene1597 92.51 38.41 5.34 7.49
## 1598: Gene1598 74.53 17.96 46.06 25.47
## 1599: Gene1599 16.14 3.20 10.58 83.86
## 1600: Gene1600 27.77 8.47 12.56 72.23
We can then compare our batch corrected assay, using the following code:
ex_variation <- batchqc_explained_variation(se = se_object,
batch = "batch", condition = "condition", assay_name = "ComBat_corrected")
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
EV_table <- BatchQC::EV_table(batchqc_ev = ex_variation[[1]])
EV_table
## Gene Name Explained batch condition Unexplained
## <char> <num> <num> <num> <num>
## 1: Gene1 12.83 0.98 6.99 87.17
## 2: Gene2 5.37 0.01 4.60 94.63
## 3: Gene3 23.71 1.65 23.53 76.29
## 4: Gene4 12.85 0.23 12.72 87.15
## 5: Gene5 24.47 1.53 22.35 75.53
## ---
## 1596: Gene1596 49.51 2.84 43.03 50.49
## 1597: Gene1597 29.33 0.12 29.29 70.67
## 1598: Gene1598 70.83 19.44 69.39 29.17
## 1599: Gene1599 16.14 5.56 12.94 83.86
## 1600: Gene1600 15.65 0.05 13.21 84.35
EV_table_type_2 <- BatchQC::EV_table(batchqc_ev = ex_variation[[2]])
EV_table_type_2
## Gene Name Explained batch condition Unexplained
## <char> <num> <num> <num> <num>
## 1: Gene1 12.83 5.84 11.85 87.17
## 2: Gene2 5.37 0.77 5.36 94.63
## 3: Gene3 23.71 0.18 22.06 76.29
## 4: Gene4 12.85 0.12 12.61 87.15
## 5: Gene5 24.47 2.12 22.94 75.53
## ---
## 1596: Gene1596 49.51 6.48 46.68 50.49
## 1597: Gene1597 29.33 0.04 29.21 70.67
## 1598: Gene1598 70.83 1.44 51.39 29.17
## 1599: Gene1599 16.14 3.20 10.58 83.86
## 1600: Gene1600 15.65 2.45 15.60 84.35
Select your assay of interest and the batch and condition(s) of interest. We recommend viewing no more than 500 elements at a time (or the max number in your data set).
This heat map shows how similar each sample is to each other sample in your data (samples are plotted on both the x and y axis, so are identical to their self). The samples are clustered together based on similarity. Ideally, you would expect your samples to cluster and be more similar to samples from the same condition rather than from the sample batch. If there is clear visual clustering of the batch in your raw data, you should apply a batch correction method. Your batch corrected assay should have stronger clustering by condition.
Using the signature dataset, we initially see clustering based on batch, but after apply ComBat, our adjusted data set clusters more based on condition.
Use the following code to recreate the sample correlation heatmap:
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = "log_intensity",
nfeature = 38, annotation_column = c("batch", "condition"),
log_option = "FALSE")
correlation_heatmap <- heatmaps$correlation_heatmap
correlation_heatmap
The heatmap shoes the sample clustering on the x axis and the y axis is each gene. The values on the heatmap represent gene expression. The dendrogram on the x axis is the same clustering arrangement as was seen in the sample correlations. This heatmap allows the viewer to see patterns of gene expression across the samples based on the selected variables. Remember that if batch is clustering together on your raw data, you should perform a batch effect correction to adjust your data.
Use the following code to recreate the heatmap:
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = "log_intensity",
nfeature = 38, annotation_column = c("batch", "condition"),
log_option = FALSE)
heatmap <- heatmaps$topn_heatmap
heatmap