Contents

1 Preamble

1.1 Package aim

One of the steps in single cell RNA-seq analysis is the filtering of features (i.e. genes or transcripts) to retain only the most expressed ones. This allows to focus on features less affected by technical variations (relatively), and often increases the statistical power of the subsequent analyses.

This step is usually done by applying one or more arbitrary thresholds (i.e. mean TPM > 1, retention of features frequently detected across cells, etc.), or by using the data from spike-in RNA. This package aims to help the threshold decision, especially when spike-in controls are not available.

1.2 Method summary

Highly expressed features are less affected by technical variability than lowly expressed ones, and thus capture biological information more successfully. We propose to use highly expressed features as a reference to to estimate the drowning of biological variation in bins of features of decreasing expression. In this scenario, it is expected that features from major transcription programs will be correlated with each other across bins, unless the technical variation component dominates the expression results.

One of the package’s first steps is to select this reference bin of features (or ‘top window’ of features). We propose to use the 100 features with the highest mean expression across cells, and a reasonably low fraction of drop-outs (zero-expression values) as reference. Using more features will increase the noise in the biological variation reflected in this reference bin (as well as increase computation time), while using less features might result in insufficient capture of the biological complexity of the samples. Subsequent bins of features of decreasing mean expression across cells are then defined (by default, 1000 features per bin). Note that the size of the subsequent feature bins matters less than the size of the reference bin.

For comparison, several negative control bins are defined by randomly shuffling values of the top bin expression matrix, building a window that is expected to be completely uncorrelated. Then, every feature in each bin is correlated to every feature in the reference bin and the control bins to obtain correlation coefficient distributions. Ultimately, the variation of the distributions of correlation coefficients can be used to inform the threshold decision.

1.3 Installation

The stable version of scFeatureFilter can be installed from Bioconductor:

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scFeatureFilter")

Once installed, the package is ready to be loaded:

library(scFeatureFilter)

library(ggplot2)
library(cowplot) # multipanel figures + nice theme

2 Integrated usage

A single-function call option is available to the user, in which the entire package functionality is executed. This will filter the expression matrix and keep only the most informative highly expressed features, using default options:

# example dataset included with the package:
scData_hESC
## # A tibble: 60,468 x 33
##    tracking_id GSM922224_hESCp… GSM922225_hESCp… GSM922226_hESCp…
##    <chr>                  <dbl>            <dbl>            <dbl>
##  1 ENSG000000…           55.3             36.0             53.7  
##  2 ENSG000000…            0                0                0.129
##  3 ENSG000000…           54.0             55.5             41.9  
##  4 ENSG000000…            0.924            0.222            0.646
##  5 ENSG000000…            7.08             1.35             5.42 
##  6 ENSG000000…            0                0                0    
##  7 ENSG000000…            0                0                0    
##  8 ENSG000000…            7.88            29.2              2.15 
##  9 ENSG000000…           21.2              6.64             3.38 
## 10 ENSG000000…            3.59             3.47             3.64 
## # … with 60,458 more rows, and 29 more variables:
## #   GSM922227_hESCpassage0_Cell7_0 <dbl>, GSM922228_hESCpassage0_Cell8_0 <dbl>,
## #   GSM922230_hESCpassage0_Cell10_0 <dbl>,
## #   GSM922250_hESCpassage10_Cell1_0 <dbl>,
## #   GSM922251_hESCpassage10_Cell2_0 <dbl>,
## #   GSM922252_hESCpassage10_Cell3_0 <dbl>,
## #   GSM922253_hESCpassage10_Cell4_0 <dbl>,
## #   GSM922254_hESCpassage10_Cell5_0 <dbl>,
## #   GSM922255_hESCpassage10_Cell6_0 <dbl>,
## #   GSM922256_hESCpassage10_Cell7_0 <dbl>,
## #   GSM922257_hESCpassage10_Cell8_0 <dbl>,
## #   GSM922258_hESCpassage10_Cell9_0 <dbl>,
## #   GSM922259_hESCpassage10_Cell10_0 <dbl>,
## #   GSM922260_hESCpassage10_Cell11_0 <dbl>,
## #   GSM922261_hESCpassage10_Cell12_0 <dbl>,
## #   GSM922262_hESCpassage10_Cell13_0 <dbl>,
## #   GSM922263_hESCpassage10_Cell14_0 <dbl>,
## #   GSM922264_hESCpassage10_Cell15_0 <dbl>,
## #   GSM922265_hESCpassage10_Cell16_0 <dbl>,
## #   GSM922266_hESCpassage10_Cell17_0 <dbl>,
## #   GSM922267_hESCpassage10_Cell18_0 <dbl>,
## #   GSM922268_hESCpassage10_Cell19_0 <dbl>,
## #   GSM922269_hESCpassage10_Cell20_0 <dbl>,
## #   GSM922270_hESCpassage10_Cell21_0 <dbl>,
## #   GSM922271_hESCpassage10_Cell22_0 <dbl>,
## #   GSM922272_hESCpassage10_Cell23_0 <dbl>,
## #   GSM922273_hESCpassage10_Cell24_0 <dbl>,
## #   GSM922274_hESCpassage10_Cell25_0 <dbl>,
## #   GSM922275_hESCpassage10_Cell26_0 <dbl>

# filtering of the dataset with a single function call:
sc_feature_filter(scData_hESC)
## Mean expression of last top gene: 2114.40221875
## Number of windows: 18
## # A tibble: 7,442 x 33
##    geneName GSM922224_hESCp… GSM922225_hESCp… GSM922226_hESCp… GSM922227_hESCp…
##    <chr>               <dbl>            <dbl>            <dbl>            <dbl>
##  1 ENSG000…           40272.           68182            54616.           50846.
##  2 ENSG000…           30747.           32311.           38113.           34258.
##  3 ENSG000…           11057.           13508.           19671            14292.
##  4 ENSG000…           17100.           10635.           12243.           13642.
##  5 ENSG000…           19985.           41373.           23986.           28730.
##  6 ENSG000…           11290.           18370.           17370.           14370.
##  7 ENSG000…            7005.           12713.           10136.            9368.
##  8 ENSG000…           17239.           23962.           28032.           31566 
##  9 ENSG000…           14995.           30880.           17723.           19549 
## 10 ENSG000…            9676.            9654.           10235.           11405.
## # … with 7,432 more rows, and 28 more variables:
## #   GSM922228_hESCpassage0_Cell8_0 <dbl>,
## #   GSM922230_hESCpassage0_Cell10_0 <dbl>,
## #   GSM922250_hESCpassage10_Cell1_0 <dbl>,
## #   GSM922251_hESCpassage10_Cell2_0 <dbl>,
## #   GSM922252_hESCpassage10_Cell3_0 <dbl>,
## #   GSM922253_hESCpassage10_Cell4_0 <dbl>,
## #   GSM922254_hESCpassage10_Cell5_0 <dbl>,
## #   GSM922255_hESCpassage10_Cell6_0 <dbl>,
## #   GSM922256_hESCpassage10_Cell7_0 <dbl>,
## #   GSM922257_hESCpassage10_Cell8_0 <dbl>,
## #   GSM922258_hESCpassage10_Cell9_0 <dbl>,
## #   GSM922259_hESCpassage10_Cell10_0 <dbl>,
## #   GSM922260_hESCpassage10_Cell11_0 <dbl>,
## #   GSM922261_hESCpassage10_Cell12_0 <dbl>,
## #   GSM922262_hESCpassage10_Cell13_0 <dbl>,
## #   GSM922263_hESCpassage10_Cell14_0 <dbl>,
## #   GSM922264_hESCpassage10_Cell15_0 <dbl>,
## #   GSM922265_hESCpassage10_Cell16_0 <dbl>,
## #   GSM922266_hESCpassage10_Cell17_0 <dbl>,
## #   GSM922267_hESCpassage10_Cell18_0 <dbl>,
## #   GSM922268_hESCpassage10_Cell19_0 <dbl>,
## #   GSM922269_hESCpassage10_Cell20_0 <dbl>,
## #   GSM922270_hESCpassage10_Cell21_0 <dbl>,
## #   GSM922271_hESCpassage10_Cell22_0 <dbl>,
## #   GSM922272_hESCpassage10_Cell23_0 <dbl>,
## #   GSM922273_hESCpassage10_Cell24_0 <dbl>,
## #   GSM922274_hESCpassage10_Cell25_0 <dbl>,
## #   GSM922275_hESCpassage10_Cell26_0 <dbl>

3 Detailed usage

3.1 Loading data

scFeatureFilter uses an expression matrix (either as a matrix, a data.frame, a tibble or a SingleCellExperiment R object) as input. Support of Bioconductor’s expression set objects will arrive shortly. Note that features should be in rows, while cells should be in columns.

We recommend providing normalized expression values where at least library size is accounted for, such as TPM or FPKM, rather than raw counts.

An example dataset is supplied with the package:

scData_hESC
## # A tibble: 60,468 x 33
##    tracking_id GSM922224_hESCp… GSM922225_hESCp… GSM922226_hESCp…
##    <chr>                  <dbl>            <dbl>            <dbl>
##  1 ENSG000000…           55.3             36.0             53.7  
##  2 ENSG000000…            0                0                0.129
##  3 ENSG000000…           54.0             55.5             41.9  
##  4 ENSG000000…            0.924            0.222            0.646
##  5 ENSG000000…            7.08             1.35             5.42 
##  6 ENSG000000…            0                0                0    
##  7 ENSG000000…            0                0                0    
##  8 ENSG000000…            7.88            29.2              2.15 
##  9 ENSG000000…           21.2              6.64             3.38 
## 10 ENSG000000…            3.59             3.47             3.64 
## # … with 60,458 more rows, and 29 more variables:
## #   GSM922227_hESCpassage0_Cell7_0 <dbl>, GSM922228_hESCpassage0_Cell8_0 <dbl>,
## #   GSM922230_hESCpassage0_Cell10_0 <dbl>,
## #   GSM922250_hESCpassage10_Cell1_0 <dbl>,
## #   GSM922251_hESCpassage10_Cell2_0 <dbl>,
## #   GSM922252_hESCpassage10_Cell3_0 <dbl>,
## #   GSM922253_hESCpassage10_Cell4_0 <dbl>,
## #   GSM922254_hESCpassage10_Cell5_0 <dbl>,
## #   GSM922255_hESCpassage10_Cell6_0 <dbl>,
## #   GSM922256_hESCpassage10_Cell7_0 <dbl>,
## #   GSM922257_hESCpassage10_Cell8_0 <dbl>,
## #   GSM922258_hESCpassage10_Cell9_0 <dbl>,
## #   GSM922259_hESCpassage10_Cell10_0 <dbl>,
## #   GSM922260_hESCpassage10_Cell11_0 <dbl>,
## #   GSM922261_hESCpassage10_Cell12_0 <dbl>,
## #   GSM922262_hESCpassage10_Cell13_0 <dbl>,
## #   GSM922263_hESCpassage10_Cell14_0 <dbl>,
## #   GSM922264_hESCpassage10_Cell15_0 <dbl>,
## #   GSM922265_hESCpassage10_Cell16_0 <dbl>,
## #   GSM922266_hESCpassage10_Cell17_0 <dbl>,
## #   GSM922267_hESCpassage10_Cell18_0 <dbl>,
## #   GSM922268_hESCpassage10_Cell19_0 <dbl>,
## #   GSM922269_hESCpassage10_Cell20_0 <dbl>,
## #   GSM922270_hESCpassage10_Cell21_0 <dbl>,
## #   GSM922271_hESCpassage10_Cell22_0 <dbl>,
## #   GSM922272_hESCpassage10_Cell23_0 <dbl>,
## #   GSM922273_hESCpassage10_Cell24_0 <dbl>,
## #   GSM922274_hESCpassage10_Cell25_0 <dbl>,
## #   GSM922275_hESCpassage10_Cell26_0 <dbl>

The example dataset was processed by Mantsoki et al. (2016), data from human Embryonic Stem cells was generated by Yan et al. (2013).

3.2 Binning data

Firstly, a mean expression and a coefficient of variation column is added to the expression matrix provided using the calculate_cvs function:

calculate_cvs(scData_hESC)
## # A tibble: 17,929 x 36
##    geneName  mean    sd    cv GSM922224_hESCp… GSM922225_hESCp… GSM922226_hESCp…
##    <chr>    <dbl> <dbl> <dbl>            <dbl>            <dbl>            <dbl>
##  1 ENSG000… 70.0  73.3  1.05            55.3             36.0             53.7  
##  2 ENSG000…  1.00  2.53 2.53             0                0                0.129
##  3 ENSG000… 81.0  71.5  0.883           54.0             55.5             41.9  
##  4 ENSG000…  1.59  1.80 1.13             0.924            0.222            0.646
##  5 ENSG000…  9.53  8.44 0.886            7.08             1.35             5.42 
##  6 ENSG000… 21.7  17.0  0.781            7.88            29.2              2.15 
##  7 ENSG000… 18.0  15.6  0.868           21.2              6.64             3.38 
##  8 ENSG000…  4.11  6.64 1.62             3.59             3.47             3.64 
##  9 ENSG000…  7.22 17.2  2.38             0                0                0    
## 10 ENSG000…  4.94  7.40 1.50             1.36             3.34             0    
## # … with 17,919 more rows, and 29 more variables:
## #   GSM922227_hESCpassage0_Cell7_0 <dbl>, GSM922228_hESCpassage0_Cell8_0 <dbl>,
## #   GSM922230_hESCpassage0_Cell10_0 <dbl>,
## #   GSM922250_hESCpassage10_Cell1_0 <dbl>,
## #   GSM922251_hESCpassage10_Cell2_0 <dbl>,
## #   GSM922252_hESCpassage10_Cell3_0 <dbl>,
## #   GSM922253_hESCpassage10_Cell4_0 <dbl>,
## #   GSM922254_hESCpassage10_Cell5_0 <dbl>,
## #   GSM922255_hESCpassage10_Cell6_0 <dbl>,
## #   GSM922256_hESCpassage10_Cell7_0 <dbl>,
## #   GSM922257_hESCpassage10_Cell8_0 <dbl>,
## #   GSM922258_hESCpassage10_Cell9_0 <dbl>,
## #   GSM922259_hESCpassage10_Cell10_0 <dbl>,
## #   GSM922260_hESCpassage10_Cell11_0 <dbl>,
## #   GSM922261_hESCpassage10_Cell12_0 <dbl>,
## #   GSM922262_hESCpassage10_Cell13_0 <dbl>,
## #   GSM922263_hESCpassage10_Cell14_0 <dbl>,
## #   GSM922264_hESCpassage10_Cell15_0 <dbl>,
## #   GSM922265_hESCpassage10_Cell16_0 <dbl>,
## #   GSM922266_hESCpassage10_Cell17_0 <dbl>,
## #   GSM922267_hESCpassage10_Cell18_0 <dbl>,
## #   GSM922268_hESCpassage10_Cell19_0 <dbl>,
## #   GSM922269_hESCpassage10_Cell20_0 <dbl>,
## #   GSM922270_hESCpassage10_Cell21_0 <dbl>,
## #   GSM922271_hESCpassage10_Cell22_0 <dbl>,
## #   GSM922272_hESCpassage10_Cell23_0 <dbl>,
## #   GSM922273_hESCpassage10_Cell24_0 <dbl>,
## #   GSM922274_hESCpassage10_Cell25_0 <dbl>,
## #   GSM922275_hESCpassage10_Cell26_0 <dbl>

We can explore the mean - variance relationship of the dataset with the plot_mean_variance function:

library(magrittr) # to use the pipe %>%

calculate_cvs(scData_hESC) %>%
    plot_mean_variance(colourByBin = FALSE)

The top window (or reference bin) is then defined with the define_top_genes function, while the rest of bins are created with the bin_scdata function:

scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000)
## Mean expression of last top gene: 2114.40221875
## Number of windows: 18
## # A tibble: 17,929 x 37
##    geneName   mean     sd    cv   bin GSM922224_hESCp… GSM922225_hESCp…
##    <chr>     <dbl>  <dbl> <dbl> <dbl>            <dbl>            <dbl>
##  1 ENSG000… 58450. 43460. 0.744     1           40272.           68182 
##  2 ENSG000… 30454. 32771. 1.08      1           30747.           32311.
##  3 ENSG000… 23484. 26032. 1.11      1           11057.           13508.
##  4 ENSG000… 21989. 33334. 1.52      1           17100.           10635.
##  5 ENSG000… 18749. 18501. 0.987     1           19985.           41373.
##  6 ENSG000… 17557. 12660. 0.721     1           11290.           18370.
##  7 ENSG000… 16748. 16228. 0.969     1            7005.           12713.
##  8 ENSG000… 16013. 20765. 1.30      1           17239.           23962.
##  9 ENSG000… 13890. 13973. 1.01      1           14995.           30880.
## 10 ENSG000… 11299.  7602. 0.673     1            9676.            9654.
## # … with 17,919 more rows, and 30 more variables:
## #   GSM922226_hESCpassage0_Cell6_0 <dbl>, GSM922227_hESCpassage0_Cell7_0 <dbl>,
## #   GSM922228_hESCpassage0_Cell8_0 <dbl>,
## #   GSM922230_hESCpassage0_Cell10_0 <dbl>,
## #   GSM922250_hESCpassage10_Cell1_0 <dbl>,
## #   GSM922251_hESCpassage10_Cell2_0 <dbl>,
## #   GSM922252_hESCpassage10_Cell3_0 <dbl>,
## #   GSM922253_hESCpassage10_Cell4_0 <dbl>,
## #   GSM922254_hESCpassage10_Cell5_0 <dbl>,
## #   GSM922255_hESCpassage10_Cell6_0 <dbl>,
## #   GSM922256_hESCpassage10_Cell7_0 <dbl>,
## #   GSM922257_hESCpassage10_Cell8_0 <dbl>,
## #   GSM922258_hESCpassage10_Cell9_0 <dbl>,
## #   GSM922259_hESCpassage10_Cell10_0 <dbl>,
## #   GSM922260_hESCpassage10_Cell11_0 <dbl>,
## #   GSM922261_hESCpassage10_Cell12_0 <dbl>,
## #   GSM922262_hESCpassage10_Cell13_0 <dbl>,
## #   GSM922263_hESCpassage10_Cell14_0 <dbl>,
## #   GSM922264_hESCpassage10_Cell15_0 <dbl>,
## #   GSM922265_hESCpassage10_Cell16_0 <dbl>,
## #   GSM922266_hESCpassage10_Cell17_0 <dbl>,
## #   GSM922267_hESCpassage10_Cell18_0 <dbl>,
## #   GSM922268_hESCpassage10_Cell19_0 <dbl>,
## #   GSM922269_hESCpassage10_Cell20_0 <dbl>,
## #   GSM922270_hESCpassage10_Cell21_0 <dbl>,
## #   GSM922271_hESCpassage10_Cell22_0 <dbl>,
## #   GSM922272_hESCpassage10_Cell23_0 <dbl>,
## #   GSM922273_hESCpassage10_Cell24_0 <dbl>,
## #   GSM922274_hESCpassage10_Cell25_0 <dbl>,
## #   GSM922275_hESCpassage10_Cell26_0 <dbl>

Then, we can plot the resulting bin definition using the same plot_mean_variance function, setting the colourByBin argument to TRUE:

myPlot <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000) %>%
    plot_mean_variance(colourByBin = TRUE, density_color = "blue")
## Mean expression of last top gene: 2114.40221875
## Number of windows: 18

myPlot