# Chapter 3 Feature selection

## 3.1 Motivation

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs). We will discuss these below using the 10X PBMC dataset for demonstration:

## class: SingleCellExperiment
## dim: 33694 3985
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

As well as the 416B dataset:

## class: SingleCellExperiment
## dim: 46604 185
## assays(2): counts logcounts
## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1
##   CBFB-MYH11-mcherry
## rowData names(4): Length ENSEMBL SYMBOL SEQNAME
## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(10): Source Name cell line ... block sizeFactor
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC SIRV

## 3.2 Quantifying per-gene variation

The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts” ) for each gene across all cells (A. T. L. Lun, McCarthy, and Marioni 2016). The advantage of this approach is that the feature selection is based on the same log-values that are used for later downstream steps. In particular, genes with the largest variances in log-values will contribute most to the Euclidean distances between cells during procedures like clustering and dimensionality reduction. By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this effect, we use the modelGeneVar() function to fit a trend to the variance with respect to abundance across all genes (Figure 3.1).

At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component. We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.

## DataFrame with 33694 rows and 6 columns
##              mean     total      tech       bio      p.value          FDR
##         <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
## LYZ       1.95605   5.05854  0.835343   4.22320 1.10535e-270 2.17411e-266
## S100A9    1.93416   4.53551  0.835439   3.70007 2.71037e-208 7.61576e-205
## S100A8    1.69961   4.41084  0.824342   3.58650 4.31572e-201 9.43177e-198
## HLA-DRA   2.09785   3.75174  0.831239   2.92050 5.93941e-132 4.86760e-129
## CD74      2.90176   3.36879  0.793188   2.57560 4.83931e-113 2.50485e-110
## ...           ...       ...       ...       ...          ...          ...
## TMSB4X    6.08142  0.441718  0.679215 -0.237497     0.992447            1
## PTMA      3.82978  0.486454  0.731275 -0.244821     0.990002            1
## HLA-B     4.50032  0.486130  0.739577 -0.253447     0.991376            1
## EIF1      3.23488  0.482869  0.768946 -0.286078     0.995135            1
## B2M       5.95196  0.314948  0.654228 -0.339280     0.999843            1

(Careful readers will notice that some genes have negative biological components, which have no obvious interpretation and can be ignored in most applications. They are inevitable when fitting a trend to the per-gene variances as approximately half of the genes will lie below the trend.)

Strictly speaking, the interpretation of the fitted trend as the technical component assumes that the expression profiles of most genes are dominated by random technical noise. In practice, all expressed genes will exhibit some non-zero level of biological variability due to events like transcriptional bursting. Thus, it would be more appropriate to consider these estimates as technical noise plus “uninteresting” biological variation, under the assumption that most genes do not participate in the processes driving interesting heterogeneity across the population.

## 3.3 Quantifying technical noise

The assumption in Section 3.2 may be problematic in rare scenarios where many genes at a particular abundance are affected by a biological process. For example, strong upregulation of cell type-specific genes may result in an enrichment of HVGs at high abundances. This would inflate the fitted trend in that abundance interval and compromise the detection of the relevant genes. We can avoid this problem by fitting a mean-dependent trend to the variance of the spike-in transcripts (Figure 3.2), if they are available. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene.

## DataFrame with 46604 rows and 6 columns
##               mean     total      tech       bio      p.value          FDR
##          <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
## Lyz2       6.61097   13.8497   1.57131   12.2784 1.48993e-186 1.54156e-183
## Ccl9       6.67846   13.1869   1.50035   11.6866 2.21855e-185 2.19979e-182
## Top2a      5.81024   14.1787   2.54776   11.6310  3.80016e-65  1.13040e-62
## Cd200r3    4.83180   15.5613   4.22984   11.3314  9.46221e-24  6.08574e-22
## Ccnb2      5.97776   13.1393   2.30177   10.8375  3.68706e-69  1.20193e-66
## ...            ...       ...       ...       ...          ...          ...
## Rpl5-ps2   3.60625  0.612623   6.32853  -5.71590     0.999616     0.999726
## Gm11942    3.38768  0.798570   6.51473  -5.71616     0.999459     0.999726
## Gm12816    2.91276  0.838670   6.57364  -5.73497     0.999422     0.999726
## Gm13623    2.72844  0.708071   6.45448  -5.74641     0.999544     0.999726
## Rps12l1    3.15420  0.746615   6.59332  -5.84670     0.999522     0.999726

In the absence of spike-in data, one can attempt to create a trend by making some distributional assumptions about the noise. For example, UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts (Figure 3.3) with the modelGeneVarByPoisson() function. Note the increased residuals of the high-abundance genes, which can be interpreted as the amount of biological variation that was assumed to be “uninteresting” when fitting the gene-based trend in Figure 3.1.

## DataFrame with 6 rows and 6 columns
##              mean     total      tech       bio   p.value       FDR
##         <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ       1.95605   5.05854  0.631190   4.42735         0         0
## S100A9    1.93416   4.53551  0.635102   3.90040         0         0
## S100A8    1.69961   4.41084  0.671491   3.73935         0         0
## HLA-DRA   2.09785   3.75174  0.604448   3.14730         0         0
## CD74      2.90176   3.36879  0.444928   2.92386         0         0
## CST3      1.47546   2.95646  0.691386   2.26507         0         0

Interestingly, trends based purely on technical noise tend to yield large biological components for highly-expressed genes. This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins, which are considered uninteresting for characterizing cellular heterogeneity. These observations suggest that a more accurate noise model does not necessarily yield a better ranking of HVGs, though one should keep an open mind - house-keeping genes are regularly DE in a variety of conditions (Glare et al. 2002; Nazari, Parham, and Maleki 2015; Guimaraes and Zavolan 2016), and the fact that they have large biological components indicates that there is strong variation across cells that may not be completely irrelevant.

## 3.4 Handling batch effects

Data containing multiple batches will often exhibit batch effects - see Multi-sample Chapter 1 for more details. We are usually not interested in HVGs that are driven by batch effects; instead, we want to focus on genes that are highly variable within each batch. This is naturally achieved by performing trend fitting and variance decomposition separately for each batch. We demonstrate this approach by treating each plate (block) in the 416B dataset as a different batch, using the modelGeneVarWithSpikes() function. (The same argument is available in all other variance-modelling functions.)

## DataFrame with 6 rows and 6 columns
##              mean     total      tech       bio      p.value          FDR
##         <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
## Lyz2      6.61235   13.8619   1.58416   12.2777  0.00000e+00  0.00000e+00
## Ccl9      6.67841   13.2599   1.44553   11.8143  0.00000e+00  0.00000e+00
## Top2a     5.81275   14.0192   2.74571   11.2734 3.89855e-137 8.43398e-135
## Cd200r3   4.83305   15.5909   4.31892   11.2719  1.17783e-54  7.00722e-53
## Ccnb2     5.97999   13.0256   2.46647   10.5591 1.20380e-151 2.98405e-149
## Hbb-bt    4.91683   14.6539   4.12156   10.5323  2.52639e-49  1.34197e-47

The use of a batch-specific trend fit is useful as it accommodates differences in the mean-variance trends between batches. This is especially important if batches exhibit systematic technical differences, e.g., differences in coverage or in the amount of spike-in RNA added. In this case, there are only minor differences between the trends in Figure 3.4, which indicates that the experiment was tightly replicated across plates. The analysis of each plate yields estimates of the biological and technical components for each gene, which are averaged across plates to take advantage of information from multiple batches.

Alternatively, we might consider using a linear model to account for batch effects and other unwanted factors of variation. This is more flexible as it can handle multiple factors and continuous covariates, though it is less accurate than block= in the special case of a multi-batch design. See Advanced Section 3.3 for more details.

As an aside, the wave-like shape observed above is typical of the mean-variance trend for log-expression values. (The same wave is present but much less pronounced for UMI data.) A linear increase in the variance is observed as the mean increases from zero, as larger variances are obviously possible when the counts are not all equal to zero. In contrast, the relative contribution of sampling noise decreases at high abundances, resulting in a downward trend. The peak represents the point at which these two competing effects cancel each other out.

## 3.5 Selecting highly variable genes

Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. For example, heterogeneity in T cell activation responses is an interesting phenomena (Richard et al. 2018) but may be irrelevant noise in studies that only care about distinguishing the major immunophenotypes.

The most obvious selection strategy is to take the top $$n$$ genes with the largest values for the relevant variance metric. The main advantage of this approach is that the user can directly control the number of genes retained, which ensures that the computational complexity of downstream calculations is easily predicted. For modelGeneVar() and modelGeneVarWithSpikes(), we would select the genes with the largest biological components. This is conveniently done for us via getTopHVgs(), as shown below with $$n=1000$$.

##  chr [1:1000] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...

The choice of $$n$$ also has a fairly straightforward biological interpretation. Recall our trend-fitting assumption that most genes do not exhibit biological heterogeneity; this implies that they are not differentially expressed between cell types or states in our population. If we quantify this assumption into a statement that, e.g., no more than 5% of genes are differentially expressed, we can naturally set $$n$$ to 5% of the number of genes. In practice, we usually do not know the proportion of DE genes beforehand so this interpretation just exchanges one unknown for another. Nonetheless, it is still useful as it implies that we should lower $$n$$ for less heterogeneous datasets, retaining most of the biological signal without unnecessary noise from irrelevant genes. Conversely, more heterogeneous datasets should use larger values of $$n$$ to preserve secondary factors of variation beyond those driving the most obvious HVGs.

The main disadvantage of this approach that it turns HVG selection into a competition between genes, whereby a subset of very highly variable genes can push other informative genes out of the top set. This can be problematic for analyses of highly heterogeneous populations if the loss of important markers prevents the resolution of certain subpopulations. In the most extreme example, consider a situation where a single subpopulation is very different from the others. In such cases, the top set will be dominated by differentially expressed genes involving that distinct subpopulation, compromising resolution of heterogeneity between the other populations. (This can be recovered with a nested analysis, as discussed in Section 5.5, but we would prefer to avoid the problem in the first place.)

Another potential concern with this approach is the fact that the choice of $$n$$ is fairly arbitrary, with any value from 500 to 5000 considered “reasonable”. We have chosen $$n=1000$$ in the code above though there is no particular a priori reason for doing so. Our recommendation is to simply pick an arbitrary $$n$$ and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value. Alternatively, we may pick one of the other selection strategies discussed in Advanced Section 3.5.

## 3.6 Putting it all together

The code chunk below will select the top 10% of genes with the highest biological components.

##  chr [1:1274] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...

We then have several options to enforce our HVG selection on the rest of the analysis.

## Session Info

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB              LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] scater_1.22.0               ggplot2_3.3.5
[3] scran_1.22.0                scuttle_1.4.0
[5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
[7] Biobase_2.54.0              GenomicRanges_1.46.0
[9] GenomeInfoDb_1.30.0         IRanges_2.28.0
[11] S4Vectors_0.32.0            BiocGenerics_0.40.0
[13] MatrixGenerics_1.6.0        matrixStats_0.61.0
[15] BiocStyle_2.22.0            rebook_1.4.0

loaded via a namespace (and not attached):
[1] bitops_1.0-7              filelock_1.0.2
[3] tools_4.1.1               bslib_0.3.1
[5] utf8_1.2.2                R6_2.5.1
[7] irlba_2.3.3               vipor_0.4.5
[9] DBI_1.1.1                 colorspace_2.0-2
[11] withr_2.4.2               gridExtra_2.3
[13] tidyselect_1.1.1          compiler_4.1.1
[15] graph_1.72.0              BiocNeighbors_1.12.0
[17] DelayedArray_0.20.0       bookdown_0.24
[19] sass_0.4.0                scales_1.1.1
[21] rappdirs_0.3.3            stringr_1.4.0
[23] digest_0.6.28             rmarkdown_2.11
[25] XVector_0.34.0            pkgconfig_2.0.3
[27] htmltools_0.5.2           sparseMatrixStats_1.6.0
[29] fastmap_1.1.0             limma_3.50.0
[31] highr_0.9                 rlang_0.4.12
[33] DelayedMatrixStats_1.16.0 jquerylib_0.1.4
[35] generics_0.1.1            jsonlite_1.7.2
[37] BiocParallel_1.28.0       dplyr_1.0.7
[39] RCurl_1.98-1.5            magrittr_2.0.1
[41] BiocSingular_1.10.0       GenomeInfoDbData_1.2.7
[43] Matrix_1.3-4              ggbeeswarm_0.6.0
[45] Rcpp_1.0.7                munsell_0.5.0
[47] fansi_0.5.0               viridis_0.6.2
[49] lifecycle_1.0.1           stringi_1.7.5
[51] yaml_2.2.1                edgeR_3.36.0
[53] zlibbioc_1.40.0           grid_4.1.1
[55] ggrepel_0.9.1             parallel_4.1.1
[57] dqrng_0.3.0               crayon_1.4.1
[59] dir.expiry_1.2.0          lattice_0.20-45
[61] beachmat_2.10.0           locfit_1.5-9.4
[63] CodeDepends_0.6.5         metapod_1.2.0
[65] knitr_1.36                pillar_1.6.4
[67] igraph_1.2.7              codetools_0.2-18
[69] ScaledMatrix_1.2.0        XML_3.99-0.8
[71] glue_1.4.2                evaluate_0.14
[73] BiocManager_1.30.16       vctrs_0.3.8
[75] purrr_0.3.4               gtable_0.3.0
[77] assertthat_0.2.1          xfun_0.27
[79] rsvd_1.0.5                viridisLite_0.4.0
[81] tibble_3.1.5              beeswarm_0.4.0
[83] cluster_2.1.2             bluster_1.4.0
[85] statmod_1.4.36            ellipsis_0.3.2

### References

Glare, E. M., M. Divjak, M. J. Bailey, and E. H. Walters. 2002. “beta-Actin and GAPDH housekeeping gene expression in asthmatic airways is variable and not suitable for normalising mRNA levels.” Thorax 57 (9): 765–70.

Guimaraes, J. C., and M. Zavolan. 2016. “Patterns of ribosomal protein expression specify normal and malignant human cells.” Genome Biol. 17 (1): 236.

Lun, A. T. L., D. J. McCarthy, and J. C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-seq Data.” F1000Res. 5 (August).

Nazari, F., A. Parham, and A. F. Maleki. 2015. “GAPDH, -actin and -microglobulin, as three common reference genes, are not reliable for gene expression studies in equine adipose- and marrow-derived mesenchymal stem cells.” J Anim Sci Technol 57: 18.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.