**smartid**

`smartid`

is a package that enables automated selection of group specific signature genes, especially for rare population. This package is developed for generating lists of specific signature genes based on **Term Frequency-Inverse Document Frequency** (TF-IDF) modified methods and **expectation maximization** (EM) for labeled data. It can also be used as a new gene-set scoring method or data transformation method for un-labeled data. Multiple visualization functions are implemented in this package.

`smartid`

R package can be installed from Bioconductor or GitHub.

The most updated version of `smartid`

is hosted on GitHub and can be installed using `devtools::install_github()`

function provided by devtools.

```
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("smartid", quietly = TRUE)) {
BiocManager::install("smartid")
}
```

To show a quick start guide of `smartid`

, here we use package *splatter* to simulate a scRNA-seq data of 1000 genes * 3000 cells. This data consists of 4 groups, each has 2% DEGs except Group 4, which has no DEG as a negative control group.

```
library(smartid)
library(SummarizedExperiment)
library(splatter)
library(ggplot2)
library(scater)
## set seed for reproducibility
set.seed(123)
sim_params <- newSplatParams(
nGenes = 1000,
batchCells = 3000,
group.prob = seq(0.1, 0.4, length.out = 4),
de.prob = c(0.02, 0.02, 0.02, 0),
# de.downProb = 0,
de.facLoc = 0.5,
de.facScale = 0.4
)
data_sim <- splatSimulate(sim_params, method = "groups")
## get up markers based on fold change
fc <- 1
cols <- paste0("DEFacGroup", seq_along(unique(data_sim$Group)))
defac <- as.data.frame(rowData(data_sim)[, cols])
up <- lapply(cols, \(id)
dplyr::filter(defac, if_all(-!!sym(id), \(x) !!sym(id) / x > fc)) |>
rownames())
slot(data_sim, "metadata")$up_markers <- setNames(up, cols)
slot(data_sim, "metadata")$up_markers
#> $DEFacGroup1
#> [1] "Gene31" "Gene42" "Gene172" "Gene225" "Gene308" "Gene312" "Gene352"
#> [8] "Gene391" "Gene425" "Gene436" "Gene547" "Gene650" "Gene696" "Gene893"
#> [15] "Gene904" "Gene913"
#>
#> $DEFacGroup2
#> [1] "Gene37" "Gene76" "Gene141" "Gene332" "Gene419" "Gene628" "Gene682"
#> [8] "Gene713" "Gene778" "Gene818"
#>
#> $DEFacGroup3
#> [1] "Gene26" "Gene28" "Gene357" "Gene405" "Gene462" "Gene518" "Gene833"
#>
#> $DEFacGroup4
#> character(0)
data_sim
#> class: SingleCellExperiment
#> dim: 1000 3000
#> metadata(2): Params up_markers
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
#> rowData names(8): Gene BaseGeneMean ... DEFacGroup3 DEFacGroup4
#> colnames(3000): Cell1 Cell2 ... Cell2999 Cell3000
#> colData names(4): Cell Batch Group ExpLibSize
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
```

`smartid`

can be easily used to accurately identify specific marker genes on labeled data. By adapting and modifying TF-IDF approach, `smartid`

shows robust power in finding marker genes, especially for rare population which many methods fail in.

**marker identification of smartid includes 3 key steps:**

step 1. score samples

step 2. scale and transform scores

step 3. identify markers using expectation maximization (EM)

The first step is to score all samples/cells by using specified approach. The score can be composed of 3 terms: TF (term/feature frequency), IDF (inverse document/cell frequency) and IAE (inverse average expression of features). Each term has a couple of available choices with different formats to suit labeled or un-labeled data. Users can use function `idf_iae_methods()`

to see available methods for IDF/IAE term. More details of each term can be seen in help page of each function, e.g. `?idf`

.

```
## show available methods
idf_iae_methods()
#> unlabel HDBSCAN label IGM unlabel max
#> "hdb" "igm" "m"
#> null label probability label relative frequency
#> "null" "prob" "rf"
#> unlabel SD unlabel standard
#> "sd" "standard"
```

The basic version of TF, IDF and IAE can be termed as:

\(\mathbf{TF_{i,j}}=\frac{N_{i,j}}{\sum_j{N_{i,j}}},\) \(\mathbf{IDF_i} = \log(1+\frac{n}{n_i+1}),\) \(\mathbf{IAE_i} = \log(1+\frac{n}{\sum_j^n\hat N_{i,j}+1})\)

Where \(N_{i,j}\) is the counts of feature \(i\) in cell \(j\); \(\hat N_{i,j}\) is \(\max(0,N_{i,j}-\mathrm{threshold})\); \(n\) is the total number of documents(cells); \(n_i\) is \(\sum_{j = 1}^{n} \mathrm{sign}(N_{i,j} > \mathrm{threshold})\).

Here for labeled data, we can choose logTF * IDF_prob * IAE_prob for marker identification: \[\mathbf{score}=\log \mathbf{TF}*\mathbf{IDF}_{prob}*\mathbf{IAE}_{prob}\]

The probability version of IDF can be termed as: \[\mathbf{IDF_{i,j}} = \log(1+\frac{\frac{n_{i,j\in D}}{n_{j\in D}}}{\max(\frac{n_{i,j\in \hat D}}{n_{j\in \hat D}})+ e^{-8}}\frac{n_{i,j\in D}}{n_{j\in D}})\]

And the probability version of IAE can be termed as: \[\mathbf{IAE_{i,j}} = \log(1+\frac{\mathrm{mean}(\hat N_{i,j\in D})}{\max(\mathrm{mean}(\hat N_{i,j\in \hat D}))+ e^{-8}}*\mathrm{mean}(\hat N_{i,j\in D}))\]

Where \(D\) is the category of cell \(j\); \(\hat D\) is the category other than \(D\).

TF here stands for gene frequency, which is similar to CPM, while IDF represents the inverse cell/sample frequency for scRNA-seq data, and IAE is the inverse average expression of each gene across all cells or cells in each labeled group.

Another advantage of `smartid`

is that it can start with raw counts data, with no need for pre-processed data. And the scoring is quite fast.

```
## compute score
system.time(
data_sim <- cal_score(
data_sim,
tf = "logtf",
idf = "prob",
iae = "prob",
par.idf = list(label = "Group"),
par.iae = list(label = "Group")
)
)
#> user system elapsed
#> 0.442 0.012 0.454
## score and tf,idf,iae all saved
assays(data_sim)
#> List of length 7
#> names(7): BatchCellMeans BaseCellMeans BCV CellMeans TrueCounts counts score
names(metadata(data_sim))
#> [1] "Params" "up_markers" "tf" "idf" "iae"
```

Scaling is needed to find the markers specific to the group, however, standard scaling might fail due to the rare populations. Here `smartid`

uses a special scaling strategy `scale_mgm()`

, which can scale imbalanced data by given group labels. By doing this, we can avoid the bias towards features with larger numerical ranges during feature selection.

The scale method is depicted as below:

\[z=\frac{x-\frac{\sum_k^{n_D}(\mu_k)}{n_D}}{sd}\]

The score will be transformed using softmax before passing to EM algorithm.

```
top_m <- top_markers(
data = data_sim,
label = "Group",
n = Inf # set Inf to get all features processed score
)
top_m
#> # A tibble: 4,000 × 3
#> # Groups: .dot [4]
#> .dot Genes Scores
#> <chr> <chr> <dbl>
#> 1 Group1 Gene352 0.0216
#> 2 Group1 Gene696 0.0200
#> 3 Group1 Gene425 0.0171
#> 4 Group1 Gene391 0.0160
#> 5 Group1 Gene225 0.0159
#> 6 Group1 Gene893 0.0147
#> 7 Group1 Gene913 0.0113
#> 8 Group1 Gene650 0.0107
#> 9 Group1 Gene172 0.00998
#> 10 Group1 Gene547 0.00980
#> # ℹ 3,990 more rows
```

The top n features for each group will be ordered and listed in `top_m`

. `smartid`

provides easy-to-use functions to visualize top feature scores in each group and compare with actual up-regulated DEGs.

It’s clear that the real UP DEGs are popping up to the top n features. And for the negative control “Group 4”, the shape of top feature score is totally different from the ones with DEGs, which can provide more insights to help understand the data.

```
score_barplot(
top_markers = top_m,
column = ".dot",
f_list = slot(data_sim, "metadata")$up_markers,
n = 20
)
```

As we can see, there is an UP DEG ‘Gene76’ not popping up in Group 2, we can check the relative expression of this gene using violin plot. It is clear that this gene is not significantly highly expressed in Group 2 and the average expression is quite low across all cells.

This can also be confirmed in data simulation information, where the scale factor is higher in Group2, but the GeneMean is too small to be confident. Thus this gene won’t be selected by `smartid`

.

```
sin_score_boxplot(
metadata(data_sim)$tf,
features = "Gene76",
ref.group = "Group2",
label = data_sim$Group
)
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> ℹ Please use `tibble::rownames_to_column()` instead.
#> ℹ The deprecated feature was likely used in the smartid package.
#> Please report the issue at
#> <https://github.com/DavisLaboratory/smartid/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
```

```
## sim gene info
SummarizedExperiment::elementMetadata(data_sim)[76, ]
#> DataFrame with 1 row and 8 columns
#> Gene BaseGeneMean OutlierFactor GeneMean DEFacGroup1 DEFacGroup2
#> <character> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 Gene76 0.00628427 1 0.00628427 1 1.73743
#> DEFacGroup3 DEFacGroup4
#> <numeric> <numeric>
#> 1 1 1
```

As we can see from above, there is a distinctly different distribution of feature score between group with DEGs and without DEGs. And there is a clear separation (break point) between the real DEGs and non-DEGs.

To help automatically select real markers for each group, `smartid`

used an expectation maximization (EM) approach to identify which genes fall into the real DEGs distribution and which are not.

Regarding the distribution of scores as a mixture model, here we can choose function `markers_mixmdl()`

in `smartid`

to separate features. There are 2 available mixture model to choose: normal (Gaussian) or gamma. We choose “norm” here as it runs faster.

`smartid`

also allows to plot the mixture distribution plot after EM. It’s obvious that the top 2 components of Group 4 share quite similar distribution, thus no markers will be selected for this group.

```
set.seed(123)
marker_ls <- markers_mixmdl(
top_markers = top_m,
column = ".dot",
ratio = 2,
dist = "norm",
plot = TRUE
)
#> number of iterations= 88
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```