ClassifyR provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The driver functions crossValidate and runTests implements varieties of cross-validation. They are:
Driver functions can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of the driver functions is a ClassifyResult object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart.
Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means (DM), ClassifyR also allows for classification using differential variability (DV; changes in scale) and differential distribution (DD; changes in location and/or scale).
To demonstrate some key features of ClassifyR, a data set consisting of the 2000 most variably expressed genes and 190 people will be used to quickly obtain results. The journal article corresponding to the data set was published in Scientific Reports in 2018 and is titled A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data.
Load the package.
A glimpse at the RNA measurements and sample classes.
data(asthma) # Contains measurements and classes variables. measurements[1:5, 1:5]
## HBB BPIFA1 XIST FCGR3B HBA2 ## Sample 1 9.72 14.06 12.28 11.42 7.83 ## Sample 2 11.98 13.89 6.35 13.25 9.42 ## Sample 3 12.15 17.44 10.21 7.87 9.68 ## Sample 4 10.60 11.87 6.27 14.75 8.96 ## Sample 5 8.18 15.01 11.21 6.77 6.43
##  No No No No Yes No ## Levels: No Yes
The numeric matrix variable measurements stores the normalised values of the RNA gene abundances for each sample and the factor vector classes identifies which class the samples belong to. The measurements were normalised using DESeq2’s varianceStabilizingTransformation function, which produces \(log_2\)-like data.
For more complex data sets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a MultiAssayExperiment is recommended for data storage and supported by ClassifyR’s methods.
The crossValidate function offers a quick and simple way to start analysing a dataset in ClassifyR. It is a wrapper for runTests, the core model building and testing function of ClassifyR. crossValidate must be supplied with measurements, a simple tabular data container or a list-like structure of such related tabular data on common samples. The classes of it may be matrix, data.frame, DataFrame, MultiAssayExperiment or list of data.frames. For a dataset with \(n\) observations and \(p\) variables, the crossValidate function will accept inputs of the following shapes:
|Data Type||\(n \times p\)||\(p \times n\)|
|list of data.frames||✔|
crossValidate must also be supplied with outcomes, which represents the prediction to be made in a variety of possible ways.
The type of classifier used can be changed with the classifier argument. The default is a random forest, which seamlessly handles categorical and numerical data. A full list of classifiers can be seen by running ?crossValidate. A feature selection step can be performed before classification using nFeatures and selectionMethod, which is a t-test by default. Similarly, the number of folds and number of repeats for cross validation can be changed with the nFolds and nRepeats arguments. If wanted, nCores can be specified to run the cross validation in parallel. To perform 5-fold cross-validation of a Support Vector Machine with 2 repeats:
result <- crossValidate(measurements, classes, classifier = "SVM", nFeatures = 20, nFolds = 5, nRepeats = 2, nCores = 1)
## Processing sample set 10.
## Warning in .local(results, ...): Balanced Accuracy not found in all elements of results. Calculating it now.
crossValidate also allows data from multiple sources to be integrated into a single model. The integration method can be specified with multiViewMethod argument. In this example, suppose the first 10 variables in the asthma data set are from a certain source and the remaining 1990 variables are from a second source. To integrate multiple data sets, each variable must be labeled with the data set it came from. This is done in a different manner depending on the data type of measurements.
If using Bioconductor’s DataFrame, this can be specified using mcols. In the column metadata, each column must have a dataset and a feature name.
measurementsDF <- DataFrame(measurements) mcols(measurementsDF) <- data.frame( dataset = rep(c("dataset_1", "dataset_2"), times = c(10, 1990)), feature = colnames(measurementsDF) ) result <- crossValidate(measurementsDF, classes, classifier = "SVM", nFolds = 5, nRepeats = 2, multiViewMethod = "merge")
## Processing sample set 10. ## Processing sample set 10. ## Processing sample set 10.
## Warning in .local(results, ...): Balanced Accuracy not found in all elements of results. Calculating it now.
If using a list of data.frames, the name of each element in the list will be used as the data set name.
# Assigns first 10 variables to dataset_1, and the rest to dataset_2 measurementsList <- list( (measurements |> as.data.frame())[1:10], (measurements |> as.data.frame())[11:2000] ) names(measurementsList) <- c("dataset_1", "dataset_2") result <- crossValidate(measurementsList, classes, classifier = "SVM", nFolds = 5, nRepeats = 2, multiViewMethod = "merge") performancePlot(result)
In the following sections, some of the most useful functions provided in ClassifyR will be demonstrated. However, a user could wrap any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the appendix section of this guide titled “Rules for New Functions” for a description of these.
There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.
|Package||Run User-defined Classifiers||Parallel Execution on any OS||Parameter Tuning||Intel DAAL Performance Metrics||Ranking and Selection Plots||Class Distribution Plot||Sample-wise Error Heatmap||Direct Support for MultiAssayExperiment Input|
Although being a cross-validation framework, a number of popular feature selection and classification functions are provided by the package which meet the requirements of functions to be used by it (see the last section).
In the following tables, a function that is used when no function is explicitly specified by the user is shown as.
The functions below produce a ranking, of which different size subsets are tried and the classifier performance evaluated, to select a best subset of features, based on a criterion such as balanced error rate, for example.
|t-test ranking if two classes, F-test ranking if three or more||✔|
|limmaRanking||Moderated t-test ranking using variance shrinkage||✔|
|edgeRranking||Likelihood ratio test for count data ranking||✔|
|bartlettRanking||Bartlett’s test non-robust ranking||✔|
|leveneRanking||Levene’s test robust ranking||✔|
|DMDranking||Difference in location (mean/median) and/or scale (SD, MAD, \(Q_n\))||✔||✔||✔|
|likelihoodRatioRanking||Likelihood ratio (normal distribution) ranking||✔||✔||✔|
|KolmogorovSmirnovRanking||Kolmogorov-Smirnov distance between distributions ranking||✔||✔||✔|
|KullbackLeiblerRanking||Kullback-Leibler distance between distributions ranking||✔||✔||✔|
Likewise, a variety of classifiers is also provided.
|Wrappers for sparsediscrim’s functions dlda and predict.dlda functions||✔|
|classifyInterface||Wrapper for PoiClaClu’s Poisson LDA function classify||✔|
|elasticNetGLMtrainInterface, elasticNetGLMpredictInterface||Wrappers for glmnet’s elastic net GLM functions glmnet and predict.glmnet||✔|
|NSCtrainInterface, NSCpredictInterface||Wrappers for pamr’s Nearest Shrunken Centroid functions pamr.train and pamr.predict||✔|
|fisherDiscriminant||Implementation of Fisher’s LDA for departures from normality||✔||✔*|
|mixModelsTrain, mixModelsPredict||Feature-wise mixtures of normals and voting||✔||✔||✔|
|naiveBayesKernel||Feature-wise kernel density estimation and voting||✔||✔||✔|
|randomForestTrainInterface, randomForestPredictInterface||Wrapper for randomForest’s functions randomForest and predict||✔||✔||✔|
|SVMtrainInterface, SVMpredictInterface||Wrapper for e1071’s functions svm and predict.svm||✔||✔ †||✔ †|
* If ordinary numeric measurements have been transformed to absolute deviations using subtractFromLocation.
† If the value of kernel is not “linear”.
If a desired selection or classification method is not already implemented, rules for writing functions to work with ClassifyR are outlined in the wrapper vignette. Please visit it for more information.
A number of methods are provided for users to enable classification in a feature-set-centric or interactor-centric way. The meta-feature creation functions should be used before cross-validation is done.
|Function||Description||Before CV||During CV|
|edgesToHubNetworks||Takes a two-column matrix or DataFrame and finds all nodes with at least a minimum number of interactions||✔|
|featureSetSummary||Considers sets of features and calculates their mean or median||✔|
|pairsDifferencesSelection||Finds a set of pairs of features whose measurement inequalities can be used for predicting with||✔|
|kTSPclassifier||Voting classifier that uses inequalities between pairs of features to vote for one of two classes||✔|
For more control over the finer aspects of cross-validation of a single data set, runTests may be employed in place of crossValidate. For the variety of cross-validation, the parameters are specified by a CrossValParams object. The default setting is for 100 permutations and five folds and parameter tuning is done by resubstitution. It is also recommended to specify a parallelParams setting. On Linux and MacOS operating systems, it should be MulticoreParam and on Windows computers it should be SnowParam. Note that each of these have an option RNGseed and this needs to be set by the user because some classifiers or feature selection functions will have some element of randomisation. One example that works on all operating systems, but is best-suited to Windows is:
CVparams <- CrossValParams(parallelParams = SnowParam(16, RNGseed = 123)) CVparams
For the actual operations to do to the data to build a model of it, each of the stages should be specified by an object of class ModellingParams. This controls how class imbalance is handled (default is to downsample to the smallest class), any transformation that needs to be done inside of cross-validation (i.e. involving a computed value from the training set), any feature selection and the training and prediction functions to be used. The default is to do an ordinary t-test (two groups) or ANOVA (three or more groups) and classification using diagonal LDA.
## An object of class "ModellingParams" ## Slot "balancing": ##  "downsample" ## ## Slot "transformParams": ## NULL ## ## Slot "selectParams": ## An object of class 'SelectParams'. ## Selection Name: Difference in Means. ## ## Slot "trainParams": ## An object of class 'TrainParams'. ## Classifier Name: Diagonal LDA. ## ## Slot "predictParams": ## An object of class 'PredictParams'. ## Predictor Name: Diagonal LDA. ## ## Slot "doImportance": ##  FALSE
runTests is the main function in ClassifyR which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be demonstrated. It uses a t-test or ANOVA ranking (depending on the number of classes) for feature ranking and DLDA for classification. This classifier relies on differences in means between classes. No parameters need to be specified, because this is the default classification of runTests. By default, the number of features is tuned by resubstitution on the training set.
crossValParams <- CrossValParams(permutations = 5) DMresults <- runTests(measurements, classes, crossValParams, verbose = 1)
## You have 2000 features and 190 samples and only one data-type.
## An object of class 'ClassifyResult'. ## Characteristics: ## characteristic value ## Selection Name Difference in Means ## Classifier Name Diagonal LDA ## Cross-validation 5 Permutations, 5 Folds ## Features: List of length 25 of feature identifiers. ## Predictions: A data frame of 950 rows. ## Performance Measures: None calculated yet.
Here, 5 permutations (non-default) and 5 folds cross-validation (default) is specified. For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams. The parameter seed is important to set for result reproducibility when doing a cross-validation such as this, because it employs randomisation to partition the samples into folds. Also, RNGseed is highly recommended to be set to the back-end specified to BPPARAM if doing parallel processing. The first seed mentioned does not work for parallel processes. For more details about runTests and the parameter classes used by it, consult the help pages of such functions.
The most frequently selected gene can be identified using the distribution function and its relative abundance values for all samples can be displayed visually by plotFeatureClasses.
selectionPercentages <- distribution(DMresults, plot = FALSE) head(selectionPercentages) sortedPercentages <- sort(selectionPercentages, decreasing = TRUE) head(sortedPercentages) mostChosen <- names(sortedPercentages) bestGenePlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1, xAxisLabel = "Normalised Expression")