Sonali Arora (sarora@fredhutch.org) output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{D. Machine Learning} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() ``` ## Introduction to Machine Learning Lets say that we are interested in predicting the run time of an athlete depending on his shoe size, height and weight in a study of 100 people. We can do so using a simple linear regression model where ```{r eval=FALSE} y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size ``` Here y is the response variable (run time), n is the number of observations (100 people), p is the number of variables/ features/ predictors (3 IE height, weight, shoe size), X is a nxp matrix This data set is a low dimensional data where n >> p but most of the biological data sets coming out of modern biological techniques are high dimensional IE n << p This poses statistical challenge and simple linear regression can no longer help us. For example, * Identify the risk factors(genes) for prostrate cancer based on gene expression data * Predict the chances of breast cancer survival in a patient. * Identify patterns of gene expression among different sub types of breast cancer In all of the 3 examples, listed above n, number of observations, is 30-40 patients whereas p, number of features, is approximately 30,000 genes. Try writing a linear regression formula for the outcome variable, y, in any of the above three scenarios.. Listed below are things that can go wrong with high dimensional data - some of these predictors are useful, some are not - if we include too many predictors, we can over fit the data This is why we need Machine Learning. Lets first introduce some basic concepts and then dive into examples and a lab session. **Supervised Learning** - Use a data set X to predict the association with a response variable Y. The response variable can be continuous or categorical. For example: Predicting the chances of breast cancer survival in a patient. **Unsupervised Learning** - Discover the associations or patterns in X. No response variable is present. For example: Cluster similar genes into groups. **Training & Test Datasets** - Usually we split observation into test and training data sets. We fit the model on the training data set and evaluate on the test data set. The test set error rate is an estimate of the models performance on future data sets. **Model Selection** - We usually consider numerous models for a given problem. For example, we are trying to identify the genes responsible for a given disease using gene expression data set- we could have the following models a) model 1 - Use all 30000 genes from the array to build a model b) model 2 - we include only genes related to the pathway that we know is upregulated in that disease to build a model c) model 3 - include genes found in literature which are known to influence this disease It is highly recommended to use the test set only on our final model to see how our model will do with new, unseen data. So how do we pick the best model which can be tested on the test data set? **Cross-validation** We can use different approaches to find the best model. Lets look at the commonly used approaches, namely, validation set, leave one out cross-validation, k-fold cross validation. Briefly, the __validation set approach__ deals with diving the full data sets into 3 groups - training set, validation set and the test set. We train the models on the training set, evaluate their performance on the validation set and then the best model is chosen to fit on the test set. The __leave one out cross validation__ starts with fitting n models (where n is number of observations in the training data set), each on n-1 observations, evaluating each model on the left-out observation. The best model is the one for which the total test error is the smallest and that is then used to predict the test set. Lastly the __5 fold cross validation__ (here k=5), is splitting the training data set into 5 sets and repeatedly training the model on the other 4 sets and evaluating the performance on the fifth. **Bias, Variance, Overfitting** - Bias refers to the average difference between the actual betas and the predicted betas, Variance refers to the amount by which the betas differ across experiments. As the model complexity(no of variables) increases, the bias decreases and the variance increases. This is know as the Bias-Variance Tradeoff and a model that has too much of variance, is said to be over fit. ## Datasets For **Unsupervised learning**, we will use RNA-Seq count data from the Biocoductor package, `r Biocpkg("airway")`. From the abstract, a brief description of the RNA-Seq experiment on airway smooth muscle (ASM) cell lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours).” ```{r message=FALSE} library(airway) data("airway") se <- airway colData(se) library("DESeq2") dds <- DESeqDataSet(se, design = ~ cell + dex) ``` For **Supervised learning**, we will use cervical count data from the Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains expressions of 714 miRNA's of human samples. There are 29 tumor and 29 non-tumor cervical samples. For learning purposes, we can treat these as two separate groups and run various classification algorithms. ```{r message=FALSE} library(MLSeq) filepath = system.file("extdata/cervical.txt", package = "MLSeq") cervical = read.table(filepath, header = TRUE) ``` ## Unsupervised Learning Unsupervised Learning is a set of statistical tools intended for the setting in which we have only a set of 'p' features measured on 'n' observations. We are primarily interested in discovering interesting things about the 'p' features. Unsupervised Learning is often performed as a part of Exploratory Data Analysis. These tools help us to get a good idea about the data set. Unlike a supervised learning problem, where we can use prediction to gain some confidence about our learning algorithm, there is no way to check our model. The learning algorithm is thus, aptly named "unsupervised". **RLOG TRANSFORMATION** Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, the variance grows with the mean.If one performs PCA (principal components analysis) directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog for short. See the help for ?rlog for more information and options. The function rlog returns a SummarizedExperiment object which contains the rlog-transformed values in its assay slot: ```{r} rld <- rlog(dds) head(assay(rld)) ``` To assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design? We use the R function dist to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data ```{r} sampleDists <- dist( t( assay(rld) ) ) sampleDists ``` Note the use of the function t to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns. **HEATMAP** We visualize the sample-to-sample distances in a heatmap, using the function heatmap.2 from the gplots package. Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap. ```{r message=FALSE} library("gplots") library("RColorBrewer") sampleDistMatrix <- as.matrix( sampleDists ) rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" ) colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) hc <- hclust(sampleDists) heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=colors, margins=c(2,10), labCol=FALSE ) ``` **PCA** Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out in the two directions which explain most of the differences in the data. The x-axis is the direction (or principal component) which separates the data points the most. The amount of the total variance which is contained in the direction is printed in the axis label. Here, we have used the function plotPCA which comes with DESeq2. The two terms specified by intgroup are the interesting groups for labelling the samples; they tell the function to use them to choose colors. ```{r} plotPCA(rld, intgroup = c("dex", "cell")) ``` From both visualizations, we see that the differences between cells are considerable, though not stronger than the differences due to treatment with dexamethasone. This shows why it will be important to account for this in differential testing by using a paired design (“paired”, because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this by using the design formula ~ cell + dex when setting up the data object in the beginning. **MDS** Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don't have the original data, but only a matrix of distances. Here we have the MDS plot for the distances calculated from the rlog transformed counts: ```{r} library(ggplot2) mds <- data.frame(cmdscale(sampleDistMatrix)) mds <- cbind(mds, colData(rld)) qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds)) ``` ### Exercise: Use the plotMDS function from the limma package to make a simila plot. What is the advtange of using this function over base R's cmdscale? **Solutions:** A similar plot can be made using the plotMDS() function in limma where the input is a matrix of log-fold expression values. Here the advantage is that the distances on plot are proportional to log2-fold change and not only is the plot created, but the object (with distance matrix) is also returned. ```{r plotMDS} suppressPackageStartupMessages({ library(limma) library(DESeq2) library(airway) }) plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell)) ``` ## Supervised Learning In supervised learning, along with the 'p' features, we also have the a response Y measured on the same n observations. The goal is then to predict Y using X (n x p matrix) for new observations. For the cervical data, we know that the first 29 are non-Tumor samples whereas the last 29 are Tumor samples. We will code these as 0 and 1 respectively. We will randomly sample 30% of our data and use that as a test set. The remaining 70% of the data will be used as training data ```{r } set.seed(9) class = data.frame(condition = factor(rep(c(0, 1), c(29, 29)))) nTest = ceiling(ncol(cervical) * 0.2) ind = sample(ncol(cervical), nTest, FALSE) cervical.train = cervical[, -ind] cervical.train = as.matrix(cervical.train + 1) classtr = data.frame(condition = class[-ind, ]) cervical.test = cervical[, ind] cervical.test = as.matrix(cervical.test + 1) classts = data.frame(condition = class[ind, ]) ``` MLSeq aims to make computation less complicated for a user and allows one to learn a model using various classifier's with one single function. The main function of this package is classify which requires data in the form of a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment, used to store the input values, intermediate calculations and results of an analysis of differential expression. So lets create DESeqDataSet object for both the training and test set, and run DESeq on it. ```{r} cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train, colData = classtr, formula(~condition)) cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local") cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts, formula(~condition)) cervical.testS4 = DESeq(cervical.testS4, fitType = "local") ``` Classify using Support Vector Machines. ```{r} svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") svm ``` It returns an object of class 'MLseq' and we observe that it successfully fitted a model with 97.8% accuracy. We can access the slots of this S4 object by ```{r} getSlots("MLSeq") ``` And also, ask about the model trained. ```{r} trained(svm) ``` We can predict the class labels of our test data using "predict" ```{r} pred.svm = predictClassify(svm, cervical.testS4) table(pred.svm, relevel(cervical.testS4$condition, 2)) ``` The other classification methods available are 'randomforest', 'cart' and 'bagsvm'. ### Exercise: Train the same training data and test data using randomForest. **Solutions:** ```{r} rf = classify(data = cervical.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") trained(rf) pred.rf = predictClassify(rf, cervical.testS4) table(pred.rf, relevel(cervical.testS4$condition, 2)) ``` ## SessionInfo ```{r} sessionInfo() ``` ## References 1. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0. 2. Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665. 3. An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 4. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman