--- title: "Lab 7a: Machine learning exercises" output: BiocStyle::html_document: toc: true number_sections: false vignette: > % \VignetteIndexEntry{Lab 7a: Machine learning exercises} % \VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() ``` ```{r setup, echo=FALSE, results='hide'} library(knitr) opts_chunk$set(cache=TRUE, error=FALSE) ``` # Exploratory hierarchical clustering with shiny ## A basic function The R source program `dfHclust.R` is in the github repository for lab 7. Source it into your R session, and then verify that it works with the call ``` data(mtcars) dfHclust(mtcars, labels=rownames(mtcars)) ``` If it fails, add any missing libraries, and do what it takes to get it to work. Interrupt the shiny session to proceed. ## Application to tissue discrimination Set up inputs to `dfHclust` using the `tissuesGeneExpression` data. ```{r gett} library(tissuesGeneExpression) data(tissuesGeneExpression) df = data.frame(t(e)) no = which(tab$SubType == "normal") df = df[no,] tisslabel = tab$Tissue[no] ``` Use `dfHclust(df[,1:50], tisslabel)` as a check. Interrupt ## Exercises. ### Symbol mapping Map the column names of `df` to gene symbols. Use `hgu133a.db`. Remove columns with unmappable symbols and rename the remaining columns with the symbols. ```{r mapit} library(hgu133a.db) nids = mapIds(hgu133a.db, keys= sub("^X", "", colnames(df)), keytype="PROBEID", column="SYMBOL") bad = which(is.na(nids)) if (length(bad)>0) { df = df[,-bad] nids = nids[-bad] colnames(df) = nids } dim(df) ``` Interrupt the shiny session and use the new `df` as input. Note that the clustering is based on three genes by default. Other default choices for the clustering are - the object:object distance used - the agglomeration algorithm - the height at which the tree is cut to define clusters Shift the view to the `silhouette` plot. With the default settings, the average silhouette value for five clusters is 0.35. Increase the `height for cut` value to 8. How many clusters are declared, and what is the average silhouette value? Add the gene CCL5 to the feature set used for clustering. Now how many clusters are declared? Interrupt the shiny session to proceed. ### Alphabetizing the selection options Modify `df` so that the column names are in alphabetical order. Use `dfHclust(df[,1:50], tisslabel)` for the new ordering. What is the average silhouette value for the default choices of `dfHclust` settings? Change the clustering method to `ward.D2`. What is the new average silhouette value? Interrupt the shiny session to proceed. ### Clustering with a gene set We have used an arbitrary selection of genes for these illustrations. Consider the idea that steady-state expression pattern of genes that are used to perform splicing is important for tissue differentiation. We can get a list of relevant genes on the hgu133a array as follows. ``` # using GO.db GOID TERM 22097 GO:0045292 mRNA cis splicing, via spliceosome ``` ```{r dogo} splg = select(hgu133a.db, keys="GO:0045292", keytype="GO", columns="SYMBOL") tokeep = intersect(splg$SYMBOL, colnames(df)) dfsp = df[, tokeep] ``` You should have 7 genes available after these operations. Use `dfsp` with dfHclust, and select all genes for clustering. As you add spliceosome-annotated genes into the clustering, does the appearance of the clustering tree improve? # NMF with drosophila expression patterns ## The drosmap package Install and attach the `drosmap` package. This is a simple repackaging of code and data provided at [BDGP](http://insitu.fruitfly.org/insitu-pp/prinPatCode.zip). ```{r doinst,eval=FALSE} library(BiocInstaller) biocLite("vjcitn/drosmap") library(drosmap) ``` ## Expression patterns A data.frame of spatially recorded gene expression patterns derived from blastocyst samples is available. We'll display some examples. ```{r lkexp, fig=TRUE} library(drosmap) data(expressionPatterns) data(template) imageBatchDisplay(expressionPatterns[,1:12], nrow=3, ncol=4, template=template[,-1]) ``` ## Exercises ### Comparing non-negative matrix factorizations of the expression pattern matrix We'll reduce the data matrix (for convenience) to 701 unique genes ```{r dou} data(uniqueGenes) uex = expressionPatterns[,uniqueGenes] ``` We'll begin with a factorization using a basis of rank 10. ```{r don1,cache=TRUE} set.seed(123) library(NMF) m10 =nmf(uex, rank=10) m10 ``` The authors of the [Wu et al. 2016 PNAS paper](http://www.pnas.org/content/113/16/4290.full) justify a rank 21 basis. ```{r don2,cache=TRUE} set.seed(123) library(NMF) m21 =nmf(uex, rank=21) ``` To visualize the clustering of the expression patterns with the rank 10 basis, use ```{r lk10,fig=TRUE,fig.height=3.8} imageBatchDisplay(basis(m10), nrow=4,ncol=3,template=template[,-1]) ``` The 'predicted' matrix with the rank 10 basis is ```{r lkpr} PM10 = basis(m10)%*%coef(m10) ``` Compare the faithfulness of the rank 10 and rank 21 approximations. ### Comparison to a cell fate schematic Produce the display of the `m21` basis with `imageBatchDisplay` and check that the constituents are similar to those shown as principal patterns below (from the Wu et al. paper). ```{r lippp,fig=TRUE,echo=FALSE} im = readPNG("fullFateMap.png") grid.raster(im) ``` Can any of the patterns found with the rank 10 basis be mapped to key anatomical components of the blastocyst fate schematic?