Advertisements

# StatQuest: K-means clustering

## demo of k-means clustering... ## Step 1: make up some data x <- rbind( matrix(rnorm(100, mean=0, sd = 0.3), ncol = 2), # cluster 1 matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), # cluster 2 matrix(c(rnorm(50, mean = 1, sd = 0.3), # cluster 3 rnorm(50, mean = 0, sd = 0.3)), ncol = 2)) colnames(x) <- c("x", "y") ## Step 2: show the data without clustering plot(x) ## Step 3: show the data with the known clusters (this is just so we ## can see how well k-means clustering recreates the original clusters we ## created in step 1) colors <- as.factor(c( rep("c1", times=50), rep("c2", times=50), rep("c3", times=50))) plot(x, col=colors) ## Step 3: cluster the data ## NOTE: nstart=25, so kmeans() will cluster using 25 different starting points ## and return the best cluster. (cl <- kmeans(x, centers=3, nstart=25)) ## Step 4: plot the data, coloring the points with the clusters plot(x, col = cl$cluster)

# StatQuest: K-Nearest Neighbors

You’re going to freak out when you find out how easy it is to classify data!

# StatQuest: Hierarchical Clustering

## NOTE: You can only have a few hundred rows and columns in a heatmap ## Any more and the program will need to much memory to run. Also, it's ## hard to distinguish things when there are so many rows/columns. ## Usually I pick the top 100 genes of interest and focus the heatmap on those ## There are, of course, a million ways to pick 100 "genes of interest". ## You might know what they are by doing differential gene expression ## you you might just pick the top 100 with the most variation accross the ## samples. Anything goes as long as it helps you gain insight into the data. ## first lets "make up some data".. sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20)) sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10)) sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20)) sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10)) sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20)) sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10)) data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6) head(data) ## draw heatmap without clustering heatmap(as.matrix(data), Rowv=NA, Colv=NA) ## draw heatmap with clustering and use default settings for everything ## ## The heatmap function defaults to using... ## The Eucldian distance to compare rows or columns. ## "Complete-linkage" to compare clusters. ## ## Also, heatmap defaults to scaling the rows so that each row has mean=0 ## and standard deviation of 1. This is done to keep the range of colors ## needed to show differences to a reasonable amount. There are only so ## many shades of a color we can easily differenitate... heatmap(as.matrix(data)) ## draw heatmap with clustering and custom settings... ## We'll use the 'manhattan' distance to compare rows/columns ## And "centroid" to compare clusters. ## The differences are... subtle... heatmap(as.matrix(data), distfun=function(x) dist(x, method="manhattan"), hclustfun=function(x) hclust(x, method="centroid")) ## Now do the same thing, but this time, scale columns instead of rows... heatmap(as.matrix(data), distfun=function(x) dist(x, method="manhattan"), hclustfun=function(x) hclust(x, method="centroid"), scale="column") ## if you want to save a heatmap to a PDF file.... pdf(file="my_first_heatmap.pdf", width=5, height=6) heatmap(as.matrix(data)) dev.off()

# StatQuest: Fitting a curve to data, aka Lowess, aka Loess

As an added bonus, here’s some code that shows you how to use “lowess()” and “loess()” in R.

## first let's make a noisy gamma distribution plot... x <- seq(from=0, to=20, by=0.1) y.gamma <- dgamma(x, shape=2, scale=2) y.gamma.scaled <- y.gamma * 100 y.norm <- vector(length=201) for (i in 1:201) { y.norm[i] <- rnorm(n=1, mean=y.gamma.scaled[i], sd=2) } plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5) ## now fit a lowess curve to the data... ## by default "lowess()" fits a line in each window using ## 2/3's of the data points. ## ## the first parameter, y.norm ~ x, says that y.norm is being ## modeled by x, and the second parameter, f, is the fraction ## of points to use in each window. Here, we're using 1/5 of the ## data points in each window. lo.fit.gamma <- lowess(y.norm ~ x, f=1/5) plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5) lines(x, lo.fit.gamma$y, col="black", lwd=3) ## now fit a loess curve to the data... ## by default "loess()" fits a parabola in each window using ## 75% of the data points. plx<-predict(loess(y.norm ~ x, span=1/5, degree=2, family="symmetric", iterations=4), se=T) plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5) lines(x, plx$fit, col="black", lwd=3) ## now let's add a confidence interval to the loess() fit... plot(x, y.norm, type="n", frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5) polygon(c(x, rev(x)), c(plx$fit + qt(0.975,plx$df)*plx$se, rev(plx$fit - qt(0.975,plx$df)*plx$se)), col="#99999977") points(x, y.norm, col="#d95f0e", lwd=1.5) lines(x, plx$fit, col="black", lwd=3)