StathQuest: What is a Statistical Model?

Advertisements

StatQuest: K-means clustering

## demo of k-means clustering...

## Step 1: make up some data
x <- rbind(
  matrix(rnorm(100, mean=0, sd = 0.3), ncol = 2), # cluster 1
  matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), # cluster 2
  matrix(c(rnorm(50, mean = 1, sd = 0.3), # cluster 3
    rnorm(50, mean = 0, sd = 0.3)), ncol = 2))
colnames(x) <- c("x", "y")

## Step 2: show the data without clustering
plot(x)

## Step 3: show the data with the known clusters (this is just so we
## can see how well k-means clustering recreates the original clusters we
## created in step 1)
colors <- as.factor(c(
  rep("c1", times=50),
  rep("c2", times=50),
  rep("c3", times=50)))
plot(x, col=colors)

## Step 3: cluster the data
## NOTE: nstart=25, so kmeans() will cluster using 25 different starting points
## and return the best cluster.
(cl <- kmeans(x, centers=3, nstart=25)) 

## Step 4: plot the data, coloring the points with the clusters
plot(x, col = cl$cluster)

StatQuest: Hierarchical Clustering

## NOTE: You can only have a few hundred rows and columns in a heatmap
## Any more and the program will need to much memory to run. Also, it's
## hard to distinguish things when there are so many rows/columns.

## Usually I pick the top 100 genes of interest and focus the heatmap on those
## There are, of course, a million ways to pick 100 "genes of interest".
## You might know what they are by doing differential gene expression
## you you might just pick the top 100 with the most variation accross the
## samples. Anything goes as long as it helps you gain insight into the data.

## first lets "make up some data"..

sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6)
head(data)

## draw heatmap without clustering
heatmap(as.matrix(data), Rowv=NA, Colv=NA)

## draw heatmap with clustering and use default settings for everything
##
## The heatmap function defaults to using...
## The Eucldian distance to compare rows or columns.
## "Complete-linkage" to compare clusters.
##
## Also, heatmap defaults to scaling the rows so that each row has mean=0
## and standard deviation of 1. This is done to keep the range of colors
## needed to show differences to a reasonable amount. There are only so
## many shades of a color we can easily differenitate...
heatmap(as.matrix(data))

## draw heatmap with clustering and custom settings...
## We'll use the 'manhattan' distance to compare rows/columns
## And "centroid" to compare clusters.
## The differences are... subtle...
heatmap(as.matrix(data),
  distfun=function(x) dist(x, method="manhattan"),
  hclustfun=function(x) hclust(x, method="centroid"))

## Now do the same thing, but this time, scale columns instead of rows...
heatmap(as.matrix(data),
  distfun=function(x) dist(x, method="manhattan"),
  hclustfun=function(x) hclust(x, method="centroid"),
  scale="column")

## if you want to save a heatmap to a PDF file....
pdf(file="my_first_heatmap.pdf", width=5, height=6)
heatmap(as.matrix(data))
dev.off()

StatQuest: Fitting a curve to data, aka Lowess, aka Loess

As an added bonus, here’s some code that shows you how to use “lowess()” and “loess()” in R.

## first let's make a noisy gamma distribution plot...
x <- seq(from=0, to=20, by=0.1)
y.gamma <- dgamma(x, shape=2, scale=2)
y.gamma.scaled <- y.gamma * 100

y.norm <- vector(length=201)
for (i in 1:201) {
y.norm[i] <- rnorm(n=1, mean=y.gamma.scaled[i], sd=2)
}
plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5)

## now fit a lowess curve to the data...
## by default "lowess()" fits a line in each window using
## 2/3's of the data points.
##
## the first parameter, y.norm ~ x, says that y.norm is being
## modeled by x, and the second parameter, f, is the fraction
## of points to use in each window. Here, we're using 1/5 of the
## data points in each window.
lo.fit.gamma <- lowess(y.norm ~ x, f=1/5)

plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5)
lines(x, lo.fit.gamma$y, col="black", lwd=3)

## now fit a loess curve to the data...
## by default "loess()" fits a parabola in each window using
## 75% of the data points.
plx<-predict(loess(y.norm ~ x, span=1/5, degree=2, family="symmetric", iterations=4), se=T)

plot(x, y.norm, frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5)
lines(x, plx$fit, col="black", lwd=3)

## now let's add a confidence interval to the loess() fit...
plot(x, y.norm, type="n", frame.plot=FALSE, xlab="", ylab="", col="#d95f0e", lwd=1.5)
polygon(c(x, rev(x)), c(plx$fit + qt(0.975,plx$df)*plx$se, rev(plx$fit - qt(0.975,plx$df)*plx$se)), col="#99999977")
points(x, y.norm, col="#d95f0e", lwd=1.5)
lines(x, plx$fit, col="black", lwd=3)