StatQuest: Hierarchical Clustering

[sourcecode language=”R”]
## NOTE: You can only have a few hundred rows and columns in a heatmap
## Any more and the program will need to much memory to run. Also, it’s
## hard to distinguish things when there are so many rows/columns.

## Usually I pick the top 100 genes of interest and focus the heatmap on those
## There are, of course, a million ways to pick 100 "genes of interest".
## You might know what they are by doing differential gene expression
## you you might just pick the top 100 with the most variation accross the
## samples. Anything goes as long as it helps you gain insight into the data.

## first lets "make up some data"..

sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6)
head(data)

## draw heatmap without clustering
heatmap(as.matrix(data), Rowv=NA, Colv=NA)

## draw heatmap with clustering and use default settings for everything
##
## The heatmap function defaults to using…
## The Eucldian distance to compare rows or columns.
## "Complete-linkage" to compare clusters.
##
## Also, heatmap defaults to scaling the rows so that each row has mean=0
## and standard deviation of 1. This is done to keep the range of colors
## needed to show differences to a reasonable amount. There are only so
## many shades of a color we can easily differenitate…
heatmap(as.matrix(data))

## draw heatmap with clustering and custom settings…
## We’ll use the ‘manhattan’ distance to compare rows/columns
## And "centroid" to compare clusters.
## The differences are… subtle…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"))

## Now do the same thing, but this time, scale columns instead of rows…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"),
scale="column")

## if you want to save a heatmap to a PDF file….
pdf(file="my_first_heatmap.pdf", width=5, height=6)
heatmap(as.matrix(data))
dev.off()
[/sourcecode]

2 thoughts on “StatQuest: Hierarchical Clustering

  1. I am somewhat surprised and vexed that, in doing clustering, there are so many choices in terms of calculating the distance and choosing the method of clustering. It’s very… empirical… I somehow expected that certain parameters would fit best certain scenarios/ types of problems . Has anybody checked systematically the various permutations of parameters for “best fitting” of a problem (say gene expression). Perhaps, the scenarios need to be better defined in order to find this correspondence.
    It seems that there are cases in which a certain combination of parameters fit best the essence of the system , according to the researcher’s experience. Maybe scrutinizing such cases can reveal a certain feature of the system that is not encountered in seemingly similar cases for which the combination doesn’t work so well…? Maybe some “constant” a la theorems encountered in physics can do the trick? 😉

    • Unfortunately, no matter how you do it, there’s a large amount of “art” required to do “science”. Finding the “best” method is part of that art. However, this is what I say, “These things (heatmaps, PCA vs t-SNE vs MDS etc.) are just means of creating new testable hypotheses. So try them all and see if they lead to testable hypotheses. If so… test them!

Leave a Reply

Your email address will not be published. Required fields are marked *