StatQuest: Hierarchical Clustering

[sourcecode language=”R”]
## NOTE: You can only have a few hundred rows and columns in a heatmap
## Any more and the program will need to much memory to run. Also, it’s
## hard to distinguish things when there are so many rows/columns.

## Usually I pick the top 100 genes of interest and focus the heatmap on those
## There are, of course, a million ways to pick 100 "genes of interest".
## You might know what they are by doing differential gene expression
## you you might just pick the top 100 with the most variation accross the
## samples. Anything goes as long as it helps you gain insight into the data.

## first lets "make up some data"..

sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6)
head(data)

## draw heatmap without clustering
heatmap(as.matrix(data), Rowv=NA, Colv=NA)

## draw heatmap with clustering and use default settings for everything
##
## The heatmap function defaults to using…
## The Eucldian distance to compare rows or columns.
## "Complete-linkage" to compare clusters.
##
## Also, heatmap defaults to scaling the rows so that each row has mean=0
## and standard deviation of 1. This is done to keep the range of colors
## needed to show differences to a reasonable amount. There are only so
## many shades of a color we can easily differenitate…
heatmap(as.matrix(data))

## draw heatmap with clustering and custom settings…
## We’ll use the ‘manhattan’ distance to compare rows/columns
## And "centroid" to compare clusters.
## The differences are… subtle…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"))

## Now do the same thing, but this time, scale columns instead of rows…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"),
scale="column")

## if you want to save a heatmap to a PDF file….
pdf(file="my_first_heatmap.pdf", width=5, height=6)
heatmap(as.matrix(data))
dev.off()
[/sourcecode]

2 thoughts on “StatQuest: Hierarchical Clustering”

Ed Nedea

December 4, 2017 at 8:22 pm

I am somewhat surprised and vexed that, in doing clustering, there are so many choices in terms of calculating the distance and choosing the method of clustering. It’s very… empirical… I somehow expected that certain parameters would fit best certain scenarios/ types of problems . Has anybody checked systematically the various permutations of parameters for “best fitting” of a problem (say gene expression). Perhaps, the scenarios need to be better defined in order to find this correspondence.
It seems that there are cases in which a certain combination of parameters fit best the essence of the system , according to the researcher’s experience. Maybe scrutinizing such cases can reveal a certain feature of the system that is not encountered in seemingly similar cases for which the combination doesn’t work so well…? Maybe some “constant” a la theorems encountered in physics can do the trick? 😉

- Josh
  
  December 4, 2017 at 8:26 pm
  
  Unfortunately, no matter how you do it, there’s a large amount of “art” required to do “science”. Finding the “best” method is part of that art. However, this is what I say, “These things (heatmaps, PCA vs t-SNE vs MDS etc.) are just means of creating new testable hypotheses. So try them all and see if they lead to testable hypotheses. If so… test them!

StatQuest!!!

An epic journey through statistics and machine learning

StatQuest: Hierarchical Clustering

2 thoughts on “StatQuest: Hierarchical Clustering”

Leave a Reply Cancel reply