[sourcecode language=”R”]

## NOTE: You can only have a few hundred rows and columns in a heatmap

## Any more and the program will need to much memory to run. Also, it’s

## hard to distinguish things when there are so many rows/columns.

## Usually I pick the top 100 genes of interest and focus the heatmap on those

## There are, of course, a million ways to pick 100 "genes of interest".

## You might know what they are by doing differential gene expression

## you you might just pick the top 100 with the most variation accross the

## samples. Anything goes as long as it helps you gain insight into the data.

## first lets "make up some data"..

sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))

sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))

sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))

sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))

data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6)

head(data)

## draw heatmap without clustering

heatmap(as.matrix(data), Rowv=NA, Colv=NA)

## draw heatmap with clustering and use default settings for everything

##

## The heatmap function defaults to using…

## The Eucldian distance to compare rows or columns.

## "Complete-linkage" to compare clusters.

##

## Also, heatmap defaults to scaling the rows so that each row has mean=0

## and standard deviation of 1. This is done to keep the range of colors

## needed to show differences to a reasonable amount. There are only so

## many shades of a color we can easily differenitate…

heatmap(as.matrix(data))

## draw heatmap with clustering and custom settings…

## We’ll use the ‘manhattan’ distance to compare rows/columns

## And "centroid" to compare clusters.

## The differences are… subtle…

heatmap(as.matrix(data),

distfun=function(x) dist(x, method="manhattan"),

hclustfun=function(x) hclust(x, method="centroid"))

## Now do the same thing, but this time, scale columns instead of rows…

heatmap(as.matrix(data),

distfun=function(x) dist(x, method="manhattan"),

hclustfun=function(x) hclust(x, method="centroid"),

scale="column")

## if you want to save a heatmap to a PDF file….

pdf(file="my_first_heatmap.pdf", width=5, height=6)

heatmap(as.matrix(data))

dev.off()

[/sourcecode]

I am somewhat surprised and vexed that, in doing clustering, there are so many choices in terms of calculating the distance and choosing the method of clustering. It’s very… empirical… I somehow expected that certain parameters would fit best certain scenarios/ types of problems . Has anybody checked systematically the various permutations of parameters for “best fitting” of a problem (say gene expression). Perhaps, the scenarios need to be better defined in order to find this correspondence.

It seems that there are cases in which a certain combination of parameters fit best the essence of the system , according to the researcher’s experience. Maybe scrutinizing such cases can reveal a certain feature of the system that is not encountered in seemingly similar cases for which the combination doesn’t work so well…? Maybe some “constant” a la theorems encountered in physics can do the trick? ðŸ˜‰

Unfortunately, no matter how you do it, there’s a large amount of “art” required to do “science”. Finding the “best” method is part of that art. However, this is what I say, “These things (heatmaps, PCA vs t-SNE vs MDS etc.) are just means of creating new testable hypotheses. So try them all and see if they lead to testable hypotheses. If so… test them!