Here it is, folks! By popular demand, a StatQuest on linear discriminant analysis (LDA)! And just because you are awesome, you can also download the powerpoint slides.

Also, because you asked for it, here’s some sample R code that shows you how to get LDA working in R.

require(MASS) ## if you don't have the "MASS" package, install it with:
## install.package("MASS")
require(ggplot2) ## if you don't have the ggplot2 package, install it with:
## install.package("ggplot2")
## Load a sample data set
data("iris")
my.data <- iris
## look at the first 6 rows
head(my.data)
## create the lda model
model <- lda(formula = Species ~ ., data = my.data)
## get the x,y coordinates for the LDA plot
data.lda.values <- predict(model)
## create a dataframe that has all the info we need to draw a graph
plot.data <- data.frame(X=data.lda.values$x[,1], Y=data.lda.values$x[,2], Species=my.data$Species)
head(plot.data)
## draw a graph using ggplot2
p <- ggplot(data=plot.data, aes(x=X, y=Y)) +
geom_point(aes(color=Species)) +
theme_bw()
p
## you can save the graph with the following command:
## ggsave(file="my_graph.pdf")

If all went well, you should get a graph that looks like this:

Hi Joshua! Thanks a lot for your very helpful video!!

I have a little question for you…
I was thinking about how to find the variables that mainly contribute to the formation of the new axes. How do you concretely do that?
How would you correlate LD1 (coefficients of linear discriminants) with the variables?

Madeleine,
I use R, so here’s how to do it in R. First do the LDA…

library(MASS) ## Load the “MASS” package (which contains the lda() function)
data(iris) ## load an example dataset
head(iris, 3) ## look at the first 3 rows..
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

lda.results <- lda(formula = Species ~ ., data = iris) # now do LDA
lda.results$scaling # now look at the the linear combination of coefficients
LD1 LD2
Sepal.Length 0.8293776 0.02410215
Sepal.Width 1.5344731 2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width -2.8104603 2.83918785

… roughly speaking, the absolute value of the "scaling" values will tell you which variables were the most important for each linear discriminant.

If you want to know how much variation each linear discriminate accounts for…

thanks for your answer!
the higher the absolute value is the most the variable contributes to the groups/categories separation?

Do you think that one can “filtering variables” using the ones with highest “scaling” values and recompute lda? …or lda can not really be used as variable selection?

Yes, I think you can use LDA iteratively to filter out variables that are not helpful. Just like when you do a regression with a ton of variables and then leave some out and see if the sums of squares are still pretty much in your favor.

Hello friend,

Thank you for your helpful video. Could you please provide me the code to run this analysis in R?

Best regards

Amber

LikeLike

Hi Joshua! Thanks a lot for your very helpful video!!

I have a little question for you…

I was thinking about how to find the variables that mainly contribute to the formation of the new axes. How do you concretely do that?

How would you correlate LD1 (coefficients of linear discriminants) with the variables?

Thanks in advance,

best

Madeleine

LikeLike

Madeleine,

I use R, so here’s how to do it in R. First do the LDA…

library(MASS) ## Load the “MASS” package (which contains the lda() function)

data(iris) ## load an example dataset

head(iris, 3) ## look at the first 3 rows..

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

lda.results <- lda(formula = Species ~ ., data = iris) # now do LDA

lda.results$scaling # now look at the the linear combination of coefficients

LD1 LD2

Sepal.Length 0.8293776 0.02410215

Sepal.Width 1.5344731 2.16452123

Petal.Length -2.2012117 -0.93192121

Petal.Width -2.8104603 2.83918785

… roughly speaking, the absolute value of the "scaling" values will tell you which variables were the most important for each linear discriminant.

If you want to know how much variation each linear discriminate accounts for…

lda.results$svd^2/sum(lda.results$svd^2)

LikeLike

Hi Josh,

thanks for your answer!

the higher the absolute value is the most the variable contributes to the groups/categories separation?

Do you think that one can “filtering variables” using the ones with highest “scaling” values and recompute lda? …or lda can not really be used as variable selection?

sorry, I’m wandering a bit off.

thanks in advance

best,

Madeleine

LikeLiked by 1 person

Yes, I think you can use LDA iteratively to filter out variables that are not helpful. Just like when you do a regression with a ton of variables and then leave some out and see if the sums of squares are still pretty much in your favor.

LikeLike

Great!

Thank you very much :)

M

LikeLike