StatQuest: Linear Regression (aka GLMs, part 1)

Linear regression is the first part in a bunch of videos I’m going to do about General Linear Models.

I also made a companion StatQuest that shows how to do linear regression in R:

Here’s the code from the video if you want to try it out yourself:

## Here's the data from the example:
mouse.data <- data.frame(
  weight=c(0.9, 1.8, 2.4, 3.5, 3.9, 4.4, 5.1, 5.6, 6.3),
  size=c(1.4, 2.6, 1.0, 3.7, 5.5, 3.2, 3.0, 4.9, 6.3))

mouse.data # print the data to the screen in a nice format

## plot a x/y scatter plot with the data
plot(mouse.data$weight, mouse.data$size)

## create a "linear model" - that is, do the regression
mouse.regression <- lm(size ~ weight, data=mouse.data)
## generate a summary of the regression
summary(mouse.regression)

## add the regression line to our x/y scatter plot
abline(mouse.regression, col="blue")

StatQuest: K-means clustering

## demo of k-means clustering...

## Step 1: make up some data
x <- rbind(
  matrix(rnorm(100, mean=0, sd = 0.3), ncol = 2), # cluster 1
  matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), # cluster 2
  matrix(c(rnorm(50, mean = 1, sd = 0.3), # cluster 3
    rnorm(50, mean = 0, sd = 0.3)), ncol = 2))
colnames(x) <- c("x", "y")

## Step 2: show the data without clustering
plot(x)

## Step 3: show the data with the known clusters (this is just so we
## can see how well k-means clustering recreates the original clusters we
## created in step 1)
colors <- as.factor(c(
  rep("c1", times=50),
  rep("c2", times=50),
  rep("c3", times=50)))
plot(x, col=colors)

## Step 3: cluster the data
## NOTE: nstart=25, so kmeans() will cluster using 25 different starting points
## and return the best cluster.
(cl <- kmeans(x, centers=3, nstart=25)) 

## Step 4: plot the data, coloring the points with the clusters
plot(x, col = cl$cluster)