StatQuest: Logistic Regression in R

Here’s a link to the source code on the StatQuest GitHub.

25 thoughts on “StatQuest: Logistic Regression in R

    • Thanks for catching that. There was something odd with including the first line of the output from the summary() command. Once I deleted that line (which was commented out), the original code came back. Strange! But I’m very grateful you spotted the error before it became a problem for other people. Thanks!

  1. Hi Josh,
    Excellent videos. Your explanation has helped me grasp how to perform logistic regression in R. I was wondering whether you could demonstrate how to put the data in a bar graph with 95% confidence intervals, like is done in academic papers.

    • I can’t recreate that error. Are you sure you created “predicted.data” correctly?
      predicted.data <- data.frame(
      probability.of.hd=logistic$fitted.values,
      hd=data$hd)

      • Josh, when I run the following script, there is an error:

        > predicted.data <- data.frame(
        + probability.of.hd=logistic$fitted.values,
        + hd=data$hd)
        Error in data.frame(probability.of.hd = logistic$fitted.values, hd = data$hd) :
        arguments imply differing number of rows: 297, 303

        How can we resolve it?
        Thanks!

  2. I can’t recreate that error. Can you try running the code from the start and see if it happens again. Your error suggests that you might have skipped the line when we removed samples with “NA” in them:
    data <- data[!(is.na(data$ca) | is.na(data$thal)),]

      • Hi Josh,

        The ggplot works well!

        A) May I know if I’ve had to remove all NA in the data set? Or only in some if the variables?

        B) Do we have to do a randomization check before we remove the NA? We have more than 10,000 samples with NA.

        C) for the formula:
        predicted.data <- data.frame(
        probability.of.hd=logistic$fitted.values,

        Can I check with you if I were to apply to my model, are the following correct?
        hd=data$hd)
        A) Logistics: my model name
        B) data: mydata (my data set)
        C) hd: my response Y
        D) fitted.values: keep it as fitness.values

        D) Lastly I’m trying to do a confusion matrix of the variables, eg Sex & hd. Beside using xtabs, how to we check the Log odds ratio & p-value?

        Pls share with us the r-script to find the better variables.

        Thanks for your kind assistance Josh! 👍🏼😊

  3. Hi Josh! Thank you for the amazing video, it helped clarifying quite a few things!
    I have a question: the ggplot of the model I made doesn’t have an S shape but rather it looks like a half arch, starting from the bottom left corner and ending in the top left. Any suggestions on why that happened? In general the model has quite a low R^2 (0.2357887) but also a very small p-val (7.886781e-08).

    Thanks and cheers,
    Ginevra

  4. Hi Josh, This video is exactly what I need right now! Unfortunately when I try and access the data it says page not found. Is there another way to access the data so I can follow along?
    Thanks!

    • Unfortunately wordpress, which hosts StatQuest, will not allow me to upload text files or CSV files, so I can’t put the data up here. However, if you try again, I bet you can get the data now. If not, try again in an hour.

Leave a Reply

Your email address will not be published. Required fields are marked *