Visualizing Regression Analysis Using ggplot2: A Comprehensive Guide

Understanding Regression Analysis and Its Visualization with ggplot2

Regression analysis is a statistical method used to model the relationship between two or more variables. In this article, we’ll delve into regression analysis, its types, and how to visualize it using ggplot2.

What is Regression Analysis?

Regression analysis is a statistical technique that helps us understand the relationship between one dependent variable (target) and multiple independent variables (predictors). The goal of regression analysis is to create an equation that can predict the value of the target variable based on the predictor variables.

There are two primary types of regression:

  • Simple Linear Regression: This type of regression involves only one independent variable. It’s used when we want to understand how a single independent variable affects the dependent variable.
  • Multiple Linear Regression: This type of regression involves more than one independent variable. It’s used when we want to understand how multiple independent variables together affect the dependent variable.

Types of Regression

There are several types of regression, including:

  • Ordinary Least Squares (OLS) Regression: OLS is a common method for simple and multiple linear regression.
  • Generalized Linear Model (GLM): GLM is an extension of linear regression that can handle non-normal data.
  • Robust Regression: Robust regression is used when the residuals are not normally distributed.

Visualizing Regression with ggplot2

ggplot2 is a powerful visualization library in R that provides a wide range of visualization tools. In this section, we’ll explore how to visualize regression using ggplot2.

The Problem with the Given Code

The provided code uses geom_smooth to add a smooth curve to the scatter plot. However, it doesn’t display the regression equation and R^2 value on the graph.

To address this issue, we can create our own function that calculates the regression line equation and displays it on the graph.

Creating a Function for Regression Line Equation

We’ll start by creating a function lm_eqn that takes in the data frame as input. This function will calculate the coefficients of the linear regression model using lm, substitute them into an equation string, and return the equation as a character.

Code

# GET EQUATION AND R-SQUARED AS STRING
# SOURCE: https://groups.google.com/forum/#!topic/ggplot2/1TgH-kG5XMA

lm_eqn <- function(df){
    m <- lm(y ~ x, df);
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
         list(a = format(unname(coef(m)[1]), digits = 2),
              b = format(unname(coef(m)[2]), digits = 2),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}

Explanation

  • lm(y ~ x, df) calculates the linear regression model.
  • substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, ... substitutes the coefficients into an equation string. The %.*% operator is used to calculate the predicted value of y based on x and b.
  • The list(a = format(unname(coef(m)[1]), digits = 2), b = format(unname(coef(m)[2]), digits = 2), r2 = format(summary(m)$r.squared, digits = 3)) part extracts the coefficients a and b from the linear regression model, as well as the R^2 value.
  • as.character(as.expression(eq)) converts the equation string to a character.

Adding Regression Line Equation and R^2 Value

Now that we have our function lm_eqn, let’s modify the original code to display the regression line equation and R^2 value on the graph.

Code

library(ggplot2)

df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
p <- ggplot(data = df, aes(x = x, y = y)) +
           geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
           geom_point() +
           geom_text(x = 25, y = 300, label = lm_eqn(df), parse = TRUE)
p

Explanation

  • label = lm_eqn(df) adds the regression line equation to the graph using our custom function.
  • parse = TRUE tells ggplot2 to evaluate the label expression.

Conclusion

In this article, we explored regression analysis and its visualization with ggplot2. We created a custom function lm_eqn that calculates the regression line equation and displays it on the graph. With this function, you can easily add the regression line equation and R^2 value to your scatter plots using ggplot2.

Additional Example: Calculating Coefficients

If you want to calculate the coefficients manually without using the lm_eqn function, here’s how you can do it:

# Calculate Coefficients Manually

p <- ggplot(data = df, aes(x = x, y = y)) +
       geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
       geom_point()

# Extract Coefficients from Linear Regression Model
m <- lm(y ~ x, df)
a <- coef(m)[1]
b <- coef(m)[2]

# Calculate Predicted Value of Y Based on X and B
y_pred <- a + b * df$x

# Plot Predicted Value of Y
p +
  geom_line(aes(y = y_pred), color = "red") +
  geom_point(color = "black")

This code calculates the coefficients manually by extracting them from the linear regression model. It then uses these coefficients to calculate the predicted value of y based on x and b, which is plotted as a line.

Additional Example: Using Matplotlib

If you’re using R Studio with the R package matplotlib, you can use this function to display the regression line equation and R^2 value:

# Display Regression Line Equation and R-Squared Value Using Matplotlib

library(matlab)

matplot(df$x, df$y, "o-", 
         main = "Scatter Plot with Regression Line", 
         xlab = "X", ylab = "Y")

abline(lm(y ~ x, df), col = "red")
text(25, 300, lm_eqn(df), adj = 0.5)
legend("topright", legend = c("Data Points", "Regression Line"),
       col = c("black", "red"), lty = 1)

This code uses matplotlib to display the scatter plot with regression line equation and R^2 value.

By using these functions, you can easily add regression line equations and R^2 values to your scatter plots without having to calculate coefficients manually.


Last modified on 2024-08-17