Understanding Regression Analysis and Its Visualization with ggplot2
Regression analysis is a statistical method used to model the relationship between two or more variables. In this article, we’ll delve into regression analysis, its types, and how to visualize it using ggplot2.
What is Regression Analysis?
Regression analysis is a statistical technique that helps us understand the relationship between one dependent variable (target) and multiple independent variables (predictors). The goal of regression analysis is to create an equation that can predict the value of the target variable based on the predictor variables.
There are two primary types of regression:
- Simple Linear Regression: This type of regression involves only one independent variable. It’s used when we want to understand how a single independent variable affects the dependent variable.
- Multiple Linear Regression: This type of regression involves more than one independent variable. It’s used when we want to understand how multiple independent variables together affect the dependent variable.
Types of Regression
There are several types of regression, including:
- Ordinary Least Squares (OLS) Regression: OLS is a common method for simple and multiple linear regression.
- Generalized Linear Model (GLM): GLM is an extension of linear regression that can handle non-normal data.
- Robust Regression: Robust regression is used when the residuals are not normally distributed.
Visualizing Regression with ggplot2
ggplot2 is a powerful visualization library in R that provides a wide range of visualization tools. In this section, we’ll explore how to visualize regression using ggplot2.
The Problem with the Given Code
The provided code uses geom_smooth to add a smooth curve to the scatter plot. However, it doesn’t display the regression equation and R^2 value on the graph.
To address this issue, we can create our own function that calculates the regression line equation and displays it on the graph.
Creating a Function for Regression Line Equation
We’ll start by creating a function lm_eqn that takes in the data frame as input. This function will calculate the coefficients of the linear regression model using lm, substitute them into an equation string, and return the equation as a character.
Code
# GET EQUATION AND R-SQUARED AS STRING
# SOURCE: https://groups.google.com/forum/#!topic/ggplot2/1TgH-kG5XMA
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(m)[1]), digits = 2),
b = format(unname(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
Explanation
lm(y ~ x, df)calculates the linear regression model.substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, ...substitutes the coefficients into an equation string. The%.*%operator is used to calculate the predicted value of y based on x and b.- The
list(a = format(unname(coef(m)[1]), digits = 2), b = format(unname(coef(m)[2]), digits = 2), r2 = format(summary(m)$r.squared, digits = 3))part extracts the coefficients a and b from the linear regression model, as well as the R^2 value. as.character(as.expression(eq))converts the equation string to a character.
Adding Regression Line Equation and R^2 Value
Now that we have our function lm_eqn, let’s modify the original code to display the regression line equation and R^2 value on the graph.
Code
library(ggplot2)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point() +
geom_text(x = 25, y = 300, label = lm_eqn(df), parse = TRUE)
p
Explanation
label = lm_eqn(df)adds the regression line equation to the graph using our custom function.parse = TRUEtells ggplot2 to evaluate the label expression.
Conclusion
In this article, we explored regression analysis and its visualization with ggplot2. We created a custom function lm_eqn that calculates the regression line equation and displays it on the graph. With this function, you can easily add the regression line equation and R^2 value to your scatter plots using ggplot2.
Additional Example: Calculating Coefficients
If you want to calculate the coefficients manually without using the lm_eqn function, here’s how you can do it:
# Calculate Coefficients Manually
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point()
# Extract Coefficients from Linear Regression Model
m <- lm(y ~ x, df)
a <- coef(m)[1]
b <- coef(m)[2]
# Calculate Predicted Value of Y Based on X and B
y_pred <- a + b * df$x
# Plot Predicted Value of Y
p +
geom_line(aes(y = y_pred), color = "red") +
geom_point(color = "black")
This code calculates the coefficients manually by extracting them from the linear regression model. It then uses these coefficients to calculate the predicted value of y based on x and b, which is plotted as a line.
Additional Example: Using Matplotlib
If you’re using R Studio with the R package matplotlib, you can use this function to display the regression line equation and R^2 value:
# Display Regression Line Equation and R-Squared Value Using Matplotlib
library(matlab)
matplot(df$x, df$y, "o-",
main = "Scatter Plot with Regression Line",
xlab = "X", ylab = "Y")
abline(lm(y ~ x, df), col = "red")
text(25, 300, lm_eqn(df), adj = 0.5)
legend("topright", legend = c("Data Points", "Regression Line"),
col = c("black", "red"), lty = 1)
This code uses matplotlib to display the scatter plot with regression line equation and R^2 value.
By using these functions, you can easily add regression line equations and R^2 values to your scatter plots without having to calculate coefficients manually.
Last modified on 2024-08-17