Grouping Data and Creating a Summary: A Step-by-Step Guide with R

Grouping Data and Creating a Summary

In this article, we’ll explore how to group data based on categories and create a summary of the results. We’ll start by examining the original data, then move on to creating groups and summarizing the data using various techniques.

Understanding the Original Data

The original data is in a table format, with categories and corresponding values:

Category Value
14        1
13        2
32        1
63        4
24        1
77        3
51        2
19        4
15        1
24        4
32        3
10        1
...

We want to create groups of variables like C1, C2, and so on, where each group consists of values that belong to a specific category.

Creating Groups

To create these groups, we can use the dplyr package in R. First, let’s define the categories and their corresponding group names:

library(dplyr)

# Define the categories and group names
C1 <- c(14, 13, 24, 19, 77)
C2 <- c(32, 51, 63, 15, 10)
...

Next, we can create a data frame that contains these values:

# Create a data frame with the categories and group names
df <- data.frame(
  Category = c("14", "13", "24", "19", "77"),
  Group = ifelse(Category %in% C1, "C1", ifelse(Category %in% C2, "C2", NA))
)

This will create a data frame with two columns: Category and Group.

Pivoting the Data

To pivot the data into wide format, we can use the pivot_wider function from the tidyr package:

# Pivot the data into wide format
df %>% 
  group_by(Group) %>% 
  mutate(Value = row_number()) %>>%
  pivot_wider(names_from = Group, values_from = Category)

This will create a new data frame with each category as a separate column.

Adding Totals

To add totals for each group, we can use the adorn_totals function from the janitor package:

# Add totals for each group using adorn_totals
library(janitor)

df %>% 
  pivot_wider(names_from = Group, values_from = Category) %>>%
  janitor::adorn_totals()

This will add a row with the sum of each category in the last column.

Summarizing the Data

Finally, we can summarize the data by calculating the percentage of each group:

# Calculate the percentage of each group
df %>% 
  group_by(Group) %>>%
  summarise(Value = n() / nrow(df) * 100)

This will create a new data frame with the sum and count for each group.

Putting it All Together

Here is the complete code:

library(dplyr)
library(tidyr)
library(janitor)

# Define the categories and group names
C1 <- c(14, 13, 24, 19, 77)
C2 <- c(32, 51, 63, 15, 10)
...

# Create a data frame with the categories and group names
df <- data.frame(
  Category = c("14", "13", "24", "19", "77"),
  Group = ifelse(Category %in% C1, "C1", ifelse(Category %in% C2, "C2", NA))
)

# Pivot the data into wide format
df %>% 
  group_by(col = case_when(Category %in% C1 ~ 'C1', Category %in% C2 ~ 'C2')) %>>%
  mutate(Value = row_number()) %>>%
  pivot_wider(names_from = col, values_from = Category) %>>%

# Add totals for each group using adorn_totals
janitor::adorn_totals()

# Calculate the percentage of each group
df %>% 
  group_by(Group) %>>%
  summarise(Value = n() / nrow(df) * 100)

This code will create a data frame with the sum and count for each group, and then calculate the percentage of each group.

Conclusion

In this article, we explored how to group data based on categories and create a summary of the results. We used various techniques, including dplyr, tidyr, and janitor packages in R, to achieve this. By following these steps, you can easily group your data and create a summary that provides valuable insights into your data.

Last modified on 2024-12-24