Creating New Columns Based on Certain Strings Appearing in a Variable at Least Twice
In this post, we will explore how to create new columns based on certain strings appearing in a variable at least twice when grouped by another column. We’ll use the dplyr package in R and discuss how to define conditions inside case_when.
Problem Statement
We have a data frame containing two variables: ‘id’ and ‘var1’. We want to group the data frame by ‘id’, create new columns ‘condition1’, ‘condition2’, ‘condition3’, etc., and set the values based on whether certain strings appear in ‘var1’ at least twice. If a string appears only once or not at all, we’ll set its corresponding condition value to 0.
Example Code
Here’s an example code snippet that demonstrates how to create these new columns:
library(dplyr)
df2 <- df %>%
group_by(id) %>%
summarise(condition1 = as.integer(sum(var1 == "A") > 1),
condition2 = as.integer(sum(var1 == "B") > 1),
condition3 = as_integer(sum(var1 == "C") > 1))
However, this approach only works if the conditions are mutually exclusive (i.e., a value can’t be both true and false). If we want to use case_when with multiple conditions, it becomes more complex.
Solution
One simple way to define conditions inside case_when is by using logical values. We know that logical values FALSE/TRUE are coded internally as 0/1, so we can sum the results of our comparisons and check if the sums are greater than 1.
Here’s how you can do it:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(condition1 = as_integer(sum(var1 == "A") > 1),
condition2 = as_integer(sum(var1 == "B") > 1),
condition3 = as_integer(sum(var1 == "C") > 1))
However, this won’t give us the exact behavior we want. Let’s re-examine our requirements and come up with a better solution.
Re-Examining Requirements
We need to create new columns ‘condition1’, ‘condition2’, etc., based on whether certain strings appear in ‘var1’ at least twice when grouped by ‘id’. We also want these conditions to be mutually exclusive (i.e., a value can’t be both true and false).
Better Solution
To achieve this, we’ll use the summarise_if function from the tidyr package. This function allows us to specify multiple summarise functions for each group.
Here’s how you can do it:
library(tidyverse)
library(dplyr)
df %>%
group_by(id) %>%
summarise(
condition1 = case_when(var1 == "A" ~ 1,
TRUE ~ 0),
condition2 = case_when(var1 == "B" ~ 1,
TRUE ~ 0),
condition3 = case_when(var1 == "C" ~ 1,
TRUE ~ 0)
)
However, this approach still doesn’t give us the exact behavior we want because of how case_when works. We’ll need to rethink our strategy.
Alternative Approach
Let’s assume that when a string appears at least twice in ‘var1’, its corresponding condition should be set to 1; otherwise, it should be set to 0. This can be achieved by using the following code:
library(tidyverse)
library(dplyr)
df %>%
group_by(id) %>%
summarise(
condition1 = sum(var1 == "A") >= 2,
condition2 = sum(var1 == "B") >= 2,
condition3 = sum(var1 == "C") >= 2
)
This approach works because sum counts the number of times a value appears in ‘var1’. By comparing this count to 2, we can determine whether a string has appeared at least twice.
Summary
In conclusion, creating new columns based on certain strings appearing in a variable at least twice when grouped by another column requires careful consideration of how to define conditions. We’ve explored several approaches and come up with the best solution using summarise_if or simply counting the occurrences of each string.
If you’re interested in learning more about data manipulation in R, I recommend checking out the following resources:
Feel free to ask if you have any further questions!
Last modified on 2025-04-18