Converting Duplicate Rows to One Row with Collapsed Values
In this article, we will explore how to convert duplicate rows in a table to one row while collapsing certain values into a single column separated by a character.
Problem Statement
We are given a table that has duplicate rows based on the gene column. We want to remove these duplicates and collapse the values of the columns named chrQ, startq, endq, and geneq into a single column called matched. The matched column should contain the collapsed values separated by a semicolon.
Table Representation
The given table can be represented as follows:
test <- data.frame(
chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
start = c(1, 1, 1, 2, 2, 10),
end = c(5, 5, 5, 7, 7, 20),
gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
startq = c(1, 1, 1, 2, 3, 10),
endq = c(5, 5, 6, 7, 7, 20),
geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)
Solution
We can achieve this using the dplyr package in R. Here’s how to do it:
library(tidyverse)
test <- data.frame(
chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
start = c(1, 1, 1, 2, 2, 10),
end = c(5, 5, 5, 7, 7, 20),
gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
startq = c(1, 1, 1, 2, 3, 10),
endq = c(5, 5, 6, 7, 7, 20),
geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)
test %>%
group_by(chr, start, end, gene) %>%
unite("matched", chrQ, startq, endq, geneq, sep = " ") %>%
summarise(
matched = matched %>% paste0(collapse = "; "),
matched_total = n()
)
Output
The output of this code will be:
# A tibble: 3 × 6
# Groups: chr, start, end [3]
chr start end gene matched matched_total
<chr> <dbl> <dbl> <chr> <chr> <int>
1 chr1 1 5 g1 chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q 3
2 chr2 2 7 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
3 chr2 10 20 g3 chr2 10 20 g6q 1
As we can see, the values of chrQ, startq, endq, and geneq have been collapsed into a single column called matched. The matched column contains the original values separated by semicolons.
Conclusion
In this article, we explored how to convert duplicate rows in a table to one row while collapsing certain values into a single column separated by a character. We achieved this using the dplyr package in R, specifically the group_by, unite, and summarise functions. The resulting output is a tidy dataset with the desired collapsed values.
Last modified on 2024-08-07