Convert Duplicate Rows to One Row with Collapsed Values in a Single Column Separated by Semicolons

Converting Duplicate Rows to One Row with Collapsed Values

In this article, we will explore how to convert duplicate rows in a table to one row while collapsing certain values into a single column separated by a character.

Problem Statement

We are given a table that has duplicate rows based on the gene column. We want to remove these duplicates and collapse the values of the columns named chrQ, startq, endq, and geneq into a single column called matched. The matched column should contain the collapsed values separated by a semicolon.

Table Representation

The given table can be represented as follows:

test <- data.frame(
  chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  start = c(1, 1, 1, 2, 2, 10),
  end = c(5, 5, 5, 7, 7, 20),
  gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
  chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  startq = c(1, 1, 1, 2, 3, 10),
  endq = c(5, 5, 6, 7, 7, 20),
  geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)

Solution

We can achieve this using the dplyr package in R. Here’s how to do it:

library(tidyverse)

test <- data.frame(
  chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  start = c(1, 1, 1, 2, 2, 10),
  end = c(5, 5, 5, 7, 7, 20),
  gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
  chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  startq = c(1, 1, 1, 2, 3, 10),
  endq = c(5, 5, 6, 7, 7, 20),
  geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)

test %>% 
  group_by(chr, start, end, gene) %>% 
  unite("matched", chrQ, startq, endq, geneq, sep = " ") %>% 
  summarise(
    matched = matched %>% paste0(collapse = "; "),
    matched_total = n()
  )

Output

The output of this code will be:

# A tibble: 3 × 6
# Groups:   chr, start, end [3]
   chr   start   end gene  matched                                  matched_total
   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;                                            &lt;int&gt;
1 chr1      1     5 g1    chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q             3
2 chr2      2     7 g2    chr2 2 7 g4q; chr2 3 7 g5q                           2
3 chr2     10    20 g3    chr2 10 20 g6q                                       1

As we can see, the values of chrQ, startq, endq, and geneq have been collapsed into a single column called matched. The matched column contains the original values separated by semicolons.

Conclusion

In this article, we explored how to convert duplicate rows in a table to one row while collapsing certain values into a single column separated by a character. We achieved this using the dplyr package in R, specifically the group_by, unite, and summarise functions. The resulting output is a tidy dataset with the desired collapsed values.


Last modified on 2024-08-07