Understanding How to Subset Regions from AAString Objects in Biostrings

Understanding AAString Sets in Biostrings

Biostrings is a package in R that provides classes for various types of biological sequences, including DNA, RNA, and proteins. One of these classes is AAStringSet, which represents a set of amino acid (AA) sequences.

In this article, we will explore how to subset regions from an AAString object. We will first examine the base approach using string manipulation functions, then delve into the complexities of working with Biostrings objects.

Base Approach

The first step in solving this problem is to understand that each sequence in an AAStringSet is represented as a single string object. To extract a specific region from one of these sequences, we can use the strsplit function to split the sequence into its constituent parts and then select the desired range.

Here’s an example:

seq1 = 'MEKIVLLLA'
seq2 = 'MEKIVLDIA'

paste(unlist(strsplit(seq2, split = ''))[c(1,3,6:9)], collapse='')
[1] "MKLDIA"

paste(unlist(strsplit(seq1, split = ''))[c(1,3,6:9)], collapse='')
[1] "MKLLLA"

As we can see from the above code, simply splitting the sequence into its parts and selecting the desired range is not enough. We need to create AAString objects from these sequences.

Biostrings Approach

To work with Biostrings objects, we first need to create an AAString object for each sequence in our set.

seq1_AA = Biostrings::AAString(x = seq1, start = 1)
seq2_AA = Biostrings::AAString(x = seq2, start = 1)

seq1_AA[c(1,3,6:9)]
6-letter AAString object
seq: MEKIVLLLA

Biostrings::subseq(seq1_AA[c(1,3,6:9)])
6-letter AAString object
seq: MKLLLA

In this example, we create an AAString object for each sequence using the AAString constructor. Then, we select the desired region from the first sequence and pass it to the subseq function.

However, working with Biostrings objects can be complex, especially when trying to apply indexing or subsetting operations across multiple sequences.

Using XVector

One alternative approach is to use the XVector package, which provides a more modern and flexible way of working with Biostrings objects.

library(XVector)

positions <- c(1, 3, 6, 7, 8, 9)
end_positions <- c(1, 3, 6, 7, 8, 9)
subseq(df, start = positions, end = end_positions)

In this example, we use the XVector package to create a vector of positions and end positions. Then, we pass these vectors to the subseq function.

Unfortunately, applying indexing or subsetting operations to an AAStringSet is not as straightforward with XVector.

Using IRanges

Another alternative approach is to use the IRanges class from the Biostrings package.

library(Biostrings)

str_6 <- list()
for (i in 1:length(AAseq_12)) {
  str_6[[i]] <- unlist(AAseq_12[[i]])[c(1,3,6:9)]
}

Biostrings::AAStringSet(str_6)

In this example, we create a vector of sequence strings and then convert it to an AAStringSet using the AAStringSet constructor.

This approach can be useful when working with multiple sequences that need to be subsetted in the same way.

Conclusion

Subsetting regions from an AAString object is not as straightforward as working with other types of Biostrings objects. However, by understanding how to create and manipulate Biostrings objects using various packages, we can find creative solutions to this problem.

We hope that this article has provided a comprehensive overview of the challenges and opportunities when working with Biostrings objects, and that you will find it useful in your own work with these classes.


Last modified on 2023-05-25