Extracting Patterns from Strings in R Using Regular Expressions and stringr Package

Pattern Extraction in Strings with R

=====================================================

In this article, we will explore how to extract different patterns from strings using the stringr package in R. We will use a specific example where we need to find phrases such as “number of subscribers,” “audited number of subscribers,” and “unaudited number of subscribers” in a given text.

Introduction


The stringr package is an extension to the base R language that provides functions for manipulating strings. One of its most useful features is pattern matching, which allows us to extract specific patterns from a string.

In this article, we will focus on extracting different patterns from strings using regular expressions and the str_extract function from the stringr package.

Understanding Regular Expressions


Regular expressions (regex) are a way of describing patterns in strings. They use special characters and syntax to match specific sequences of characters.

In this example, we will use the following pattern:

\\b(number|audited|unaudited)\\s+of\\s+subscribers?

This pattern matches any word that starts with “number”, “audited”, or “unaudited” followed by “of subscribers?”.

Creating a Function to Extract Patterns


To extract the patterns, we need to create a function that takes in a string and applies the pattern matching logic.

Here is an example of how we can do this:

find_words <- function(text) {
  pattern <- "\\b(number|audited|unaudited)\\s+of\\s+subscribers?"
  
  str_extract(text, pattern)
}

This function takes in a string text and applies the pattern matching logic using str_extract. The pattern is defined as a character vector containing the regex patterns to match.

Limitations of the Existing Pattern


The existing pattern does not work for all cases. For example, it will not match “audited number of subscribers” if there are multiple spaces between the words.

Improving the Pattern


To improve the pattern, we can use a different syntax that allows us to specify optional parts of the pattern.

find_words <- function(text) {
  pattern <- "(audited |unaudited )?number\\s+of\\s+subscribers"
  
  str_extract(text, pattern)
}

This improved pattern uses parentheses and the | character to specify alternative patterns. The (audited | unaudited)? part makes “audited” and “unaudited” optional.

Using the Improved Pattern


We can use this improved function with our sample texts:

text1 <- "On a year-on-year basis, the number of subscribers of Netflix increased 1.15% in November last year."
text2 <- "There is no confirmed audited number of subscribers in the Netflix's earnings report."
text3 <- "Netflix's unaudited number of subscribers has grown more than 1.50% at the last quarter."

find_words(text1)
# 'number of subscribers'

find_words(text2)
# 'audited number of subscribers'

find_words(text3)
# 'unaudited number of subscribers'

This shows that our improved pattern works correctly for all cases.

Conclusion


In this article, we explored how to extract different patterns from strings using the stringr package in R. We used a specific example where we needed to find phrases such as “number of subscribers,” “audited number of subscribers,” and “unaudited number of subscribers” in a given text.

We created a function that takes in a string and applies pattern matching logic using regular expressions. We also improved the pattern by making some parts optional, which allowed us to match all cases correctly.

This technique can be applied to any problem where you need to extract specific patterns from strings, and it is an essential tool for any data analyst or programmer working with text data.


Last modified on 2024-01-11