Using R's rvest Package for Webscraping: A Step-by-Step Guide to Handling HTTP Errors 500

Introduction to Webscraping with ‘rvest’

Webscraping is the process of automatically extracting data from websites. In this tutorial, we will use the popular R package ‘rvest’ to scrape information from a specific website.

Prerequisites

To follow along with this tutorial, you will need:

  • R installed on your system
  • The ‘rvest’ package installed in R (you can install it using install.packages("rvest"))
  • Basic knowledge of HTML and CSS

Understanding the Problem

The problem presented is that the code provided keeps stopping due to an HTTP error 500. This happens when the connection to a specific webpage fails.

Solution with ’tryCatch’

To solve this issue, we can use the tryCatch function from R, which allows us to catch and handle errors during our execution of R code.

Here is how you would implement this in your R script:

library(rvest)

summary2 <- data.frame(matrix(nrow = 0, ncol = 4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")

for (i in k) {

  # First scrape

  # sys.sleep(1) # Uncomment if ness.

  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  # Second scrape

  # sys.sleep(1) # Uncomment if ness.

  link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
                   error = function(e) NA)
  
  if (is.na(link)) {
    
    type_text <- NA
    table_text <- NA
    
  } else {
  
    type <- html_nodes(link, 'h3')
    type_text <- html_text(type)
    table <-html_node(link, "table.table.table-bordered tbody")
  
    table_text <- html_text(table)
  
    table_text <- gsub("\n", "", table_text)
    table_text <- gsub("\t", "", table_text)
    table_text <- gsub("", "", table_text)
    
  }
  
  # Output
  
  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

# Print output
tibble::as_tibble(summary2)

Output

The code above will create a tibble with the extracted data.

In this tutorial, we have learned how to use ‘rvest’ to scrape information from websites and handle errors that may occur during execution.


Last modified on 2023-06-29