Introduction to Webscraping with ‘rvest’
Webscraping is the process of automatically extracting data from websites. In this tutorial, we will use the popular R package ‘rvest’ to scrape information from a specific website.
Prerequisites
To follow along with this tutorial, you will need:
- R installed on your system
- The ‘rvest’ package installed in R (you can install it using
install.packages("rvest")) - Basic knowledge of HTML and CSS
Understanding the Problem
The problem presented is that the code provided keeps stopping due to an HTTP error 500. This happens when the connection to a specific webpage fails.
Solution with ’tryCatch’
To solve this issue, we can use the tryCatch function from R, which allows us to catch and handle errors during our execution of R code.
Here is how you would implement this in your R script:
library(rvest)
summary2 <- data.frame(matrix(nrow = 0, ncol = 4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")
for (i in k) {
# First scrape
# sys.sleep(1) # Uncomment if ness.
webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
billno <- html_nodes(webpage, 'h1')
billno_text <- html_text(billno)
billsum <- html_nodes(webpage, '.interno')
billsum_text <- html_text(billsum)
billsum_text <- gsub("\n", "", billsum_text)
billsum_text <- gsub("\t", "", billsum_text)
billsum_text <- gsub(" ", "", billsum_text)
# Second scrape
# sys.sleep(1) # Uncomment if ness.
link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
error = function(e) NA)
if (is.na(link)) {
type_text <- NA
table_text <- NA
} else {
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
}
# Output
summary2[i, 1] <- billno_text
summary2[i, 2] <- billsum_text
summary2[i, 3] <- type_text
summary2[i, 4] <- table_text
}
# Print output
tibble::as_tibble(summary2)
Output
The code above will create a tibble with the extracted data.
In this tutorial, we have learned how to use ‘rvest’ to scrape information from websites and handle errors that may occur during execution.
Last modified on 2023-06-29