07. Web scraping with R

guides

This guide will provide you with an overview of web scraping and how you can use R to scrape data from the web using packages such as {rvest}. Web scraping is a technique used to extract data from websites. It is a powerful tool that can be used to collect data for research, analysis, and visualization. In this guide, you will learn how to use R to scrape data from websites and save it in a format that can be used for further analysis.

Outcomes

  • Understand the concept of web scraping and its applications.
  • Learn how to use R to scrape data from websites.
  • Save scraped data in a format that can be used for further analysis.

HTML: the language of the web

Web scraping is a technique used to extract data from websites. It is a powerful tool that can be used to collect data from documents such as PDF or DOCX files, but is most often used to acquire the contents of the public-facing web.

The language of the web is HTML. A markup language, raw HTML contains a semi-structured document which is formed around the concept of tags. Tags are opened <tag> and closed </tag> in a hierarchical fashion. The tags come pre-defined in terms of how they are used and displayed when a browser parses the document. For example, <ul></ul> delimits an unordered list. Embedded inside will be a series of <li></li> tags for items in the unordered list. So for example, the HTML fragment in Snippet 1 is displayed by a browser as in Snippet 2.

Snippet 1
<ul>
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
</ul>
Snippet 2
  • First item
  • Second item
  • Third item

Common HTML tags

  • <html>: The root element of an HTML page.
  • <head>: Contains meta-information about the document.
  • <body>: Contains the content of the document.
  • <h1> to <h6>: Header tags, with <h1> being the highest level.
  • <p>: Paragraph tag.
  • <ul>: Unordered list tag.
  • <ol>: Ordered list tag.
  • <li>: List item tag.
  • <a>: Anchor tag for hyperlinks.
  • <table>: Table tag.
  • <div>: Division tag, used to group elements.

This structure is what makes it possible to target and extract content from websites, as we will soon see. However, in addition to tags we need to be aware of and understand the CSS selectors ids and classes. Ids and classes are used as attributes to tags that allow developers to specifiy how a tag element should behave.

OK, that isn’t altogether insightful. Let me give you an example. So imagine that we have two lists much like in Snippet 1 but one corresponds to the table of contents of our page and the other is used in the content area as a basic list. Say we want to make our table of contents appear in bold font and the other list to appear as normal text. One way to do this is to distinguish between these two lists using a class attribute, as in Snippet 3.

Snippet 3
<ul class="toc">
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
</ul>

After doing this a web designer would then create a CSS expression to target tags with the toc class and make them bold. In our toy case, this only targets our unordered list with the class="toc".

Snippet 4
/* CSS to make the toc class bold */
.toc {
  font-weight: bold
}

Now, our list from Snippet 3 will appear as in Snippet 5.

Snippet 5
  • First item
  • Second item
  • Third item

Ids work in a similar way, but instead have the apt id="..." attribute.

All this is to say that the combination of the HTML tag structure and the use of CSS selectors tends to give the would-be web scraper various ways to target certain elements on a webpage and not others.

Web Scraping

Where it has always been possible to navigate to a webpage, select/copy, and paste content into a document, web scraping makes this workflow automatic. This is particularly useful when you need to collect data from multiple pages or websites. Let’s consider the steps involved in web scraping:

  1. Download webpage content
  2. Parse the HTML structure
  3. Extract text content using the tags, CSS selectors, and structure
  4. Format and save the extracted content

Important

Before scraping a website, it is important to check the website’s terms of service and robots.txt file to ensure you are not violating any rules. Be respectful of the website’s resources and consider using an API if one is available.

Download and parse

In R, the {rvest} package (Wickham 2024) is commonly used for web scraping. It provides the key function read_html() that downloads and parses HTML documents. Here’s a simple eample of how to download and parse a webpage using {rvest}, as seen in Snippet 6.

Snippet 6
# Load the rvest package
library(rvest)

# Specify the URL of the webpage to scrape
url <- "https://en.wikipedia.org/wiki/Web_scraping"

# Download and parse the webpage content
page <- read_html(url)

class(page)
1
Replace the URL with the webpage you want to scrape.
2
The read_html() function downloads the webpage content and parses it into an HTML document.
3
The class() function is used to check the class of the page object, which should be xml_document.

Note that the R object page is an xml_document object, which is a representation of the HTML document. This object can be used to extract specific elements from the webpage as we will see in the next section.

For demonstration purposes, we will use the toy HTML document in Snippet 7, instead of the Wikipedia page, to illustrate how to extract text content from specific elements in a more simplied form. This will help us cover the basics on how to target and extract text from HTML documents using CSS selectors.

Snippet 7
<html>
  <body>
    <div class="article">
      <h1>Language Varieties</h1>

      <div class="excerpt" id="excerpt1">
        <h2>American English</h2>
        <p>American English is the variety of English spoken in the United States.
        It has several distinctive features in pronunciation, vocabulary, and grammar.</p>
      </div>

      <div class="excerpt" id="excerpt2">
        <h2>British English</h2>
        <p>British English refers to the English language as spoken and written in
        Great Britain. It differs from American English in spelling, vocabulary,
        and some grammatical constructions.</p>
      </div>
    </div>
  </body>
</html>

Before we move on, let’s note the tags, CSS selectors, and structure in the toy HTML document:

  • Tags: html, body, div, h1, h2, p
  • CSS selectors:
    • Classes: article, excerpt
    • Ids: excerpt1, excerpt2
  • Structure: inside the body tag, there is a div with class article containing two div elements with class excerpt and ids excerpt1 and excerpt2

Extract Text Content

In {rvest} the html_element() and html_elements() function provide a way to target specific elements in the HTML document using tags and/or CSS selectors. So say we want to select the single <h1> element in the toy HTML document (Snippet 7), we can use html_element() as in Snippet 8.

Snippet 8
# Target the h1 element
page |>
  rvest::html_element("h1")
1
The html_element() function targets the first matching element in the HTML document.
{html_node}
<h1>

The html_element() function returns the first matching element, as an ‘html_node’, not the content contained within. If we want to extract the text content of the element, we can use html_text() as in Snippet 9.

Snippet 9
# Extract the text content of the h1 element
page |>
  html_element("h1") |>
  html_text()
1
Target the <h1> element.
2
Extract the text content of the <h1> element with html_text().
[1] "Language Varieties"

The result is a character vector of length 1 containing the text content of the <h1> element.

Now in Snippet 9 we targeted and extracted a single element, <h1>. This is the only <h1> element so we don’t need to worry about multiple elements. However, if we wanted to extract all the <h2> elements, we would use html_elements() as in Snippet 10.

Snippet 10
# Extract the text content of all h2 elements
page |>
  html_elements("h2")
1
Target all <h2> elements in the HTML document.
{xml_nodeset (2)}
[1] <h2>American English</h2>
[2] <h2>British English</h2>

Note now the R object is an ‘xml_nodeset’ which is a list of ‘xml_node’ objects. To extract the text content of each element in the list, we can use html_text() as in Snippet 11.

Snippet 11
# Extract the text content of all h2 elements
page |>
  html_elements("h2") |>
  html_text()
1
Target all <h2> elements in the HTML document.
2
Extract the text content of all <h2> elements with html_text().
[1] "American English" "British English" 

The result is a character vector containing the text content of all <h2> elements in the HTML document.

Let’s consider targeting sections of the HTML within <div> elements. The <div> tag in HTML is a generic container for grouping elements. In the toy HTML document, we have three sets of content within <div> elements: two embedded within one. If we were to target just the <div> tags with html_elements() we would get all three, as in Snippet 12.

Snippet 12
# Extract all div elements
page |>
  html_elements("div")
1
Target all <div> elements in the HTML document.
{xml_nodeset (3)}
[1] <div class="article">\n      <h1>Language Varieties</h1>\n\n      <div cl ...
[2] <div class="excerpt" id="excerpt1">\n        <h2>American English</h2>\n  ...
[3] <div class="excerpt" id="excerpt2">\n        <h2>British English</h2>\n   ...

This is were our CSS selectors come in handy. We can use the class attribute to target specific <div> elements. For example, to target the <div> elements with the class excerpt, we can use html_elements("div.excerpt") as in Snippet 13.

Snippet 13
# Extract div elements with class "excerpt"
page |>
  html_elements("div.excerpt")
1
Target all <div> elements with the class excerpt in the HTML document.
{xml_nodeset (2)}
[1] <div class="excerpt" id="excerpt1">\n        <h2>American English</h2>\n  ...
[2] <div class="excerpt" id="excerpt2">\n        <h2>British English</h2>\n   ...

There are number of operators that are used to target elements in the HTML document with {rvest}. The . denotes a class, # denotes an id, and > denotes a child element. So for example, to target the <div> element with the id excerpt1 we would use html_element("div#excerpt1").

Snippet 14
# Extract div element with id "excerpt1"
page |>
  html_element("div#excerpt1")
1
Target the <div> element with the id excerpt1 in the HTML document.
{html_node}
<div class="excerpt" id="excerpt1">
[1] <h2>American English</h2>
[2] <p>American English is the variety of English spoken in the United States ...

A child element is one that is nested within another element. So for example, to target the <p> element within the <div> element with the id excerpt1 we would use html_element("div#excerpt1 > p").

Snippet 15
# Extract p element within div element with id "excerpt1"
page |>
  html_element("div#excerpt1 > p")
{html_node}
<p>

Organize Extracted Text

Now, let’s shoot to extract the content from this toy HTML document and organize it into a structured format, as in Table 1.

Table 1: Extracted text content from the toy HTML document.
variety description
American English American English is the variety of English spoken in the United States. It has several distinctive features in pronunciation, vocabulary, and grammar.
British English British English refers to the English language as spoken and written in Great Britain. It differs from American English in spelling, vocabulary, and some grammatical constructions.

This can be achieved in a number of ways. Let’s look a straightforward way of doing this by targeting the elements directly. First, let’s extract the <h2> elements as a vector and then the <p> elements as a vector. We can then combine these vectors into a data frame using the tibble() function from the {tibble} package.

Snippet 16
library(tibble)

# Extract text content from the toy HTML document
varieties <-
  page |>
  html_elements("div.excerpt > h2") |>
  html_text()

descriptions <-
  page |>
  html_elements("div.excerpt > p") |>
  html_text()

df <- tibble(variety = varieties, description = descriptions)

df
1
Target all <h2> elements within <div> elements with the class excerpt.
2
Extract the text content of all <h2> elements.
3
Target all <p> elements within <div> elements with the class excerpt.
4
Extract the text content of all <p> elements.
5
Combine the extracted text content into a data frame using tibble().
# A tibble: 2 × 2
  variety          description                                                  
  <chr>            <chr>                                                        
1 American English "American English is the variety of English spoken in the Un…
2 British English  "British English refers to the English language as spoken an…

From this point we have a data frame df that contains the extracted text content from the toy HTML document in a structured format. We can now save this data frame to a file and create the necessary documentation files for our analysis.

This is a simple example of how to extract text content from HTML. As websites get more complicated, the process of targeting and extracting content can become more complex. However, the principles remain the same: target specific elements using tags and CSS selectors, extract the text content, and organize it into a structured format.

Next Steps

To build on these basics:

  • Practice extracting text from more complex webpage layouts
  • Consider using the {polite} package (Perepolkin 2023) for ethical scraping
  • Explore ways to collect metadata along with your text
  • Learn to handle different text encodings

References

Perepolkin, Dmytro. 2023. Polite: Be Nice on the Web. https://CRAN.R-project.org/package=polite.
Wickham, Hadley. 2024. Rvest: Easily Harvest (Scrape) Web Pages. https://rvest.tidyverse.org/.