07. Web scraping with R
This guide will provide you with an overview of web scraping and how you can use R to scrape data from the web using packages such as {rvest}. Web scraping is a technique used to extract data from websites. It is a powerful tool that can be used to collect data for research, analysis, and visualization. In this guide, you will learn how to use R to scrape data from websites and save it in a format that can be used for further analysis.
HTML: the language of the web
Web scraping is a technique used to extract data from websites. It is a powerful tool that can be used to collect data from documents such as PDF or DOCX files, but is most often used to acquire the contents of the public-facing web.
The language of the web is HTML. A markup language, raw HTML contains a semi-structured document which is formed around the concept of tags. Tags are opened <tag>
and closed </tag>
in a hierarchical fashion. The tags come pre-defined in terms of how they are used and displayed when a browser parses the document. For example, <ul></ul>
delimits an unordered list. Embedded inside will be a series of <li></li>
tags for items in the unordered list. So for example, the HTML fragment in Snippet 1 is displayed by a browser as in Snippet 2.
This structure is what makes it possible to target and extract content from websites, as we will soon see. However, in addition to tags we need to be aware of and understand the CSS selectors ids
and classes
. Ids and classes are used as attributes to tags that allow developers to specifiy how a tag element should behave.
OK, that isn’t altogether insightful. Let me give you an example. So imagine that we have two lists much like in Snippet 1 but one corresponds to the table of contents of our page and the other is used in the content area as a basic list. Say we want to make our table of contents appear in bold font and the other list to appear as normal text. One way to do this is to distinguish between these two lists using a class attribute, as in Snippet 3.
After doing this a web designer would then create a CSS expression to target tags with the toc
class and make them bold. In our toy case, this only targets our unordered list with the class="toc"
.
Now, our list from Snippet 3 will appear as in Snippet 5.
Ids work in a similar way, but instead have the apt id="..."
attribute.
All this is to say that the combination of the HTML tag structure and the use of CSS selectors tends to give the would-be web scraper various ways to target certain elements on a webpage and not others.
Web Scraping
Where it has always been possible to navigate to a webpage, select/copy, and paste content into a document, web scraping makes this workflow automatic. This is particularly useful when you need to collect data from multiple pages or websites. Let’s consider the steps involved in web scraping:
- Download webpage content
- Parse the HTML structure
- Extract text content using the tags, CSS selectors, and structure
- Format and save the extracted content
Download and parse
In R, the {rvest} package (Wickham 2024) is commonly used for web scraping. It provides the key function read_html()
that downloads and parses HTML documents. Here’s a simple eample of how to download and parse a webpage using {rvest}, as seen in Snippet 6.
Note that the R object page
is an xml_document
object, which is a representation of the HTML document. This object can be used to extract specific elements from the webpage as we will see in the next section.
For demonstration purposes, we will use the toy HTML document in Snippet 7, instead of the Wikipedia page, to illustrate how to extract text content from specific elements in a more simplied form. This will help us cover the basics on how to target and extract text from HTML documents using CSS selectors.
Before we move on, let’s note the tags, CSS selectors, and structure in the toy HTML document:
- Tags:
html
,body
,div
,h1
,h2
,p
- CSS selectors:
- Classes:
article
,excerpt
- Ids:
excerpt1
,excerpt2
- Classes:
- Structure: inside the
body
tag, there is adiv
with classarticle
containing twodiv
elements with classexcerpt
and idsexcerpt1
andexcerpt2
Extract Text Content
In {rvest} the html_element()
and html_elements()
function provide a way to target specific elements in the HTML document using tags and/or CSS selectors. So say we want to select the single <h1>
element in the toy HTML document (Snippet 7), we can use html_element()
as in Snippet 8.
The html_element()
function returns the first matching element, as an ‘html_node’, not the content contained within. If we want to extract the text content of the element, we can use html_text()
as in Snippet 9.
The result is a character vector of length 1 containing the text content of the <h1>
element.
Now in Snippet 9 we targeted and extracted a single element, <h1>
. This is the only <h1>
element so we don’t need to worry about multiple elements. However, if we wanted to extract all the <h2>
elements, we would use html_elements()
as in Snippet 10.
Note now the R object is an ‘xml_nodeset’ which is a list of ‘xml_node’ objects. To extract the text content of each element in the list, we can use html_text()
as in Snippet 11.
The result is a character vector containing the text content of all <h2>
elements in the HTML document.
Let’s consider targeting sections of the HTML within <div>
elements. The <div>
tag in HTML is a generic container for grouping elements. In the toy HTML document, we have three sets of content within <div>
elements: two embedded within one. If we were to target just the <div>
tags with html_elements()
we would get all three, as in Snippet 12.
This is were our CSS selectors come in handy. We can use the class
attribute to target specific <div>
elements. For example, to target the <div>
elements with the class excerpt
, we can use html_elements("div.excerpt")
as in Snippet 13.
There are number of operators that are used to target elements in the HTML document with {rvest}. The .
denotes a class, #
denotes an id, and >
denotes a child element. So for example, to target the <div>
element with the id excerpt1
we would use html_element("div#excerpt1")
.
A child element is one that is nested within another element. So for example, to target the <p>
element within the <div>
element with the id excerpt1
we would use html_element("div#excerpt1 > p")
.
Organize Extracted Text
Now, let’s shoot to extract the content from this toy HTML document and organize it into a structured format, as in Table 1.
variety | description |
---|---|
American English | American English is the variety of English spoken in the United States. It has several distinctive features in pronunciation, vocabulary, and grammar. |
British English | British English refers to the English language as spoken and written in Great Britain. It differs from American English in spelling, vocabulary, and some grammatical constructions. |
This can be achieved in a number of ways. Let’s look a straightforward way of doing this by targeting the elements directly. First, let’s extract the <h2>
elements as a vector and then the <p>
elements as a vector. We can then combine these vectors into a data frame using the tibble()
function from the {tibble} package.
From this point we have a data frame df
that contains the extracted text content from the toy HTML document in a structured format. We can now save this data frame to a file and create the necessary documentation files for our analysis.
This is a simple example of how to extract text content from HTML. As websites get more complicated, the process of targeting and extracting content can become more complex. However, the principles remain the same: target specific elements using tags and CSS selectors, extract the text content, and organize it into a structured format.
Next Steps
To build on these basics:
- Practice extracting text from more complex webpage layouts
- Consider using the {polite} package (Perepolkin 2023) for ethical scraping
- Explore ways to collect metadata along with your text
- Learn to handle different text encodings