Outcomes
- Understand the concept of web scraping and its applications.
- Learn how to use R to scrape data from websites.
- Save scraped data in a format that can be used for further analysis.
HTML: the language of the web
Web scraping is a technique used to extract data from websites. It is a powerful tool that can be used to collect data from documents such as PDF or DOCX files, but is most often used to acquire the contents of the public-facing web.
The language of the web is HTML. A markup language, raw HTML contains a semi-structured document which is formed around the concept of tags. Tags are opened <tag>
and closed </tag>
in a hierarchical fashion. The tags come pre-defined in terms of how they are used and displayed when a browser parses the document. For example, <ul></ul>
delimits an unordered list. Embedded inside will be a series of <li></li>
tags for items in the unordered list. So for example, the HTML fragment in Snippet 1 is displayed by a browser as in Snippet 2.
Common HTML tags
<html>
: The root element of an HTML page.
<head>
: Contains meta-information about the document.
<body>
: Contains the content of the document.
<h1>
to <h6>
: Header tags, with <h1>
being the highest level.
<p>
: Paragraph tag.
<ul>
: Unordered list tag.
<ol>
: Ordered list tag.
<li>
: List item tag.
<a>
: Anchor tag for hyperlinks.
<table>
: Table tag.
<div>
: Division tag, used to group elements.
This structure is what makes it possible to target and extract content from websites, as we will soon see. However, in addition to tags we need to be aware of and understand the CSS selectors ids
and classes
. Ids and classes are used as attributes to tags that allow developers to specifiy how a tag element should behave.
OK, that isn’t altogether insightful. Let me give you an example. So imagine that we have two lists much like in Snippet 1 but one corresponds to the table of contents of our page and the other is used in the content area as a basic list. Say we want to make our table of contents appear in bold font and the other list to appear as normal text. One way to do this is to distinguish between these two lists using a class attribute, as in Snippet 3.
After doing this a web designer would then create a CSS expression to target tags with the toc
class and make them bold. In our toy case, this only targets our unordered list with the class="toc"
.
Now, our list from Snippet 3 will appear as in Snippet 5.
Ids work in a similar way, but instead have the apt id="..."
attribute.
All this is to say that the combination of the HTML tag structure and the use of CSS selectors tends to give the would-be web scraper various ways to target certain elements on a webpage and not others.
Web Scraping
Where it has always been possible to navigate to a webpage, select/copy, and paste content into a document, web scraping makes this workflow automatic. This is particularly useful when you need to collect data from multiple pages or websites. Let’s consider the steps involved in web scraping:
- Download webpage content
- Parse the HTML structure
- Extract text content using the tags, CSS selectors, and structure
- Format and save the extracted content
Important
Before scraping a website, it is important to check the website’s terms of service and robots.txt file to ensure you are not violating any rules. Be respectful of the website’s resources and consider using an API if one is available.
Download and parse
In R, the {rvest} package (Wickham 2024) is commonly used for web scraping. It provides the key function read_html()
that downloads and parses HTML documents. Here’s a simple eample of how to download and parse a webpage using {rvest}, as seen in Snippet 6.
Note that the R object page
is an xml_document
object, which is a representation of the HTML document. This object can be used to extract specific elements from the webpage as we will see in the next section.
For demonstration purposes, we will use the toy HTML document in Snippet 7, instead of the Wikipedia page, to illustrate how to extract text content from specific elements in a more simplied form. This will help us cover the basics on how to target and extract text from HTML documents using CSS selectors.
Before we move on, let’s note the tags, CSS selectors, and structure in the toy HTML document:
- Tags:
html
, body
, div
, h1
, h2
, p
- CSS selectors:
- Classes:
article
, excerpt
- Ids:
excerpt1
, excerpt2
- Structure: inside the
body
tag, there is a div
with class article
containing two div
elements with class excerpt
and ids excerpt1
and excerpt2
Next Steps
To build on these basics:
- Practice extracting text from more complex webpage layouts
- Consider using the {polite} package (Perepolkin 2023) for ethical scraping
- Explore ways to collect metadata along with your text
- Learn to handle different text encodings