GoodReads: Webscraping and Text Analysis with R (Part 1)

Inspired by this article about sentiment analysis and this guide to webscraping, I have decided to get my hands dirty by scraping and analyzing a sample of reviews on the website Goodreads.

The goal of this project is to demonstrate a complete example, going from data collection to machine learning analysis, and to illustrate a few of the dead ends and mistakes I encountered on my journey. We’ll be looking at the reviews for five popular romance books. I have voluntarily chosen books in the same genre in order to make comments text more homogeneous a priori; these five books are popular enough that I can easily pull a few thousands reviews for each, yielding a significant corpus with minimum effort. If you don’t like romance books, feel free to replicate the analysis with your genre of choice!

To make the article more digestible, I have divided it in three segments:

Part 1: Webscraping

Part 2: Exploratory data analysis and sentiment analysis

Part 3: Predictive analytics with machine learning

This post includes the part 1, and the two following parts will be posted on DataScience+ in 1-week intervals.

Part 1: Webscraping

Goodreads’s reviews are a trove of text content begging to be scraped, with an interesting non-text variable attached, the ratings left by the reviewers. But there is a problem: navigation between pages of comments is done through a javascript button, not an html link. Fear not: this problem has actually a pretty simple solution, through the use of the RSelenium package (which has a nice vignette here).

Setup

Let’s load the libraries we’ll require during the analyses, and define some variables we’ll use later.

library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript

url <- "https://www.goodreads.com/book/show/18619684-the-time-traveler-s-wife#other_reviews"
book.title <- "The time traveler's wife"
output.filename <- "GR_TimeTravelersWife.csv"

Note that I’m working on a book-by-book basis. This means we have to manually change the variables above and re-run the script for each book. This could be automated to work on a grander scale, but that’s good enough for what I want to do here. Also, I’d rather not overload Goodreads’s servers by pulling massive amounts of data from them.

Let’s then start the RSelenium server. I have had some trouble with Firefox, and I have had to reinstall a previous version of the browser (which Firefox frowns upon), so your mileage may vary here.

startServer()
remDr <- remoteDriver(browserName = "firefox", port = 4444) # instantiate remote driver to connect to Selenium Server
remDr$open() # open web browser
remDr$navigate(url)

These instructions start an instance of Firefox and navigate to the url as if you were directly interacting with the browser.

We then initialize the data frame that we’ll be populating with the data later.

global.df <- data.frame(book=character(),
                        reviewer = character(),
                        rating = character(),
                        review = character(), 
                        stringsAsFactors = F)

We are now ready to proceed with the webscraping process itself!

The webscraping process

To extract the content we want, we’ll be looping through the 100 pages or so of comments for each of the books. Here I remove the loop to show the code going through one page only and explain its workings line by line.

First, we need to identify “where” the reviews appear in the page code. This is done by using SelectorGadget, a Chrome extension that allows you to identify CSS selectors. Once we have identified the proper CSS selector (here “#bookReviews .stacked”), we just pass its name to the findElements function of the RSelenium server.

reviews <- remDr$findElements("css selector", "#bookReviews .stacked")

We extract the html code for the reviews, then the text component.

reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
reviews.text <- unlist(reviews.list)

We now have the text of the reviews in list format, but a rapid inspection of it shows that there is still a lot of work to do to get a clean text. This we will do by using regular expressions (regex).

Cleaning the reviews with Regex

In my experience with text analytics, regex are both a blessing and a curse. A blessing because how else can you remove all non-letters characters in a string in one short command? And a curse because it’s a fairly esoteric language that is hard to understand or remember when you re-read your code later. So if you are not familiar with regex, I would definitely advise very generous commenting at the brief moment in time when you actually understand what your code does.

# Removing all characters that are not letters or dash
reviews.text2 <- gsub("[^A-Za-z\\-]|\\.+", " ", reviews.text)
# Removing the end of line characters and extra spaces
reviews.clean <- gsub("\n|[ \t]+", " ", reviews.text2)

In order to write these commands, I have found these resources useful:

Putting the reviews in table format

We now have our reviews in a reasonably clean state. But due to the underlying structure of the html code, we have a problem: for each review, we have the name of the reviewer and his/her rating in one string, and the review in the following string. In addition to that, the system to preview reviews means that the beginning of the review appears twice in the string. We’ll have to clean all that, again using regex, to get our data in table format.

We start by counting the number of reviews we have, i.e. half the number of strings and creating a temporary data frame that we’ll use to store the data before transferring it to the main data frame.

n <- floor(length(reviews)/2)
reviews.df <- data.frame(book = character(n),
                         reviewer = character(n),
                         rating = character(n),
                         review = character(n), 
                         stringsAsFactors = F)

We loop through our strings and populate our data frame, by extracting the fields we want review by review, based on recurring stop words. A for loop will do for this non-production example, but for production code you’d probably want to vectorialize everything you can.

The following code might appear a bit cryptic so first I’ll explain what I’m going to do:

In the first part, I identify several expressions that can appear between the reviewer’s name and the rating, and use them in a regex to determine the ending point of the name in the string, then extract the name.
In the second part, I identify several expressions that can appear at the end of the rating, and use them in a regex to determine the ending point of the rating; sometimes none of these expressions appear, so I have a conditional telling R to go to the end of the string if it finds none of the expressions (by convention, it returns the position as being -1). Then I extract the rating.
In the third part, I remove the beginning of each review, which is repeated in the html file, by looking for the position in the string where the first 50 characters of the string appear again. I have a conditional in place to deal with cases when the review is short enough that its beginning is not repeated. I deal with the end of the review in the same way I did with the end of the rating
Finally, note the structure of the loop: I’m not looping through the strings one by one, but through the reviews, each review taking 2 consecutive strings, hence the 2*j and 2*j-1 indices.

for(j in 1:n){
  reviews.df$book[j] <- book.title
    
  #Isolating the name of the author of the review
  auth.rat.sep <- regexpr(" rated it | marked it | added it ", 
                          reviews.clean[2*j-1]) 
  reviews.df$reviewer[j] <- substr(reviews.clean[2*j-1], 5, auth.rat.sep-1)
    
  #Isolating the rating
  rat.end <- regexpr("· | Shelves| Recommend| review of another edition",
                     reviews.clean[2*j-1])
  if (rat.end==-1){rat.end <- nchar(reviews.clean[2*j-1])}
  reviews.df$rating[j] <- substr(reviews.clean[2*j-1], auth.rat.sep+10, rat.end-1)
    
  #Removing the beginning of each review that was repeated on the html file
  short.str <- substr(reviews.clean[2*j], 1, 50)
  rev.start <- unlist(gregexpr(short.str, reviews.clean[2*j]))[2]
  if (is.na(rev.start)){rev.start <- 1}
  rev.end <- regexpr("\\.+more|Blog", reviews.clean[2*j])
  if (rev.end==-1){rev.end <- nchar(reviews.clean[2*j])}
  reviews.df$review[j] <- substr(reviews.clean[2*j], rev.start, rev.end-1)
  }

Now that our temporary data frame has been populated, we can transfer its content to our main data frame.

global.lst <- list(global.df, reviews.df)
global.df <- rbindlist(global.lst)

And finally, we tell RSelenium to “click” on the next page button, by passing the proper CSS selector that we identified with SelectorGadget. Final trick: I found that in the initial iterations, RSelenium was too slow to load the pages, and was not responding in time to the instructions at the beginning of the next loop, so we tell R to wait 3 seconds at the end of each loop.

NextPageButton <- remDr$findElement("css selector", ".next_page")
NextPageButton$clickElement()
Sys.sleep(3)

After closing the overall loop, we can save the final data frame in a file.

write.csv(global.df,output.filename)

The result data frame looks like this:

book	reviewer	rating	review
The time traveler’s wife	Liz S	it was ok	I recently read…
Eleanor & Park	Danielle	did not like it	Why can’t there be…
Me Before You	Swaps	it was amazing	This review has been…

You can find the full code, including the loops I have omitted here, on my github.

Part two and three

7 Comments

AN

akshay naidu January 25, 2017

Hi @florentbuisson:disqus , this is great tutorial article. I have followed it and successfully extracted the reviews, however as for better understanding purpose you didn’t loop through all the pages of comments which is why I got only 30 reviews after following this tutorial.
Can you Please tell me what changes should I make in order to loop through hundreds of pages of comments in order to get all the reviews.
Thank You.
1. FB
  
  Florent BuissonAuthor October 11, 2017
  
  Hi akshay naidu,
  Sorry it took me so long to answer (!), I’m not monitoring anymore the message on this page. You can find the complete code, including the loops, on my github: https://github.com/BuissonFlorent/GoodReads_TextMining/blob/master/GR_Webscraping.R
  Note however that due to a big change in Selenium, you’ll need to run it in a docker environment.
FB

Florent Buisson December 16, 2016

Hi @ScottRobinett:disqus , sorry it took me so long to answer, I haven’t been around DataScience+ lately. The sad truth is that in the meantime, there has been a major update of the Selenium underlying mechanism, which means that this tutorial would not work anymore as is. A solution I have found for a more recent project (after much trial and errors) has been to use an older version of selenium and firefox in a docker container. You can find a tutorial here: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-docker.html.

Also, I have never been able to make it work with chrome, even in the good ol’ days when it was working directly with firefox without the need for docker.
S

ScottRobinett October 11, 2016

try as I might, I can not get this to execute. Using browserName = “chrome”. Browser opens to the url, but times out and reviews = List of 0. ??? thanks for sharing.
M

Mara September 18, 2016

As an avid reader and Goodreads user, I’ve thought of doing pretty much exactly this so many times, and just never got around to it! I’m so glad that you did! Can’t wait to work through the parts to come!
M

Martin September 10, 2016

Very helpful, thanks. FYI, I think you’ve got a typo in your creation of reviews.list (though the version on github seems correct)
1. FB
  
  Florent Buisson September 12, 2016
  
  Thanks for the comment and thanks for the tip; there was indeed a typo!

GoodReads: Webscraping and Text Analysis with R (Part 1)

Part 1: Webscraping

Setup

The webscraping process

Cleaning the reviews with Regex

Putting the reviews in table format

Part two and three

7 Comments

Leave a comment Cancel reply

More in Programming

Using cache to avoid re-processing, improve UX, and quicken results in R

Support Vector Machine for Hand Written Alphabet Recognition

New Features in Python 3.9 That You Need to Know