Extract a table from html into a R table

95 views Asked by At

Im trying to extract the table of "AREA 1 Legal Frameworks | Criminalisation of consensual same-sex sexual acts" from https://database.ilga.org/criminalisation-consensual-same-sex-sexual-acts into a data table in R.

But when I build up my codes, I have a trouble in finding correct table nodes by their class to convert into a list of data frames.

So far I have the following code:

# Load the packages
library(RCurl)
library(xml2)
library(rvest)

# Download the web page
theurl <- "https://database.ilga.org/criminalisation-consensual-same-sex-sexual-acts"
webpage <- getURL(theurl)

# Parse the html document
htmldoc <- read_html(webpage)

# Find all the table nodes by their class
tablenodes <- html_nodes(htmldoc, ".tablesorter")

# Convert the table nodes to a list of data frames
tablelist <- html_table(tablenodes)

# Select the data frame that contains the table you want
tabledf <- tablelist[[1]]

But when I try to find the necessary table nodes, the tablenodes <- html_nodes(htmldoc, ".tablesorter") will return a list of 0 and prevent further steps to convert the table nodes to a list of data frames.

Can anybody help me figure out how to extract the correct table nodes and transfer them into a list of data frames?

1

There are 1 answers

2
thothal On

Your problem is that the website is JavaScript powered and apparently the table is loaded via JavaScript. You can easily verify that with the following code (in RStudio):

library(rvest)

url <- "https://database.ilga.org/criminalisation-consensual-same-sex-sexual-acts"

doc <- url %>% 
  read_html()
tf <- tempfile(fileext = ".html")
cat(as.character(doc), file = tf)
getOption("viewer")(tf)

You will see the HTML rvest sees:

Screenshot of the page loaded at the url

You need to fallback to RSelenium and friends to make sure the JavaScript is loaded.

Disclaimer: JavaScript powered webpages are often a nightmare to scrape from and I am by no means an expert, I developped the following code simply by trial an derror and maybe there are better/smarter ways of doing it w/o the heavy burden of RSelenium (about which I woudl be also very curious to learn).

library(rvest)
library(RSelenium)

url <- "https://database.ilga.org/criminalisation-consensual-same-sex-sexual-acts"
## First problem to get the server started
## worked after switching to firefox and providing NULL for chromever
## cf https://stackoverflow.com/questions/45395849/cant-execute-rsdriver-connection-refused
rD <- rsDriver(browser = "firefox", port = 4555L, verbose = TRUE, chromever = NULL)
remDr <- rD[["client"]]
remDr$navigate(url)

## Next problem: table seems not to be loaded before visually shown on the screen
## needed to manually scroll down
remDr$executeScript('window.scrollBy(0, 4500)')

## Finally we can scrape it
(tab <- read_html(x = remDr$getPageSource()[[1]]) %>% 
  html_element(".mainTable") %>% 
  html_table())

# # A tibble: 314 × 11
#    ``                    ``    ``    ``    ``    Penalties Penalties Penalties Penalties ``    ``   
#    <chr>                 <chr> <chr> <chr> <chr> <chr>     <chr>     <chr>     <chr>     <chr> <chr>
#  1 Jurisdiction          xLeg… xLas… xDec… xMec… xDeath P… xMax Pri… xFine     xOther P… "xEx… "xEn…
#  2 x Cite EntryReport e… Ileg… 2018  -     N/A   No legal… 2 Years   No        No        "Sin… ""   
#  3 x Cite EntryReport e… Legal -     1995  Legi… No        No        No        No        "The… ""   
#  4 x Cite EntryReport e… Ileg… 1966  -     N/A   No        2 Years   Yes       No        "Art… ""   
#  5 x Cite EntryReport e… Legal -     1980  Legi… No        No        No        No        "Pub… ""   
#  6 x Cite EntryReport e… Legal -     1990  Legi… No        No        No        No        "Alt… ""   
#  7 x Cite EntryReport e… Legal -     2021  Legi… No        No        No        No        "Art… ""   
#  8 x Cite EntryReport e… Legal -     2001  Legi… No        No        No        No        "The… ""   
#  9 x Cite EntryReport e… Legal 1995  2022  Judi… No        No        No        No        "Con… ""   
# 10 x Cite EntryReport e… Legal -     1903  Legi… No        No        No        No        "Law… ""   
# # ℹ 304 more rows
# # ℹ Use `print(n = ...)` to see more rows

rD$server$stop()