I'm looking to scrape each of the boxscores for each game from basketball reference, as an example here:
https://www.basketball-reference.com/boxscores/202402220CHI.html
The tables are shown for different periods of the game, e.g. Q1, Q2, Q3, Q4 and you can see these by clicking on the various options:
My goal is to get each of these tables (for EACH TEAM on EACH DAY of the season), join them together, then specify which period the table applies to, e.g. Q1, Q2, etc. most likely by adding a column that says "Q1", "Q2", etc.
My attempt so far:
library(rvest)
library(tidyverse)
##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
make_url <- function(team, year, month, day) {
paste0(
'https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html'
)
}
dates <- expand.grid(
team = team, year = year, month = month, day = day
)
urls <- dates |>
mutate(
url = make_url(team, year, month, day),
team = team,
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
scrape_table <- function(url) {
page_html <- url %>%
rvest::read_html()
page_html %>%
rvest::html_nodes("table") %>%
rvest::html_table(header = FALSE)
}
safe_scrape_table <- purrr::safely(scrape_table)
tbl_scrape <- purrr::map(urls$url, \(url) {
Sys.sleep(5)
safe_scrape_table(url)
}) |>
set_names(paste(urls$team, urls$date, sep = '-'))
final_result <- tbl_scrape |>
purrr::transpose() |>
pluck('result')
This is where I am stuck. I can see that list[[1]] and list[[9]] are the Game outputs, list[[2]] and list[[10]] are the Q1 outputs, and so on.
How can I only get the ones that I need and bind them altogether? I only require Q1, Q2, Q3 and Q4.
I also need to add a column that is effectively the title of each, e.g. 'CHI-2024-02-22' so I know which game these stats relate to.
Lastly, I am looking to add two columns, one for the home team and one for the away team. I know these details appear on each page, but I cannot figure out how to get them?


Try this. Inside the function "getPageTable", I read the game date and the table headings and all of the tables on the page. I filter out only the box scores for the quarters 1, 2, 3, 4. and remove the headings off the rows and then add on the game date and table heading onto the table and then merge into 1 data frame per game.
See the comments for additional details.
Edited