R Targets with DBI/SQL database

229 views Asked by At

I have a targets pipeline which should fill and use a DBI database (in this case duckdb. Both SQL as well as duckdb are chosen for performance reasons).

How can I effectively use {targets} in combination with an SQL table.

A minimal, reproducible example would look like this, which reads two datasets, combines them, creates a model, and a plot.

# in _targets.R
library(targets)

# preparation once:
# d <- ggplot2::mpg |> dplyr::mutate(id = 1:dplyr::n())
# write_csv(d |> dplyr::select(id, cty), "data1.csv")
# write_csv(d |> dplyr::select(id, displ, cyl), "data2.csv")

get_data <- function(file) {
  read_csv(file, col_types = cols()) %>%
    as_tibble()
}

combine_data <- function(d1, d2) {
  left_join(d1, d2, by = "id")
}

fit_model <- function(data) {
  lm(cty ~ displ + cyl, data) |> 
    coefficients()
}

plot_model <- function(model, data) {
  ggplot(data) +
    geom_point(aes(x = displ, y = cty, color = cyl)) +
    geom_abline(intercept = model[1], slope = model[2]) +
    theme_gray(24)
}

tar_option_set(packages = c("tibble", "readr", "dplyr", "ggplot2"))

list(
  tar_target(file1, "data1.csv", format = "file"),
  tar_target(file2, "data2.csv", format = "file"),
  tar_target(data1, get_data(file1)),
  tar_target(data2, get_data(file2)),
  tar_target(data, combine_data(data1, data2)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)

# run: tar_make() -> runs all targets
# change a value in data2.csv -> tar_make() -> rebuilds only dependencies

This should then be used with this connection to a database

# write to Database
con <- DBI::dbConnect(duckdb::duckdb(), "data.db")
on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)
DBI::dbWriteTable(con, "data1", d |> dplyr::select(id, cty))
DBI::dbWriteTable(con, "data2", d |> dplyr::select(id, displ, cyl))

# in _targets.R... unclear at best...
list(
  tar_target(data1, ???), # if I use DBI::dbReadTable(con, "data1") it duplicates the data and returns it to R
)

The problem is, that if I return the values of the table back to R, this might take too long. Is there a database-aware adapter for targets and is this even possible/intended with targets?

0

There are 0 answers