reading sparse data in h2o using svmlight

565 views Asked by At

I am trying to read a dataset in SVMLight format into h2o. Writing it to a file on disk and reading it back is working OK but reading it directly from R's memory is not. I would like to know if there is a different function or a different way of calling the function I have used below.

Here's an example R 3.3.3, h2o 3.10.3.6:

require(data.table)
require(h2o)

set.seed(1000)
tot_obs <- 100
tot_var <- 500
vars_per_obs <- round(.0*tot_var,0):round(.1*tot_var,0)

#randomly generated data
mat.dt <- do.call('rbind', lapply(1:tot_obs, function(n) {
    nvar <- sample(vars_per_obs,1)
    if(nvar>0) data.table(obs=n, var=sample(1:tot_var,nvar))[, value:=sample(10:50,.N,replace=TRUE)]
}))

event.dt <- data.table(obs=1:tot_obs)[, is_event:=sample(0:1,.N,prob=c(.9,.1),replace=TRUE)]

#SVMLight format
setorder(mat.dt, obs, var)
mat.agg.dt <- mat.dt[, .(feature=paste(paste0(var,":",value), collapse=" ")), obs]
mat.agg.dt <- merge(event.dt, mat.agg.dt, by="obs", sort=FALSE, all.x=TRUE)
mat.agg.dt[is.na(feature), feature:=""]
mat.agg.dt[, svmlight:=paste(is_event,feature)][, c("obs","is_event","feature"):=NULL]
fwrite(mat.agg.dt, file="svmlight.txt", col.names=FALSE)

#h2o
localH2o <- h2o.init(nthreads=-1, max_mem_size="4g")
h2o.no_progress()

#works
h2o.orig <- h2o.importFile("svmlight.txt", parse=TRUE)

#does NOT work
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
1

There are 1 answers

9
Erin LeDell On

The easy answer is that you probably don't have enough R memory to perform this action, so one solution is to increase the amount of memory in R (if that's an option for you). It could also mean that you don't have enough memory in your H2O cluster, so you could increase that as well.

The only way to go directly from R memory to the H2O cluster is the as.h2o() function, so you are definitely using the right command. Under the hood, the as.h2o() function writes the frame from R memory to disk (stored in a temp file) and then reads it directly into the H2O cluster using H2O's native parallel read functionality.

We recently added the ability to use data.table's read/write functionality any place that we use base R, so since you have data.table installed, you should probably be able to get around this bottleneck by adding this to the top of your script: options("h2o.use.data.table"=TRUE). This will force the use of data.table instead of base R to write to disk for the first half of the as.h2o() conversion process. This should work for you since it's doing the exact same thing that your code is doing already where you use fwrite to write to disk and h2o.importFile() to read it back in.

Also you don't need the last line with h2o.parseRaw():

tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")

You can just do:

h2o.orig.1 <- as.h2o(mat.agg.dt)

There is a related post that shows how to use data.table to solve the reverse problem (using as.data.frame() instead of as.h2o()) here.