I'm trying to replicate the procedure proposed here on my data.
target is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable (categorical too). Due to the object characteristics, indeed, if split.variable is 1 target can be only 1, while if it is 0, target can be 0 or 1. This leads to:
> table(training_set$target, training_set$split.variable)
0 1
0 69 0
1 59 56
I'm able to create tr1 and tr2 (tr3 returns an error [Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels] because -if I'm correct- it's "empty", so no need of it [see also this post]).
tr1 <- ctree(target ~ split.variable, data = training_set, maxdepth = 1) # create the first split at comp_cat
tr2 <- ctree(target ~ split.variable + ., data = training_set, # then the left branch...
subset = predict(tr1, type = "node") == 2)
fix_ids <- function(x, startid = 1L) {
id <- startid - 1L
new_node <- function(x) {
id <<- id + 1L
if(is.terminal(x)) return(partynode(id, info = info_node(x)))
partynode(id,
split = split_node(x),
kids = lapply(kids_node(x), new_node),
surrogates = surrogates_node(x),
info = info_node(x))
}
return(new_node(x))
}
no <- node_party(tr1)
no$kids <- list(
fix_ids(node_party(tr2), startid = 2L)
#, fix_ids(node_party(tr3), startid = 5L)
)
no # visualize the structure
[1] root
| [2] V2 <= 1
| | [3] V15 <= -2.489 *
| | [4] V15 > -2.489 *
mdf <- model.frame(target ~ split.variable + ., data = training_set)
tr <- party(no,
data = mdf,
fitted = data.frame(
"(fitted)" = fitted_node(no, data = mdf),
"(response)" = model.response(mdf),
check.names = FALSE),
terms = terms(mdf), )
but, running party(...) I get the following error:
Error in kids_node(node)[[i]] : subscript out of bounds
The only reference to such error that I was able to find is this Github issue.
Here the traceback:
8: is.terminal(node)
7: fitted_node(kids_node(node)[[i]], data, vmatch, obs[indx], perm)
6: fitted_node(no, data = mdf)
5: data.frame(`(fitted)` = fitted_node(no, data = mdf), `(response)` = model.response(mdf),
check.names = FALSE)
4: party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), )
3: .is.positive.intlike(x)
2: .traceback(x, max.lines = max.lines)
1: traceback(party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), ))
I don't get if it is an issue related to the missing branch, to mlr or to any other particular situation related to my data.
Your issue
The problem is that you in
no$kidsyou just define the first subtree but just leave out the second subtree (consisting of just a terminal node). You can simply set this up with the correct id aspartynode(5L), i.e.,This is already sufficient here. In case the node your subsetting would have an
infoassociated with it (not the case here), you would also have to pass that on:After that you can follow the steps from the other answer to set up your
constpartyobject.More generally
I don't understand why you are doing this in the first place. If
split.variable= 1 always impliestarget= 1, then there seems no point in modeling that. So why not just model the subset of the data withsplit.variable= 0?But even if you decide that you want to model it,
ctreechoosessplit.variableas the first split anyway. So all of this manual forcing of the split does not seem to be necessary in the first place.