I have used Microsoft's "Data Science End to End Walkthrough" to set myself up with R Server, and their example works perfectly.
The example (New York taxi data) uses non-categorical variables (ie distance, taxi fare etc.) to predict a categorical variable (1 or 0 for whether or not a tip was paid).
I am trying to predict a similar binary output using categorical variables as an input, using linear regression (the rxLinMod function), and am coming up with an error.
The error says that the number of parameters does not match the number of variables, however it looks to me like the number of variables is actually the number of levels within each factor (variable).
To Replicate
Create a table called example in SQL Server:
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
[Person] [nvarchar](max) NULL,
[City] [nvarchar](max) NULL,
[Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
Put data in it:
insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);
I also use a SQL function which returns variables in table format, as that's what it looks like is required from the Microsoft example. Create the function formatAsTable:
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
SELECT
@City AS City,
@Person AS Person
);
We now have a table with two categorical variables - Person, and City.
Let's start predicting. In R, run the following:
library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr,
colClasses = c(City = "factor",Bin="logical",Person="factor"
),
rowsPerRead=500)
Now, set up the linear regression model.
isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)
Look at the model object:
isWonObj
Notice it looks like this:
...
Total independent variables: 11 (Including number dropped: 3)
...
Coefficients:
Bin
(Intercept) 6.666667e-01
City=London -1.666667e-01
City=New York 4.450074e-16
City=Liverpool 3.333333e-01
City=Paris 4.720871e-16
City=Sydney -1.666667e-01
City=Mexico City Dropped
Person=John -1.489756e-16
Person=Paul -3.333333e-01
Person=George Dropped
Person=Ringo Dropped
It says there are 11 variables, which is fine, as this is the sum of levels in the factors.
Now, when I try to predict the Bin value based on City and Person, I get an error:
First I format the City and Person I want to predict for as a table. Then, I predict using this as an input.
sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
, colClasses = c(City = "factor",Person="factor"))
If you check the pred object, it looks as expected:
> head(pred)
City Person
1 London George
Now when I try to predict, I get an error.
scoredOutput <- RxSqlServerData(
connectionString = connStr,
table = "binaryOutput"
)
rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput,
predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)
The error says:
INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of variables: 3 vs. 11.
I can see where the 11 comes from, but I have only supplied 2 values to the predict query - so I can't see where the 3 comes from, or why there is a problem.
Any assistance is appreciated!
The answer appears to be consistent with how R treats factor variables, however the error message could have made a more clear distinction between factors, levels, variables and parameters.
It appears that the parameters input to generate a prediction cannot simply be a character or factor with no levels. They need to have the same levels as the factors of the same variable used in the model parameterisation.
As such, the following lines:
... should be replaced with this:
I have seen other examples with categorical variables which seem to work without this, but maybe the levels were in there anyway.
I hope this saves someone as many hours as I lost on it!
Edit for SLSvenR's Response
I think my comment regarding having the same levels as the training set still holds.
I can't comment on whether this is good or bad - however it looks like one way to get around this is to apply the levels of the training data to the test set, which I assume you can do in real time.