Monday, September 5, 2016

Matching test data factor levels for random forest models in R

When using random forest learning algorithm in R, following are frequently encountered errors while trying to do prediction against validation or test data:
  1. New factor levels not present in the training data
  2. Type of predictors in new data do not match that of the training data
Both are due to the factor levels or type of test data  not matching that of training data. As mentioned in many forums and blogs, this can be resolved by matching the levels of test data and training data as follows:

for(colName in names(testData)) {
    levels(testData[[colName]]) = levels(trainingData[[colName]])
}

But very often the training data is used to create a model which is persisted as an RDS file. During evaluation, the model is loaded and used for prediction on the test data. In this case the training data won't be available during the prediction.

There is not much information out there on how to match levels when we have only the model. If we have a closer look random forest implementation in R, random forest algorithm has level information in forest$xlevels field of the model . The following code snippet can be used to match levels from the model to the test data:

model = readRDS(modelFileName)
for(colName in names(testData)) {
    levels(testData[[colName]]) = model$forest$xlevels[[colName]]
}