When using random forest learning algorithm in R, following are frequently encountered errors while trying to do prediction against validation or test data:
for(colName in names(testData)) {
levels(testData[[colName]]) = levels(trainingData[[colName]])
}
But very often the training data is used to create a model which is persisted as an RDS file. During evaluation, the model is loaded and used for prediction on the test data. In this case the training data won't be available during the prediction.
There is not much information out there on how to match levels when we have only the model. If we have a closer look random forest implementation in R, random forest algorithm has level information in forest$xlevels field of the model . The following code snippet can be used to match levels from the model to the test data:
model = readRDS(modelFileName)
for(colName in names(testData)) {
levels(testData[[colName]]) = model$forest$xlevels[[colName]]
}
- New factor levels not present in the training data
- Type of predictors in new data do not match that of the training data
for(colName in names(testData)) {
levels(testData[[colName]]) = levels(trainingData[[colName]])
}
But very often the training data is used to create a model which is persisted as an RDS file. During evaluation, the model is loaded and used for prediction on the test data. In this case the training data won't be available during the prediction.
There is not much information out there on how to match levels when we have only the model. If we have a closer look random forest implementation in R, random forest algorithm has level information in forest$xlevels field of the model . The following code snippet can be used to match levels from the model to the test data:
model = readRDS(modelFileName)
for(colName in names(testData)) {
levels(testData[[colName]]) = model$forest$xlevels[[colName]]
}