Monday, September 5, 2016

Matching test data factor levels for random forest models in R

When using random forest learning algorithm in R, following are frequently encountered errors while trying to do prediction against validation or test data:
  1. New factor levels not present in the training data
  2. Type of predictors in new data do not match that of the training data
Both are due to the factor levels or type of test data  not matching that of training data. As mentioned in many forums and blogs, this can be resolved by matching the levels of test data and training data as follows:

for(colName in names(testData)) {
    levels(testData[[colName]]) = levels(trainingData[[colName]])
}

But very often the training data is used to create a model which is persisted as an RDS file. During evaluation, the model is loaded and used for prediction on the test data. In this case the training data won't be available during the prediction.

There is not much information out there on how to match levels when we have only the model. If we have a closer look random forest implementation in R, random forest algorithm has level information in forest$xlevels field of the model . The following code snippet can be used to match levels from the model to the test data:

model = readRDS(modelFileName)
for(colName in names(testData)) {
    levels(testData[[colName]]) = model$forest$xlevels[[colName]]
}


Saturday, August 6, 2016

R Code Snippets

Handling properties file

Sample File: test.properties
key1=value1
key2=value2
key3=value3


Read properties from file
filePath = "/path/to/properties/file"
props = read.table(filePath, header=FALSE, sep="=", row.names=1, strip.white=TRUE, na.strings="NA", stringsAsFactors=FALSE)


Properties can be accessed using their keys by props[key, 1]

Example:-
value = props["key1", 1]
print(value)


Prints value1

Loading choices for a Shiny app drop down from properties file
loadChoicesFromPropertiesFile = function(filePath) {
  props = read.table(filePath, header=FALSE, sep="=", row.names=1, strip.white=TRUE, na.strings="NA", stringsAsFactors=FALSE)
  choices = list()
  for(key in row.names(props)) {
    choices[[paste0(props[key, 1])]] = key
  }
  return (choices)
}


Defining the select drop down in ui.R
myOptions = loadChoicesFromPropertiesFile(filePath)
selectInput("myOptions", label = h4("Options"), choices = myOptions)

Thursday, April 2, 2015

Avoiding XPathParser.initXPath(...) infinite loop in Apache Xalan XSLT

An explicit mention of the problem and solution mention in this link: http://marc.info/?l=xalan-j-users&m=104758138708998.

During XSLT transformation using org.apache.xalan.processor.TransformerFactoryImpl, org.apache.xpath.compiler.XPathParser.initXPath() goes into infinite loop if the XLS file is invalid.

The solution is to set an error listener on the TransformerFactory and rethrow the exception from the listener.

Sample Code:
    String xslFilePath = "/path/to/xsl/file";
    String inputXmlPath = "/path/to/input/xml";
    String outputXmlPath = "/path/to/input/xml";
    TransformerFactory factory = new org.apache.xalan.processor.TransformerFactoryImpl()
    factory.setErrorListener(new ErrorListener() {
        public void warning(TransformerException exception) throws TransformerException {
            throw exception;
        }
        public void fatalError(TransformerException exception) throws TransformerException {
            throw exception;
        }
        public void error(TransformerException exception) throws TransformerException {
            throw exception;
        }
    });

    /*
     * TransformerFactory factory = TransformerFactory.newInstance();
     * uses org.apache.xalan.xsltc.trax.TransformerFactoryImpl which doensn't
     * have the infinite loop problem.
     */

    StreamSource xslStream = new StreamSource(xslFilePath);
    Transformer transformer = factory.newTransformer(xslStream);
    StreamSource in = new StreamSource(inputXmlPath);
    StreamResult out = new StreamResult(outputXmlPath);
    HashMap context = new HashMap();
    // set context values into map
    transformer.setParameter("context", context);
    transformer.transform(in, out);

Monday, May 12, 2014

Embedding Tomcat8: Minimal

Required dependencies:
  • tomcat-embed-core.jar
  • tomcat-embed-logging-juli.jar
  • servlet-api.jar
  • log4j-1.2.7.jar
  • commons-beanutils-1.8.0.jar
  • commons-codec-1.9.jar
  • commons-collections-3.2.1.jar
  • commons-digester-2.1.jar
  • commons-io-2.4.jar
  • commons-lang-2.6.jar


Sample minimal code to embed and programmatically configure Tomcat 8:

package org.jr.server;

import java.io.File;

import org.apache.catalina.core.StandardContext;
import org.apache.catalina.startup.Tomcat;

public enum Tomcat8Server {
    INSTANCE;
   
    public void startServer(String contextPath, int port) {
        try {
            String docBase = ".";
            Tomcat tomcat = new Tomcat();
            tomcat.setPort(port);
            StandardContext ctx = (StandardContext)tomcat.addContext(contextPath, new File(docBase).getAbsolutePath());
            Tomcat.addServlet(ctx, "mainServlet", new MainServlet());
            ctx.addServletMapping("/*", "mainServlet");
            tomcat.start();
            tomcat.getServer().await();
        } catch(Exception e) {
            e.printStackTrace();
        }
    }
   
    public static void main(String[] args) {
        INSTANCE.startServer("", 8090);
    }
}

Wednesday, October 2, 2013

Remote debugging Java programs using jdb

The command line tool jdb can be a quick and convenient option to  debug Java programs, particularly in environments where using an IDE will be an overhead or slow.

This can be used for any Java application like those running a main method, web applications, etc.

Steps:

Enable remote debugging for the Java application by adding the following JVM parameter

-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n

The jdb tool can be invoked on the same machine as the running Java program or a remote machine that can access this machine.

jdb -connect com.sun.jdi.SocketAttach:hostname=hostname,port=8000



Refer https://docs.oracle.com/javase/7/docs/technotes/tools/windows/jdb.html for more details.

Saturday, September 28, 2013

Remote debugging Java applications using Eclipse IDE

Java applications can be debugged by attaching its source to a remotely IDE following a client-server approach. The running Java application is considered as server and the IDE with source attached is considered as client.

Server (Java application)
The application to be debugged should be started with the following JVM argument for Java 5.0 and beyond.
-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n

The debugger listens to the port 8000.

Executing java -agentlib:jdwp=help on the command prompt shows the help and list of options.

For pre Java 5.0, the server should be started with
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=8000, suspend=n

Client (Eclipse IDE)
  1. Click the Debug Configurations in the Debug button menu.
  2. In Debug Configurations window, right click Remote Java Application and click New.
  3. Give an appropriate name for the remote debug configuration.
  4. The Connection Type should be Standard (Socket Attach)
  5. Enter the Host on which the Java application is running.
  6. Enter the Port which was used as addresss= while starting the Java application.
  7. In the Source tab select a project or source jar.
  8. Click Apply to save the settings.
  9. Set appropriate break points in the source and start debugging by clicking on Debug in the Debug Configurations window.