HURDAT Data Mining

Predicting Atlantic basin tropical cyclone landfalls

In the Spring of 2009, I took Data Mining, COMP 790 at UNC. It was my first graduate-level class and I thoroughly enjoyed it -- even if it did get slow at times, and frustrating at others. We covered a variety of data mining and machine learning techniques, and each student was to develop a final project that expressed their understanding of one of those. I decided to try predicting whether or not tropical cyclones would make landfall based on the earliest stages of their development.

My initial plan was to use a SVM (Support Vector Machine) to classify storms into two sets, those that made landfall and those that did not. I used the fantastic libSVM implementation of C-SVM and a Radial Basis Function kernel for this portion of the project. Given that SVM is a fairly complex data mining technique, I decided to compare its performance to a fairly naive k-nearest neighbor classifier I developed.

Surprisingly, the performance of the k-NN was not all together much worse than the SVM classifier, though it was much slower. My best performing classifiers of each variety has accuracy around 70-75%, whereas my worst of each type had accuracy close to 50%. I decided to combine the classifiers into a two-step hybrid classifier. I essentially combine the output of both classifiers, compute a weighted average, and use that output to actually perform classification. The idea here is that unless an individual classifier is very confident in its prediction for some input data point, it will require support from the other classifier in order for the data point to be classified as a landfall.

This approach greatly improved my accuracy, and using two independently well-performing classifiers pushed my accuracy over 80%. More interestingly, this approach seemed to place a lower bound on predictive performance, as even using my two worst-performing classifier models I was able to achieve almost 70% accuracy -- 15-20% better than either of the individual classifier models were able to do on their own.

I'd also like to note that this was my first project of any scale using Python. This whole project was done using Python and bash scripts on the front end and MySQL on the back.

Links