github twitter linkedin email rss
Titanic: machine learning from disaster
Jan 8, 2018
2 minutes read

A happy new year to one and all. Over the Christmas holidays I had crack at my first Kaggle competition. Kaggle host competitions based around machine learning. The premise: they provide a data set, you must build a model and predict an outcome.

The competition I took part in, whilst filled to the brim with turkey, mince pies, and costco millionaire shortbread traybake which is honestly the nicest thing in the world and worth a costco membership alone, uses data from the sinking of the Titanic. You are given a set of 891 passengers, some characteristics (passenger class, ticket price, age, gender etc) and whether they survived or not. Based on this, you must build a binary classification model which is able to predict if a passenger survives or dies based on their characteristics.

I used python (pandas and scikitlearn libraries) with a Jupyter notebook to do my analysis and build the models. I did all of my initital analysis in this notebook, and my final submission is here. The final notebook contains just code, whist the initial is annotated with my thought process, contains charts and plots and is much more detailed.

The hardest part, and where I believe my models lose accuracy, involves the passengers ages. In the train data set, not all of the ages of the passengers are known. As such, in order to include age in the final model, some gaps must be filled. After playing with linear regression (terribly), I decided to bin the ages into bands and used random forests (RF) to predict the age band. I then randomly assigned the age from the predicted age band based on a gaussian distribtion. This process could defnitely be improved, as the RF model wasn’t particulary accurate on unseen data. In future, I think I would assign the mean age of the band and more parameter analysis to refine the age model.

So, how did I do? Well, run the notebooks and submit the predictions! Loljk - the model in the first notebook had an accuracy of 77% for the test data, and my second submission scored a marginally better 78%. In closing, I am just glad that now we have the sense to never again cry “full steam ahead!” in a foolish endeavour regardless of warnings of danger. Or not…

md


Back to posts