The classic Titanic machine learning competition is an fantastic way to apply your machine learning skills and compare your work with others.

Even though the full dataset is available and you can cheat your way to perfect score, it is very satisfying to compete fairly and achieve good result.

In the notebook, I focus on the preprocessing aspect of machine learning, i.e., the cleaning and feature engineering of the dataset, which resulted in a top 4% position in the leaderboard with XGBoost.

This dataset is an excellent example to illustrate the power of understanding your dataset and making it more useful before diving into specific ML algorithms.

The notebook is divided into five parts:

Chapter 1: Missing values
Chapter 2: Feature engineering
Chapter 3: Assessing the features
Chapter 4: Prepare data for training
Chapter 5: Build a model to predict Titanic survival

The success of the prediction boils down to judicious selection of useful features in addtion to creating new ones. Below is the correlation matrix of existing and engineered features in this dataset.


You can find my Jupyter notebook on Kaggle.