Recently I started to play around kaggle.com. The first competition is Titanic - Machine Learning from Disaster. After finishing the tutorial, you are encouraged to post your own submission. The solution you end with after completing the tutorial is straightforward. After submitting it, you end with a Public Score of 0.77511. For sure we can improve it!

Add validation

We can’t judge the correctness of the solution without comparing our results with the values that we know are correct. We can split our data into training and validation sets using the train_test_split function from sklearn..

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)


model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(train_X, train_y)
val_predictions = model.predict(val_X)

mae = mean_absolute_error(val_y, val_predictions)
print(f"MAE {mae}")

With our data divided into training and validation subsets, we can train our model with training data and check the correctness of our model with validation data. We can calculate mean absolute error (MAE in short) to check the correctness of our model. The current model has MAE 0.21524663677130046. Let’s see if we can improve it.

Selecting model parameters

Random forest classifier accepts two parameters, the number of estimators and max depth. We can create a function, which could allow us to find the best pair of those values. Since we are not dealing with a very enormous model, we can just iterate over all possible candidates and try to find

def get_mae(n_estimators, max_depth, train_X, val_X, train_y, val_y):
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=1)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_depth = [3, 4, 5, 6]
candidate_n_estimators = [100, 120, 150, 200, 300]


# Write loop to find the ideal tree size from candidate_max_leaf_nodes
scores = {(depth, n_estimators): get_mae(n_estimators, depth, train_X, val_X, train_y, val_y) for depth in candidate_max_depth for n_estimators in candidate_n_estimators}
best_tree_size = min(scores, key=scores.get)
print(f"Best tree size {best_tree_size}")

For our case, the best tree size is depth 3 and 120 estimators.

Add age

There are multiple rows without age. Let’s add an average age of 28 https://www.shiftcomm.com/insights/never-let-go-titanic-survival-101/

features = ["Pclass", "Sex", "Age", "Parch", "SibSp"]
X = pd.get_dummies(train_data[features]).fillna(25)
X_test = pd.get_dummies(test_data[features]).fillna(25))

MAE 0.19730941704035873

if we set age to 1

MAE 0.20179372197309417

So looks like setting the correct parameter value matters.

After applying all those changes, we end up with a Public Score of 0.77990. There are some improvements from what we started with!!

Conclusion

Looks like with this example, the more features we add, the better is the result. Looks like the value which we fill the missing data is critical. The model can be further improved, but for now I am missing some skills to do it.

You can find my notebook here