Near, far, wherever you are — That’s what Celine Dion sang in the Titanic movie soundtrack, and if you are near, far or wherever you are, you can follow this Python Machine Learning analysis by using the Titanic dataset provided by Kaggle.
We are going to make some predictions about this event. Let’s get started! First, find the dataset in Kaggle.
Let’s start by adding some libraries.
Panda’s is great for handling datasets, on the other hand, matplotlib
and seaborn
are libraries for graphics.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Copy
We load the dataset.
train = pd.read_csv("train.csv") Copy
First, let’s see the data a little bit.
train.head() Copy
Here’s the Data Dictionary, so we can understand better the columns info:
As we can see here, the ship was very big, so there must be a lot of people there, let’s see how many people:
train.count() Copy
Ok, we can see 891 total. There are some null values for some columns, later we are going to deal with that. Let’s see how many men and women were there:
According to this graphic, we can see more men than women, let’s see how many exactly:
train[train['Sex'].str.match("female")].count() Female: 314Copy
train[train['Sex'].str.match("male")].count() Male: 577Copy
In the movie, we can see actors like Leonardo DiCaprio, Kate Winslet, and Kathy Bates. If you are interested in the full cast & crew, see this info: Titanic Cast & Crew
In the cast, we can see Leonardo DiCaprio character was Jack Dawson, Kate Winslet as Rose Dewitt Bukater and Kathy Bates as Molly Brown.
Let’s take a look at the names column to look for them:
train[train["Name"].str.contains("Dawson")] No resultsCopy
train[train["Name"].str.contains("Bukater")] No resultsCopy
train[train["Name"].str.contains("Brown")] Click on the image below to see the resultCopy
Molly Brown was a real passenger in the Titanic. We can see more info about her on Wikipedia. If you read the Wikipedia article, you can find out she was a rich woman, no wonder she was traveling in class 1. We can see that most of the men were traveling in class 3 (that was the case of Jack Dawson in the movie).
Fortunately, Molly did survive. Let’s see how many people survived divided by class.
sns.countplot(x='Survived', hue='Pclass', data=train) Copy
Let’s see how many people survived divided by sex.
sns.countplot(x='Survived', hue='Sex', data=train) Copy
We can infer that, as Molly, if you were a female and you were in class 1, probably you would survive.
On the other hand, if you were a man and you were in class 3, you didn’t have good chances to live.
As we saw before, we have some null values for age. Let’s create a function to impute ages regarding the corresponding age average per class.
plt.figure(figsize=(10,7)) sns.boxplot(x='Pclass',y='Age',data=train) Copy
Let’s impute average age values to null age values:
def add_age(cols): Age = cols[0] Pclass = cols[1] if pd.isnull(Age): return int(train[train["Pclass"] == Pclass]["Age"].mean()) else: return Age Copy
Here, we call the function this way:
train["Age"] = train[["Age", "Pclass"]].apply(add_age,axis=1) Copy
We have lots of null values for Cabin column, so we just remove it.
train.drop("Cabin",inplace=True,axis=1) Copy
Finally, we remove some rows with null values:
train.dropna(inplace=True) Copy
Ok, we are done with cleaning the data. We are going to convert some categorical data into numeric. For example, the sex column.
Let’s use the get_dummies
function of Pandas. It will create two columns, one for male, one for female.
pd.get_dummies(train["Sex"]) Copy
What we can do is to remove the first column because one column indicates the value of the other column.
For example, if the male is 1, then the female will be 0 and vice versa.
sex = pd.get_dummies(train["Sex"],drop_first=True) Copy
Let’s do the same for Embarked and PClass:
embarked = pd.get_dummies(train["Embarked"],drop_first=True) embarked = pd.get_dummies(train["Pclass"],drop_first=True) Copy
We add these variables to the dataset:
train = pd.concat([train,pclass,sex,embarked],axis=1) Copy
Then, we remove some columns that we are not going to use for our model.
train.drop(["PassengerId","Pclass","Name","Sex","Ticket","Embarked"],axis=1,inplace=True) Copy
Now our dataset is ready for the model.
X will contain all the features and y will contain the target variable
X = train.drop("Survived",axis=1) y = train["Survived"] Copy
We will use train_test_split from cross_validation module to split our data. 70% of the data will be training data and %30 will be testing data.
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101) Copy
Let’s use Logistic Regression to train the model:
from sklearn.linear_model import LogisticRegression logmodel = LogisticRegression() logmodel.fit(X_train,y_train) Copy
Let’s see how accurate is our model for predictions:
predictions = logmodel.predict(X_test) from sklearn.metrics import classification_report print(classification_report(y_test, predictions)) Copy
We got 78% accuraccy, not bad. Let’s see the confusion matrix
from sklearn.metrics import confusion_matrix confusion_matrix(y_test, predictions) Copy
True positive: 107 (We predicted a positive result and it was positive)
True negative: 60 (We predicted a negative result and it was negative)
False positive: 21 (We predicted a positive result and it was negative)
False negative: 26 (We predicted a negative result and it was positive)
We still can improve our model, this tutorial is intended to show how we can do some exploratory analysis, clean up data, perform predictions and talk about this event and this wonderful movie.
Thanks!
Diego Lescano.