Logistic regression is a machine learning algorithm which is primarily used for binary classification. In linear regression we used equation p(X)=β0+β1X
The problem is that these predictions are not sensible for classification since of course, the true probability must fall between 0 and 1. To avoid this problem, we must model p(X) using a function that gives outputs between 0 and 1 for all values of X. Logistic regression is named after the function used at its core, the logistic function:
p(X)=eβ0+β1X1+eβ0+β1X
We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.
Import Libraries
Let’s import some libraries to get started!
Pandas and Numpy for easier analysis.
import pandas as pd import numpy as np Copy
Seaborn and Matplotlib for data visualization.
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Copy
The Data
Let’s start by reading in the titanic_train.csv file into a pandas dataframe.
train = pd.read_csv('titanic_train.csv') train.info() RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) Copy
Exploratory Data Analysis
Let’s begin some exploratory data analysis! We’ll start by checking out missing data!
Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis') Copy
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We’ll probably drop this later, or change it to another feature like “Cabin Known: 1 or 0”
Let’s continue on by visualizing some more of the data!
Countplot of people who survived based on their sex.
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r') Copy
Countplot of people who survived based on their Passenger class.
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow') Copy
Distribution plot of dataset based on age.
train['Age'].hist(bins=30,color='darkred',alpha=0.7) Copy
Distribution plot of different amount of fare paid by passengers.
train['Fare'].hist(color='green',bins=40,figsize=(8,4)) Copy
Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age` of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter') Copy
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these average age values to impute based on Pclass for Age.
def impute_age(cols): Age = cols[0] Pclass = cols[1] if pd.isnull(Age): if Pclass == 1: return 37 elif Pclass == 2: return 29 else: return 24 else: return Age train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1) Copy
Check that heat map again!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis') Copy
Let’s go ahead and drop the Cabin column and the row in Embarked that is NaN.
train.drop('Cabin',axis=1,inplace=True) train.dropna(inplace=True) Copy
Converting Categorical Features
We’ll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won’t be able to directly take in those features as inputs.
sex = pd.get_dummies(train['Sex'],drop_first=True) embark = pd.get_dummies(train['Embarked'],drop_first=True) train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True) train = pd.concat([train,sex,embark],axis=1) Copy
Great! Our data is ready for our model!
Building a Logistic Regression model
Let’s start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).
Train Test Split
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], test_size=0.30, random_state=101) Copy
Training and Predicting
from sklearn.linear_model import LogisticRegression logmodel = LogisticRegression() logmodel.fit(X_train,y_train) predictions = logmodel.predict(X_test) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) Copy
Let’s move on to evaluate our model!
Evaluation
We can check precision,recall,f1-score using classification report!
from sklearn.metrics import classification_reportprint(classification_report(y_test,predictions)) precision recall f1-score support 0 0.81 0.93 0.86 163 1 0.85 0.65 0.74 104 avg / total 0.82 0.82 0.81 267 Copy
This was a brief overview of how to use a logistic regression model with python. I also demonstrated some useful methods to while doing data cleaning. The following notebook can be found here on github.
Thank You!