An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

Regression Models

- Published on July 10, 2017 at 6:18 am

- 1.4k Views
- Shares
- 0 Comments

In my last post I demonstrated how to obtain linear regression parameter estimates in R using only matrices and linear algebra. Using the well-known Boston data set of housing characteristics, I calculated ordinary least-squares parameter estimates using the closed-form solution. In this post I’ll explore how to do the same thing in Python using `numpy`

arrays and then compare our estimates to those obtained using the `linear_model`

function from the `statsmodels`

package.

First, let’s import the modules and functions we’ll need. We’ll use `numpy`

for matrix and linear algebra. In the last post, we obtained the Boston housing data set from R’s `MASS`

library. In Python, we can find the same data set in the `scikit-learn`

module.

import numpy as np import pandas as pd from numpy.linalg import inv from sklearn.datasets import load_boston from statsmodels.regression.linear_model import OLS

Next, we can load the Boston data using the `load_boston`

function. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s as well as the median home price in each of the tracts.

Now we’re ready to start. Recall from my previous post that linear regression typically takes the form:

$$y=\beta X + \epsilon$$

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \(\epsilon\) is the error term; it represents features that affect the response, but are not explicitly included in our model.

From the `boston`

object, we will extract the features, which are conveniently already a numpy array, and assign it to `X`

. Our target variable is the median home value (in thousands of US dollars) for each tract. We assign the target to the variable `y`

.

# load the boston data set boston = load_boston() # obtain the feature matrix as a numpy array X = boston.data # obtain the target variable as a numpy array y = boston.target

If we examine the features, we can see that `X`

is the expected shape.

print(X.shape)(506, 14)

We can also look at the feature names.

feature_names = boston.feature_names print(feature_names)['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

If you would like to see more details about these features and the data set, you can view the DESCR attribute of the `boston`

object.

print(boston.DESCR)

We will need to add a vector of ones to our feature matrix for the intercept term. We can do this easily in `numpy`

and then add it as the first column in our feature matrix.

# create vector of ones... int = np.ones(shape=y.shape)[..., None] #...and add to feature matrix X = np.concatenate((int, X), 1)

With the preparatory work out of the way, we can now implement the closed-form solution to obtain OLS parameter estimates.

$$\beta = (X^TX)^{-1}X^Ty$$

We do this in python using the numpy arrays we just created, the `inv()`

function, and the `transpose()`

and `dot()`

methods.

# calculate coefficients using closed-form solution coeffs = inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)

Let’s examine them to see if they make sense. Here I convert the `coeffs`

array to a `pandas`

DataFrame and add the feature names as an index.

# extract the feature names of the boston data set and prepend the intercept feature_names = np.insert(boston.feature_names, 0, 'INT') # collect results into a DataFrame for pretty printing results = pd.DataFrame({'coeffs':coeffs}, index=feature_names) print(results.round(2))coeffs INT 36.49 CRIM -0.11 ZN 0.05 INDUS 0.02 CHAS 2.69 NOX -17.80 RM 3.80 AGE 0.00 DIS -1.48 RAD 0.31 TAX -0.01 PTRATIO -0.95

Now we have our parameter vector β. These are the linear relationships between the median home value and each of the features. For example, we see that an increase of one unit in the number of rooms (RM) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘CHAS’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘CHAS’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now we’ll estimate the same model using the `linear_model`

function from `statsmodels`

and assign the results to `coeffs_lm`

.

# create a linear model and extract the parameters coeffs_lm = OLS(y, X).fit().params

Last of all, we place our newly-estimated parameters next to our original ones in the `results`

DataFrame and compare.

results['coeffs_lm'] = coeffs_lm print(results.round(2))coeffs coeffs_lm INT 36.49 36.49 CRIM -0.11 -0.11 ZN 0.05 0.05 INDUS 0.02 0.02 CHAS 2.69 2.69 NOX -17.80 -17.80 RM 3.80 3.80 AGE 0.00 0.00 DIS -1.48 -1.48 RAD 0.31 0.31 TAX -0.01 -0.01 PTRATIO -0.95 -0.95 B 0.01 0.01 LSTAT -0.53 -0.53

Once again, our results are identical. Now you know how these estimates are obtained using the closed-form solution. Like I mentioned in my R post on the same topic, you’d never actually implement linear regression in this way. You would use the `linear_model`

function or the `LinearRegression`

function from the `scikit-learn`

package if you’d prefer to approach linear regression from a machine learning standpoint. These functions are very quick, require, very little code, and provides us with a number of diagnostic statistics, including , t-statistics, and p-values.

I have made the code from this post available at my Github here.