Multiple Linear Regression in Python

Tino · July 18, 2020, 5:33am

This is super easy to carry out using the statsmodel package

    import statsmodels.api as sm
    X = df["X1", "X2"]
    X = sm.add_constant(X)
    y = df["y"]
    model = sm.OLS(y, X).fit()
    predictions = model.predict(X)
    model.summary()

nohcop1 · July 28, 2020, 10:48am

I love the formula api in statsmodel, it reminds me of my R days.
With your example above

import statsmodels.formula.api as sm

model = sm.OLS("y ~ X1 + X2", data=df).fit()

And then you can get all funky:

C(X1) to make X1 values dummies
X1:X2 is the product of the 2 variables
X1*X2 is the variables and their product
-1 to remove the intercept
you can also use functions, e.g. “np.log(X1)”

Tino · July 28, 2020, 10:53am

Did not know you could do that!
Especially the variable manipulation directly.
Thanks!

david · July 29, 2020, 7:28pm

Was also unaware that it is so flexible!

The problem with statsmodels imo is that it calculates too much when calling fit(), so it is a bit slow. Would be nice if it was a bit more flexible in what results to calculate when calling fit

Tino · July 29, 2020, 11:43pm

I did a little comparison and it seems scikit-learn is 3 times faster than Statsmodel, but the last implementation is 5 times faster than scikit-learn!

import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# generate random data-set
np.random.seed(0)
x = np.random.rand(10000, 1)
y = 2 + 3 * x + np.random.rand(10000, 1)

# sckit-learn implementation
regression_model = LinearRegression()

%%timeit 
regression_model.fit(x, y)

499 µs ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print(regression_model.intercept_)
print(regression_model.coef_)

[2.49209753]
[[3.006329]]

# Statsmodel implementation
X = sm.add_constant(x)

%%timeit
model = sm.OLS(y, X).fit()

1.43 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

model.params

array([2.55808002, 2.93655106])

# implementation according to https://www.jstor.org/stable/24305577?seq=1
def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
    """
    x_avg: average of previous x, if no previous sample, set to 0
    y_avg: average of previous y, if no previous sample, set to 0
    Sxy: covariance of previous x and y, if no previous sample, set to 0
    Sx: variance of previous x, if no previous sample, set to 0
    n: number of previous samples
    new_x: new incoming 1-D numpy array x
    new_y: new incoming 1-D numpy array x
    """
    new_n = n + len(new_x)

    new_x_avg = (x_avg*n + np.sum(new_x))/new_n
    new_y_avg = (y_avg*n + np.sum(new_y))/new_n

    if n > 0:
        x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
    elif n == 0:
        x_star = new_x_avg
        y_star = new_y_avg
    else:
        raise ValueError

    new_Sx = Sx + np.sum((new_x-x_star)**2)
    new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))

    beta = new_Sxy/new_Sx
    alpha = new_y_avg - beta * new_x_avg
    return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg

%%timeit
Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1), y)

98.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

print(alpha)
print(beta)

2.508280405287606
3.0061896512716526

NickPollari · July 30, 2020, 12:52am

Try wrapping numba jit around the function and see what happens

Tino · July 30, 2020, 1:00am

Hmm, exactly the same time.
I mean, 100 micro seconds is probably at the limit of how fast electrons can move round my system? lol

Wonder if the function as more complex?

NickPollari · July 30, 2020, 12:52pm

it should make a difference if its part of a workflow where its run several thousand times