Multiple Linear Regression in Python

This is super easy to carry out using the statsmodel package

    import statsmodels.api as sm
    X = df["X1", "X2"]
    X = sm.add_constant(X)
    y = df["y"]
    model = sm.OLS(y, X).fit()
    predictions = model.predict(X)
    model.summary()
3 Likes

I love the formula api in statsmodel, it reminds me of my R days.
With your example above

import statsmodels.formula.api as sm

model = sm.OLS("y ~ X1 + X2", data=df).fit()

And then you can get all funky:

  • C(X1) to make X1 values dummies
  • X1:X2 is the product of the 2 variables
  • X1*X2 is the variables and their product
  • -1 to remove the intercept
  • you can also use functions, e.g. “np.log(X1)”
5 Likes

Did not know you could do that!
Especially the variable manipulation directly.
Thanks!

1 Like

Was also unaware that it is so flexible!

The problem with statsmodels imo is that it calculates too much when calling fit(), so it is a bit slow. Would be nice if it was a bit more flexible in what results to calculate when calling fit :slight_smile:

2 Likes

I did a little comparison and it seems scikit-learn is 3 times faster than Statsmodel, but the last implementation is 5 times faster than scikit-learn!

import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# generate random data-set
np.random.seed(0)
x = np.random.rand(10000, 1)
y = 2 + 3 * x + np.random.rand(10000, 1)
# sckit-learn implementation
regression_model = LinearRegression()
%%timeit 
regression_model.fit(x, y)
499 µs ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print(regression_model.intercept_)
print(regression_model.coef_)
[2.49209753]
[[3.006329]]
# Statsmodel implementation
X = sm.add_constant(x)
%%timeit
model = sm.OLS(y, X).fit()
1.43 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
model.params
array([2.55808002, 2.93655106])
# implementation according to https://www.jstor.org/stable/24305577?seq=1
def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
    """
    x_avg: average of previous x, if no previous sample, set to 0
    y_avg: average of previous y, if no previous sample, set to 0
    Sxy: covariance of previous x and y, if no previous sample, set to 0
    Sx: variance of previous x, if no previous sample, set to 0
    n: number of previous samples
    new_x: new incoming 1-D numpy array x
    new_y: new incoming 1-D numpy array x
    """
    new_n = n + len(new_x)

    new_x_avg = (x_avg*n + np.sum(new_x))/new_n
    new_y_avg = (y_avg*n + np.sum(new_y))/new_n

    if n > 0:
        x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
    elif n == 0:
        x_star = new_x_avg
        y_star = new_y_avg
    else:
        raise ValueError

    new_Sx = Sx + np.sum((new_x-x_star)**2)
    new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))

    beta = new_Sxy/new_Sx
    alpha = new_y_avg - beta * new_x_avg
    return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
%%timeit
Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1), y)
98.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print(alpha)
print(beta)
2.508280405287606
3.0061896512716526
2 Likes

Try wrapping numba jit around the function and see what happens

2 Likes

Hmm, exactly the same time.
I mean, 100 micro seconds is probably at the limit of how fast electrons can move round my system? lol

Wonder if the function as more complex?

1 Like

it should make a difference if its part of a workflow where its run several thousand times

2 Likes