This is super easy to carry out using the statsmodel package
import statsmodels.api as sm
X = df["X1", "X2"]
X = sm.add_constant(X)
y = df["y"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
This is super easy to carry out using the statsmodel package
import statsmodels.api as sm
X = df["X1", "X2"]
X = sm.add_constant(X)
y = df["y"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
I love the formula api in statsmodel, it reminds me of my R days.
With your example above
import statsmodels.formula.api as sm
model = sm.OLS("y ~ X1 + X2", data=df).fit()
And then you can get all funky:
Did not know you could do that!
Especially the variable manipulation directly.
Thanks!
Was also unaware that it is so flexible!
The problem with statsmodels imo is that it calculates too much when calling fit(), so it is a bit slow. Would be nice if it was a bit more flexible in what results to calculate when calling fit 
I did a little comparison and it seems scikit-learn is 3 times faster than Statsmodel, but the last implementation is 5 times faster than scikit-learn!
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
# generate random data-set
np.random.seed(0)
x = np.random.rand(10000, 1)
y = 2 + 3 * x + np.random.rand(10000, 1)
# sckit-learn implementation
regression_model = LinearRegression()
%%timeit
regression_model.fit(x, y)
499 µs ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print(regression_model.intercept_)
print(regression_model.coef_)
[2.49209753]
[[3.006329]]
# Statsmodel implementation
X = sm.add_constant(x)
%%timeit
model = sm.OLS(y, X).fit()
1.43 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
model.params
array([2.55808002, 2.93655106])
# implementation according to https://www.jstor.org/stable/24305577?seq=1
def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
"""
x_avg: average of previous x, if no previous sample, set to 0
y_avg: average of previous y, if no previous sample, set to 0
Sxy: covariance of previous x and y, if no previous sample, set to 0
Sx: variance of previous x, if no previous sample, set to 0
n: number of previous samples
new_x: new incoming 1-D numpy array x
new_y: new incoming 1-D numpy array x
"""
new_n = n + len(new_x)
new_x_avg = (x_avg*n + np.sum(new_x))/new_n
new_y_avg = (y_avg*n + np.sum(new_y))/new_n
if n > 0:
x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
elif n == 0:
x_star = new_x_avg
y_star = new_y_avg
else:
raise ValueError
new_Sx = Sx + np.sum((new_x-x_star)**2)
new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))
beta = new_Sxy/new_Sx
alpha = new_y_avg - beta * new_x_avg
return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
%%timeit
Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1), y)
98.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print(alpha)
print(beta)
2.508280405287606
3.0061896512716526
Try wrapping numba jit around the function and see what happens
Hmm, exactly the same time.
I mean, 100 micro seconds is probably at the limit of how fast electrons can move round my system? lol
Wonder if the function as more complex?
it should make a difference if its part of a workflow where its run several thousand times