Time series cross validation and forecasting

How do we do time series cross validation in machine learning (xgboost etc). I’ve done high-level econometrics, so I’m more interested in the applied code than the theory.

There are many snippets of the process, but no information on the whole pipleline (even less examples use multivariate series).
CV + tuning → best model => forecast.

Bonus marks for ensembles. Eg using the predictions as part of a 2nd round of learning with further CV (stacking).

I’d really appreciate any code or resources on the matter.

Hi Alexandros,

Welcome to the forum.

If you are looking for someone to do your home work I think it’s only right that you look at a map and look for StruggleVille.com

We have all gone through that so it’s only fair.

What are your findings so far?

I’ve been using Sktime do a lot of the sliding window and grid-search stuff.
My problem is a 1-step forecast. So for cross validation I build a time series of 1-step predictions. It’s a much harsher score than the one that comes out of grid-search. I use this score as my model validation score.
Now, I’m working on the ensemble part of the problem. To ensemble, I use the predictions/errors as inputs into another CV model.
I’m learning sktime pipelines so i can build many models off different/sub datasets, then combine the smaller models with a second layer of models.

P.S. do you have any material on time series PCA? How do i build a series of the principal or secondary component? A simple time-series PCA over a large period. Bonus karma for anything with rolling PCA, preferably a PCA that occurs at each time period (if that’s even possible). If I use a garch and DCC model, surely I can get a covariance matrix at each point in time, then take the eigen values. What tripping me up is the reconstruction of the 1st, 2nd etc… components’ time series.

Building PCA time series is not difficult once you know how:

Get the first few components:

from sklearn.decomposition import PCA

num_components = 3
pca = PCA(n_components = num_components)
principal_components = pca.fit_transform(corr_mtx)
col_labels = ['PCA' + str(x) for x in range(num_components)]

Then you just multiply that with the original returns:


pca_returns = pd.DataFrame(index=rets.index, columns = col_labels)
for i,comp in enumerate(pca.components_):
    pca_returns.iloc[:,i] = rets.multiply(comp/sum(comp)).sum(axis=1)

And Robert is your father’s brother