Variable Selection for Forecasting, from Plots and Gut Feel to Data-Driven Search Algorithms

Variable Selection for Forecasting, from Plots and Gut Feel to Data-Driven Search Algorithms

Read time
4min
CATEGORY
Variable selection

Selecting the right drivers for a forecast often matters more than the model class itself. In energy demand forecasting, for example, transforming and selecting weather inputs has delivered accuracy gains between 3.7 and 5.2 percent compared to using raw weather data, a material lift at scale that translates directly into better staffing, purchasing, and hedging decisions (Energy Informatics, 2023). In electricity load cases, curating weather stations and features is a known best practice precisely because it improves forecast skill and business value (Hong, 2015; see also evidence on station selection in Moreno-Carbonell et al., 2020). The broader methodological literature likewise shows that principled selection and shrinkage reduce error and overfitting risk, whether via information criteria, penalization, or Bayesian methods (Tibshirani, 1996; George and McCulloch, 1993).

Below we outline what variable selection is, how it evolved, why exogenous-variable handling can create look-ahead bias, and how to implement modern alternatives, from open source to automated platforms.

A short timeline of variable selection in forecasting

  • Visual lag inspection
    Early forecasters eyeballed scatterplots and lagged correlograms to spot leading indicators, a useful but subjective practice that is hard to scale.
  • Correlation and information criteria
    Correlation screens and stepwise inclusion with AIC or BIC aimed to pick parsimonious sets that balance fit and complexity (stepAIC in R’s MASS; discussion on AIC vs BIC trade-offs in CrossValidated). Stepwise works, but it is myopic and can be unstable when predictors are collinear (Zhang, 2016).
  • Penalized regression and sparse models
    Methods like the LASSO perform shrinkage and selection simultaneously, improving out-of-sample generalization in high-dimensional settings (Tibshirani, 1996; time-varying extensions appear in macro and finance, e.g., Kapetanios et al., 2018).
  • Bayesian variable selection and averaging
    Spike-and-slab priors enable probabilistic inclusion and account for model uncertainty, often yielding stronger predictive performance when many candidates and lags are on the table (George and McCulloch, 1993; overview in Ishwaran and Rao, 2005; applications and software in bsts).

The exogenous pitfall, why treating drivers as exogenous can leak the future

Many machine learning and timeseries models treat drivers (independent variables) as exogenous. If you evaluate such models using actual future values for the drivers, you are leaking information, which inflates apparent accuracy. Time series evaluation must use rolling or expanding origins and must simulate the information set that was available at the forecast date to avoid look-ahead bias (Hyndman, Forecasting: Principles and Practice; see tsCV and rolling-origin examples in Hewamalage et al., 2022, and the practical guide in Hyndman’s blog).

Econometrics largely moved away from treating many macro drivers as exogenous in the 1970s to 1980s. Christopher Sims’ “Macroeconomics and Reality” proposed vector autoregressions, where all variables are modeled jointly as endogenous. The Sveriges Riksbank Prize in Economic Sciences in 2011 recognized Sims and Sargent for empirical methods that show how shocks propagate, including VARs (Nobel Prize press release, 2011; background in Christiano, 2012). Modeling the system jointly forces you to forecast the drivers as well, which removes the leakage that occurs when you feed realized exogenous values into test folds.

What good variable selection looks like in 2025

  • Define decision-first targets
    Align KPIs like RMSE, MAE, or MASE with business costs and horizons, and evaluate with rolling-origin procedures so you see true decision-time error (FPP3 and Hewamalage et al., 2022).
  • Search broadly, then shrink
    Assemble candidate features, e.g., lags, calendar signals, weather, prices, policy dummies, and apply penalization or Bayesian selection to control variance while keeping signal (Tibshirani, 1996; George and McCulloch, 1993).
  • Prefer system models when drivers co-move
    When predictors and targets influence each other, move to VAR or VECM so the drivers are forecasted, not borrowed from the future (statsmodels VAR; R vars package).
  • Quantify real gains
    Log feature-set changes with their out-of-sample impact. In energy time series, better weather feature engineering yields measurable gains, for instance the 3.7 to 5.2 percent improvement cited above (Energy Informatics, 2023). Similar domain-specific studies corroborate that targeted exogenous signals raise accuracy when handled correctly (MIT CTL capstone, 2024).

Implementing variable selection, three practical paths

1) Open-source, programmatic workflow
If you need full control and auditability:

  • Python, penalized and Bayesian
    Use scikit-learn for LASSO and elastic net, or pystan and PyMC for Bayesian models. For system modeling, the statsmodels VAR API supports lag order selection and multi-step forecasting, which prevents leakage by jointly forecasting all series (statsmodels VAR docs; overview in statsmodels VAR guide).
  • R, stepwise and spike-and-slab
    MASS::stepAIC provides AIC-based stepwise search, while bsts implements spike-and-slab priors that perform Bayesian variable selection and model averaging, especially useful with many candidate lags and indicators (stepAIC; bsts manual). For system modeling, the vars package estimates VAR, SVAR, and VECM and includes impulse responses and FEVD for diagnostics (CRAN vars).

2) Structured evaluation for leakage-free accuracy
No matter the toolchain, enforce rolling-origin evaluation and forbid use of realized future exogenous inputs in validation folds. Hyndman’s texts and notes give concrete, reproducible setups for multi-horizon evaluation and tsCV (FPP3; tsCV tutorial; methodological review in Hewamalage et al., 2022).

3) No-code platforms for speed and coverage
For teams that want broad model coverage and modern selection without writing code, platforms like Indicio automate variable search, feature transformations, and benchmarking across statistical, econometric, and ML models, then operationalize the best configurations with proper backtesting, all through a user-friendly interface (Indicio, variable selection). Tools in this category are designed to surface measurable accuracy improvements rapidly, while still enforcing leakage-free evaluation and repeatable pipelines.

Bringing it together, a clean, leakage-free selection pipeline

  • Curate your candidate set
    Domain-informed features, lag structures, interactions, and transformations, including external data like weather or policy calendars where relevant, since these often drive real gains in practice (Energy Informatics, 2023; Hong, 2015).
  • Run selection with shrinkage or Bayesian priors
    Use penalization to stabilize estimates or spike-and-slab to capture model uncertainty (Tibshirani, 1996; bsts).
  • Prefer VAR when causality runs both ways
    Co-evolving drivers and targets should enter a joint system to avoid exogeneity assumptions and look-ahead bias (Sims, 1980; Nobel Prize, 2011).
  • Evaluate exactly as you will operate
    Rolling or prequential evaluation with the correct information set, not random splits, so reported gains persist in production (Hyndman tsCV; Hewamalage et al., 2022).

Bottom line

Variable selection is not a checkbox, it is the backbone of accurate and reliable forecasting. Pair modern selection, shrinkage, and Bayesian averaging with system models when drivers and targets co-move, evaluate with leakage-proof protocols, and you will ship forecasts that stand up in production. If you value speed to impact, consider a no-code platform like Indicio to automate the heavy lifting while still adhering to best-practice evaluation and deployment (Indicio).

Virtual demo

View our click-through demo

Experience the ease and accuracy of Indicio’s automated forecasting platform firsthand. Click to start a virtual demo today and discover how our cutting-edge tools can streamline your decision-making process.