In the energy sector, time series forecasting holds significant importance as it helps in early detection of bottlenecks, necessitating precise consumption and generation forecasts. In previous contributions, we discussed the models suitable for specific time series. However, multiple methods may be suitable, each with advantages and disadvantages for the particular application in which the forecast will be used. Therefore, it’s worthwhile to make a finer distinction, particularly with non-parametric models, and compare conventional machine learning methods with deep learning models. Here, the focus is less on the properties of the time series and more on the models and their suitability for specific applications.
Conventional Machine Learning
Conventional machine learning encompasses several methods, including:
- Linear Regression,
- Support Vector Machines (SVM),
- k-Nearest Neighbors (kNN),
- Decision Trees,
- Random Forrest,
Here, we will focus primarily on tree learning methods and their boosting, providing only a brief description of the advantages and disadvantages of the other methods without delving deeper into their functioning. The following paragraphs rely on [1, 3, 4].
Linear regression is well-suited for processing large datasets, including time series. They are notably easy to implement, understand, and interpret. However, they have limitations when dealing with nonlinear relationships between features (e.g. the time of the day or the current weekday) and the predicted variables. Consequently, substantial effort may be required to compile those features that exhibit the most linear relationships.
Support Vector Machines
Support Vector Machines (SVM) are especially suitable for binary classification problems but are limited to binary tasks. They can handle a large number of features and are robust against overfitting. However, they were not initially designed for time series data. They may sometimes struggle to effectively capture temporal dependencies. Additionally, they require extensive data preprocessing and hyperparameter tuning.
k-Nearest Neighbors (kNN) methods are suitable for small datasets that need to be processed quickly without prior training. They recognize local patterns, making them ideal for real-time stream data predictions when computational resources are limited. However, for larger datasets and many features, kNNs become computationally intensive. Choosing the value for ‘k’ can take time and effort as it may lead to overfitting or underfitting. Furthermore, kNNs have difficulty identifying complex, larger patterns and are susceptible to random noise in the data.
Decision Trees, Random Forrest, and Boosting
Decision trees, including their derived ensemble and boosting methods, share many common characteristics and are thus discussed together.
Decision trees, as the name suggests, are based on constructing decision trees from provided training data. They excel at capturing nonlinear relationships in time series while remaining highly interpretable. They can also handle missing data and outliers. However, a drawback is that minor data variations can significantly influence results, and decision trees tend to overfit the training data. A general issue with methods based on decision trees is their difficulty in generalizing beyond the training data (Figure 1). Figure 1 shows how the decision tree-based model learns the seasonal component but not the linear trend, as decision trees only predict values already present in the training data.
The problem of decision tree overfitting can be mitigated by forming an ensemble of multiple decision trees (Random Forest). To not produce the same tree repeatedly, individual trees are trained on random subsets of training data and features. The size of these subsets, as well as the depth of the decision trees (the number of splits due to decision rules) and other parameters, can be adjusted when finding the best hyperparameters. For the final prediction of the ensemble, the predictions of all decision trees are averaged in regression tasks, while in classification tasks, the results are voted. This ensemble of decision trees retains most of the strengths of a single decision tree. However, training time increases, and the model becomes more complex to interpret.
Using an ensemble of decision trees is further developed in boosting methods. Many shallower decision trees are created, with various types of creation and weighting leading to different boosting methods:
- Bagging (Bootstrap Aggregation): Similar to Random Forest, various subsets of training data are used to create decision trees. However, missing data is replaced through bootstrapping. For regressions, the predictions of all decision trees are averaged, while for classifications, the results are voted upon.
- AdaBoosting (Adaptive Boosting): The aim of AdaBoosting is to sequentially create decision trees, improving each tree by adjusting the weighting of training data. Improvement means minimizing the loss-function. The prediction of each tree is weighted based on its performance on the test data, and a weighted average is calculated for regressions or a weighted vote for classifications.
- Gradient Boosting: Similar to AdaBoosting, each newly created tree in Gradient Boosting is improved. However, in this case, the loss-function of the entire ensemble of decision trees is optimized using a gradient descent method, also optimizing the weighting of each tree. For ensemble predictions, a weighted average is formed for regression or a weighted vote for classifications.
Deep Learning refers to artificial neural networks (ANN) with multiple layers. Each layer consists of neurons, typically depending on neurons in the previous layer. For a neuron Nⱼ, a weighted sum of values from neurons in the previous layer xᵢ, and weights, wᵢⱼ, is computed. Then, an activation function ϕ is applied to make nonlinear relationships possible in an ANN:
In the first layer, features and, additionally, past segments of the time series of the target variable can be used as xᵢ instead of neurons from the previous layer. Frequently used activation functions ϕ include the ReLU function, hyperbolic tangent, or the sigmoid function. A simple ANN with three layers and varying neuron counts per layer is depicted in Figure 2. The ANN learns to make accurate predictions by minimizing the loss-function using backpropagation and gradient descent.
In addition, neurons can also have feedback loops. Such networks are called recurrent neural networks (RNNs). These feedback loops provide the network with a form of memory, enabling it to learn relationships between past and new predictions of the model. A typical example of an RNN is a Long Short-Term Memory (LSTM).
One of the strengths of Deep Learning in time series prediction is its ability to autonomously learn complex relationships within the time series itself. This means there’s no need to manually search for features within the time series; Deep Learning methods learn these features automatically. However, this comes at the cost of interpretability since the relationships within the networks are often less comprehensible. Furthermore, Deep Learning models can handle large datasets but require them to learn more complex relationships. These methods can adapt well to changes in patterns with new data and incorporate them through retraining. Nevertheless, Deep Learning models are more challenging to train, and fine-tuning hyperparameters can be quite labor-intensive at times. [5, 6]
Comparison of Models
Both boosting methods based on decision trees and deep learning methods can yield good results in time series prediction. Therefore, it’s worth comparing these models to assess their suitability for different applications.
Requirements for Training Data and Features
The requirements for training data and features vary depending on the method used, whether decision tree-based methods or deep learning.
For decision tree-based methods, training data should be well-structured and consist of observations of the target variable and its corresponding features. Identifying temporal dependencies in the data involves the deliberate creation of suitable features and the selection of appropriate models for decision trees. Decision tree-based methods are well-suited for time series where significant features are easy to identify. They stand out for their efficiency and relatively short training duration.
In contrast, deep learning methods can handle less structured data and allow for the simultaneous prediction of multiple time steps. They learn important features from unstructured data and can recognize temporal relationships in the data. Deep learning can identify complex dependencies within time series and their features, which is advantageous in certain applications. However, deep learning methods typically require a substantial amount of data, hyperparameter tuning, and a long training time to perform optimally.
In many applications, it’s essential to define ranges within which the target variable is expected to fall. This may be crucial, for instance, in predicting peak loads with a certain level of confidence. To make such probability statements, quantile predictions can be used.
For decision tree-based methods, which consist of many decision trees, this is relatively straightforward. A distribution for the target variable is obtained through the various predictions of individual decision trees. By adjusting the voting of the decision trees and using quantiles instead of the mean, quantiles can be determined without training an additional model. In the case of gradient boosting, however, a new model must be created because the entire ensemble is optimized for the loss-function. In this scenario, it is necessary to adapt the loss-function and train a new model for each desired quantile. Similar approaches could also be applied to other decision tree-based methods.
In the field of deep learning, two options are presented here. First, by introducing the random element of “dropout,” which temporarily deactivates random neurons, it’s possible to generate a probability distribution of predictions. However, this requires numerous predictions with different dropout settings. The second option is to adjust the loss-function and create a model for each desired quantile. Both methods involve substantial computational effort.
Often, it is necessary to predict more than just one-time step into the future. For example, to predict the intraday energy price for the entire upcoming day, all 96-time steps of the following day need to be forecasted, as the data is available in 15-minute intervals.
Deep learning methods allow for easy adjustment of the number of predictions by adapting the number of neurons in the final layer to the desired output. This enables the prediction of all time steps simultaneously, preserving possible relationships between individual time steps and allowing the most recently transmitted data to be used for all time steps.
For decision tree-based methods, this is more complex. There are various approaches to predict multiple time steps. Suppose we want to predict n time steps. The simplest approach is to train a model that does not include features using data from less than n time steps ago (Figure 3a). Otherwise, data from the future would be used, which cannot be done in practice. However, this means that the most recently transmitted data are not used as features by most of the predicted time steps.
To utilize this data, recursive predictions can be made, where previously made predictions serve as features for subsequent predictions (Figure 3b). However, there is a risk that errors in predictions will propagate and result in significant errors in predictions further into the future. One way to overcome this is to train an individual model for each time step (Figure 3c). This allows the use of the latest data while only actual measured data is used, preventing error propagation. However, this approach is computationally intensive, as it requires creating n models. 
Predicting time series data in the energy sector is crucial for early bottleneck detection and accurate forecasting. Various models come with their advantages and disadvantages. When dealing with non-parametric models, it’s worthwhile to compare conventional machine learning methods with deep learning. Conventional machine learning methods require structured training data, while deep learning can process unstructured data and discover and learn the important features from them. Quantile forecasting involves adjustments, whereas tree-based methods simplify quantile prediction compared to deep learning. For multi-step forecasting, deep learning models can flexibly adapt the number of predictions, whereas decision tree-based methods require different approaches. Choosing the appropriate model depends on the application and data.
- Further Information
- Predictions in Energy Economics – Which Methods Are Suitable?
- Predictions in Energy Economics – Which Error Metrics Are Suitable?
- Grundlagen künstlicher Intelligenz und Machine Learning in der Energiewirtschaft
- Anwendungsfälle von Supervised Machine Learning in der Energiewirtschaft
- Supervised Machine Learning
 Russell, S., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Prentice Hall.
 Chauhan, N. K., & Singh, K. (2018). A Review on Conventional Machine Learning vs Deep Learning. In 2018 International Conference on Computing, Power and Communication Technologies (GUCON) (pp. 347-352). doi: 10.1109/GUCON.2018.8675097.
 Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, Inc.
 Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc.
 Brownlee, J. (2018). Deep Learning for Time Series Forecasting: Predict the Future with MLPs, CNNs, and LSTMs in Python. Machine Learning Mastery.
 Gamboa, J. C. B. (2017). Deep Learning for Time-Series Analysis. arXiv preprint arXiv:1701.01887.
 Bontempi, G., Ben Taieb, S., & Le Borgne, Y. A. (2013). Machine Learning Strategies for Time Series Forecasting. In Business Intelligence: Second European Summer School, eBISS 2012 (pp. 62-77).
 Haben, S., Voss, M., & Holderbaum, W. (2023). Core Concepts and Methods in Load Forecasting: With Applications in Distribution Networks. Springer Nature.
 Mahalakshmi, G., Sridevi, S., & Rajaram, S. (2016, January). A Survey on Forecasting of Time Series Data. In 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE’16) (pp. 1-8). IEEE.