Comparison of ARMA and Multilayer Perceptron Based Methods for Economic Time Series Forecasting
Aistis Raudys and Jonas Mockus
Institute of Mathematics and Informatics,
Akademijos 4, Vilnius 2600, Lithuania
Abstract. - In this paper two popular time series prediction methods - the Auto Regression Moving Average (ARMA) and the multilayer perceptron (MLP) -are compared while forecasting seven real economical time series. It is shown that the predictions of both the methods are poor in ill-structured problems. In the well-structured cases, when prediction accuracy is high, the MLP predicts better providing lower average prediction error.
Introduction
Forecasting the economical times series is the important tasks in modern data analysis. The problem can be formulated as follows. Let X_{1}, X_{2}, … , X_{R} be a multivariate time series consisting of R p-variate vectors. An objective is to forecast the first component of a vector
on a basis of information contained in a sequence of p-variate vectors X_{j-1}, X_{j-2}, … , X_{j-K}, where K is the number of vectors in past we want to use. Usually for prediction of each particular value X_{j-1} we use only a finite number (K) of “history” vectors X_{j-1}, X_{j-2}, … ,X_{j-K}, where K << R. For prediction of parameter Y_{j}= X_{1j }one seeks for a functional relationship Y_{j }= f (X_{j-1}, X_{j-2}, … ,X_{j-K}).
There is number of methods for solving time series forecasting problem. Most popular are Regression Trees, Neural Networks, Auto Regression and others. We concentrate on two of them: Auto Regression Moving Average (ARMA) and Multilayer Perseptron (MLP).
In the ARMA prediction model, it is supposed the vectors X_{1}, X_{2}, … , X_{R }are samples of a stationary Gaussian process (see e. g. Box and Jenkins, 1974), and
(1)
where
is t-th day (a moment of the time series) prediction,
is i-th day vector,
is t-th day prediction error,
K is a number of “history” moments,
p is vector size,
q is number of Moving Average (MA) parameters.
The problem is to find unknown coefficients , from a “learning-set” data.
The multilayer perceptron calculates an output as follows
, (2)
where
is a non-linear activation function, e.g. ,
is MLP hidden layer coefficients,
v_{j} is MLP output layer coefficients,
p is vector size (dimensionally),
h is number of hidden neurones,
K is number of “history” moments,
and the prediction task is to find coefficients v_{j} ,
An objective of the paper is to apply both the ANN and the ARMA prediction methods to several real data sets and to compare their accuracy. In the next section, the details are described. In the third section, the economical data sets used in the experiments are given, in the fourth section the criteria to compare accuracy of different prediction methods is discussed. The fifth section contains the results of experimental analysis.
Prediction methods.
ARMA
Definitions
Define the ARMA model as
(3)
We assuming that
and
where
is the prediction for the t-th day,
is the data vector of the i-th day,
is the prediction error of the t-th day t,
K is the “history length”,
p is the dimensionality of data vector,
is the number of AR parameters,
q - is the number of MA parameters
Minimising the Residual
An advantage of residual minimisation is that one may see directly how the objective depends on unknown parameters. Using the equalities (3) we define residuals by recurrent expressions:
(4)
where
is original values,
is forecasted values.
The following sum is minimised
(5)
where
is the prediction error of the day t,
R is the number of moments,
The logarithm of the sum is often used to decrease the objective variation by improving the scales.
(see Mockus, 1997)
Multilayer perceptron
The MLP consists of several layers of neurones. The neurons are connected with each other as shown in Figure 1.
The input signals are submitted to hidden layer neurons. One can use several hidden layers where the outputs of each layer are submitted to inputs of the next layer neurones. In terms of the standard Regression models, each single neuron uses a non-linear activation function :
(6)
where
is forecasted value,
are the values used for forecasting,
are the neuron parameters.
The idea of the ANN model is that the activation function roughly represents the activation of a real neurone. Outputs of the highest hidden layer neurons are submitted to the output layer neurones. In the analysis, only one hidden layer perceptron was used with following data processing function:
(7)
where
is the prediction of the t -th day,
is the data of the day i (i-th day vector),
K is the “history length”,
p is the dimensionality of data vector,
are MLP hidden layer coefficients,
v_{j} are MLP output layer coefficients,
N - is a number of inputs,
are the activation functions. In the output layer
, (8)
where
c is a scaling parameter to fit the MLP outputs to the data.
In the hidden layer we used traditional sigmoid function:
,
The non-linearity of activation function means that the perceptron is a non-linear forecasting algorithm. We use MLP to forecast on an information contained in the k preceding vectors (i=1 ... p, j=1 ... k).
Typically in order to find unknown coefficients v_{j} , , we are minimising sum of squares cost function:
(9)
where
is a real value from the learning-set data,
is a forecasted value,
R is a size of a learning-set - a number of vectors used to find the unknown weights of the perceptron.
The cost function depends on unknown parameters represented as vectors (; i=0, …, and ; j =0, …, h). Expression (9) show that the residuals are non-linear functions of parameters if the activation function is non-linear. The minimum conditions
(10)
and
(11)
is a system of non-linear equations that have a multiple solutions as usual.
An important parameter in MLP training is the learning step value. Typically, for each new data sets it is chosen after few preliminary experiments. (see D.E.Rumelhart et. al., 1986.).
The MLP tarining process encounters a multi-modality problem, similary as in the case of ARMA model. Thus, the perceptron’s starting weights are important. Traditionally while training the perceptron the starting weights are selected randomly. We initialized the weights randomly, and used our “active initialization” technique (Raudys, 1998): we mapped the data into two variate space, and actively controlled the hidden layer weights in the two variate space. Then gradually we returned to the multivariate feature space. For the real world data discused both initialization approaches resulted the same accuracy. Therefore, in tables below we present the results obtained for random weights initialization.
Real world data sets used in experiments
To compare the accuracy of both the prediction methods we used several sets of real economical data obtained from different sources. Define data set as table of numbers
(12)
Here each column represents one event in the time series data
(13)
The i-th row we call as the i-th feature:
We predict values of the main feature for the next day.
Seven real data sets were used to compare prediction algorithms. On some data sets unsatisfactory results were obtained Attempting to improve the prediction accuracy both the original and the transformed data sets were used (see section 6). The transformation was this:
(14)
where
is the transformed data,
is the original data,
is a number of records in old data.
Transformed vector is a difference between today ant yesterday.
London Stock Exchange data
Data set name is stock; number of features (dimensionally of the time series) is 23; number of records is 1383; forecasted feature is 23. Data represent London Stock Exchange closing rates in the period of approximately three years. Data contains an information such as currency exchange rates, Daw Jones indexes, closing rates of stocks of some biggest companies, e.t.c. This data set is the largest one in this research. Therefore, in order to reduce an influence of the high data dimensionally we performed a feature selection. The experiments have been performed with tree data sets containing 2, 8 and all the 23 features.
AT&T share data
Data set name is att; dimensionality is 1; number of records is 488; Data contains AT&T stock closing rates.
Intel stock data
Data set name is intel; dimensionally is 1; number of records is 492. Data contains closing rates of Intel stocks (see Figure 2).
Figure 2. Intel data set 1-st feature (original on the left, transformed on the right).
Call centre data
Data set name is call; dimensionally is 15; forecasted feature is 5 (number of calls per day); feature list: month, day, year, alternative call count, call count, unknown, unknown, 8 binary specific events; number of records is 365. Data contains number of calls per day.
Lithuanian banks stock data
Data set name is bank; number of features is 8; forecasted feature is 7 (during experiments with bank data set we obtained the best result forecasting Ðiauliø bankas (7) share values); feature list: stock session code, Bankas Hermis, Vilniaus bankas, Lietuvos taupomasis bankas, Vilniaus bankas, bankas Snoras, Ðiaulø bankas, Ûkio bankas; number of records is 120. Data contains stock rates of 7 major Lithuanian banks.
Currency exchange data
Data set name is currency; number of features is 4; forecasted feature is 3 (during experiments with cur data set we obtained the best result forecasting Lb to $ exchange rate), number of records was 464. Data contained the exchange rates to $ of four major currenciecies Dm, Fr, Lb, Yen.
USA bond rates data
Data set name is bond; number of features is 8; forecasted feature is 4, number of records was 5893. Data contain 1, 2, 3, 5 and 10 years US bond rates over the period from 1976 to 1998 daily. The data is ill-structured and was not shown in result table.
Methodology of experiments
To evaluate the forecasting accuracy the data sets where split into three equal parts. The first part - a learning-set - was used for training - estimation of unknown prediction equation parameters (the perceptron weights, ARMA parameters). The second part - a validation-set - was used to select the best variant of the prediction methods: the number of inputs, the number of hidden neurones, an optimal stopping moment in MLP training, the ARMA model order parameters, e.t.c.. The third part of each data set was used to test performances of final variants of MLP and ARMA models. It is called a test-set.
ARMA software used in experiments was self optimising (see Mockus, 1997), and the learning and validation sets were used together to estimate both the prediction model architecture and the model parameters. The MLP software, was not self optimising. Therefore we applied the MLP predictor several times sequentially changing the number of previous observation vectors used for forecasting, and also the number of neurones in the hidden layer. The other parameters were set to the default values. The MLP was trained on the learning-set. The best set of parameters was determined using the information in the validation-set.
For evaluation of the average accuracy of both prediction methods two criteria were used. First one is the relative efficiency of the method in comparison with the "random walk" (random walk predicts: tomorrow-will-be-as-today) method:
, (15)
where
,
is a mean square prediction error (MSPE),
is MSPE of the random walk method,
is the forecasted feature value of the day t (the moment t),
is the sample size,
For example, is the efficiency coefficient of the method in comparison with the random walk: J =0 this mean that algorithm is at the same quality as the random walk, J = -100, means that algorithm forecast ideally, and J >0 mean that algorithm is worse than random walk.
For evaluating the accuracy of the MLP predictor, in addition to the relative efficiency we used traditional correlation coefficient r between the actual and predicted valued estimated on the training (validation) and the test data sets
(16)
where
are the original values,
is forecasted values
is mean of the original values
is mean of the forecasted values .
Results
Results of experiments are presented in several ways. There are tables containing correlation coefficients and comparison with RW numbers evaluated on the validation and the test sets. In the tables, the estimates obtained on the validation set are denoted by index ?train?. The train set estimates by definition represent well only the training set, therefore the efficiency of the prediction algorithm is evaluated by the test-set estimates. In addition to the tables two scatter diagrams are present, which depict the true and the forecasted parameter values using MLP and ARMA algorithms. In Table 1 MS is the a number of learning epochs (an epoch is the number of learning operation using all data set vectors in learning set), Deep is the number of days back, # hidden is number of neurones in the hidden layer, r_{train} is correlation on the learning set, r_{test} is the correlation on the test set, J_{valid} is the validation set comparison to random walk (in %), J_{train} is the learning set comparison to random walk (in %), J_{test} is the test set comparison to random walk (in %).
Table 1. Results of a number of experiments with MLP using different parameters, for the stock data set.
MS |
Deep |
# hidden |
r_{train} |
r_{test} |
J_{valid} |
J_{train} |
J_{test} |
112 |
1 |
1 |
0.28023 |
0.283288 |
-58.2809 |
-27.5862 |
-31.7638 |
112 |
1 |
2 |
0.283477 |
0.282115 |
-58.2816 |
-27.6427 |
-31.765 |
109 |
1 |
3 |
0.283722 |
0.272426 |
-58.2733 |
-27.6777 |
-31.5691 |
111 |
1 |
4 |
0.285586 |
0.278071 |
-58.2859 |
-27.6673 |
-31.6706 |
285 |
2 |
1 |
0.85115 |
0.981302 |
-87.9664 |
-56.2395 |
-78.7807 |
299 |
2 |
2 |
0.863091 |
0.981265 |
-87.9156 |
-58.174 |
-80.5493 |
299 |
2 |
3 |
0.865426 |
0.979076 |
-87.6547 |
-57.1322 |
-77.0033 |
327 |
2 |
4 |
0.864981 |
0.979685 |
-87.731 |
-58.5318 |
-80.2028 |
237 |
3 |
1 |
0.815711 |
0.975249 |
-85.638 |
-53.0071 |
-74.9967 |
237 |
3 |
2 |
0.821114 |
0.973893 |
-85.227 |
-53.1312 |
-73.574 |
268 |
3 |
3 |
0.835287 |
0.970705 |
-84.8299 |
-56.4441 |
-78.1757 |
277 |
3 |
4 |
0.827319 |
0.969812 |
-84.6009 |
-55.3923 |
-77.0132 |
308 |
4 |
1 |
0.815684 |
0.976644 |
-83.5278 |
-56.0419 |
-84.1125 |
136 |
1 |
7 |
0.286319 |
0.280904 |
-58.2725 |
-27.7822 |
-31.71 |
The Table 1 shows that the efficiency of the MLP prediction depends on the number of inputs (a number of previous days used for prediction), the number of hidden neurons and, of course, on the number of learning iterations. The best prediction accuracy J_{valid} estimated on the validation set is shown in bold. The Table 1 contains several estimates. The most interesting estimates are r_{test} and J_{test}. Comparing the prediction methods on several data sets shorter presentation is used for other data.
In Table 2: Data Set is the name of the data set, RW r_{test} is correlation coefficient between Random Walk (RW) forecasted values and original values, MLP r_{test} is the minimum correlation coefficient between forecasted values and original values in the test set, during series of experiments with MLP, MLP J_{test} is the minimum coefficient between forecasted values and original values in the test set, during series of experiments with MLP, ARMA J_{test} is J coefficient between ARMA forecasted values and original values in the test set. The best prediction accuracy J_{test} is in bold.
Table 2. Forecasting results of all data set using ARMA and MLP prediction methods
Data Set Name |
ARMA J_{test} |
MLP J_{test} |
MLP r_{test} |
RW r_{test} |
att |
0.721 |
4.871 |
0.986 |
0.986 |
bank(7) |
43.940 |
17.244 |
0.704 |
0.743 |
stock1 |
-28.436 |
-28.927 |
0.026 |
0.006 |
stock2 |
-54.584 |
-78.780 |
0.981 |
0.006 |
stock8 |
-33.005 |
-86.614 |
0.983 |
0.006 |
stock23 |
- |
32.264 |
0.519 |
0.006 |
call |
-19. 482 |
-21.399 |
0.665 |
0.535 |
currency(3) |
43.940 |
89.299 |
0.910 |
0.932 |
intel |
23.273 |
1356.960 |
0.684 |
0.983 |
intel^{(tra)} |
-25.382 |
-26.352 |
0.223 |
0.098 |
intel^{*} |
23.273 |
1056.720 |
0.877 |
0.983 |
Figure 3 show scatter diagrams of the closing rates (on an horizontal axis) and the predicted rates (on a vertical axis). For the prediction we used only 2 features: the first and eleventh ones. Straight line in both figures means an exact prediction (then we have zero forecasting error). For MPL forecast of architecture 4-3-1 the correlation coefficient r_{test} = 0.981, ARMA correlation coefficient r_{test} = 0.898, RW correlation coefficient r_{test} = 0. We see for this data set the MLP outperforms the ARMA and RW in the sense of higher correlation.
Figure 3. London Stock Exchange closing index forecasting (the test data set) by MLP (a) and ARMA (b) methods. Points - MLP(a) or ARMA(b) prediction, plain line - ideal prediction.
The Table 2 shows that for some data sets the prediction accuracy was low, using both the ARMA and MLP models: only at four data sets (stock-2, stock-8, call and intel^{tra}) out from eleven we obtained higher accuracy than the trivial random walk method. For these four data sets the multilayer perceptron was more efficient. Curiously, however, for seven data sets the random walk was the most effective prediction algorithm. Note, that many different methods failed on the same seven data sets. It suggest the absence of structure in these data sets. The difference between the ARMA model and MLP was rather small in these cases. The optimisation and the parameter estimation of ARMA model is much faster.
The traditional criterion used to estimate the prediction accuracy - the correlation coefficient between the true and predicted values - often fails to evaluate the prediction performance: in some cases the MLP resulted in lower accuracy than the random walk method (see att, intel, currency3), nevertheless the correlation coefficient was high. A possible reason of this phenomenon can be traced inspecting the Fig. 3. There we see that the correlation coefficient for the MLP algorithm is rather high, however a slope of the regression line in this figure differs from the bisector which represents an ideal prediction. It means, that solving the real life problems one needs to consider several accuracy measures.
Conclusions
MLP forecast can be made using several models of different architecture. The first one is when the output layer neurone acts as described by equation (7). The second does not uses the activation function at all
(17)
In the first case, the predicted values are mapped into an interval (0.4. 0.6). The output of the activation function vary between 0 and 1. Thus, the forecasted feature vary approximately in the linear part of the activation function. Then influences of very large deviations of the target - true (measured) values of the forcasted feature are reduced. Consequently, the activation function brings some stability to MLP training. In the another case, we do not use the activation function and do not map the forecasted feature. Thus, it may vary in a very wide interval. In this case, abnormal deviations of the forecasted feature values in the training set were observed the prediction accuracy was much lower. A reason of more successful use of the MPL predictor with the non-linear output is that use the non-linearity in the output layer of the MLP predictor makes this algorithm to be more robust to out layers - atypical observations.
The Tables show that ARMA is slightly better when data is ill-structured and thus difficult to forecast. This difference, however, is not significant since the prediction accuracy is low (not outperformed random walk method). MLP exhibits better results than ARMA in cases when the data set is well-structured and thus more suitable for prediction. On some well-structured data sets we obtained much better results comparing with the random walk. In these cases the MLP forecasting accuracy was notably higher then ARMA one.
In the ill-structured data sets the regular methods did not
outperformed RW prediction (the highest tomorrow prediction accuracy was archived by using
today values). Note that the ARMA model implements the random walk exactly if the first AR
weight is
equal to one and all the others are zero. That is not possible for the non-linear MLP. If
the non-linearity is involved in the well-structured data sets then MLP may show good
results.
It seems that many economical data sets are ill-structured. The results are discouraging, not better than the trivial random walk Just occasionally we surpassed RW (see intel J_{test} = 23.273, att J_{test} = 0.721 data sets). That means the structure of currency exchange rates, stock values and similar data sets changes in time in hardly predictable way therefore one needs additional information to make good economical forecasting.
We obtained very low prediction accuracy for the Intel data set using both the MLP and ARMA methods. We made an attempt to forecast first differences between values, and later return to the original variables:
where
is transformed values,
is original values,
is forecasted transformed values,
is forecasted original values.
This strategy does not improve the results. Forecasted (using both MLP and ARMA) differences () was so tiny that after transformation back it seems that this values have no difference from RW forecasted values. As expected, if data in original form forecasts pour it will not significantly increase forecasting differences.
The conclusions are based on comparatively small number of the real life data sets. In a future research, it would be desirable to make experiments using more sets of real life data and the artificial data sets, too. A special attention should be paid to cases when the prediction equations are non-linear with respect to input parameters.
Acknowledgements
The authors express their gratitude to Dr. A. Long from City University, London, UK, A. S. Soofi, Vidas Plaèiakis and J. Juodagalvyte form Vytautas Magnus University, Kaunas, Lithuania for the real world data sets used in the experiments.
References
Box, G. E. P., Jenkins, G. M. (1974). Time series analysis. Forecasting and control. Mir, Moscow, 1-406 (in Russian).
Mockus, J. (1997). The set of examples of global and discrete optimisation, part I. Informatica, 8, 237-264
Mockus, J. (1997). The set of examples of global and discrete optimisation, part II. Informatica, 8, 495-526
Raudys A. (1998). In Advances in Pattern Recognition. Springer Lecture Notes in Computer Science, Vol. 1451. (Proc. Joint IAPR Int. Workshops SSPR’98 and SPR’98, Sydney, Australia, August 11-13, 1998), Springer Verlag, 989-996.
Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986). Learning Internal Representations by Error Propagation. Parallel distributed processing: Explorations in the microstructure of cognition, I, 318-362. Bradford Books, Cambridge, MA.
Soofi, A., Mockus, J (1996) Long Memory Process and Exchange Rate Forecasting, Proceedings of the Conference on Forecasting Financial Markets, Chemical Bank and Imperial College, London 1996