Comparison of ARMA and Multilayer Perceptron Based Methods for Economic Time Series Forecasting

Aistis Raudys and Jonas Mockus
Institute of Mathematics and Informatics,
Akademijos 4, Vilnius 2600, Lithuania

Abstract. - In this paper two popular time series prediction methods - the Auto Regression Moving Average (ARMA) and the multilayer perceptron (MLP) -are compared while forecasting seven real economical time series. It is shown that the predictions of both the methods are poor in ill-structured problems. In the well-structured cases, when prediction accuracy is high, the MLP predicts better providing lower average prediction error.

Introduction

Forecasting the economical times series is the important tasks in modern data analysis. The problem can be formulated as follows. Let X1, X2, … , XR be a multivariate time series consisting of R p-variate vectors. An objective is to forecast the first component of a vector

on a basis of information contained in a sequence of p-variate vectors Xj-1, Xj-2, … , Xj-K, where K is the number of vectors in past we want to use. Usually for prediction of each particular value Xj-1 we use only a finite number (K) of “history” vectors Xj-1, Xj-2, … ,Xj-K, where K << R. For prediction of parameter Yj= X1j one seeks for a functional relationship Yj = f (Xj-1, Xj-2, … ,Xj-K).

There is number of methods for solving time series forecasting problem. Most popular are Regression Trees, Neural Networks, Auto Regression and others. We concentrate on two of them: Auto Regression Moving Average (ARMA) and Multilayer Perseptron (MLP).

In the ARMA prediction model, it is supposed the vectors X1, X2, … , XR are samples of a stationary Gaussian process (see e. g. Box and Jenkins, 1974), and

(1)

where

is t-th day (a moment of the time series) prediction,

is i-th day vector,

is t-th day prediction error,

K is a number of “history” moments,

p is vector size,

q is number of Moving Average (MA) parameters.

The problem is to find unknown coefficients , from a “learning-set” data.

The multilayer perceptron calculates an output as follows

, (2)

where

is a non-linear activation function, e.g. ,

is MLP hidden layer coefficients,

vj is MLP output layer coefficients,

p is vector size (dimensionally),

h is number of hidden neurones,

K is number of “history” moments,

and the prediction task is to find coefficients vj ,

An objective of the paper is to apply both the ANN and the ARMA prediction methods to several real data sets and to compare their accuracy. In the next section, the details are described. In the third section, the economical data sets used in the experiments are given, in the fourth section the criteria to compare accuracy of different prediction methods is discussed. The fifth section contains the results of experimental analysis.

Prediction methods.

ARMA

Definitions

Define the ARMA model as

(3)

We assuming that

and

where

is the prediction for the t-th day,

is the data vector of the i-th day,

is the prediction error of the t-th day t,

K is the “history length”,

p is the dimensionality of data vector,

is the number of AR parameters,

q - is the number of MA parameters

Minimising the Residual

An advantage of residual minimisation is that one may see directly how the objective depends on unknown parameters. Using the equalities (3) we define residuals by recurrent expressions:

(4)

where

is original values,

is forecasted values.

The following sum is minimised

(5)

where

is the prediction error of the day t,

R is the number of moments,

The logarithm of the sum is often used to decrease the objective variation by improving the scales.

(see Mockus, 1997)

Multilayer perceptron

The MLP consists of several layers of neurones. The neurons are connected with each other as shown in Figure 1.

The input signals are submitted to hidden layer neurons. One can use several hidden layers where the outputs of each layer are submitted to inputs of the next layer neurones. In terms of the standard Regression models, each single neuron uses a non-linear activation function :

(6)

where

is forecasted value,

are the values used for forecasting,

are the neuron parameters.

The idea of the ANN model is that the activation function roughly represents the activation of a real neurone. Outputs of the highest hidden layer neurons are submitted to the output layer neurones. In the analysis, only one hidden layer perceptron was used with following data processing function:

(7)

where

is the prediction of the t -th day,

is the data of the day i (i-th day vector),

K is the “history length”,

p is the dimensionality of data vector,

are MLP hidden layer coefficients,

vj are MLP output layer coefficients,

N - is a number of inputs,

are the activation functions. In the output layer

, (8)

where

c is a scaling parameter to fit the MLP outputs to the data.

In the hidden layer we used traditional sigmoid function:

,

The non-linearity of activation function means that the perceptron is a non-linear forecasting algorithm. We use MLP to forecast on an information contained in the k preceding vectors (i=1 ... p, j=1 ... k).

Typically in order to find unknown coefficients vj , , we are minimising sum of squares cost function:

(9)

where

is a real value from the learning-set data,

is a forecasted value,

R is a size of a learning-set - a number of vectors used to find the unknown weights of the perceptron.

The cost function depends on unknown parameters represented as vectors (; i=0, …, and ; j =0, …, h). Expression (9) show that the residuals are non-linear functions of parameters if the activation function is non-linear. The minimum conditions

(10)

and

(11)

is a system of non-linear equations that have a multiple solutions as usual.

An important parameter in MLP training is the learning step value. Typically, for each new data sets it is chosen after few preliminary experiments. (see D.E.Rumelhart et. al., 1986.).

The MLP tarining process encounters a multi-modality problem, similary as in the case of ARMA model. Thus, the perceptron’s starting weights are important. Traditionally while training the perceptron the starting weights are selected randomly. We initialized the weights randomly, and used our “active initialization” technique (Raudys, 1998): we mapped the data into two variate space, and actively controlled the hidden layer weights in the two variate space. Then gradually we returned to the multivariate feature space. For the real world data discused both initialization approaches resulted the same accuracy. Therefore, in tables below we present the results obtained for random weights initialization.

Real world data sets used in experiments

To compare the accuracy of both the prediction methods we used several sets of real economical data obtained from different sources. Define data set as table of numbers

(12)

Here each column represents one event in the time series data

(13)

The i-th row we call as the i-th feature:

We predict values of the main feature for the next day.

Seven real data sets were used to compare prediction algorithms. On some data sets unsatisfactory results were obtained Attempting to improve the prediction accuracy both the original and the transformed data sets were used (see section 6). The transformation was this:

(14)

where

is the transformed data,

is the original data,

is a number of records in old data.

Transformed vector is a difference between today ant yesterday.

London Stock Exchange data

Data set name is stock; number of features (dimensionally of the time series) is 23; number of records is 1383; forecasted feature is 23. Data represent London Stock Exchange closing rates in the period of approximately three years. Data contains an information such as currency exchange rates, Daw Jones indexes, closing rates of stocks of some biggest companies, e.t.c. This data set is the largest one in this research. Therefore, in order to reduce an influence of the high data dimensionally we performed a feature selection. The experiments have been performed with tree data sets containing 2, 8 and all the 23 features.

AT&T share data

Data set name is att; dimensionality is 1; number of records is 488; Data contains AT&T stock closing rates.

Intel stock data

Data set name is intel; dimensionally is 1; number of records is 492. Data contains closing rates of Intel stocks (see Figure 2).

wpe7.jpg (14654 bytes)

Figure 2. Intel data set 1-st feature (original on the left, transformed on the right).

Call centre data

Data set name is call; dimensionally is 15; forecasted feature is 5 (number of calls per day); feature list: month, day, year, alternative call count, call count, unknown, unknown, 8 binary specific events; number of records is 365. Data contains number of calls per day.

Lithuanian banks stock data

Data set name is bank; number of features is 8; forecasted feature is 7 (during experiments with bank data set we obtained the best result forecasting Ðiauliø bankas (7) share values); feature list: stock session code, Bankas Hermis, Vilniaus bankas, Lietuvos taupomasis bankas, Vilniaus bankas, bankas Snoras, Ðiaulø bankas, Ûkio bankas; number of records is 120. Data contains stock rates of 7 major Lithuanian banks.

Currency exchange data

Data set name is currency; number of features is 4; forecasted feature is 3 (during experiments with cur data set we obtained the best result forecasting Lb to $ exchange rate), number of records was 464. Data contained the exchange rates to $ of four major currenciecies Dm, Fr, Lb, Yen.

USA bond rates data

Data set name is bond; number of features is 8; forecasted feature is 4, number of records was 5893. Data contain 1, 2, 3, 5 and 10 years US bond rates over the period from 1976 to 1998 daily. The data is ill-structured and was not shown in result table.

Methodology of experiments

To evaluate the forecasting accuracy the data sets where split into three equal parts. The first part - a learning-set - was used for training - estimation of unknown prediction equation parameters (the perceptron weights, ARMA parameters). The second part - a validation-set - was used to select the best variant of the prediction methods: the number of inputs, the number of hidden neurones, an optimal stopping moment in MLP training, the ARMA model order parameters, e.t.c.. The third part of each data set was used to test performances of final variants of MLP and ARMA models. It is called a test-set.

ARMA software used in experiments was self optimising (see Mockus, 1997), and the learning and validation sets were used together to estimate both the prediction model architecture and the model parameters. The MLP software, was not self optimising. Therefore we applied the MLP predictor several times sequentially changing the number of previous observation vectors used for forecasting, and also the number of neurones in the hidden layer. The other parameters were set to the default values. The MLP was trained on the learning-set. The best set of parameters was determined using the information in the validation-set.

For evaluation of the average accuracy of both prediction methods two criteria were used. First one is the relative efficiency of the method in comparison with the "random walk" (random walk predicts: tomorrow-will-be-as-today) method:

, (15)

where

,

is a mean square prediction error (MSPE),

is MSPE of the random walk method,

is the forecasted feature value of the day t (the moment t),

is the sample size,

For example, is the efficiency coefficient of the method in comparison with the random walk: J =0 this mean that algorithm is at the same quality as the random walk, J = -100, means that algorithm forecast ideally, and J >0 mean that algorithm is worse than random walk.

For evaluating the accuracy of the MLP predictor, in addition to the relative efficiency we used traditional correlation coefficient r between the actual and predicted valued estimated on the training (validation) and the test data sets

(16)

where

are the original values,

is forecasted values

is mean of the original values

is mean of the forecasted values .

Results

Results of experiments are presented in several ways. There are tables containing correlation coefficients and comparison with RW numbers evaluated on the validation and the test sets. In the tables, the estimates obtained on the validation set are denoted by index ?train?. The train set estimates by definition represent well only the training set, therefore the efficiency of the prediction algorithm is evaluated by the test-set estimates. In addition to the tables two scatter diagrams are present, which depict the true and the forecasted parameter values using MLP and ARMA algorithms. In Table 1 MS is the a number of learning epochs (an epoch is the number of learning operation using all data set vectors in learning set), Deep is the number of days back, # hidden is number of neurones in the hidden layer, rtrain is correlation on the learning set, rtest is the correlation on the test set, Jvalid is the validation set comparison to random walk (in %), Jtrain is the learning set comparison to random walk (in %), Jtest is the test set comparison to random walk (in %).

Table 1. Results of a number of experiments with MLP using different parameters, for the stock data set.

MS

Deep

# hidden

rtrain

rtest

Jvalid

Jtrain

Jtest

112

1

1

0.28023

0.283288

-58.2809

-27.5862

-31.7638

112

1

2

0.283477

0.282115

-58.2816

-27.6427

-31.765

109

1

3

0.283722

0.272426

-58.2733

-27.6777

-31.5691

111

1

4

0.285586

0.278071

-58.2859

-27.6673

-31.6706

285

2

1

0.85115

0.981302

-87.9664

-56.2395

-78.7807

299

2

2

0.863091

0.981265

-87.9156

-58.174

-80.5493

299

2

3

0.865426

0.979076

-87.6547

-57.1322

-77.0033

327

2

4

0.864981

0.979685

-87.731

-58.5318

-80.2028

237

3

1

0.815711

0.975249

-85.638

-53.0071

-74.9967

237

3

2

0.821114

0.973893

-85.227

-53.1312

-73.574

268

3

3

0.835287

0.970705

-84.8299

-56.4441

-78.1757

277

3

4

0.827319

0.969812

-84.6009

-55.3923

-77.0132

308

4

1

0.815684

0.976644

-83.5278

-56.0419

-84.1125

136

1

7

0.286319

0.280904

-58.2725

-27.7822

-31.71

The Table 1 shows that the efficiency of the MLP prediction depends on the number of inputs (a number of previous days used for prediction), the number of hidden neurons and, of course, on the number of learning iterations. The best prediction accuracy Jvalid estimated on the validation set is shown in bold. The Table 1 contains several estimates. The most interesting estimates are rtest and Jtest. Comparing the prediction methods on several data sets shorter presentation is used for other data.

In Table 2: Data Set is the name of the data set, RW rtest is correlation coefficient between Random Walk (RW) forecasted values and original values, MLP rtest is the minimum correlation coefficient between forecasted values and original values in the test set, during series of experiments with MLP, MLP Jtest is the minimum coefficient between forecasted values and original values in the test set, during series of experiments with MLP, ARMA Jtest is J coefficient between ARMA forecasted values and original values in the test set. The best prediction accuracy Jtest is in bold.

 

Table 2. Forecasting results of all data set using ARMA and MLP prediction methods

Data Set Name

ARMA Jtest

MLP Jtest

MLP rtest

RW rtest

att

0.721

4.871

0.986

0.986

bank(7)

43.940

17.244

0.704

0.743

stock1

-28.436

-28.927

0.026

0.006

stock2

-54.584

-78.780

0.981

0.006

stock8

-33.005

-86.614

0.983

0.006

stock23

-

32.264

0.519

0.006

call

-19. 482

-21.399

0.665

0.535

currency(3)

43.940

89.299

0.910

0.932

intel

23.273

1356.960

0.684

0.983

intel(tra)

-25.382

-26.352

0.223

0.098

intel*

23.273

1056.720

0.877

0.983

(tra) Transformed data set, we forecasted differences between neighbour moments (see 3, 3.3 and 6).
stock2, stock8 - forecasting using just 2 or 8 features form all 23.
* means that MLP model is used , without non-linear activation function in output layer.
Minus in stock row means that ARMA model have broken on this (quite big) data set.

Figure 3 show scatter diagrams of the closing rates (on an horizontal axis) and the predicted rates (on a vertical axis). For the prediction we used only 2 features: the first and eleventh ones. Straight line in both figures means an exact prediction (then we have zero forecasting error). For MPL forecast of architecture 4-3-1 the correlation coefficient rtest = 0.981, ARMA correlation coefficient rtest = 0.898, RW correlation coefficient rtest = 0. We see for this data set the MLP outperforms the ARMA and RW in the sense of higher correlation.

wpe8.jpg (32659 bytes) wpe9.jpg (33513 bytes)

Figure 3. London Stock Exchange closing index forecasting (the test data set) by MLP (a) and ARMA (b) methods. Points - MLP(a) or ARMA(b) prediction, plain line - ideal prediction.

The Table 2 shows that for some data sets the prediction accuracy was low, using both the ARMA and MLP models: only at four data sets (stock-2, stock-8, call and inteltra) out from eleven we obtained higher accuracy than the trivial random walk method. For these four data sets the multilayer perceptron was more efficient. Curiously, however, for seven data sets the random walk was the most effective prediction algorithm. Note, that many different methods failed on the same seven data sets. It suggest the absence of structure in these data sets. The difference between the ARMA model and MLP was rather small in these cases. The optimisation and the parameter estimation of ARMA model is much faster.

The traditional criterion used to estimate the prediction accuracy - the correlation coefficient between the true and predicted values - often fails to evaluate the prediction performance: in some cases the MLP resulted in lower accuracy than the random walk method (see att, intel, currency3), nevertheless the correlation coefficient was high. A possible reason of this phenomenon can be traced inspecting the Fig. 3. There we see that the correlation coefficient for the MLP algorithm is rather high, however a slope of the regression line in this figure differs from the bisector which represents an ideal prediction. It means, that solving the real life problems one needs to consider several accuracy measures.

 

Conclusions

MLP forecast can be made using several models of different architecture. The first one is when the output layer neurone acts as described by equation (7). The second does not uses the activation function at all

(17)

In the first case, the predicted values are mapped into an interval (0.4. 0.6). The output of the activation function vary between 0 and 1. Thus, the forecasted feature vary approximately in the linear part of the activation function. Then influences of very large deviations of the target - true (measured) values of the forcasted feature are reduced. Consequently, the activation function brings some stability to MLP training. In the another case, we do not use the activation function and do not map the forecasted feature. Thus, it may vary in a very wide interval. In this case, abnormal deviations of the forecasted feature values in the training set were observed the prediction accuracy was much lower. A reason of more successful use of the MPL predictor with the non-linear output is that use the non-linearity in the output layer of the MLP predictor makes this algorithm to be more robust to out layers - atypical observations.

The Tables show that ARMA is slightly better when data is ill-structured and thus difficult to forecast. This difference, however, is not significant since the prediction accuracy is low (not outperformed random walk method). MLP exhibits better results than ARMA in cases when the data set is well-structured and thus more suitable for prediction. On some well-structured data sets we obtained much better results comparing with the random walk. In these cases the MLP forecasting accuracy was notably higher then ARMA one.

In the ill-structured data sets the regular methods did not outperformed RW prediction (the highest tomorrow prediction accuracy was archived by using today values). Note that the ARMA model implements the random walk exactly if the first AR weight is
equal to one and all the others are zero. That is not possible for the non-linear MLP. If the non-linearity is involved in the well-structured data sets then MLP may show good results.

It seems that many economical data sets are ill-structured. The results are discouraging, not better than the trivial random walk Just occasionally we surpassed RW (see intel Jtest = 23.273, att Jtest = 0.721 data sets). That means the structure of currency exchange rates, stock values and similar data sets changes in time in hardly predictable way therefore one needs additional information to make good economical forecasting.

We obtained very low prediction accuracy for the Intel data set using both the MLP and ARMA methods. We made an attempt to forecast first differences between values, and later return to the original variables:

where

is transformed values,

is original values,

is forecasted transformed values,

is forecasted original values.

This strategy does not improve the results. Forecasted (using both MLP and ARMA) differences () was so tiny that after transformation back it seems that this values have no difference from RW forecasted values. As expected, if data in original form forecasts pour it will not significantly increase forecasting differences.

The conclusions are based on comparatively small number of the real life data sets. In a future research, it would be desirable to make experiments using more sets of real life data and the artificial data sets, too. A special attention should be paid to cases when the prediction equations are non-linear with respect to input parameters.

 

Acknowledgements

The authors express their gratitude to Dr. A. Long from City University, London, UK, A. S. Soofi, Vidas Plaèiakis and J. Juodagalvyte form Vytautas Magnus University, Kaunas, Lithuania for the real world data sets used in the experiments.

 

References

Box, G. E. P., Jenkins, G. M. (1974). Time series analysis. Forecasting and control. Mir, Moscow, 1-406 (in Russian).

Mockus, J. (1997). The set of examples of global and discrete optimisation, part I. Informatica, 8, 237-264

Mockus, J. (1997). The set of examples of global and discrete optimisation, part II. Informatica, 8, 495-526

Raudys A. (1998). In Advances in Pattern Recognition. Springer Lecture Notes in Computer Science, Vol. 1451. (Proc. Joint IAPR Int. Workshops SSPR’98 and SPR’98, Sydney, Australia, August 11-13, 1998), Springer Verlag, 989-996.

Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986). Learning Internal Representations by Error Propagation. Parallel distributed processing: Explorations in the microstructure of cognition, I, 318-362. Bradford Books, Cambridge, MA.

Soofi, A., Mockus, J (1996) Long Memory Process and Exchange Rate Forecasting, Proceedings of the Conference on Forecasting Financial Markets, Chemical Bank and Imperial College, London 1996