Bayesian Heuristic Approach

Bayesian Heuristic Approach to Automated Parameter Tuning, Investigation of Examples

Jonas Mockus

Institute of Mathematics and Informatics,
Akademijos 4, LT-2600 Vilnius, Lithuania
jmockus@gmail.com

Abstract

Results of complexity theory show the limitations of exact analysis. That explains popularity of heuristic algorithms. It is well known that efficiency of heuristics depends on the parameters. Thus we need some automatic procedures for tuning the heuristics. That helps comparing results of different heuristics. This enhance their efficiency, too.

In this paper the theory and applications of Bayesian Heuristic Approach are discussed. Examples of the Bayesian approach to automated tuning of heuristics are investigated.

All the algorithms are implemented as platform independent Java applets or servlets. Readers can easily verify and apply the results for studies and for real life optimization models.

The information is on the main web-site:
$http://pilis.if.ktu.lt/\tilde{} jmockus$
and four mirrors.

Optimization, Bayesian, Tuning Heuristics, Distance Studies

Introduction


The efficiency of metaheuristics depends on parameters. Often this relation is defined by statistical simulation and have many local minima. Therefore methods of stochastic global optimization are needed to optimize the parameters.

The traditional numerical analysis considers optimization algorithms that guarantee some accuracy for all functions to be optimized. This includes the exact algorithms. Limiting the maximal error requires a computational effort that often increases exponentially with the size of the problem [1].

An alternative is the average analysis where the expected error is made as small as possible [2]. The average is taken over a set of functions to be optimized. The average analysis is called the Bayesian Approach (BA) [3,4]. Application of BA to optimization of heuristics is called the Bayesian Heuristic Approach (BHA) [5].

In the examples of this paper distances from the global minimum are not known. Thus the efficiency of methods is tested by comparing with average results of the "pure" Monte Carlo. Monte Carlo converges and does not depend on parameters that can be adjusted to a given problem by using some expert knowledge or additional test runs.

Possibilities of application are illustrated by several examples. designed for distance graduate studies in the Internet environment. All the algorithms are implemented as platform independent Java applets or servlets therefore readers can easily verify and apply the results for studies and for real life heuristic optimization problems.

Bayesian Approach (BA)

Tuning of heuristics is regarded as an optimization problem. We are looking for such parameters $x\in R^m$ of heuristics $h=h(x)$ that provide best results. Denote the original function to be optimized as $\phi(\omega),\omega \in R^{m_o}$. We know just that this function belongs to some family $\Phi$ of functions. Thus we cannot define the optimization quality by just a single sample $\phi(\omega) \in \Phi$. All the family $\Phi$ have to be regarded. Assume that search time is limited and defined as the number $n$ of iterations.

Denote by ( $\phi^n,\omega^n$) the results obtained applying $n$ times a heuristic $h$ to a function $\phi(\omega) \in \Phi$. Here $\phi^n=\phi(\omega^n)$ and $\omega^n=\omega^n(h)$, where $\omega^n(h)$ is the final decision of heuristics $h$ after $n$ iterations. Formally heuristic $h$ after $n$ iteration transforms the original function $\phi(\omega) \in \Phi$ into another function $f(x)\in F$. Here $f(x)=\phi(\omega^n(h(x)))$ belongs to a family $F$ of functions $\phi$ transformed by heuristics $h$. Transformed function $f(x)$ shows how the value of the original function $\phi(\omega)$ obtained applying $n$ times the heuristic $h$ depends on the heuristic parameters $x$.

Denote results of $n$-th iteration as $(y^n,x^n)$. Here $y^n=f(x^n)$ and $x^n=(x_1^n,....,x_m^n)$. Using the same heuristics parameter $x$ we obtain different results $f(x)$ depending on which sample function $\phi$ optimize. Thus here is optimization under uncertainty. In the optimization under uncertainty two different approaches are widely used:
the Min-Max and the Bayesian.

The Min-Max approach minimizes the maximal deviation from the solution. Here the worst case is regarded. The Bayesian approach searches for minimal expected deviation. Thus the average case is investigated. If the family $F$ is a small set of well defined functions $f(x)$ then the Min-Max approach is efficient. Otherwise we regard the average case because the worst case is too bad.

Applying the Bayesian approach we fix a prior distribution $P$ on a set $F$ of functions $f(x)$. We need that defining average results. A prior distribution is transformed into a posterior distribution $P^n$ on a subset $F^n$ of the family of $F$. We can do that by conditional probabilities.

$\displaystyle P^n(y,x)=P(f(x)<y\vert f(x^i)< y^i, i=1,...,n)$     (1)

Here $P(f(x)<y\vert f(x^i)< y^i, i=1,...,n)$ shows how the probability $P$ of event $f(x)<y$ depends on calculation results $y^1,...,y^n$.

Now we can define a risk function $R_0(x)$. The risk function $R_0(x)$ shows how the expected deviation $r(x)$ from the solution depends on parameters $x$. Here $r(x)=f(x)-f(x^*)$, where $x^*$ is the optimal and $x$ is a current values of heuristic parameters. Owing to linearity of deviation $r(x)$ the risk function $R_0(x)$ can be expressed as the difference $R_0(x)={\bf E} f(x) - {\bf E} f(x^*)$. The second component is a constant associated with the set $F$ and a prior distribution $P$. We see that ${\bf E} f(x^*)$ does not depend on the decisions $x$ therefore we disregard this component. Then we can define the risk function as the expected value of $f(x)$

$\displaystyle R(x) = {\bf E} f(x) = \int_F f(x) dP(f(x)$     (2)

The Bayesian Approach (BA) is defined by fixing a prior distribution $P$ on a set of functions $f(x)$ and by minimizing the risk function $R(x)$ [6,7,3,8,9]. The risk function shows how the average deviation depends on parameters $x$. The distribution $P$ is regarded as a stochastic model of $f(x), x \in R^m$, where $f(x)$ may be a deterministic or a stochastic function. This is possible because using Bayesian approach uncertain deterministic functions can be regarded as some stochastic functions [10,6,11,12,13]. That is essential feature of the Bayesian approach in this setup. For example, if several values of some deterministic function $y_i=f(x_i), i=1,...,n$ are known then the level of uncertainty can be represented as the conditional standard deviation $s_n(x)$ of the corresponding stochastic function $f(x)=f(x,\nu)$ where $\nu$ is a stochastic variable. The aim of BA is to provide as small average deviation as possible.

Software Framework for Global Optimization (GMJ)

In optimization problems theory and software are interconnected. The final results depend on the mathematical theory of optimization and the software implementation. Thus we have to regard them both.

Bayesian Search, "Bayes"

Representing different examples as a part of some general set-up we need basic theoretical and software tools. The examples should be united by some common framework. We call that GMJ (Global Minimizer by Java). The Bayesian Heuristic Approach is a proper theoretical concept. We apply this approach both for automatic tuning of heuristic parameters and for search of optimal mixtures of heuristics.

That is just an initial part of the GMJ. Important is to make GMJ open for development by users. Users contribute their own optimization methods in addition to the Bayesian ones. User optimization models are included as GMJ tasks. The results of optimization are represented by GMJ analysis objects. A minimal set of methods, tasks, and analysis objects is implemented by default. The rest depends on users.

Monte Carlo, Uniform Search, "Mig1"

Often the calculation of deviation from the exact global minimum is as difficult as the original global optimization problem. We can define some $\epsilon-bounds$ by using additional properties of the objective function, such as Lipshitz constant.
Otherwise a different standard should be defined. The uniform Monte Carlo search can be used:
$\displaystyle x^{n+1}= arg \min_{1 \le i \le n} f(x^i)$     (3)

Here components $x_j^i$ are generated randomly by uniform distribution on the feasible intervals $a_j \le x_j^i \le b_j, j=1,...,m, i=1,...,n$

Monte Carlo is apparently the simplest method providing convergence with probability one for any continuous objective function. Thus we can test efficiency of global optimization methods by comparing them with Monte Carlo search (3).
The analysis tool of GMJ "Convergence" defines how the best values of objective functions depend on iteration number. A method of search can be classified as efficient only if it converges faster than Monte Carlo.

Convergence

The 'Convergence' window of GMJ shows how the best value of the objective depends on the number of iterations, For most methods that is a decreasing step-function. Steps happen at 'improving' iterations. In global optimization the convergence line often is nearly logarithmic. That illustrates the high complexity of global optimization, in general.

Projection

The 'Projection' window of GMJ is apparently the simplest two-dimensional visualization of multi-dimensional objective functions. Using the Projection we may obtain some useful information about the objective if the variables are changed separately. The method 'Exkor' does just that.

Other optimization methods change many variables at each iteration. Here the 'Projection' shows how the density of observations depends on variables $x$. In efficient methods the density is higher around the global minimum.
Comparing "Projection" of different methods we see higher densities around global minima in both Bayesian methods "Exkor" and "Bayes".

The illustration are projections of the Exkor and Bayes methods in Fig. 1. and Fig. 2.

\epsfig{figure=gmj-exkor-projx-col.eps,width=\linewidth}
\epsfig{figure=gmj-bayes-projx-col.eps,width=\linewidth}

Improving Expert Heuristics

The main objective of BHA is improving any given heuristic by defining the best parameters and the best ``mixtures'' of different heuristics. The examples indicate that heuristic decision rules mixed and adapted by BHA often outperform the best individual heuristics. In addition, BHA provides almost sure convergence. However, the final results of BHA depend on the quality of the specific heuristics including the expert knowledge. Therefore the BHA should be regarded as a tool for enhancing the heuristics but not for replacing them.

Many well known optimization algorithms, such as Genetic Algorithms [14], GRASP [15], and Tabu Search [16], may be regarded as metaheuristics that can be improved using BHA.

Genetic Algorithms [14] is an important "source" of useful stochastic search heuristics. It is well known that the results of the genetic algorithms depend on the mutation and cross-over parameters. The Bayesian Heuristic Approach could be used in optimizing those parameters.

In tabu search the issues of identifying best combinations of short and long term memory and best balances of intensification and diversification strategies can be obtained using BHA.

Hence, the Bayesian Heuristics Approach can be applied to improve various stochastic or heuristic algorithms of discrete optimization. The proved convergence of a discrete search method is an asset. If not, then the convergence conditions are provided by tuning the BHA parameters [17].

Optimization of Mixed Heuristics

Greedy Heuristics

Greedy heuristics build a system from scratch. The well known greedy heuristic $h_j=c_j/g_j$ is the specific value of object $j$. The greedy heuristic algorithm: "take the greatest feasible $h_j$", is very fast but it may get stuck in some non-optimal decision.

We remove the heuristic algorithm out of such non-optimal decisions by taking decision $j$ with probability $r_j=\rho_x(h_j)$ where $\rho_x(h_j)$ is an increasing function of $h_j$ and $x=(x_1,...,x_m)$ is a parameter vector. The BA is to optimize the parameters $x$ by minimizing the best result $f_K(x)$ obtained applying $K$ times the randomized heuristic algorithm $\rho_x(h_j)$.

Optimization of $x$ adapts the heuristic algorithm $\rho_x(h_j)$ to a given problem. Let us illustrate the parametrization of $\rho_x(h_j)$ by three randomization functions: $r_i^l=h_i^l / \sum_j h_j^l, l=0,1,\infty $ Here the upper index $l=0$ denotes the Monte Carlo component (randomization by the uniform distribution). The index $l=1$ defines the linear component of randomization. The index $\infty$ denotes the pure heuristics with no randomization: $r_i^{\infty}=1$ if $h_i=\max_j h_j$, and $r_i^{\infty}= 0$ , otherwise. Here parameters $x=(x_0, x_1, x_{\infty})$ define the probabilities of using randomization $l=0,1,\infty$. The optimal $x$ may be applied solving different but related problems, too [17]. That is important in the "on-line" optimization.

Cut Optimization, "Cut"

This is a simple example of a typical task using the Bayesian Approach for tuning heuristics. The two-dimensional "Guillotine" cut is regarded. The 'mixture' of three 'greedy" heuristics is optimized. The term 'Greedy' describes heuristics that build system from scratch by adding elements one-by-one. The 'mixture' means that different heuristics are selected randomly at each iteration by some fixed probabilities.

The first heuristic is denoted as "Random". Here we select cut elements with equal probabilities. The second heuristic is 'Max Size First'. The third heuristic is 'Min Size First'. The names speaks for themselves. We search for the best mixture of these three heuristics at fixed optimization time. The time parameter is important because the optimal mixture depends on the number of iterations.

A sample of optimization of a two-dimensional "guillotine" cut problem are in Fig. 3. Numbers show how many times the corresopnding heuristic is used.
A sample of optimization of a three-dimensional packing problems is in Fig. 4.
\epsfig{figure=gmj-cut-out-pattern-col.eps,width=12.0cm}
\epsfig{figure=pack-color2better.eps,width=12.0cm}

Permutation Heuristics

Permutation heuristics are to improve some initial expert decision. Here the expert knowledge is involved in the initial decision. Applying BHA for global minimization different permutations of some feasible solution $y^0$ are tested. Heuristics are defined as the difference $h_i=v(y^i)-v(y^0)$ between the permuted $y^i$ and the original $y^0$ solutions.

Optimization of Simulated Annealing (SA)

A good example is optimization of profiled school schedule model. The vector objective is evaluated by linear scalarization. The relative importance of various factors are defined by the school authorities. The initial schedule is provided by the school and is improved by permutations.

Permutations are made by trying to close the gaps between lectures. The best obtained schedule is recorded after each iteration. Changes to worse schedules are made with some probabilities. That is for improving convergence conditions.

Three different heuristics are regarded. Two heuristics with fixed parameters and the third heuristic with parameters obtained by global stochastic optimization.

Using the first heuristic we keep the current schedule until the better schedule is found. The parameter $x_1$ is the probability to pass by the next teacher. This way we can reach just a sort of local minimum.

A natural first step to provide convergence to the global minimum is the Simulated Annealing (SA): we move from the current schedule $i$ to the permuted schedule $i+1$ with probability

$\displaystyle r_{i+1}$ $\textstyle =$ $\displaystyle \cases {e^{{-h_{i+1} \over x_2 /\ln (1+N)}},$ (4)

Here $N$ is the iteration number and $x_2$ is the "initial temperature." The logarithmic "cooling schedule" $\ln
(1+N)$ follows from convergence conditions [18].

The difference from the traditional SA is that here we want to improve the average results for some fixed number of iterations $N=K$. Thus the cooling rate should be regarded, too. A way to do it is by introducing the cooling rate parameter $x_3$.

This transforms expression (4)

$\displaystyle r_{i+1}$ $\textstyle =$ $\displaystyle \cases {e^{{-h_{i+1} \over x_2 /\ln (1+x_3 N)}},$ (5)

where $x_2 \ge 0$ defines an "initial temperature" of SA and $x_3 \ge 0$ describes the "cooling rate".

The third heuristic optimizes all three parameters $x=(x_1,x_2,x_3)$ for a fixed optimization time defined by the number of "internal"[*] iterations $N=K$. Fig. 5. shows the results of SA optimization in school scheduling. Fig. 6. shows initial and optimized student schedules.

\epsfig{figure=shed-out.eps,width=\linewidth}
\epsfig{figure=school-stud.eps,width=\linewidth}

Distance Studies

For the graduate level distance and class-room studies of the theory of games and markets in the Internet environment a set of examples of global and discrete optimization was implemented using platform independent Java applets and servlets. Here is the list of web-sites:
$http://pilis.if.ktu.lt/\tilde{} jmockus$
$http://eta.ktl.mii.lt/\tilde{}mockus$
$http://proin.ktu.lt/\tilde{}mockus$
$http://mockus.us/optimum$
The theoretical background and the complete description of the software is in the file 'stud2.pdf'. Examples are described in two separate sections: "Global Optimization" for continuous variables, and "Discrete Optimization" for discrete optimization plus applications of linear and dynamic programming.
All the results for international users are in English. Specific examples designed for Lithuanian universities are in Lithuanian.

The video-conferencing is regular: each Friday from 8:00 until 9:30 EET (EEST). Now the language is Lithuanian because no foreign students are connected. However, the essential part of the web-site is in English so the English broadcasts are possible, too. The main difficulty for English speaking students apparently are early hours (7:00-8:30 CET) .
The connection is free. Static IP and NetMeeting is used. Advance notification is assumed. Interactive mode is easy using some standard Audio and Video tools.
Fig. 7. is a snapshot of video-conferencing about the Wiener model with some "on-line" graphical explanations in response to students questions.
\epsfig{figure=video-wiener-col.eps,width=\linewidth}
All the video-conferences are recorded and put on web-site:
http://distance.ktu.lt/vips/join.php?sr=242

Conclusions

1. The objective of the paper is to start the scientific collaboration with colleagues on similar lines.
2. The growing power of internet presents new problems and opens new possibilities for distant scientific collaboration and graduate studies. Therefore some nontraditional ways for presentation of scientific results should be defined.
3. The paper shows an example of specific style designed for encouragement of new approaches to presentation and sharing of scientific results.
4. The results of optimization show the possibilities of some nontraditional ways of graduate studies and scientific collaboration by creating and using a specific environment for E-education.
5. Examples of applications of the Bayesian heuristic approach show the efficiency of automated tuning of heuristics.

Bibliography

1
Horst, R., Pardalos, P.:
Handbook of Global Optimization.
Kluwer Academic Publishers, Dordrecht/Boston/London (1995)

2
.Calvin, J., Zilinskas, A.:
One-dimensional p-algorithm with convergence rate o (n− 3cδ) for smooth functions.
JOTA Journal pf Optimization Theory and Applications 106 (2000) 297-307

3
Diaconis, P.:
Bayesian numerical analysis.
In: Statistical Decision Theory and Related Topics.
Springer Verlag (1988) 163-175

4
Mockus, J., L.Mockus:
Bayesian approach to global optimization and applications.
In: Theory of Optimal Decision. Volume 12.
Institute of Mathematics and Cybernetics, Akademia Nauk Lithuanian SSR, Vilnius, Lithuania (1987) 57-70 (in Russian).

5
Mockus, J.:
A Set of Examples of Global and Discrete Optimization: Application of Bayesian Heuristic Approach.
Kluwer Academic Publishers (2000) ISBN 0-7923-6359-0.

6
DeGroot, M.:
Optimal Statistical Decisions.
McGraw-Hill, New York (1970)

7
Mockus, J.:
Bayesian approach to global optimization.
Kluwer Academic Publishers, Dordrecht-London-Boston (1989)

8
Berger, J.:
Statistical Decision Theory and Bayesian Analysis.
Springer-Verlag, Berlin, Heidelberg (1985)

9
Kadane, J., Wasilkowski, G.:
Average case $\epsilon$-complexity in computer science- a bayesian view.
In: Bayesian Statistics 2.
Elsevier Science Publishers (1985) 361-374

10
Lindley, D.:
Bayesian Statistics: A Review.
SIAM, Philadelphia (1971)

11
Savage, L.:
Foundations of Statistics.
Wiley, New York (1954)

12
T.L.Fine:
Theories of Probability.
Academic Press, New York (1973)

13
Zilinskas, A.:
Global optimization: axiomatic of statistical models, algorithms and their applications.
Mokslas, Vilnius, Lithuania (1986) in Russian.

14
Goldberg, D.E.:
Genetic Algorithms in Search, Optimization, and Machine Learning.
Addison-Wesley, Reading, MA (1989)

15
Mavridou, T., Pardalos, P., Pitsoulis, L., Resende, M.:
A GRASP for the biquadratic assignment problem.
European Journal of Operations Research 105 (1998) 613-621

16
Glover, F.:
Tabu search: improved solution alternatives.
In: Mathematical Programming. State of the Art 1994.
University of Michigan (1994) 64-92

17
Mockus, J., Eddy, W., Mockus, A., Mockus, L., Reklaitis, G.:
Bayesian Heuristic Approach to Discrete and Global Optimization.
Kluwer Academic Publishers, ISBN 0-7923-4327-1, Dordrecht-London-Boston (1997)

18
Cohn, H., Fielding, M.:
Simulated annealing: Searching for an optimal temperature schedule.
SIAM J. Optim. 9 (1999) 779-802

About this document ...

Bayesian Heuristic Approach to Automated Parameter Tuning, Investigation of Examples

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -local_icons -split 3 -image_type gif -show_section_numbers -html_version 4.0 tune4.tex

The translation was initiated by mockus on 2006-05-01


mockus 2006-05-01