1 General Ideas of
Bayesian Heuristic Approach

1 Outline

The traditional numerical analysis considers optimization algorithms that guarantee some accuracy for all functions to be optimized. This includes the exact algorithms. That is the worst case analysis. To limit maximal errors one needs computational efforts that often increase exponentially with the size of the problem. This is the main disadvantage of the worst case analysis.

The alternative is the average case analysis. Here an average error is not limited but is made as small as possible. The average is taken over a set of functions to be optimized. The average case analysis is the Bayesian Approach (BA) [Diaconis, 1988,Mockus, 1989a].

There are several ways to apply BA in optimization. The first one, the Direct Bayesian Approach (DBA), is defined by fixing a prior distribution $P$ on a set of functions $f(x)$ and by minimizing the Bayesian risk function [DeGroot, 1970,Mockus, 1989a]. The risk function $R(x)$ is the expected deviation from the global minimum at a fixed point $x$. The distribution $P$ is considered as a stochastic model of $f(x),\ x \in R^m$, where $f(x)$ might be a deterministic or a stochastic function. In the Gaussian case, assuming [Mockus, 1989a] that the $(n+1)$th observation is the last one

R(x)=\frac {1}{\sqrt 2\pi s_n(x)}\times \nonumber \\
...\min(c_n,z) e^{-\frac{1}{ 2}
{(\frac{y-m_n(x)} { s_n(x)})}^2}d z.

Here $c_n=\min_i z_i -\epsilon,\ z_i=f(x_i)$. $m_n(x)$ is a conditional expectation with respect to observed values $z_i,\ i=1,...,n$. $s_n^2(x)$ is a conditional variance, and $\epsilon>0$ is a correction parameter. The minimum of the risk function $R(x)$ is obtained [Mockus, 1989a] at the point

x_{n+1}= \arg \max_x \frac { s_n(x)}{m_n(x)-c_n}.

The objective of DBA, used mainly in continuous cases, is to provide as small average error as possible while keeping convergence conditions.

Another way, the Bayesian Heuristic Approach (BHA), means fixing a prior distribution $P$ on a set of auxiliary functions $f_K(x)$. These functions define the best values obtained by using $K$ times some heuristic $h(x)$. It is assumed that heuristics $h(x)$ depend on some continuous parameters $x \in R^m$. The heuristic helps to optimize an original function $C(y)$ of variables $y \in R^n$, where $m < n$ [Mockus et al., 1997]. As usual, components of $y$ are discrete variables. Heuristics are based on expert opinions about the decision priorities. Now both DBA and BHA will be considered in detail.

2 Direct Bayesian Approach (DBA)

The Wiener process is a common [Kushner, 1964a,Saltenis, 1971,Torn and Zilinskas, 1989] stochastic model in the one-dimensional case $m=1$.
Figure 1.1 shows the conditional expectation, the conditional standard, and the risk function with respect to available evaluations.
Figure 1.1: The Wiener model. The conditional expectation $m(x)$, the conditional standard $s(x)$, and the risk function $R(x)$ with respect to observed values $x(1), y(1), x(2), y(2),...$.
The Wiener model implies continuity of almost all sample functions $f(x)$. The model assumes that increments $f(x_4)-f(x_3)$ and $f(x_2)-f(x_1)$, $x_1 < x_2 < x_3 < x_4$, are stochastically independent. Here $f(x)$ is Gaussian $(0,\sigma x)$ at any fixed $x >0$. Note, that the Wiener process originally provided a mathematical model of a particle in the Brownian motion.

The Wiener model is extended to multidimensional case, too [Mockus, 1989a]. However, simple approximate stochastic models are preferable if $m >1$. Approximate models are designed by replacing the traditional Kolmogorov consistency conditions. These conditions require the inversion of matrices of $n$th order for computing the conditional expectation ${m_n(x)}$ and variance ${s_n^2(x)}$ [*].

Replacing the regular consistency conditions by:
- continuity of the risk function $R(x)$
- convergence of $x_n$ to the global minimum
- simplicity of expressions of $m_n(x)$ and $s_n(x)$
the following simple expression of $R(x)$ is obtained using the results of [Mockus, 1989a]

R(x) =\min_{1 \le i \le n} z_i- \min_ {1 \le i \le n} \frac {
\Vert x-x_i \Vert^2} {z_i -c_n}.\nonumber \end{eqnarray}

The aim of DBA is to reduce the expected deviation. In addition, DBA has some good asymptotic properties, too. It is shown [Mockus, 1989a] that

d^* / d_a = \left(\frac {f_a-f^*+\epsilon}{ \epsilon
}\right)^{1/2},\ n \rightarrow \infty. \nonumber \end{eqnarray}

Here $d^*$ is a density of points $x_i$ around the global optimum. $d_a$ is an average density of $x_i$ in the feasible area. $f_a$ is an average value of $f(x)$ in this area. $f^*$ is an average value of $f(x)$ around the global minimum. $\epsilon$ is the correction parameter in expression (1.1). That means that DBA provides convergence to the global minimum for any continuous $f(x)$ and a greater density of observations $x_i$ around the global optimum, if $n$ is large.

Note, that the correction parameter $\epsilon$ has a similar influence as the temperature in simulated annealing. However, that is a superficial similarity. The good asymptotic behavior is just some "by-product" of DBA. The reason is that Bayesian decisions are applied for small size samples where asymptotic properties are not noticeable. To choose the next point $x_{n+1}$ by DBA, one minimizes the expected deviation $R(x)$ from the global optimum (see Figure 1). The minimization of $R(x)$ is a complicated auxiliary optimization problem. That means that DBA is useful for expensive, in terms of computing times, functions of a few ($m < 20$) continuous variables. This happens in wide variety of problems.

Some examples are described in [Mockus, 1989a]. These include maximization of the yield of differential amplifiers, optimization of mechanical systems of a shock absorber, optimization of composite laminates, evaluation of parameters of immunological models and nonlinear time series, planning of extremal experiments on thermostable polymeric compositions. A large set of test problems in global optimization is considered in [Floudas et al., 1999].

3 Bayesian Heuristic Approach (BHA)

The Bayesian Heuristic Approach (BHA) is for optimization of simple objective functions with large variable numbers. That is the case in many discrete optimization problems. As usual, these problems are solved using heuristics based on an expert opinion. Heuristics often involve randomization[*] procedures that depend on some empirically defined parameters.

The examples of such parameters are:
- an initial temperature of the simulated annealing,
- probabilities of different randomization algorithms in their "mixture."
In these problems DBA is a convenient tool for optimization of the continuous parameters of various heuristic techniques and their mixtures. That is the Bayesian Heuristic Approach (BHA) [Mockus et al., 1997].

Applying DBA, the expert knowledge is involved by defining a prior distribution. In BHA one exploits expert knowledge "twice". First time one does that by defining the heuristics. Second time the expert knowledge is involved by defining a prior distribution, needed to optimize parameters of heuristics using DBA.

4 Illustrative Examples

The example of a knapsack problem is convenient to illustrate basic principles of BHA in the discrete optimization. Given a set of objects
$j=1,...,n$ with values $c_j$ and weights $g_j$, find the most valuable collection of a limited weight $g$

\max_y C(y),\ C(y)=\sum_{j=1}^n c_j y_j, \ \sum_{j=1}^n g_j
y_j\le g. \nonumber \end{eqnarray}

Here the objective function $C(y)$ depends on $n$ Boolean variables $y=(y_1,...,y_n)$, where $y_j=1$, if an object $j$ is in the collection, and $y_j=0$, otherwise. The well known greedy heuristic $h_j=c_j/g_j$ is the specific value of an object $j$. The greedy heuristic algorithm: "take the greatest feasible $h_j$", is very fast but it may get stuck in some non optimal decision. One may force the heuristic algorithm out of there by taking a decision $j$ with some probability $r_j=\rho_x(h_j)$[*]

Here DBA is applied to optimize parameters $x$ of the auxiliary function $f_K(x)$. This function defines the best results obtained applying $K$ times a randomized heuristic algorithm $\rho_x(h_j)$. That is the most expensive operation of BHA. Therefore, the parallel computing of $f_K(x)$ should be used, if possible[*]. Optimization of $x$ adapts the heuristic algorithm $\rho_x(h_j)$ to a given problem. The parameter $x$ may be useful to solve related problems [Mockus et al., 1997]. That helps using the "on-line" optimization.

We illustrate the parameterization of $\rho_x(h_j)$ by using three randomization functions $r_i^l=h_i^l / \sum_j h_j^l,\
l=0,1,\infty $. Here the superscript $l=0$ denotes the uniformly distributed component. $l=1$ defines the linear component of randomization. The superscript $\infty$ denotes the pure heuristics with no randomization. That means $r_i^{\infty}=1$, if $h_i=\max_j h_j$ and $r_i^{\infty}= 0$ , otherwise. The parameter $x=(x_0, x_1, x_{\infty})$ defines probabilities of using the randomizations $l=0,1,\infty$, correspondingly. That can be understand as some "lottery of algorithms" where $x_l$ is a probability to "win" the algorithm $l$. The detail description of the knapsack example is in the next chapter.

5 Heuristics

The main objective of BHA is to improve any given heuristic by defining the best parameters of some heuristic and/or the best ``mixtures'' of different heuristics. Heuristic decision rules, mixed and adapted by BHA, often outperform[*] the best individual heuristics, as judged by examples. In addition, BHA provides almost sure convergence. However, the results of BHA depend on the quality of specific heuristics. The quality of heuristics reflects the expert knowledge. That means, BHA should be considered as a tool for enhancing the heuristics but not for replacing them.

Many well-known optimization algorithms, such as the Genetic Algorithms (GA) [Goldberg, 1989], GRASP [Mavridou et al., 1998], and the Tabu Search (TS) [Glover, 1994], may be considered as some general heuristics that can be improved using BHA. There are specific heuristics tailored to fit particular problems. For example, the Gupta heuristic was the best one while applying BHA to the flow-shop problem [Mockus et al., 1997].

The Genetic Algorithms [Goldberg, 1989] is an important "source" of interesting and useful stochastic search heuristics. It is well known [Androulakis and Venkatasubramanian, 1991] that results of the genetic algorithms depend on mutation and crossover parameters. The Bayesian Heuristic Approach could be used in optimizing those parameters.

In the GRASP system [Mavridou et al., 1998], a heuristic is repeated many times. During each iteration, a greedy randomized solution is constructed and the neighborhood around that solution is searched for the local optimum. The "greedy" component adds one element at a time, until a solution is constructed. A possible application of BHA in GRASP is in optimizing a random selection of a candidate to be in the solution. BHA might be useful as a local component, too, by randomizing the local decisions and optimizing the corresponding parameters.

In the Tabu search, two important parameters may be optimized using BHA. Those are:

The examples show that the Bayesian Heuristics Approach is useful applying many well-known stochastic and heuristic algorithms of discrete optimization. The convergence conditions, if needed, are provided by tuning the BHA [Mockus et al., 1997].
jonas mockus 2004-03-03