Subsections


2 Explaining Bayesian Heuristic
Approach by Example of
Knapsack Problem

In the previous chapter we showed how BHA works while searching for the solution of the knapsack problem. Now we shall use this problem to illustrate similarities and differences of various approaches including BHA.

The reason for such special attention is that the knapsack problem is a simple example of important family of NP-complete problems[*]. Note, that this example is good mainly for illustration of BHA and alternative approaches. The reason is that the simple greedy heuristic of the knapsack problem is very efficient. Therefore, only marginal improvements can be made by more sophisticated methods, including BHA. In scheduling problems, such as the flow-shop and batch scheduling [Mockus et al., 1997], BHA works more efficiently. However, the scheduling problems are more difficult for explanations. Thus, we start the explanations by the knapsack problem.

The knapsack problem is to maximize the total value of a collection of objects when the total weight $g$ of these objects is limited. Denote values of objects $i$ by $c_i$ and their weights by $g_i$.

\begin{eqnarray}
\max_y \sum_{i=1}^n c_i y_i
\\ \sum_{i=1}^n
g_i y_i \le g
\\ y _i \in \{0,1\}. \nonumber
\end{eqnarray}


Here the objective $C(y)=\sum_{i=1}^n c_i y_i$ depends on $n$ Boolean variables $y_i,\ i=1,...,n$.

1 Exact Algorithms


1 Exhaustive Search

The decision $m$ means that a collection of objects
$I(m)=\{i : y_i(m)=1\}$ is selected. The value of the collection is
$c(m)=C(y(m))$. This collection is described by the decision vector
$y=y(m)=(y_i(m),\ i=1,...,n)$. The simplest exact algorithm is to compare all the collections:
  1. select a current collection $I(m^N),\ N=1,...K$,
  2. record the best current collection $I^*(m^N)$,
  3. stop when all the collections are considered $N=K=2^n$.
The best current collection $I^*(m^N)$ denotes the most valuable collection satisfying the weight limit that is obtained during $N$ iterations. The best current collection is updated during the recording operation in step two. It is replaced by better current collection. The exhaustive search of all the decisions needs $K=2^n$ of iterations. This means $T=C\ 2^n$ of time, where the constant $C$ is the observation time (the CPU time needed to evaluate sum (2.1) and to test inequality (2.2)).


2 Branch & Bound (B&B)

The efficiency of exact algorithms can be improved by the Branch & Bound (B&B) techniques:
  1. define the decision tree,
  2. define the upper limits of the branches,
  3. cut the branch, if the upper limit is less then the best current collection,
  4. stop, if there are no more branches to consider.
For example,
  1. branch 0: $y_1=0$, branch 1: $y_1=1$,
  2. the upper limit of branch 1:

    \begin{eqnarray}
C^1= c_1+ \max_y \sum_{i=2}^n c_i y_i\nonumber \\ g_1+\sum_{i=2}^n
g_i y_i \le g\nonumber \end{eqnarray}


    the upper limit of branch 0:

    \begin{eqnarray}
C^0= \max_y \sum_{i=2}^n c_i y_i\nonumber \\ \sum_{i=2}^n g_i y_i
\le g\nonumber \end{eqnarray}


  3. cut branch 0, if $C^0 \le c_1$,
  4. stop, if there are no more branches to consider.
In this algorithm, the number of iterations to obtain the exact solution is $K \le 2^n$ and the time is $T \le C\ 2^n$. Usually, the time $T$ is much less then the exact upper limit $C\
2^n$. However, this limit may be approached. For example, if all the prices $c_i$ and weights $g_i$ are almost equal. In such a case no branches will be cut. That is a reason why approximate algorithms are preferred, if $n$ is large.


2 Approximate Algorithms


1 Monte Carlo

The simplest approximate algorithm is Monte-Carlo (MC):
  1. select a collection $I(m^N),\ N=1,...,K$ with probability [*]

    \begin{eqnarray}
r(m)={1 \over 2^n},
\end{eqnarray}


  2. record the best current collection $I^*(m^N)$,
  3. stop, if the number of iterations $N=K$.
Recording just the best current collection, one samples with replacement. Otherwise, one should keep in the memory up to $2^n$ of previous samples. In cases with replacement, one does not know when the optimum is reached. Therefore, MC stops, if the time allocated for the optimization is exhausted. This algorithm converges to an exact solution with probability one, if the number of iterations $K \rightarrow \infty$. However, the convergence is very slow, nearly logarithmic. Note, that the time $T=C n^m$ is not a limit for MC with replacement.


2 Greedy Heuristics

Define the heuristic $h_i$ as the specific value of the object $i$

\begin{eqnarray}
h_i= {c_i\over g_i}.
\end{eqnarray}


This "greedy" heuristic is the well known and widely used in the knapsack problem. The Greedy Heuristic Algorithms prefer a feasible[*] object with the highest heuristic $h_i$:
  1. select the object $i$ with the best heuristic $h_i$ in the starting list[*],
  2. move the object from the starting list into the current one,
  3. test feasibility of the collection which includes this object,
  4. if the collection is infeasible, delete the object from the starting list,
  5. test, whether the starting list is empty,
  6. if the staring list is not empty, go to the step one,
  7. if the starting list is empty, stop and record the current list as the best collection $I^*(m^N)$, where $N \le n$ is number of the last iteration.
The greedy heuristic algorithm is fast, $T < C\ n^2$. The inequality is there because objects are gradually removed from the starting list. The algorithm may stick in some non optimal decision. The greedy heuristic $h_i$ is obviously useless, if the specific values are equal $h_i=h, \ i=1,...,n$. In such cases the outcome depends on the numeration of objects.


3 Randomized Heuristics


1 Linear Randomization

Heuristic algorithms are shaked out of non optimal decisions by choosing an object $i$ with some probability $r_i$ that is proportional to the specific value $h_i$,

\begin{eqnarray}
r_i={h_i\over \sum_{j=1}^n h_j }.
\end{eqnarray}


This algorithm is similar to that of Greedy Heuristic:
  1. select an object $i$ from the starting list at random with the probability $r_i$,
  2. move the object from the starting list into the current one,
  3. test feasibility of the collection including this object,
  4. if the collection is infeasible, delete the object from the starting list,
  5. test, if the starting list is empty,
  6. if the staring list is not empty, go to step 1,
  7. if the starting list is empty, stop and record the current list as a current collection $I(m^N),\ N=1,...,K$,
  8. record the best current collection $I^*(m^N)$,
  9. start new iteration $N+1$ by restoring starting list to the initial state (as a set of all the objects) and return to step 1,
  10. stop after $K$ iterations and record the best current collection $I^*(m^K)$.
This algorithm is better than the greedy one, in the sense that it converges with probability one when $K \rightarrow \infty$. The algorithm is expected to be better than Monte-Carlo, too, since it includes an expert knowledge by relating the decision probabilities to heuristics.

2 Mixed Randomization

We may choose the preferred heuristic algorithm by considering all of them separately. Here the quality estimate of each algorithm is the best result obtained after $K$ iterations. That is a traditional way.

One solves the same problem in a more general setup by considering a "mixture" of different heuristic algorithms. For example,

Then the Mixed Randomization uses the algorithm denoted by the index $l$ with some probability $x(l)$ that is defined by probability distribution $x=(x(l),\ l=0,1,2)$:
  1. select an algorithm $l$ at random with probability $x(l)$,
  2. record the best current collection $I^*(m^N),\
N=1,...,K$,
  3. start the new iteration $N+1$ by returning to step 1,
  4. stop after $K$ iterations and record the best current collection $I^*(m^K)$, denote its value by $f_K(x)=c(m^K)$.
Here $f_K(x)$ is the best result obtained using $K$ times the Mixed Randomization with the probability distribution $x$[*].

That is a way to extend a family of possible decisions from the discrete set $x(l) \in \{0,1\}, \ l=0,1,2$ to the continuous one. That is important because the best results we often obtain using a mixture of algorithms [Mockus et al., 1997].

Here we mixed three approximate algorithms: the Monte Carlo, the Linear Randomization and the Greedy Heuristics. Following expressions define a larger family of different approximate algorithms, including these three

\begin{eqnarray}
r_i^{l}={h_i^l\over \sum_{j=1}^n h_j^l},\ l=0,1,2,...,L,
\end{eqnarray}


and

\begin{eqnarray}
r_i^{\infty}(m)&=&\cases {1, &if $h_i=\max_j h_j$, \cr 0, &otherwise.\cr}
\end{eqnarray}


Here the superscript $l=0$ denotes the Monte-Carlo randomization. The superscripts $l=1,\ l=2,....,l=L$ define a family of "polynomial" randomizations. The superscript $\infty$ denotes the greedy heuristics with no randomization. The Mixed Randomization means using randomization functions $r_i^{l},\ l=0,1,2,...,L,\infty$ with probabilities $x(l)$ and recording the best result $f_K(x)$ obtained during $K$ iterations.

3 Optimization of Mixture

The optimization of the mixture $x$ is difficult because $f_K(x)$ is a stochastic function and the multimodal one, as usual. A natural way to consider such problems is by regarding functions $f_K(x)$ as samples of a stochastic function defined by some prior distribution. The next observation is obtained by minimizing the expected deviation from the exact solution. This technique is called the Bayesian Heuristic Approach (BHA) [Mockus et al., 1997].


4 Permutation

If we build a solution from "scratch," then we may apply so-called greedy heuristics [Helman et al., 1993]. Building from scratch is convenient, if no initial expert decision is known. Otherwise, building from scratch, one ignores the expert information included into that initial decision. As usual, the permutation algorithms are used to improve initial decisions.

In the knapsack problem, the complete collection $I$ is a set of all $n$ objects. The initial collection is a set of objects $I(m^0)=\{i:y_i(m^0) =1\}\in I$ satisfying the inequality $\sum_{i \in I(m^0)} g_i y_i(m^0) \le g$. The value of this collection is $c(m^0)$. A complement of the initial set $I(m^0)$ is denoted $I^0(m^0)$. We denote permuted collections as $I(m^j),\ j=1,...,J$ and values of these collections as $c(m^j)$. We define the direct permutation heuristics as

\begin{eqnarray}
h_j=h(m^j)=c(m^j)-c(m^0).
\end{eqnarray}


To avoid confusion, the longer symbols $h(m^j)$ and $hn(m^j)$ are used sometimes instead of the short ones $h_j$ and $ hn_j$. We define the normalized permutation heuristics as

\begin{eqnarray}
hn_j=hn(m^j)={h_j-A_j \over A^j-A_j},
\end{eqnarray}


where $A_j=\min_{k =1}^J h_k$ and $A^j=\max_{k=1}^J h_k$. Normalized heuristics (2.9) can be easily converted into probabilities by expressions such as (2.6). Direct heuristics (2.8) are convenient in some well-known algorithms, such as Simulated Annealing. Here is an example of permutation algorithm using linear randomization:
  1. make $J$ feasible changes $m_j^0,\ j=1,...,J$ of the initial collection $I(m^0)$, by replacing randomly selected objects using this algorithm
    1. select at random with equal probabilities an object $j_1$ from initial collection $I(m^0)$,
    2. delete this object,
    3. select at random with equal probabilities an object $j_0$ from complement $I^0(m^0)=\{i:y_j(m^0)=0\}$ of the initial collection, skip this step if the complement is empty,
    4. put the object $j_0$ into the initial collection,
    5. test the weight limit,
      return to step one of the replacement algorithm, if the updated collection is to heavy,
  2. define normalized permutation heuristics $ hn_j$ for all the changes $j=1,...,J$,
  3. define probabilities $r_j$ by expression (2.13),
  4. select current collection $I(m^N),\ N=1,...,K$ at random with probability $r_j$,
  5. record the best current collection $I^*(m^N)$,
  6. stop after $K$ iterations and record the best current collection $I^*(m^K)$.


1 Simulated Annealing

The Simulated Annealing (SA) is a popular global optimization method. Features of SA, considering it as a permutation algorithm, are:
- only one permutation is made, $J=1$,
- the direct heuristic is used

\begin{eqnarray}
h_j=c(m^j)-c(m^0),
\end{eqnarray}


- if $h_j \ge 0$, the new collection $I(m^j)$is selected with probability one,
- if $h_j < 0$, the new collection is selected with probability $r_j$ that declines exponentially,

\begin{eqnarray}
r_j&=&\cases {e^{{h_j \over x /\ln (1+N)}}, &if $h_j < 0$, \cr
1, &otherwise.\cr}
\end{eqnarray}


Here $N$ is the iteration number and $x$ is the "initial temperature." The difference of this formulation from traditional simulating annealing algorithms is that we optimize the parameter $x$ for some fixed number of iterations $N=K$. We disregard the asymptotic behavior. The asymptotic properties are beyond the scope of the Bayesian Heuristics Approach.


2 Genetic Algorithms

1 General Idea

Genetic Algorithms (GA) present a general and visual way of permutation and randomization. GA is a methodology for searching a solution space in the manner similar to the natural selection procedure in biological evolution [Holand, 1975]. Each candidate solution is represented by a string of symbols. The set of solutions at some state $m$ is called the population of the $m$th generation. The population evolves for a prescribed number $M$ of generations. The basic structure processed by GA is the string. Strings are composed of a sequence of characters.

A simple GA consists of
- one reproductive plan, defined as the fitness proportional reproduction
- two genetic operators, defined as crossover and mutation.
The probability of selection is defined in proportion to the fitness of individuals

During the crossover operation one splits two selected strings at some random position $s$. Then two new strings are created, by exchanging all the characters up to the split point $s$. During the mutation operation one alters some string characters at random. A mutation is of $l$-th order, if one changes $l$ elements during one mutation operation. Both the crossover and the mutation are feasible, if they satisfy all constraints.

It follows from this definition that GA may be considered as a special case of the Permutation and Randomization. Thus, one may include GA into the framework of Bayesian Heuristic Approach. We illustrate that by using the Knapsack problem as an example.

2 Explaining Genetic Algorithm by Knapsack Example

In the Knapsack example the string of the collection $I(m)$ is represented as the Boolean vector

\begin{eqnarray}
y(m)=(y_i(m),\ i=1,...,n).
\end{eqnarray}


The fitness of feasible[*]collection $m^j$ is $c(m^j)$. The probability to select a string $m^j$ is

\begin{eqnarray}
r_j={hn_j \over \sum_{k=1} hn_k},
\end{eqnarray}


where $ hn_j$ is the normalized permutation heuristics

\begin{eqnarray}
hn_j=hn(m^j)={c(m^j)-A_j \over A^j-A_j}.
\end{eqnarray}


Here $A_j=\min_{k =1}^J\ c(m^k)$ and $A^j=\max_{k=1}^J\ c(m^k)$.

During the crossover operation one splits two selected strings at some random position $s,\ 1<s<n$. Then two new strings are created, by exchanging all the characters up to the split point $s$.

During the mutation operation one produces $J$ mutants by inverting randomly selected components $y_j$ of the Boolean vector $y$. A mutation is of $l$-th order, if one changes $l,\ 1 \le l \le n$ components during one mutation operation. A mutant is fertile, if it satisfies all the constraints.

Denote by $I(m^N)$ a current collection at $N$th iteration. Denote by $I^*(m^N)$ the best current collection obtained during $N$ iterations.

Here is an example of a simple GA algorithm written in BHA terms. The algorithm is similar to that of Permutation.

  1. produce a number $J$ of fertile mutants by replacing randomly selected objects using feasible changes $m_j^0,\ j=1,...,J$ of the initial collection $I(m^0)$,
  2. define normalized permutation heuristics $hn_j,\ j=1,...,J$ for all these mutants using expression (2.14),
  3. define probabilities $r_j$ by expression (2.13),
  4. select two mutants $I_j(m^+)$ and $I_j(m^-)$ at random with probabilities $r_j$[*],
  5. select a split point $s$ at random with probability $1/n$
  6. inverse the components $y_i(m^+),\ i \le s$ and $y_i(m^-),\
i \le s$ of these two mutants,
  7. update normalized permutation heuristics $ hn_j$ reflecting cross-over results
  8. update probabilities $r_j$ using (2.13)
  9. select a current collection at random with the probability $r_j$,
  10. record the best current collection $I^*(m^1)$
  11. go to step 4 with the probability $x(0)$
  12. produce $J$ fertile mutants by feasible changes of the current collection,
  13. define normalized permutation heuristics $hn_j,\ j=1,...,J$,
  14. define probabilities $r_j$ by expression (2.13),
  15. select a current collection at random with the probability $r_j$,
  16. record the best current collection $I^*(m^2)$ and return to step 11,
  17. stop after $K$ iterations and record the best current collection $I^*(m^K)$.
It is tacitly assumed that some segments of strings define specific "traits." In such cases, one may "enforce" good traits by uniting "good" segments. One may expose bad traits by uniting "bad" segments, too. Then, reproduction plans tend to favor the good traits and to reject the bad ones.

Thus, the crossover operation "improves" the population, if the "traits" assumption is true. If not, then the crossover may be considered merely as a sort of mutation. That may help to jump the area dividing separate "local" minima. However, the similar jump may be accomplished by high order mutations, too.

As usual [Androulakis and Venkatasubramanian, 1991], mutations are considered as more "radical" operations as compared with crossovers. That is correct, if one changes many elements of the "genetic" sequence during one mutation. This happens, if the mutation order $l$ is close to the string length $n$.

The results of some real life examples of network optimization [Mockus, 1967] and parameter grouping [Dzemyda and Senkiene, 1990] show that low order mutations, when merely a few elements of the decision vector $r(k)$ are changed, work better. In such cases, the mutation may be considered as a less radical operation because fewer components of the vector $y$ are changed, as compared with the crossover operation.

We may improve efficiency of the simple GA algorithm by using the crossover rate $x(0)$ (see step eleven of the algorithm) as an optimization variable in the framework of BHA. This was applied solving the Batch Scheduling problems [Mockus et al., 1997]. If the crossover rate is considered as a time function [Androulakis and Venkatasubramanian, 1991], the parameters of this function can be optimized using BHA.

3 Software Examples of Knapsack Problem

There are two software versions: one in C++ and one is Java. In the C++ version, prices $c_i$ and weights $g_i$ are generated randomly. In the Java case, $c_i,\ g_i$ reflects a real life example [Malisauskas, 1998]. The optimal probability $x(\infty)$ of the greedy heuristic $h_i=c_i/g_i$ is near to one, but not exactly one. Some degree of randomization helps to escape non optimal decisions. The randomized procedures $x(l),r_i^{(l)}$ are defined by (2.6) and (2.7). Both the software versions consider just three components.

The $C++$ version uses the Monte Carlo, the Linear and the Quadratic randomizations. Corresponding optimization parameters are
$x=(x(0), x(1),x(2))$. The Monte Carlo and Linear randomizations and the Pure Greedy heuristic are applied in the Java version. Corresponding optimization parameters are
$x=(x(0), x(1), x(\infty))$.

We are looking for such probability distribution $x$ that provides the maximal $f_K(x)$ after $K$ iterations. Bayesian methods [Mockus, 1989a,Mockus and Mockus, 1990,Mockus et al., 1997] are used for that.


1 C++

The aim of the GMC version is to estimate average error of BHA. A set of knapsack problems with random prices $c_i$ and weights $g_i$ is considered. The results of BHA and the exact deterministic B&B algorithms are in Table 2.1.

Average results were obtained by repeating the optimization procedures $K=100$ times at fixed parameters $c_i,\ g_i$ and probability distribution $x$.


Table 2.1: Comparison of the Bayesian method and the exact one.
$K_B=100$, and $K=1$
$N_O$ $K_E$ $f_B$ $f_E$ $\delta_B\%$ $x_B(0)$ $x_B(1)$ $x_B(2)$
50 313 9.56057 9.62347 0.654 0.0114 0.0280 0.9605
100 526 13.0703 13.1241 0.411 0.0316 0.0412 0.9271
150 771 16.6301 16.6301 0.000 0.0150 0.1945 0.7904
200 875 37.4665 37.4859 0.050 0.0315 0.0530 0.9437
250 568 53.7781 53.9000 0.226 0.0091 0.0511 0.9397
300 1073 28.3144 28.6106 1.034 0.0113 0.0835 0.9050
350 1416 30.4016 31.7527 4.254 0.0064 0.0646 0.9288
400 2876 32.1632 33.3192 3.469 0.0202 0.0452 0.9344
450 1038 105.467 105.578 0.105 0.0101 0.0149 0.9748
500 2132 39.3583 42.1047 6.521 0.0078 0.1556 0.8365

In Table 2.1 $N_0$ is the number of objects. $K$ is the number of repetitions. $K_B$ is the total number of observations by the Bayesian method. $K_E$ is the number of nodes considered by the exact method. $f_B$ is the best result of the Bayesian method. $f_E$ is the exact optimum. $\delta_B\%$ is the mean percentage error of the Bayesian algorithm. $x_B(n), n=0,1,2$, are optimized probabilities of different randomizations obtained using a Bayesian method. If the deviation of some solution does not exceed $5\%$, we call this a $5\%$ solution. One can expect that the $5\%$ solution satisfies the applications, where the level of data uncertainty is not less than $5\%$.

Table 2.1 shows that we need to consider from 313 to 2876 nodes to obtain the exact solution. Only 100 observations are needed to obtain the $5\%$ solution by the Bayesian method. The deviation exceeds $1\%$ only for three cases in ten. The average deviation is $1.67\%$.

Assume that roughly the same computing time is necessary for one node and for one observation. Then the Bayesian $5\%$ solution is about three times "cheaper" as compared to the exact one, if the number of objects is fifty. If this number is 400, then the Bayesian $5\%$ solution is almost thirty times cheaper. Other examples, in particular ones applying BHA to a family of scheduling problems [Mockus et al., 1997] show higher efficiency of BHA.


2 Java

Here w e optimize a "mixture" $x$ of the Monte Carlo randomization, the linear randomization, and the pure greedy heuristic. The aim is to show how BHA works while solving a real life knapsack problem. The example illustrates how to apply the Java software system for global optimization called as GMJ1. Therefore, several figures are included. They illustrate the input and output of GMJ1 graphical interface.

1 Data File

The data represents the weights, the values, the numbers, and the names of inventory items of the "Norveda" shop that sells "Hitachi" electrical tools. Table 2.2 shows a fragment of data file 'norveda.txt'.

Table 2.2: A fragment of data file 'norveda.txt.'
Weight Value Number Name
3.8 2830 1 CNF35U
10.5 4170 2 CNF65U
11.5 3850 1 CM12Y
1.8 1500 2 CE16
1.7 1500 2 CN16
17.0 2100 4 CC14
20.9 2890 2 J9312N
0.9 330 8 D6SH
1.7 1170 10 D10YA
1.3 630 5 D10VC


Here fields, separated by spaces, denote these parameters of objects:

2 Implementing Task

The algorithm is implemented as a $Task$ of the GMJ1 system. Figure 2.1 shows a fragment of the file 'Knapsack.java'
Figure 2.1: A fragment of the knapsack program.
\begin{figure}\begin{codebox}{4.8in}
\begin{verbatim}public class Knapsack imp...
...ivate String data_url_old = new String();\end{verbatim}\end{codebox}\end{figure}

3 Running GMJ1

Figure 2.2 (top) shows the input page.

On $Property$ fields,

$Total Weight$
is the weight limit, from 10 to 10.000,
$URL\ of\ data\ file$
is the URL address.
This data file is on the same server as the applet. If the data is not correct, the corresponding field turns red. In the black-and-white figure 2.2, red is shown as the dark shadow. Therefore, the incorrect URL address is not legible.

On the $Dimension$ fields,

$Min$
is the minimal value of $x(l)$,
$Max$
is the maximal value of $x(l)$,
$Default$
is the default value of $x(l)$.
The values $x(l)$ show the proportions of each method. The mixture of three methods of picking objects is considered: the Monte Carlo, the Linearly Randomized Heuristics and the pure Greedy Heuristics. Probabilities $x_l$ of methods $l,\
l=1,2,...$ are related to the proportions $x(l)$ this way $x_l=x(l) / \sum_j x(j)$.
Figure 2.2: Input page (top) and output page (bottom).
\begin{figure}\centerline{
\epsfig{file=gmjknap.eps,width=12.0cm}
}\vspace{0.5cm}
\centerline{
\epsfig{file=knapresults.eps,width=12.0cm}
}\end{figure}
Figure 2.2 (bottom) shows the output page that opens when the computing is completed. Here $Iteration$ means the iteration number where the best value was reached. $F(x)$ defines the best value, "Monte Carlo." "Randomized Heuristics," and "Greedy Heuristics" show the optimal part of each of the three methods in the mixture.

 

Figure 2.3 (top) shows how the best value of $f_K(x)$ changes subject to the iteration number.

Figure 2.3 (bottom) shows how $f_K(x)$ changes subject to proportions of the Monte Carlo randomization.

Figure 2.3: The best obtained value subject to iteration numbers (top), the objective function subject to proportions of the Monte Carlo randomization (bottom).
\begin{figure}\centerline{ \epsfig{file=knapconv.eps,height=8.0cm,width=12.0cm}
...
...
\epsfig{file=knapprojectmc.eps,height=8.0cm,width=12.0cm}
}\protect\end{figure}

Figure 2.4 (top) shows how $f_K(x)$ changes subject to proportions of the linear randomization.

Figure 2.4 (bottom) shows how $f_K(x)$ changes subject to proportions of the pure greedy heuristics.

Figure 2.4: The objective function subject to proportions of the linear randomization (top), the objective function subject to proportions of the pure greedy heuristics (bottom).
\begin{figure}\centerline
{
\epsfig{file=knapprojectrh.eps,height=8.0cm,width=1...
... \epsfig{file=knapprojectgh.eps,height=8.0cm,width=12.0cm}
}\protect\end{figure}

It is hard to notice any regularity in the $Projection$ windows in Figures 2.3 (bottom), and 2.4. The reason is that all the variables change together during the optimization. To see good projections, one uses the method $Exkor$. Variables change one by one in this method (see Figures 9.4 and 9.5).

4 Results

Figure 2.5: Table of results
\begin{figure}\centerline{
\epsfig{file=knaptable1.eps,width=12.0cm}
}\end{figure}
Table 2.5 shows how the results of optimization depend on the optimization method (the first column), the number of iterations ( the second column), and the weight limit (the third column). The fourth column shows the iteration number where the best value was obtained, the fifth column shows the best value[*]. The sixth, the seventh, and the eighth column define the optimal mixture of three search methods. These mixtures can be expressed in percentage terms dividing each of the three numbers by their sum and multiplying by 100.

The results suggest that

Now we consider some related problems. In these problems there are no obvious heuristics. Therefore, designing BHA algorithms one should consult experts. The experts may define decision rules defining greedy heuristics or suggest initial solutions needed for permutation heuristics.

4 Related Problems

1 Collecting Trains

The objective is to collect the minimal number of trains to remove $m$ cars from a goods-station. Optimization variables are $y_{ij}$. Here $i=1,...,m$ denotes a car and $j=1,...,n$ denotes a train. The Boolean variable $y_{ij}=1$, if the car $i$ is assigned to the train $j$. Otherwise $y_{ijk}=0$.

The constraints are

\begin{eqnarray}
\sum_{i=1}^m a_i y_{ij} \le A_{j}
\\
\sum_{i=1}^m b_i y_{ij} \le B_{j}
\end{eqnarray}


Here
$a_i$ is the weight of car $i$.
$A_{j}$ is the maximal weight of train $j$.
$b_i$ is length of car $i$.
$B_{j}$ is maximal length of train $j$.

The objective function is the number of trains

\begin{eqnarray}
v(y)=\sum_{ij} y_{ij}
\end{eqnarray}


2 Optimal Cut

1 Formalization

The task is to cut a rod of length $a$ into $m$ pieces of length $a_i,\ i=1,...,m$ with minimal waist $v$. Optimization variables are $y_{ik}$.
Here
$i=1,...,m$ denotes a segment.
$k=1,...,K$ defines an ordinal number of the segment.

The Boolean variable $y_{ik}=1$, if the segment $i$ is assigned to the rod with ordinal number $k$.
Otherwise, $y_{i,k}=0$.
The constraint is

\begin{eqnarray}
\sum_{ik} a_i y_{ik} \le a.
\end{eqnarray}


Here
$a_i$ is length of segment $i$ including the saw section.
$a$ is length of rod. No gaps between segments.

The objective function is the waist. The waist $v(y)$ is defined as the difference between the length of the rod and the sum $a(y)$ of lengths of all the segments cut,
$v(y)=a-a(y)$. One needs the optimization, if the sum of lengths of all segments to be cut $A > a$.

\begin{eqnarray}
\min_y \ v(y)
\\
y=(y_{ik},\ i=1,...,m,\ k=1,...,K), \nonumber \\
a-a(y) \ge 0. \nonumber
\end{eqnarray}


2 Multi-Road Case

If there are $n>1$ rods of length $a(j),\ j=1,...,n$ to be cut into segments then one minimizes the total waist

\begin{eqnarray}
\min_y \ v(y)
\\
y=(y_{ijk},\ i=1,...,m,\ j=1,...,n,\ k=1,...,K), \nonumber \\
a(y(j)) \le a(j) . \nonumber
\end{eqnarray}


Here
$v(y)=\sum_j (a(j)-a(y(j)))$,
$y(j)=(y_{ijk},\ i=1,...,m,\ k=1,...,K)$,
$a(y(j))=\sum_{i,k} a_i y_{ijk}, \ j=1,...n$,
index $i=1,...,m$ is denotes a segment.
index $j=1,...,n$ denotes a rod,
index $k=1,...,K$ defines an ordinal number of the segment $i$ in the rod $j$.

The Boolean variable $y_{ijk}=1$, if the segment $i$ is assigned to the rod $j$ with ordinal number $k$.
Otherwise $y_{ijk}=0$.
The optimization is needed if $\sum_i a_i > \sum_j a(j)$.

3 Solutions

The exact solutions are practical only for small problems. For larger problems one can apply all the heuristic methods described in the knapsack section 2. Only heuristics of the optimal cut problem are different from those used in the knapsack problems and should be defined by experts. The trivial heuristic $h_i$ is the length $a_i$ of a segment $i$. Using randomization techniques this means assigning higher probabilities to segments of greater length.

3 Sawmill Problem

The task is to produce $m$ boards with minimal waist. There are $n$ timbers. Optimization variables are $y_{ijk}$.
Here
$i=1,...,m$ is denotes a board.
$j=1,...,n$ denotes a timber.
$k=1,...,K$ defines an ordinal number of the board $i$ in the timber $j$.

The Boolean variable $y_{ijk}=1$,
if the board $i$ is assigned to the timber $j$ with ordinal number $k$.
Otherwise $y_{ijk}=0$.
The constraints are

\begin{eqnarray}
a_i y_{ijk} \le d_{ijk}.
\end{eqnarray}


Here
$a_i$ is width of board $i$ including the saw section.
$d_{ijk}$ is diameter of timber $i$ at the location $k$ of board $i$.
That means, $d_{ijk}$ depends on the order and of boards in the timber $j$ and their thickness $b_i$. No gaps between boards.

The objective function is the waist
$v(y),\ y=(y_{ijk},\ i=1,...,m,\ j=1,...,n,\ k=1,...,K$. The waist $v(y)$ is defined as the difference between the cross-section area of all timbers used and the sawn section of all boards produc Expression (2.21) defines constraints for the simple composition of timbers (see Figure 2.6)

Figure 2.6: Simple composition
\begin{figure}\begin{center}
\epsfig{file=rastai1.eps,width=10.0cm}\end{center}\end{figure}
For more complicated compositions, for example, the segment one (see Figure 2.7), different constraints should be defined.
Figure 2.7: Segment composition
\begin{figure}\begin{center}
\epsfig{file=rastai2.eps,width=10.0cm}\end{center}\end{figure}

4 Optimal Diet

As usual, the food is sold in supermarkets and meals are served in restaurants only in some units. For example one cannot buy one third of bottle of milk or order two thirds of hamburger. In this case solving the optimal diet problem one faces difficulties similar to those in the knapsack problem. The reason is that some of optimization variables are integer including the binary ones that we shall regard separately.

1 Formal Description

One minimizes the extended cost[*]while keeping some health constraints

\begin{eqnarray}
\min_y (\sum_{i=1}^n (c_i-s_i) y_i +
L(\sum_{i=1}^n g_{i 1} y_i...
...n},\ y_i \in \{0,1,2,...\},\ n_{bin} < i \le n_{int},\ y_i \ge 0.
\end{eqnarray}


Here
index $i$ denotes an item of diet,
$c_i$ denotes the unit price,
$y_i$ is the number of units,
$s_i$ is the amount of pleasure, obtained while consuming a unit of the food $i$ and estimated as some benefit in money terms,
$g_{i1}$ is the amount of calories in the item $i$,
$g_{ij},\ j >1$ is the amount of some basic food ingredients $j$, such as proteins, hydrocarbons, fats, etc., contained in a unit of item $i$,
$g_j$ defines amounts needed to keep healthy,
$L >0$ the unpleasant results due to excess calories estimated ass some losses in money terms.

2 Solution Methods

1 Exact Algorithms

The exact solution of the diet problem (2.22-(2.24) can be obtained by the Mixed Integer Linear Programming (MILP). The free MILP software in Java is in the web-sites (see section "Software Systems", under the title "LPJ"). However, one observes the exponential growth of time depending on the number of integer variables, mainly the binary ones. Therefore, one prefer approximate methods, if the number of those variables is large.

2 Continuous Approximation

This means approximating integer variables $y_i$ by continuous ones. This way the Mixed Integer Linear Programming (MILP) problem is reduced to the simple Linear Programming (LP). The disadvantages are large round-off errors. The rounding uncertainty of binary variables makes LP solution almost irrelevant.

3 Heuristics

First we separate the easy part of the diet problem from the hard one. The original MILP problem (2.22-(2.24) is replaced by a set of LP problems corresponding to all possible combinations of integer variables represented by the vector
$d=(y _i \in \{0,1\},\ i \le n_{bin},\ y_i \in \{0,1,2,...,K\},\ n_{bin} < i \le n_{int}) $.

\begin{eqnarray}
y(d)= arg\ \min_y (\sum_{i=1}^n (c_i-s_i) y_i +
L(\sum_{i=1}^n ...
...\\
\sum_{i=1}^n g_{ij} y_i \ge g_j ,\ j=1,...m
, \ y_i \ge 0,\\
\end{eqnarray}


Theoretically, one obtains the exact solution by comparing all possible values of the discrete vector $d$.

\begin{eqnarray}
\min_d y(d)
\end{eqnarray}


This is not practical, except for very small numbers $n_{bin}$ and $n_{int}$ because the number of different $d$ is
$N_d=2^{n_{bin}}+K^{n_{int}-n_{bin}}$.

The practical advantage of the formulation (2.25)-(2.28) is that using this formulation one can directly apply all the heuristic methods described in the knapsack example.

The specific structure of the diet problem is exploited by defining the greedy heuristics or by setting a proper initial diet to be improved later by permutations. The second way seems convenient for the diet problem, because one just improves the diet that a person already likes. By using greedy heuristics one designs new diets from scratch. In all the cases the BHA helps to optimize the heuristic parameters (see the section 2).

3 Software Example

Only continuous variables are optimized. That is appropriate while considering only the basic food items such as bread, milk, butter, juice, etc. The diet taste is included indirectly, first, by selecting the food list, then, by choosing the most favored item.

The Figure 2.8

Figure 2.8: Input and Output of the Diet Problem.
\begin{figure}\centerline{
\epsfig{file=diet3.eps,width=12.0cm}
}\end{figure}
shows the input data and the results of optimization. The selected list reflects the vegetarian taste. The most favored item are apples.

The upper window shows the selected list and related data. The lower-right window shows the optimization results.

Note, that there are only four non-zero items because there are only four constraints defining lower limits for calories, proteins, fats and carbohydrates. This is a specific property of the linear programming solutions.

5 Travelling Salesman Problem

Travelling salesman problem

The problem is to minimize the total path length of a salesman visiting $I$ cities and returning home (see[Miller and Pekny, 1991]).

Formally, a decision $d$ is a sequence of numbers $n_i \in {1,....,I}$ that defines the sequence of visits to diferent cities
$d=(n_i, \ i=1,...,I-1)$ .

A decision is feasible $d \in D$ if there are no gaps in the path. The feasible decision is optimal if

\begin{eqnarray}
\min_{d \in D}\ v(d),
\end{eqnarray}


where $v(d)$ is the length of path $d$.

\begin{eqnarray}
v(d)=\sum_{i=1}^{I} c_i
\\
c_i=\vert\vert z(i+1)-z(i)\vert\vert,\ i=1,...,I-1,\ c_I=\vert\vert z(I)-z(1)\vert\vert \nonumber
\end{eqnarray}


Here the vector $z(i)=z_{n_i}$ defines coordinates of the city number $n_i$. In the two-dimensional space $z_{n_i}=(x_{n_i},y_{n_i}),\ x_{n_i} \in R,\ y_{n_i} \in R$. The distances between the cities are defined as Euclidean distances between the corresponding points $z_i$.

In the following numerical experiments "cities" are considered to be points $z_{n_i}$ in the ten-dimensional unit cube [Mockus et al., 1997]. The multi-dimensional case is convenient for comparison of different methods. The reason is that the choice of alternative paths is wider here.

Besides, in some real life traveling salesman problems, more then two dimensions are regarded. For example, if a global traveling salesman is sensitive to the time lag then the third dimension is time. If a product passes through many separate reactors[*] then one minimizes differences of some parameters between adjacent reactors, in addition to geometric coordinates.

1 Greedy Heuristics

Consider the heuristics

\begin{eqnarray}
h_i=-c_i
\end{eqnarray}


Here $-c_i$ is with the minus sign, because we regard greater heuristics as the better ones. The greedy decision is to visit nearest new[*] city

\begin{eqnarray}
i(g)=arg \max_{j >i} h_j,\ i=1,...,I-1
\end{eqnarray}


It is well known that the "nearest city" heuristic provides nearly optimal solutions, as ususal. Thus, this heuristic should be the main component of any efficient heuristic algorithm. However, to provide for convergence conditions [Mockus et al., 1997], some randomization should be introduced, too.

Considering the knapsack example the compromise between the efficiecy of pure greedy heuristics and the convergence conditions is provided by adding deterministic component (2.34) representing greedy heuristics to the stochatsic components represented as polynomial (2.33).

\begin{eqnarray}
r_i^{l}={h_i^l\over
\sum_{j=1}^n h_j^l},\ l=0,1,2,...,L,
\end{eqnarray}


and

\begin{eqnarray}
r_i^{\infty}(m)&=&\cases {1, &if $h_i=\max_j h_j$, \cr 0,
&otherwise.\cr}
\end{eqnarray}


Here the superscript $l=0$ denotes the Monte-Carlo randomization. The superscripts $l=1,\ l=2,....,l=L$ define a family of "polynomial" randomizations. The superscript $\infty$ denotes the greedy heuristics with no randomization.

In the traveling salesman example considered in [Mockus et al., 1997] the similar objective was approached by the "sharp" polynomial expression of probabilities

\begin{eqnarray}
r_i^{l}=\frac{h_i^{Sl}}{\sum_{j=1}^n h_j^{Sl}},\ l=0,1,2,...,L,
\end{eqnarray}


where $S$ is large.

Using the Bayesian Heuristic Approach [Mockus et al., 1997] the randomization functions $r_i^{l},\ l=0,1,2,...,L,\infty$ are aplied with probabilities $x(l)$. The best results $f_K(x)$ obtained during $K$ iterations are recorded.

The minimum of function $f_K(x)$ is achieved optimizing the "mixture" $x$ of different randomization techniques $r_i$. The optimization of the mixture $x$ is difficult because $f_K(x)$ is a stochastic function and the multimodal one, as usual. That is a reason why the Bayesian methods are applied to minimize $f_K(x)$.

In the numerical examples [Mockus et al., 1997] only three terms were considered $L=2$. Best results were obtained using number $S=32$. In this case, the probability of going to some distant city is almost zero. However, even very small deviation from the pure heuristics is significant. For example, the probability of not going to the nearest city (in a problem of 100 cities) is about $1\%$ in one iteration and about $27\%$ in all 100 iterations. One may consider these sparse "deviations" from the pure heuristic decisions as some new "initial" points preventing the heuristics to be trapped. This is a way to provide the convergence.

The number of cities $I$ is from 100 to 500. $I$ points (representing cities) are generated 300 times, by sampling from a uniform distribution in the 10-dimensional unit cube.

Usually the exact optimum is not obtained, because the number of problems ($5 \times 300$) and the size of the problems (from 100 to 500 cities) is too large. Consequently we merely compare the Bayesian methods with the pure heuristics. Both a simple sequential nearest city algorithm and a more complicated local permutation algorithm are considered.


2 Permutation Heuristics

For purposes of the algorithm involving local permutations, some initial travel path must be used. One tries to improve this initial fixed path by selecting a pair $i$ of adjacent cities $n_i-n_{i+1}$ and afterwards considering another pair $j$ of adjacent cities $n_j-n_{j+1}$. The pairs are chosen so that reconnecting $n_i$ to $n_j$ and $n_{i+1}$ to $n_{j+1}$ we still obtain a path that visits all cities. We seek such a new path that is shorter. The initial path is selected by a greedy heuristic, then we repeat $I$ times the following two steps: The local permutation heuristics $h_l=-v(d_l)$ where $v(d_l)$ is the total length of travelling salesman path corresponding to the permutation $l$.

3 Calculation Results

Table 2.3 illustrates the results of Bayesian algorithms employing simple greedy heuristics by selecting nearest neighbor. The results of pure greedy heuristics and that of pure local permutation are shown, too, for comparison.


Table 2.3: Results of the Bayesian method using greedy heuristics.
$I$ $S$ $f_G$ $f_B$ $f_{GL}$ $f_{G-B}$ $d_{G-B}$
100 64 78.90 77.62 77 1.28 0.39
200 128 144.88 143.30 142 1.58 0.71
300 128 205.90 203.95 202 1.95 0.85
400 128 264.62 262.63 260 1.99 1.05
500 128 321.35 319.10 316 2.25 1.32

In this table the letter $B$ stands for the Bayesian method, the letter $G$ denotes pure greedy heuristics, and the symbol $GL$ denotes pure local permutations. The table provides sample means $f$ and sample variances $d$. Thus $f_{G-B}$ denotes the mean of the difference between the results of a greedy heuristic and the Bayesian methods, and $d_{G-B}$ denotes the corresponding variance. The symbol $I$ stands for the number of cities, and the letter $S$ is the power of the "sharp" polynomial (see expression (2.35). The Bayesian method performs 46 observations.

The results show that the Bayesian method is roughly $1\%$ better than the pure greedy heuristic. The improvement declines with the size of the problem, from $1.6\%$ for $I= 100$ to $0.7\%$ for $I = 500$. Comparing the columns $f_{GL}$ and $f_{B-GL}$ we see the advantage of local permutation heuristics. Therefore, the Bayesian method is applied here, too. Table 2.4 demonstrates the results of Bayesian algorithms for the case of local permutation heuristics. The results of pure greedy heuristics and that of pure local permutation are shown, too.


Table 2.4: Results of the Bayesian method using permutation heuristics.
$I$ $S$ $f_{G}$ $f_{GL}$ $f_{BL}$ $f_{G-BL}$ $d_{G-BL}$ $f_{GL-BL}$
100 32 78.7 77 76 2.7 0.6 1
200 32 144.5 142 140 4.5 0.7 2
300 32 206.1 202 200 6.1 1.2 2
400 32 265.3 260 258 7.3 1.4 2
500 32 321.5 316 313 8.5 1.5 3

In Table 2.4 the symbol $BL$ stands for the Bayesian method, the symbol $GL$ denotes a pure local permutation heuristic. The table provides sample means $f$ and sample variances $d$. Thus, the symbol $f_{GL-BL}$ denotes the average difference between the pure local permutation heuristic and the Bayesian method. The symbol $f_{G-BL}$ denotes the average difference between greedy heuristics and the Bayesian method using local permutations. The symbol $d_{G-BL}$ denotes the corresponding variance. Here the Bayesian method performed 26 observations and the algorithm stops after 50 repetitions.

Sharp polynomial randomization (2.35) was chosen as a result of some additional experimentation. When the uniform distribution was included, the average value of $f_{G-BL}$ was $1.5$ for $I= 100$ (one hundred cities). Using expression (2.35) with $S=1$, the average value of $f_{G-BL}$ was $2.5$ for the same $I= 100$. The best average gain was achieved with $S=32$ (see Table 2.4). The distribution of coefficients $x_i$ from expression (2.35) shows which powers of $h_i$ are most important in calculating $r_i^0(m)$.

The results of Table 2.4 show that the Bayesian method is approximately $1\%$ better than the pure heuristics (see the average gain $f_{GL-BL}$). The improvement is almost independent of the size of the problem. We consider $1\%$ to be a good result, because this improvement is obtained near the global minimum, where even a fraction of percent is important.

4 Software Example 1

Here four algorithms are implemented in Java 1.1:
the author calls them the $nearest neighbor$, the $repeated nearest neighbor$, the $two optimal$ , and the $two optimal (random)$. Figures 2.9 show the nearest neighbor and the repeated nearest neighbor examples. The repeated nearest neighbor starts from random starting points. Two optimal is a version of the permutation algorithm described in the section 5.2. Two optimal (random) uses the results of nearest neighbor algorithm as the starting path.

Figure 2.9: The nearest neighbor (top) and the repeated nearest neighbor (bottom) examples .
\begin{figure}\centerline{
\epsfig{file=travel1.near.eps,width=12.0cm,height=10....
...ine{
\epsfig{file=travel1.randnear.eps,width=12.0cm,height=10.5cm}
}\end{figure}
Figures 2.10 show the two optimal and random two optimal examples.
Figure 2.10: The two optimal (top) and random two optimal (bottom) examples .
\begin{figure}\centerline{
\epsfig{file=travel1.nn.eps,width=12.0cm,height=10.5c...
...rline{
\epsfig{file=travel1.randnn.eps,width=12.0cm,height=10.5cm}
}\end{figure}

5 Software Example 2

In this example two algorithms are regarded:
the $nearest neighbor$ and the $bubble$. Figures 2.11 show the nearest neighbor and the bubble examples. The bubble algorithm is similar to nearest neighbor, however the bubble starts from three connected cities.
Figure 2.11: The nearest neighbor (top) and the bubble (bottom) examples .
\begin{figure}\centerline{
\epsfig{file=travel2.near.eps,width=12.0cm,height=10....
...erline{
\epsfig{file=travel2.bubl.eps,width=12.0cm,height=10.50cm}
}\end{figure}

6 Nearest Tasks

The first task is to implement in Java the Bayesian Heuristic Algorithm (BHA) described in the section 3.2 and extend it to permutation heuristic. Using the permutation heuristics the Bayesian version of Simulated Annealing (see section 2.4) seems to be an interesting alternative to the BHA. To perform the statistical comparison of these algorithms one prefers Java1.2 implementation because it works about ten times faster in similar problems.

jonas mockus 2004-03-20