Home Articles Reviews Interviews JDK Glossary Features Discussion Search
Home » Articles » Neural Networks » Advanced

# Optimizing Functions on the Real Numbers

By Forrest Briggs

## Abstract

This article provides a guide to optimizing general functions of the form f : |Rn → |R. Optimizing such functions is important to AI because they arise in many situations, from training neural nets to industrial applications like antennae design. Techniques for solving this problem include analytical solution, gradient descent, annealing, evolution, particle swarm optimization and hybrid approaches. The basics of these techniques are presented, along with novel enhancements and extensions.

## Notation and Terminology

Functions of the form f : |Rn → |R take n real valued inputs and return a single real valued output. This may sometimes be referred to as a fitness function. Our goal is to find x such that f(x) is minimized or maximized. In this document, bold variables refer to vectors with dimension n. For example x = (x1, x2, ... , xn) and 0 is the n dimensional 0 vector. Regular type face variables are scalars, such as y = 4. ||x|| means the magnitude of the vector x and is defined as the square root of the sum of the squares of each of its components. The term individual will be used to refer to an n dimensional vector that is a member of a population. The fitness of an individual x is the evaluation of f at x.

## Analytical solution

The partial derivative of a function f : |Rn → |R with respect to the ith component of x, is δf(x) / δxi, which is defined as
 δ f(x) limit f(x1, x2, ... , xi + h, ... , xn) - f(x1, x2, ... , xn) = h → 0 δ xi h

The gradient of a f(x), denoted f(x) is the vector of the partial derivative of f with respect to x1 ... xn.

f(x) = ( δf(x) / δx1, δf(x) / δx2, ... , δf(x) / δxn )

If we have a formula for f, then we can analytically compute its gradient and find its optima by solving a system of equations f(x) = 0. For each solution, the second derivative test determines if the solution is a minima or a maxima.

## General Optimization Problems

We do not have a formula for f in many real problems, so more general techniques are necessary. There are many things that can make an optimization of this form difficult. We may not have a formula for f because it is computed by an algorithm or comes from real world data. In this case, it is not possible to analytically take the derivative of f and solve for its zeros. The function may be discontinuous, non-differentiable, have a fractal structure, change over time as we are trying to optimize it and/or only give us an expected value instead of its actual value each time we evaluate it. Functions usually have both local and global optima. In the search for global optima, a great problem is getting stuck in local optima. If you have control over the form that f takes, such as if it is a fitness function that quantifies the performance of a neural net, consider ways to make f as nice for optimization algorithms as possible by avoiding discontinuity, non-differentiability, local optima and large constant regions. Binary right or wrong classifications of behavior make for discontinuous fitness functions. Despite all of these potential complications, the algorithms discussed below are still applicable to almost any real valued optimization problem.

Most general optimization algorithms start by evaluating f at a random or informed guess or set of guesses. They then repeatedly make more guesses and evaluations of f, using information gathered from previous guesses to make better guesses in the future. It is desirable for an optimization algorithm to make as few evaluations of f as possible.

If we imagine a function from |R2 to |R as a 3D height-map for a mountain, the way gradient descent would find the lowest point on the mountain is by dropping a lot of balls and letting them roll downhill. Gradient descent starts with a guess or set of guesses. For each guess, we compute the gradient of the function to be optimized at that point. The gradient tells us the direction that the function changes most. We then make a new guess that moves a small step from the original guess in the direction that produces the most increase or decrease in f, depending on if we are maximizing or minimizing. Suppose our initial guess is x0. If we are minimizing, each subsequent guess is generated according to the rule:

xt+1 = xt - atf(xt)

at is the learning rate at iteration t, which controls how much the guess changes in each iteration. Small changes are good for fine tuning a solution once a guess that that is close to an optima is found. Large changes are good for skipping over local optima and getting to global optima. at can remain constant or change during the course of optimization according to an annealing schedule. The annealing schedule makes at decrease as time goes on, so at the beginning of the optimization, the optimizer is less likely to get stuck in local optima and more likely to find global optima. At the end of the optimization, once a guess that is near the optimal solution is found, small changes find the exact optima. Some annealing schedules might make at decay exponentially, and/or oscillate, for example:

at = e-kt or at = e-kt cos(wt)

Annealing can be applied to any optimization algorithm that has free parameters that control how much guesses change at each iteration. Even with annealing, it is unlikely that there is a trajectory of xt from a single initial guess x0 to a global optima. It is usually necessary to start from multiple initial guesses and use the best result from all of them. If a particular guess is not making any progress, adding a random noise to it may help bring it out of local optima.

Momentum can increase the performance of gradient descent by smoothing the trajectory of xt. Gradient descent with momentum is describe by the following rule:

xt+1 = xt - atf(xt) + m(xt - xt-1)

The larger the value of m, the more each change from xt to xt+1 with be like the previous change from xt-1 to xt. Values of m > 1 will probably be unstable.

There are many other enhancements and variations that can be made to gradient descent, such as conjugate gradient descent.

Even if a formula for f is not available, as long as we can evaluate the function, we can numerically compute its gradient at a point by definition. However, it must be differentiable for this calculation to give a meaningful result. This may be slow, since computing f(x) for some particular x requires 2 · n evaluations of f. The backpropagation neural net training algorithm minimizes a function from n real valued weights of a neural network to a single real valued error using gradient descent. The key to the back propagation algorithm is that it uses the chain rule to efficiently compute the gradient of this function.

## Evolution

Evolutionary optimization algorithms work on the principle of natural selection. In nature, an individual that is fit is more likely to reproduce than an individual that is not fit. Parents pass their genes with slight modification to their children. The children are similar to the parents, but some are fitter and some are less fit. The fitter ones reproduce more and over the course of multiple generations, the population becomes more fit.

When optimizing f : |Rn → |R, each individual is represented as an n dimensional vector xi. We initially create a population of k individuals {x1 ... xk } and assign each component of each individual to a random value. Choosing the initial random values from a gaussian distribution sometimes produces better results than choosing them from a uniform distribution. This is the first generation.

The next step is to pair each individual xi with the evaluation of f at that point, f(xi). We then chose the p individuals with the highest or lowest evaluations of f, depending on if we are maximizing or minimizing. These are the parents of this generation. We want to generate a new set of children and keep the parents for the next generation (since all of the children could be worse than the parents).

If the population is stored as an array of individuals, then to find the p best individuals, we sort the entire array and chose the first p. If the population is of a fixed size k, then we should create k - p children to maintain the same population size in the next generation. Since the array has been sorted, we can leave the parents where they are and create the children after them in the population array. The best possible efficiency 1 of a sorting the entire population is O(k · log k). Merge-sort has a guaranteed runtime of Θ(k · log k) and quick-sort has an expected runtime of O(k · log k), although in very unlikely cases it can have a runtime of O(k2). Still, people usually use quick-sort instead of merge-sort because it is faster and has less memory requirements. This is because quick-sort works with data in place, while merge-sort often copies it in blocks.

Storing and sorting the population with a priority queue instead of an array can be more efficient, give simpler code and has other advantages. A priority queue is a data-structure that allows its users to enqueueue items with an associated priority and dequeue the item with the lowest or highest priority. An efficient implementation of a priority queue can that already contains n items can enqueueue a new item in O(log n) time and dequeueue the item with the optimal priority in O(log n). A priority queue can therefore be used to sort (this is the basis of heapsort.) A simple was to do this is to enqueueue all of the items, and then dequeueue all of the items. They come out sorted by priority. This technique has the same asymptotic runtime as any other optimal sorting algorithm. With a priority queue, we begin evolution by enqueueuing k randomly generated individuals that are prioritized by fitness. We then dequeueue the p best individuals and store them in an array of parents for this generation, then re-enqueueue the parents into the priority queue (so they will still be there in the next generation). We can now create as many children as we want and evaluate their fitness, then enqueueue them into the priority queue. Unlike with the array, we never have to throw away old inviduals. They remain in the queue in case we ever need them again, and later we will.

Whether the population is stored and sorted with an array or a priority queue, the basic evolutionary cycle is the same. In each generation, we chose p of the best indivuduals as parents, then create more children based on the parents. We repeat this process until we find an individual with an accepable fitness. This same procedure is applicable when optimizing on other domains besides |Rn.

Children are created from parents through a combination of mating and mutations. Below are some examples of mating and mutation operators. These are just some possibilities though. The more diverse the set of operators an evolutionary algorithm has, the less likely it is to be hopelessly stuck in local minima. Be creative and come up with operators of your own.

Mating consists of combining the parameters of two parents to form a single new individual. Let parent a=(a1, ... , an), parent b=(b1, ... , bn) and child c=(c1, ... , cn). A simple way of creating a child that is a mix of two parents it to randomly use the value from one or the other parent for each component of the child.

```set parent bias to a random number between 0 and 1
for i = 1 to n
if a random number between 0 and 1 is < parent bias then
ci = ai
else
ci = bi
```
First we chose a random parent bias between 0 and 1. A parent bias of .5 specifies that the components of the child will be selected from both parents with equal likelyhood. A bias closer to 0 or 1 species that the components of the child will be selected more from on parent than the other. For each component of the child, we chose a random number between 0 and 1. If that number is less than the parent bias, then we assign that component of the child to one parent. Otherwise we assign it to that component of the other parent. Another way of mating two parents is to make the components of the child a random linear interpolation between the components of the parents.
```for i = 1 to n
set p to a random number between 0 and 1
ci = p · ai + (1 - p) · bi
```
The value of p could be chosen randomly just once for the entire child instead of for each of its component.

Mutation operates on a single individual and usually produces a similar but different individual. Mutation can be applied to children that are created by mating two parents or to direct copies of a single parent. In fact, mating is not necessary for evolution, but self-replication and mutation is. Suppose we are mutating an individual x = {b1, ... , bn}.

Noise is a kind of mutatation that consists of adding random values to a subset of the components of an individual. One way of chosing which components to mutate is to chose a noise rate between 0 and 1. Then for each component, if a random number between 0 and 1 is less than the noise rate, mutate that component. The mutation rate should be randomly chosen between 0 and 1. We could also arbitrarily specify the mutation rate, but thus would introduce a free parameter into the evolutionary algorithm that must either be chosen by hand or by other means. Where possible, it is best to avoid free parameters. The implementation presented below uses as few as possible, but some are unavoidable. Once we chose a subset of the parameters of an individual to add noise to, we can chose what kind of noise and how much to add. We may add a random variable sampled from a uniform distribution in a particular range or from a gaussian distribution. For example:

```set noise likelyhood to a random number between 0 and 1
for i = 1 to n
if a random number between 0 and 1 is < noise likelyhood then
```
Uniform random variable:
```		xi = xi + a random number between -r and r
```
or gausian random variable:
```		xi = xi + a random gaussian · g
```
The free parameters r and p control how much noise is added to each component. It may be useful to normalize r and p so they are proportional to the magnitude of x. The normalized parameters r' = ||x|| · r and p' = ||p|| can be substituted into the above agorithm. Choices of r' and p' that are less than one will be more appropriate for different f with different scales than an arbitrary un-normalized choice for r and p. Gaussian noise is usually better than uniform noise. To understand why, consider a single initial point in |R2. Mutating a large number of copies of the initial point with uniform noise will result in a square of points. The noise is making more individuals in the directions of the corners of the square. Mutating with gaussian noise makes individuals that are distributed more evenly in all directions around the parent.

Scaling is another mutation that can be applied to a subset of the components of an individual. This subset is chosen in the same way as for noise. A scaling factor is randomly chosen between -2 and 2 either for all components or for each component. This means that the scaling operator can decrease any component to 0, increase it by up to a factor of 2 and/or switch its parity. Then each component in the chosen subset is mutliplied by either the one scaling factor or its individual scaling factor. For example:

```set scale likelyhood to a random number between 0 and 1
set scale factor to a random number between -2 and 2
for i = 1 to n
if a random number between 0 and 1 is < scale likelyhood then
xi = xi · scale factor
```

Another type of mutation is component swapping. First we randomly chose an integer number of swaps to make between 0 and q inclusive (where q is a free parameter.) Then we randomly chose two components of x such as xi and xj, where i and j are random numbers between 1 and n inclusive.

There are many more complex operators for real valued evolutionary algorithms, including unimodal normal distribution crossover (UNDX), simplex crossover (SPX), blend crossover (BLX), Parent-Centric Recombination (PCX) [5].

We now have an assortment of operators at our disposal that create children from parents. The final piece to creating a new generation is chosing which operators to apply. Suppose we wish to create c children using p parents. For each child, we must decide if it will be created by mating or be a copy of a single parent. Then we must decide which, if any of noise, scaling and swaping to mutate the child with. The following pseudo code illustrates the process of creating c children from p parents, which constitutes a generation of evolution:

```get p best parents parents in population
set mating rate to a random number between 0 and 1
set noise rate to a random number between 0 and 1
set scale rate to a random number between 0 and 1
set swap rate to a random number between 0 and 1
for i = 1 to c
if a random number between 0 and 1 < mating rate then
make a new child by mating two randomly chosen parents
else
make a new child by copying a randomly chosen parent

if a random number between 0 and 1 < noise rate
apply noise to the child

if a random number between 0 and 1 < scale rate
apply scale mutation to the child

if a random number between 0 and 1 < swap rate
apply swap mutation to the child

evaluate child fitness

```
Notice that each generation has different rates of sexual vs. asexual reproduction, noise, scaling and swapping.

The concept of annealing can be applied to evolutionary algorithms. Just as gradient descent has the parameter at that determines how much xt changes with each timestep, evolution has the parameters r and g that determine how much noise changes children from parents. These parameters can also be made a function of time. As with gradient descent, if a population is showing little improvment because it is stuck in a local optima or flat region of the fitness function, and/or its population lacks diversity, adding more random noise to the parents or including parents with lower fitness may be helpful.

Another way of dynamically changing r, g and any other free parameters in the evolutionary algorithm is meta-evolution. We specify r and g per individual and evolve their values just like x. In fact they can simply be appended as the last elements of x and evolved as a part of it. This does not mean that r and g are in any way evaluated by f, just that whenever a new child is created, r and g are mated and mutated as though they were components of x. In addition to optimizing f, the evolutionary algorithm is optimizing its own parameters to evolve faster.

When there are large regions in f that are constant, a population can evolve to be all very similar individuals. This is bad, since a diverse population evolves faster and is less likely to get stuck in local optima. One way of preventing this is to use islands. Each island is a different population that primarily mates and mutates only within itself. Rarely, inviduals from one island population are taken to a different population. If a population has already become all the same or very similar and is in a constant region of f, it can be restored to diversity by breaking ties in fitness with a measure of uniqueness. The similarity of any two individuals a and b is inversely proportional to ||a - b||2, the magnitude squared of the distance between the two individuals. The uniqueness of an individual a with respect to a set of parents {p1, ... , pn} is the sum of ||a - pi|| over all pi. In a population with all the same fitness, evolving for individuals with high uniqueness makes the population more diverse and helps push the population out of local optima.

## Particle Swarm Optimization

Particle Swarm Optimization (PSO) is based on the metaphor that a group of agents with personal memory and social interactions are swarming around in the domain of f, looking for its optima. Each particle, which represents an agent, has a position and velocity in |Rn, a mass, and possibly information about its history [2][3][4]. The position and velocity of each particle are initially random. Their mass may be random or 1. Let xi(t) and vi(t) be the position and velocity of particle i at time t and mi be the mass of particle i. As time progresses in steps of Δt, the particle moves according to:

xi(t + Δt) = xi(t) + Δt · vi(t)

Δt is a free parameter that controls how fast through the domain of f particles move. At every timestep, we evaluate f(xi(t) ). Thus, the guesses at optima that PSO makes follow the paths of its particles through the domain of f. The particle is subject to acceleration due to many different forces. In most documents describing PSO, the changes in velocity of a particle are not ususally represented as the result of forces. However, this description facilitates the formulation of many particle behaviors and is computationally equivelent to the standard vesion of PSO. It changes during each timestep according to:

vi(t + Δt) = μ · vi(t) + Δt · Fi(t) / mi

Here μ is a free paramter that controls friction and Fi(t) is the force acting on particle i at time t. Setting μ to 1 gives no friction, while values near but smaller than 1 can help make the system more stable. Values greater than 1 will be unstable. It may be necessary to impose an arbitrary maximum velocity. The combination of these rules give particles momentum and with the right force vector, complex swarming behavior. Which forces are used consititue the primary differences between PSO implementations. The most common forces cause a particle to seek the best evaluation of f that it or any other particle has found and to avoid other particles. Multiple forces are simply summed.

A force that causes particle i to seek the best evaluation of f found by any particle can be computed in the following way. For the entire swarm, store the position of the best evaluation of f found and the evaluation of f at that point. Every time a particle moves, check if the evalatution of f at its position is better than the best, and if so, set the best to the position of the particle that moved. Let the position of the best evaluation of f be gbest = (gbest1, ... , gbestn). Let Fi(t) = ( fi(t)1, ... , fi(t)n ). In other words, fi(t)j is the jth component of the force on particle i at time t.

fi(t)j = random number between 0 and 2 · ( gbestj - xi(t)j )

A force that causes each particle to seek the best evaluation of f that it individually has found can be constructed in a similar way, by making each particle keep track of the best evaluation of f that it has seen.

Making particles repel eachother helps keep the solutions that they try diverse. Let a swarm consist of n particles with positions x1 ... xn. The force acting on particle i is the sum of the forces repelling it from each other particle.

Fi(t) = d · Σ (xi - xj) / || xi - xj ||2 for all j ≠ i

Where d is a free parameter that controls the strength of the force.

Notice that the attraction to best evaluations of f is proportional to the distance from the particle to the evaluation, while the magntitude of the repulsive force is inversely proportional to the distance between the two particles. This means that repulsion acts mostly on particles that are close.

A force with completely random components may also be used.

The overall PSO algorithm can be summarized as follows:

```for i = 1 to k
randomize the position and velocity of particle i

while stop criteria not reached
for i = 1 to k
compute the forces on particle i
update the position of particle i
evaluate f at the position of particle i
update the velocity of particle i
```
Notice that k, the number of particles in the swarm, is a free parameter. Tribes is an algorithm based on PSO that eliminates this parameter.

## Hybrid Evolution and PSO

It is possible to use evolution and PSO in a single optimization algorithm that is more robust than either alone [1]. The algorithm simply alternates between doing evolution for a short time, and then particle swarm optimization for a short time. Each scheme makes use of the results from the other. Using a priority queue to store the population makes this particularly simple.

Suppose we start by choosing a certain number of random guesses and insert them into the priority queue, prioritized by their fitness (evaluation of f). This gives evolution or PSO something to start with. Evolution proceeds for a fixed number of generations. In each generation, it choses the best p individuals in the population as parents by dequeueing and then re-enqueueuing them. It creates c children from the parents through mutation and mating and enqueueues them prioritized by fitness. Then PSO takes over for a while. We create a list of the k best individuals by by dequeueing and then re-enqueueuing them from the population. We create a swarm of k particles where the initial position of each particle is one of the k best individuals in the population. The initial velocity of each particle is random. We may chose to reset gbest for the swarm each cylcle of PSO, or we may make gbest the best individual in the entire population. We run the particle swarm for a fixed number of timesteps. Each time a particle changes position and we evaluate f at its position, we enqueueue the position into the population. The results of PSO are then accesible to evolution. Another way of using the results of evolution in PSO is to give the particles random initial positions and have a force that attracts to the m best individuals in the population.

Pseudo code for alternating between evolution and PSO:

```let the population be a priority queue, prioritized by fitness

create q new random individuals and evaluate their fitness'
enqueueue them into the population

while stop criteria not reached
for i = 1 to number of generations for evolution cycle
create a list of p parents by dequeueing and
then re-enqueueueing the p best individuals

create c new children by mating and mutating the parents
and evaluate their fitness

enqueueue children into the population

create a list of the k best individuals by by dequeueing and
then re-enqueueueing create a swarm of k particles, where each
particle's position is one of the k best individuals and with
random velocity.

for i = 1 to number of timesteps for PSO cycle
for each particle
compute forces on particle,
then update its position and velocity

evaluate f at the position of particle and
enqueueue position of particle into population
```

In many situations, when we evaluate f(x), what we get is not the true value of f, but a random value. We can approach the true value f(x) by averaging multiple evaluations. For example, suppose we are training a neural net to output a particular function g for which we have a closed formula. The error of the neural net f(x) as a function of its parameters x at approximating g is the squared difference between the output of g and the neural net's output, averaged over some set of values in g's domain, but what set? People usually chose a fixed number of points and distribute them evenly throughout an interval. This gives a fitness for the net, but to a limited accuracy. A more powerful approach is to adaptively update the fitness of the neural net, such that nets that with high fitness have their fitness more thuroughly evaluated.

This concept can be incorporated into the general evolution and hybrid algorithms. Up until now, we have paired each individual x with the evaluation of f at x. Now we will pair x with the average of all evaluations of f at x. To accomplish this, with each individual x, we store the number of times f has been evaluated at x as well as the sum of all of these evaluations. The average is the sum divided by the number of times f has been evaluated. The first time an individual x is added to the population, we evaluate f(x) q times. Then every subsequent time that x is enqueueued into the population (as in when it is used as a parent in evolution or an initial condition for a particle in PSO,) we evaluate f(x) w more times and update the average fitness of x. Here q and w are free parameters that determine how many times f is evaluated when x is created and used as a parent.

Using this technique, individuals with high average fitness over a small number of evaluations are tested on more evaluations to ensure that they truly are fit. As the fitness of a population increases, making further optimizations to it becomes increasingly difficult. The difference in fitness between individuals after a long time of optimization is usually small compared to at the start of optimization. It is necessary to be able to evaluate the fitness of individuals with enough precision to differentiate them. Some problems greatly benefit from adaptive precision in the fitness function, such as training a neural net to approximate a fractal function.

## Conclusion

Each optimization problem presents different challenges. It is often necessary to analyze the performance of an optimization algorithm as it is running to determine how it can be enhanced. Do not hesitate to make things up on the spot and test them. There are still many ways to enhance optimization algorithms that have not yet been discovered.

### Footnotes

1 - There are sorting algorithms that run in less than O(n · log n) time, but they are not applicable to real numbers.

### Source Code

Meta Learner - Java source code for training a spiking neural net with hybrid evolutionary / PSO algorithm on an uncertain fitness function.

References [1] Particle Swarm, Genetic Algorithm, and their Hybrids: Optimization of a Profiled Corrugated Horn Antenna by Jacob Robinson, Seelig Sinton, and Yahya Rahmat-Samii.

[2] Neural Network Learning using Particle Swarm Optimizers by Matt Settles and Bart Rylander.

[3] Particle Swarm Optimization by James Kennedy and Russell Eberhart.

[4] PSO Tutorial by Xiaohui Hu.

[5] Real-Coded Evolutionary Algorithms with Parent-Centric Recombination by Kalyanmoy Deb, Dhiraj Joshi and Ashish Anand.

Submitted: 07/07/2004