The Bregman Algorithm (1/4)

Optimization With Bregman Divergences

In the 1960s Lev Meerovich Bregman developed an optimization algorithm [1] which became rather popular beginning of 2000s. It’s not my intention to present the proofs for all the algorithmic finesse, but rather the general ideas why it is so appealing. Some ramifications and new algorithms, that were derived from the original Bregman algorithm (e.g. [2, 3, 4]), will be presented in subsequent posts.

Finding a common point in a set of convex sets

Let’s assume we are given a family of convex sets $A_i$, with $i=1, \ldots, n$. Let’s assume further that the intersection $R:=\bigcap_{i}A_i$ of all these sets is not empty. We want to find any point in this intersection $R$. In the figure below we would seek any point in the shaded area.

Example of intersections of a set

A more concrete application of this problem formulation would be solving linear systems of equations. Each equation in our system represents a line or plane (which are both convex sets) and solving the linear system comes down to finding a point which is inside each line/plane, that is a point which is in the intersection of all these lines/planes.

An important tool in finding such a point in the intersection will be the so-called Bregman divergence.

Bregman Divergence

The definition presented here has been extracted from [1]. We consider a function $D:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}$ with the following 6 properties. Note that the function $D$ is explicitly linked to a family of convex sets as defined above by these properties.

  1. $D(x, y) \geqslant 0$ for all $x$, $y$ and $D(x,y)=0$ if and only if $x=y$.
  2. For any $y\in\mathbb{R}^n$ and any $A_i$ there exists a $x := P_i(y)\in A_i$ such that $D(x,y)=\min_z\lbrace D(z,x)\ \vert\ z\in A_i\rbrace$.
  3. For each $A_i$ and each $y\in\mathbb{R}^n$ the function $D(\cdot,y)-D(\cdot,P_i(y))$ is convex over $A_i$.
  4. A derivative $D’(\cdot, y)$ of the function $D(\cdot, y)$ exists when $x=y$ while $D’(y,y)=0$.
  5. For each $z\in\bigcap_{i}A_i$ and for every real number $L$ the set $\lbrace x\in\mathbb{R}^n \vert D(z,x) \leqslant L \rbrace$ is compact.
  6. If $D(x^{(n)},y^{(n)})\to 0$ and $y^{(n)}\to y^*$ and the set of elements of the series $(x^{(n)})$ is compact, then we have $x^{(n)}\to y^*$.

These six conditions are rather technical and necessary to show that the algorithm that we will present here is actually well-behaved. We won’t look into them in detail here. Let us just remark that one can show that if $f:\mathbb{R}\to\mathbb{R}$ is a strictly convex and differentiable function then the following construction for $D(\cdot,\cdot)$ fulfils the aforementioned conditions.

$$\begin{equation} D(x,y) := f(x) - f(y) - f’(y)(x-y) \end{equation}$$

This construction reflects more or less what we nowadays call a Bregman Divergence and it shows that there are actually quite a lot of functions that fulfil the aforementioned conditions.

The function $D(\cdot,\cdot)$ as defined above represents difference between a function $f$ and its first order Taylor approximation. It’s also interesting to note that the Bregman divergence behaves a bit like a distance (mostly because of the requirements from point 1.). In fact, the Bregman divergence built with $f(x):=\lVert x\rVert^2$ gives us $D(x,y) = \lVert x-y\rVert^2$ which is the squared euclidean distance between $x$ and $y$.

The Algorithm

The idea is actually quite simple. We start with an arbitrary point $x^{(0)}\in\mathbb{R}^n$ and compute iteratively

$$\begin{equation} x^{(n+1)} = \arg\min_{z\in A_{n+1}} D(z, x^{(n)}) \end{equation}$$

If, during your iterations, you have run over all sets $A_i$, you simply start over with the first one. One can show that the order does not matter that much. It is important however that you use each set $A_i$ in your iterations.

The algorithm does the following. Starting from an arbitrary point $x^{(0)}$, we project that point $x^{(0)}$ onto one of our sets, say $A_1$ and get a point $x^{(1)}$. Then we project that point $x^{(1)}$ onto $A_2$ and get a point $x^{(2)}$ and so on.

A similar technique, using orthogonal projections, had already been known before Bregmans work. In addition, if we base our Bregman divergence on the squared euclidean norm, then Bregmans algorithm coincides with the iterative approach using orthogonal projections. Bregmans algorithm generalizes this approach to projections which are not necessary orthogonal, as the figure below suggests. There the sequence $(y^{(j)})$ uses orthogonal projections and the sequence $(z^{(j)})$ uses non-orthogonal projections.

Solving Linear Systems

As mentioned, the algorithm may be used to solve linear systems of equations. We do so by using the squared euclidean norm as underlying function for the Bregman divergence. In that case we do orthogonal projections onto each line. Implementing the algorithm in julia is straightforward.

using LinearAlgebra

struct Line
    normal_vector
    offset
end

# Compute the orthogonal projection onto a line
function ProjectOntoLine(line, point)
    return point - ((dot(line.normal_vector, point) - line.offset) /
        dot(line.normal_vector, line.normal_vector)) * line.normal_vector
end

function BregmanAlgorithm(matrix, rhs)
    solution = [0; 0; 0]
    i = 0
    # stop when the residual drops below 1.0e-10
    while (norm(A * solution - b) > 1.0e-10)
        # Extract a line from from the system of equations.
        line = Line(matrix[(i%3)+1,:], rhs[(i%3)+1])
        # Compute the projection of the previous iterate onto the line.
        solution = ProjectOntoLine(line, solution)
        i = i + 1
    end
    return solution
end

A = [1 1 1;
     1 2 1;
     4 0 3]

b = [7; 6; 9]

x_ref = A \ b  # reference solution to check that everything is correct.
x = BregmanAlgorithm(A, b)
println("Final solution: ", x)
println("Error: ", norm(x - x_ref))

After about 18500 iterations the algorithm should return the correct solution (up to the specified threshold for the residual). Obviously it’s not the fastest method. However the Bregman algorithm is appealing because it is rather generic and can be used to solve a multitude of problems

References

[1] Bregman, L. M. “The Relaxation Method of finding the common point of convex sets and its application to the solution of problems in convex programming.” USSR Computational Mathematics and Mathematical Physics, 1967, 7, 200–217 (English translation, Russian original available here)

[2] Bauschke H. H. and Combettes P. L., “Iterating Bregman Retractions.” SIAM Journal on Optimization, 2002, 13(4), 1159–1173

[3] Goldstein T. and Osher S., “The Split Bregman Method for L1-Regularized Problems.” SIAM Journal on Imaging Sciences, 2009, 2(2), 323–343

[4] Hoeltgen, L. “Bregman Iteration for Optical Flow.” Master Thesis, Saarland University, Saarbrücken, Germany, 2010

Notes

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Mathematician and
Software Engineer

Researcher, Engineer, Tinkerer, Scholar, Philosopher, and Hyrox afficionado