AATA Steepest Descent

Section 6.4 Steepest Descent

As mentioned in the previous section, this method writes as

\begin{equation*} \boldsymbol{x_{k+1}} = \boldsymbol{x_k} - \alpha_k \boldsymbol{\nabla_{x_k}}f. \end{equation*}

MATLAB/Octave implementation. Below we implement this method and use it to find the minimum of three functions:

The quadratic function
\begin{equation*} f(x,y)=5x^2+4xy+y^2-6x-4y+15. \end{equation*}
An exact calculation shows that its minimum is attained at
\begin{equation*} (x,y) = (-1,4), \end{equation*}
where the function attains the value 10.
The quartic function
\begin{equation*} f(x,y)=5x^2+4xy+y^2-6x-4y+15+2x^4+x^2y^2. \end{equation*}
One can prove that this function has a single critical point at about
\begin{equation*} (-0.1472439849989..., 2.245791889806...). \end{equation*}
This must be a minimum since clearly the function goes to \(+\infty\) for \(x^2+y^2\to\infty\text{.}\)
The Rosenbrock's function
\begin{equation*} f(x,y) = 100*(y - x^2)^2 + (1 - x)^2. \end{equation*}
This function has a single minimum at \((1,1)\) but this minimum is "hidden" in a very narrow curved valley and so it is very hard to reach by numerical algorithms. You can check by yourself that the steepest descent method fails with this function even if you start from a point very close to the minimum such as \((1.01,0.99)\text{.}\)

In order to find the minimum of the second function, you need to comment out the definition of the first one (lines 7,8) and uncomment the definition of the second one (lines 11,12). The definition of the third one is at lines 15,16.

Line-search. As mentioned above, it is not easy in general to implement the line-search steepest descent algorithm because one has to minimize at every step a function of one variable, for instance by finding its critical points.

Things are much simpler when a function is quadratic. Consider indeed the function

\begin{equation} f(x) = \frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x} - \boldsymbol{b}^T\boldsymbol{x} +c.\label{sec-optimization-quadratic}\tag{6.4.1} \end{equation}

Given a point \(\boldsymbol{p_0}\) and a direction \(\boldsymbol{v}\text{,}\) the parametric equations of the line passing through \(\boldsymbol{p_0}\) and parallel to the vector \(\boldsymbol{v}\) are given by

\begin{equation*} \boldsymbol{p}(\alpha) = \boldsymbol{p_0} + \alpha\boldsymbol{v} \end{equation*}

In case of the line-search algorithm, \(\boldsymbol{p_0} = \boldsymbol{x_k}\) and \(\boldsymbol{v} = \boldsymbol{\nabla_{x_k}}f\text{,}\) so that

\begin{equation*} g_k(\alpha) = f(\boldsymbol{p}(\alpha)) = \frac{1}{2}(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f)^T\, A\,(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f) - \boldsymbol{b}^T(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f) + c. \end{equation*}

Then

\begin{equation*} g'_k(\alpha) = \frac{1}{2}(-\boldsymbol{\nabla_{x_k}}f)^T\, A\,(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f) + \frac{1}{2}(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f)^T\, A\,(-\boldsymbol{\nabla_{x_k}}f) - \boldsymbol{b}^T(-\boldsymbol{\nabla_{x_k}}f) \end{equation*}

\begin{equation*} = -\alpha \boldsymbol{\nabla_{x_k}}f^T\boldsymbol{\nabla_{x_k}}f + (\boldsymbol{x_k})^T\,A\,\boldsymbol{\nabla_{x_k}}f- \boldsymbol{b}^T(-\boldsymbol{\nabla_{x_k}}f), \end{equation*}

where we used the fact that

\begin{equation*} (\nabla_{x_k}f)^T\,A\,\boldsymbol{x_k} = (\boldsymbol{x_k})^T\,A\,\boldsymbol{\nabla_{x_k}}f \end{equation*}

because \(A\) is a symmetric matrix.

Notice now that \((\boldsymbol{x_k})^T\,A - \boldsymbol{b}^T = (\boldsymbol{\nabla_{x_k}}f)^T\text{,}\) so that the root of \(g'_k(\alpha)\) (that is necessarily a minimum) is ultimately given by

\begin{equation*} \alpha_k = \frac{\nabla_{x_k}f^T\nabla_{x_k}f}{\nabla_{x_k}f^T\,A\,\nabla_{x_k}f}. \end{equation*}

Hence we get the following algorithm:

select an initial point \(x\text{;}\)
select a threshold \(\epsilon>0\text{;}\)
until \(\|\nabla_{x}f\|<\epsilon\text{,}\) set \(x \to x-\displaystyle\frac{\nabla_{x}f^T\nabla_{x}f}{\nabla_{x}f^T\,A\,\nabla_{x}f}\nabla_{x}f.\)
print \(x\text{.}\)

Implementation of the steepest descent method with linear search. In the code below we exploit the fact that the function to minimize is quadratic to write the code in matricial notation, just as in Eq. (6.4.1). The elements \(A,b,c\) appearing in the expression of the function are defined at the code line 4. In this case we include lines with three different choices for the step-size (two of which commented out), so that the reader can easily compare the algorithm behavior depending on the chosen step-size strategy (whether it is constant, decreasing or optimized via the line-search algorithm). For instance, with the parameters given in the code, the algorithm reaches the requested tolerance in about 10 steps with the linear search, in about 1000 steps with the diminishing size and about 4000 with the constant step.

Implementation with plots. This is a small modification of the code above that keeps track of each intermediate point found by the algorithm. This way, we are able to plot the trajectory determined by the iterative algorithm.

Convergence. Are these methods going to converge? In some case, at least ignoring the computational errors, the answer is positive:

Theorem 6.4.1.

Let \(x_{k+1}=x_k-C_k\nabla_{x_k}f\) and suppose that:

the smallest eigenvalue of the \(C_k\) is not approaching 0;
the setpsize is chosen according to the line-search rule;

Then every limit point \(x^*\) of \(\{x_k\}\) is a critical point of \(f\text{,}\) namely \(\nabla_{x^*}f=0\text{.}\)

Notice that this does not grant that \(x^*\) is a minimum: it could be a saddle point of a local max!

In case \(f\) is quadratic, though, there is only one critical point...

Theorem 6.4.2.

Let \(f(x)=\frac{1}{2}x^TAx+\boldsymbol{b}^Tx+c\text{,}\) with \(A\) symmetric and with all positive eigenvalues, and let \(x^*\) be the position of its minimum.

Then, for the steepest descent method:

with \(\alpha_k\) given by line search, \(x_k\to x^*\) for all initial points;
with fixed step size \(\alpha\text{,}\) the same happens provided that \(\alpha<2/\lambda_A\text{,}\) where \(\lambda_A\) is the largest eigenvalue of \(A\text{.}\)

Rate of Convergence. We say that \(x_k\to x^*\) with order \(p\) if, from some \(k_0\) on,

\begin{equation*} \|x_{k+1}-x^*\|\leq K\|x_k-x^*\|^p \end{equation*}

The convergence is linear if \(p=1\) and \(K<1\text{:}\) \(\|x_k-x^*\|\simeq a^k\text{,}\) \(a<1\text{.}\)

It is sublinear if \(p=1\) and \(K=1\text{:}\) \(\|x_k-x^*\|\simeq 1/k\text{.}\)

It is superlinear if \(p>1\text{:}\) \(\|x_k-x^*\|\simeq a^{2^k}\) for some \(a<1\text{.}\)

Theorem 6.4.3.

Let \(f(x)=\frac{1}{2}x^TAx+\boldsymbol{b}^Tx+c\text{,}\) with \(A\) symmetric and with all positive eigenvalues, and let \(x^*\) be the position of its minimum. Then the sequence \(x_k\) generated by the steepeset descent method with exact line search converges to \(x^*\) linearly with \(K=\frac{k(A)-1}{k(A)+1}\text{.}\)

Remark: if \(k(A)=1\text{,}\) the convergence is actually quadratic! If \(k(A)\) is close to 1, the convergence is fast while, if \(k(A)\) is large, the convergence gets slower and slower: