Section 6.4 Steepest Descent
As mentioned in the previous section, this method writes as
\begin{equation*}
\boldsymbol{x_{k+1}} = \boldsymbol{x_k} - \alpha_k \boldsymbol{\nabla_{x_k}}f.
\end{equation*}
Python implementation. Below we implement this method and use it to find the minimum of three functions: - The quadratic function\begin{equation*} f(x,y)=5x^2+4xy+y^2-6x-4y+15. \end{equation*}An exact calculation shows that its minimum is attained at\begin{equation*} (x,y) = (-1,4), \end{equation*}where the function attains the value 10.
- The quartic function\begin{equation*} f(x,y)=5x^2+4xy+y^2-6x-4y+15+2x^4+x^2y^2. \end{equation*}One can prove that this function has a single critical point at about\begin{equation*} (-0.1472439849989..., 2.245791889806...). \end{equation*}This must be a minimum since clearly the function goes to \(+\infty\) for \(x^2+y^2\to\infty\text{.}\)
- The Rosenbrock's function\begin{equation*} f(x,y) = 100(y - x^2)^2 + (1 - x)^2. \end{equation*}This function has a single minimum at \((1,1)\) but this minimum is "hidden" in a very narrow curved valley and so it is very hard to reach by numerical algorithms. You can check by yourself that the steepest descent method fails with this function even if you start from a point very close to the minimum such as \((1.01,0.99)\text{.}\)
\begin{equation}
f(x) = \frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x} - \boldsymbol{b}^T\boldsymbol{x} +c.\label{sec-optimization-quadratic}\tag{6.4.1}
\end{equation}
Given a point \(\boldsymbol{p_0}\) and a direction \(\boldsymbol{v}\text{,}\) the parametric equations of the line passing through \(\boldsymbol{p_0}\) and parallel to the vector \(\boldsymbol{v}\) are given by
\begin{equation*}
\boldsymbol{p}(\alpha) = \boldsymbol{p_0} + \alpha\boldsymbol{v}
\end{equation*}
In case of the line-search algorithm, \(\boldsymbol{p_0} = \boldsymbol{x_k}\) and \(\boldsymbol{v} = \boldsymbol{\nabla_{x_k}}f\text{,}\) so that
\begin{equation*}
g_k(\alpha) = f(\boldsymbol{p}(\alpha)) = \frac{1}{2}(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f)^T\, A\,(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f) - \boldsymbol{b}^T(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f) + c.
\end{equation*}
Then
\begin{equation*}
g'_k(\alpha) = \frac{1}{2}(-\boldsymbol{\nabla_{x_k}}f)^T\, A\,(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f)
+ \frac{1}{2}(\boldsymbol{x_k}-\alpha\boldsymbol{\nabla_{x_k}}f)^T\, A\,(-\boldsymbol{\nabla_{x_k}}f) - \boldsymbol{b}^T(-\boldsymbol{\nabla_{x_k}}f)
\end{equation*}
\begin{equation*}
= -\alpha \boldsymbol{\nabla_{x_k}}f^T\boldsymbol{\nabla_{x_k}}f + (\boldsymbol{x_k})^T\,A\,\boldsymbol{\nabla_{x_k}}f- \boldsymbol{b}^T(-\boldsymbol{\nabla_{x_k}}f),
\end{equation*}
where we used the fact that
\begin{equation*}
(\nabla_{x_k}f)^T\,A\,\boldsymbol{x_k} = (\boldsymbol{x_k})^T\,A\,\boldsymbol{\nabla_{x_k}}f
\end{equation*}
because \(A\) is a symmetric matrix. Notice now that \((\boldsymbol{x_k})^T\,A - \boldsymbol{b}^T = (\boldsymbol{\nabla_{x_k}}f)^T\text{,}\) so that the root of \(g'_k(\alpha)\) (that is necessarily a minimum) is ultimately given by
\begin{equation*}
\alpha_k = \frac{\nabla_{x_k}f^T\nabla_{x_k}f}{\nabla_{x_k}f^T\,A\,\nabla_{x_k}f}.
\end{equation*}
Hence we get the following algorithm: - select an initial point \(x\text{;}\)
- select a threshold \(\epsilon>0\text{;}\)
- until \(\|\nabla_{x}f\|<\epsilon\text{,}\) set \(x \to x-\displaystyle\frac{\nabla_{x}f^T\nabla_{x}f}{\nabla_{x}f^T\,A\,\nabla_{x}f}\nabla_{x}f.\)
- print \(x\text{.}\)
Theorem 6.4.1.
Let \(x_{k+1}=x_k-C_k\nabla_{x_k}f\) and suppose that:- the smallest eigenvalue of the \(C_k\) is not approaching 0;
- the setpsize is chosen according to the line-search rule;
Theorem 6.4.2.
Let \(f(x)=\frac{1}{2}x^TAx+\boldsymbol{b}^Tx+c\text{,}\) with \(A\) symmetric and with all positive eigenvalues, and let \(x^*\) be the position of its minimum. Then, for the steepest descent method:- with \(\alpha_k\) given by line search, \(x_k\to x^*\) for all initial points;
- with fixed step size \(\alpha\text{,}\) the same happens provided that \(\alpha<2/\lambda_A\text{,}\) where \(\lambda_A\) is the largest eigenvalue of \(A\text{.}\)
\begin{equation*}
\|x_{k+1}-x^*\|\leq K\|x_k-x^*\|^p
\end{equation*}
The convergence is linear if \(p=1\) and \(K<1\text{:}\) \(\|x_k-x^*\|\simeq a^k\text{,}\) \(a<1\text{.}\) It is sublinear if \(p=1\) and \(K=1\text{:}\) \(\|x_k-x^*\|\simeq 1/k\text{.}\) It is superlinear if \(p>1\text{:}\) \(\|x_k-x^*\|\simeq a^{2^k}\) for some \(a<1\text{.}\)