Skip to main content

Section 6.5 Newton's Method

Newton's method looks for an optimal point \(x^*\) of \(f(x)\) by looking for a zero of the gradient vector field \(\nabla_x f\text{.}\) Its formulation, like in the 1-dim case, comes from Taylor's expansion:
\begin{equation*} \nabla_{y}f \simeq \nabla_{x} f + Hess_{x}f\cdot(y-x) \end{equation*}
so, in order to have \(\nabla_{y}f=0\text{,}\) the best bet is choosing \(y\) so that
\begin{equation*} y = x - (Hess_{x}f)^{-1}\cdot(\nabla_{x} f) \end{equation*}
Like in the 1-dim case, Newton's method converges in a single step when \(f(x)\) is quadratic -- this is not very surprising: when \(f(x)=x^TAx+\boldsymbol{b}^Tx+C\text{,}\) its Hessian \(Hess_{x}f\) is constant and equal to \(A\) and to use Newton's method we need to evaluate \(A^{-1}\text{,}\) which amounts to solving the linear system \(Ax=-b\text{.}\)

The algorithm:
  1. select an initial point \(x\text{;}\)
  2. select a threshold \(\epsilon>0\text{;}\)
  3. until \(\|\nabla_{x}f\|<\epsilon\text{:}\) \(x\to x-(Hess_{x}f)^{-1}\cdot(\nabla_{x}f\)).
  4. print \(x\text{.}\)

A concrete implementation. Here we implement the Newton's method and use it to find the minima of the same three problems introduced in the previous section. In this case:
  1. Using the same tolerance of \(10^{-5}\text{,}\) we get the minimum of the quadratic map in a single step...
  2. ... and the minimum of the quartic map in 6 steps.
  3. The case of the Rosenbrock's function is tough even for Newton's method. If we lower the tolerance to \(10^{-3}\text{,}\) then the method converges to the minimum in almost 4000 steps. Not an optimal result, but at least we finally get convergence.
A second example. This example uses MATLAB's symbolic capabilities in order to avoid to input by hand the gradient and the Hessian of the function. Since Octave does not support symbolic calculations, you need to run this code on MATLAB.
Convergence

Does this method always converge? Short answer: No. Nicer answer: Yes, provided that we start "close enough".

Remark 1. Theorem above says nothing about points that are further away: they may or may not converge.

Remark 2. This method just looks at zeros of the gradient, so there is guarantee that this point is a minimum: it could be a saddle or even a maximum!

Rate of Convergence.

We show this fact in dimension 1:

\begin{equation*} |x_{k+1}-x^*|=\left|x_k-x^*-\frac{f(x_k)}{f'(x_k)}\right|= \end{equation*}

\begin{equation*} =\left|\frac{1}{f'(x_k)}\right|\cdot|f'(x_k)(x_k-x^*)-f(x_k)|\simeq \end{equation*}

\begin{equation*} = \left|\frac{1}{f'(x_k)}\right|\cdot|f''(x_k)(x_k-x^*)^2+O(|x_k-x^*|^3)|\leq \end{equation*}

\begin{equation*} \leq M|x_k-x^*|^2. \end{equation*}