Congugate Gradient Method¶

These notes draw on the paper by Jonathan Shewchuck, and Quora answer by Duane Rich

Introduction¶

Solve linear systems $Ax=b$ where $A$ is symmetric and positive definite.
- Symmetric: $A = A^T$.
- Positive definite: $xAx^T > 0$ for all $x$
The method is understood in terms of minimizing a function of the quadratic form $$f(x) = \frac{1}{2}x^TAx - b^Tx.$$
If $A$ is symmetric $\nabla f=0$ implies $Ax=b$. Using index notation below.

\begin{align} \nabla f \equiv \partial_kf &= \frac{1}{2}\partial_k(x_iA_{i,j}x_j) - \partial_k(b_ix_i) \\ &= \frac{1}{2}x_i\partial_k(A_{i,j}x_j) + \frac{1}{2}(\partial_kx_i)A_{i,j}x_j) - \partial_k(b_ix_i) \\ &= \frac{1}{2}x_iA_{i,k} + \frac{1}{2}\vec{k}(A_{i,j}x_j) - b_k \\ &= \frac{1}{2}x_iA_{i,k} + \frac{1}{2}A_{k,j}x_j-b_k \\ &= \frac{1}{2}(x^TA)^T + \frac{1}{2}Ax - b \\ &= Ax - b \end{align}

If $A$ is also positive definite, then $f$ is a minimum at $x$ satisfying $Ax=b$.
- $f(x)$ is minimum if $f(p)-f(x)>0$ for all points $p$. \begin{align} f(p)-f(x) &= \frac{1}{2}p^TAp - p^Tb - \frac{1}{2}x^TAx - x^Tb \\ &= \frac{1}{2}p^TAp - \frac{1}{2}p^Tb - \frac{1}{2}p^Tb - \frac{1}{2}x^TAx + x^Tb \end{align}
- Now add and subtract $x^TAx$ and use $b-Ax=0$. \begin{align} f(p)-f(x) &= \frac{1}{2}p^TAp - \frac{1}{2}p^Tb - \frac{1}{2}p^Tb + \frac{1}{2}x^TAx \\ &= \frac{1}{2}(p^TAp - p^TAx - x^TAp + x^TAx) \\ &= \frac{1}{2}(p-x)^TA(p-x)\,> 0\, \mbox{if A is positive definite} \end{align}

For symmetric positive definite $A$, solving $Ax=b$ can be done by methods that minimize $f(x)$.
In the following, we'll consider a two component system with

$$A = \left[\begin{array}{c c}4 & 2 \\ 2 & 4\end{array}\right],\,\,\,\,b = \left[\begin{array}{c}4 \\ 6\end{array}\right],$$
- The solution is labeled $x^*$.
- A plot of $f(x)$ is shown below.

In [2]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

A = np.array([[4,2], 
              [2,4]])
b = np.array([4,6])

xs = np.linalg.solve(A,b)
L,Q = np.linalg.eig(A)
Λ = np.diag(L)
print(Λ)

def f(x):
    return 0.5*np.dot(x, np.dot(A,x)) - np.dot(b,x)
def fminusf(e):
    return 0.5*np.dot(e, np.dot(A,e))


nx = 200
ny = 100
x = np.linspace(-3,3,nx)
y = np.linspace(-3,3,ny)
X,Y = np.meshgrid(x,y)                   # meshgrid puts x in rows and y in cols
CX, CY = np.zeros((ny,nx)), np.zeros((ny,nx))
Z  = np.zeros((ny, nx))
Z2 = np.zeros((ny, nx))

for i in range(nx):
    for j in range(ny):
        xx = np.array([x[i], y[j]])
        Z[j,i] = f(xx)
        Z2[j,i] = fminusf(xx)
        cc = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xx)
        CX[j,i] = cc[0]
        CY[j,i] = cc[1]
        
plt.rc('font', size=14)
        
fig = plt.figure(figsize=(5,5))
levels = np.linspace(np.min(Z), np.max(Z), 50)
plt.contour(X,Y,Z, levels);
plt.plot([0,0], [-3,3],':', color='gray')
plt.plot([-3,3],[0,0], ':', color='gray')
plt.plot(xs[0], xs[1], 'o', color='black')
plt.xlabel(r'$x_0$')
plt.ylabel(r'$x_1$')
plt.text(xs[0]-0.3, xs[1]-0.45, r'$x^*$');

[[6. 0.]
 [0. 2.]]

No description has been provided for this image

Gradient Descent Method¶

The Congjuage Gradient (CG) method is sometimes motivated by presenting the Gradient Decent method first.

The gradient of $f(x)$ is $$\nabla f = Ax-b$$
This points up the curve and is the direction of steepest ascent.
The direction of steepest descent is $$-\nabla f = b-Ax.$$
This is also the residual $r=b-Ax$.
Starting with an initial guess vector $x_0$, step along the descent direction $d=r$:

$$x_{k+1} = x_k + \alpha d_k.$$
- Note, we distinguish $d$ and $r$ here for convenience later when $d\ne r$.
Here, $\alpha$ is a parameter that is chosen so that $x_{k+1}$ is minimum.
- That is, $\alpha$ is chosen so that $f(x)$ is minimum along the line defined by $x_{k+1}=x_k+\alpha d_k$. Using $x(\alpha)=x_k+\alpha d_k$ along the line, we have $$\frac{df}{d\alpha} = \left(\frac{df}{dx}\right)^T\frac{dx}{d\alpha}= (\nabla f)^Td_k=0,$$ $$= (Ax-b)^Td_k =(A(x_k+\alpha d_k)-b)^Td_k = (Ax_k-b+\alpha Ad_k)^Td_k = -r_k^Td_k+\alpha d_k^TAd_k=0.$$ $$\alpha=\frac{r_k^Td_k}{d_k^TAd_k}.$$

Algorithm¶

Guess $x_0$
Repeat:
- $r = b - Ax_0 = d$
- $\alpha=\frac{r^Td}{d^TAd}.$
- $x = x_0 + \alpha d$
- break if $\|x-x_0\|/\|x\| < tol$
- $x_0 = x$

In [3]:

def GD(A,b,x0, tol=1E-2, maxit=10000):
    
    xstep = np.empty(0)
    for i in range(1,maxit):
        xstep = np.append(xstep, x0)
        r = b - np.dot(A,x0)
        d = r
        α = np.dot(r,d)/np.dot(d,np.dot(A,d))
        x = x0 + α*d
        if np.linalg.norm(x-x0)/np.linalg.norm(x) <= tol:
            return x, xstep
        x0 = x.copy()
    
    print(f'warning, no convergence in {maxit} iterations')
    return x, xstep

#--------------------------------
    
A = np.array([[4,2], 
              [2,4]])
b = np.array([4,6])
x0 = np.array([2.5,-2])

x, xstep = GD(A,b,x0)
print(xstep)
xstep = np.reshape(xstep,(int(len(xstep)/2),2))
print(xstep[:,0])

#-------------------------------

plt.rc('font', size=14)
fig = plt.figure(figsize=(5,5))
plt.contour(X,Y,Z, levels);
plt.plot([0,0], [-3,3],':', color='gray')
plt.plot([-3,3],[0,0], ':', color='gray')
plt.plot(xs[0], xs[1], 'o', color='black')
plt.xlabel(r'$x_0$')
plt.ylabel(r'$x_1$')
plt.plot(xstep[:,0], xstep[:,1], 'k-');

[ 2.5        -2.          1.86567164  0.85447761  0.79870671  0.61737429
  0.66246077  1.23048101  0.43328982  1.17955413  0.40402586  1.31124192
  0.35480276  1.30030345  0.34851722  1.32858837]
[2.5        1.86567164 0.79870671 0.66246077 0.43328982 0.40402586
 0.35480276 0.34851722]

Gradient Descent can require many steps to traverse long valleys.
Each step direction is orthogonal to the previous step direction.
- In deriving $\alpha$ above, we had $(\nabla f)^Tr_k=0$, but this $(\nabla f)^T$ will be $Ax_{k+1}-b=-r_{k+1}$ it is clear that $r_{k+1}$ is orthogonal to $r_k$.
If the gradient happens to coincide with an eigenvector of $A$, then convergence is immediate.
Similar descent directions may occur repeatedly.

Center on the solution¶

$f(x)-f(x^*) = \frac{1}{2}(x-x^*)^TA(x-x^*)$
- Let $e=(x-x^*)$.
\begin{align} f(x) &= \frac{1}{2}(x^*+e)^TA(x^*+e) - b^T(x^*+e) \\ &= \frac{1}{2}x^{*T}Ax^* + \frac{1}{2}x^{*T}Ae + \frac{1}{2}e^TAx^* + \frac{1}{2}e^TAe - b^Tx^* -b^Te \end{align}
- Terms 2, 3, and 6 cancel:
\begin{align} f(x) &= \frac{1}{2}x^{*T}Ax^* + \frac{1}{2}e^TAe - b^Tx^* \\ & = f(x^*) + \frac{1}{2}e^TAe \end{align}
- Let $g(e)=f(x)-f(x^*)$.
$$g(e) = \frac{1}{2}e^TAe.$$
- This is $f(x)-f(x^*)$ with contours centered on the origin.
- Isocontours of g are defined by constant $e^TAe$.

Basis shift¶

Let $Q$ be the matrix whose columns $q_j$ are eigenvectors of $A$.
- These $q_j$ are orthonormal with $Q^{-1}=Q^T$ since $A$ is symmetric.
Let $\Lambda$ be the diagonal matrix of eigenvalues of $A$.

$$AQ = Q\Lambda,$$ $$A = Q\Lambda Q^{-1} = Q\Lambda Q^T.$$

For some vector $v_1$, we have

$$Av = Q\Lambda Q^Tv_1.$$

Now, because the $q_j$ form a basis, we can write $v_1$ as a linear combination of $q_j$, or, for convenience, as a linear combination of $\lambda_j^{-1/2}q_j$, which are just scaled $q_j$.
- The vector of coefficients of the linear combination is denoted $\hat{v}_1$, and is $v_1$ in the $Q$ basis (the basis of eigenvectors of $A$).
$$v = (Q\Lambda^{-1/2})\hat{v},$$ $$\hat{v} = (\Lambda^{1/2}Q^T)v.$$
Now, multiply $Av_1 = Q\Lambda Q^Tv_1$ by $v_2^T$:

$$v_2^TAv_1 = (v_2^T)Q\Lambda Q^Tv_1 = (Q\Lambda^{-1/2}\hat{v}_2)^TQ\Lambda Q^T(Q\Lambda^{-1/2}\hat{v}_1) = \hat{v}_2^T\Lambda^{-1/2}Q^TQ\Lambda Q^TQ\Lambda^{-1/2}\hat{v}_1 = \hat{v}_2^T\hat{v}_1.$$
This uses $(AB)^T = A^TB^T$, and $Q^TQ=I$.
Summarizing:

$$v_2^TAv_1 = \hat{v}_2^T\hat{v}_1$$.

If we take $v_1=v_2=e$, then if $e^TAe=const$ on isocontours of $g(e)$, then $e^TAe=\hat{v}^T\hat{v}$ gives $\hat{v}^T\hat{v}$ is constant on isocontours of $g(e)$, meaning the length of $\hat{v}$ is constant so that isocontours of $g(e)$ are spheres in the $Q$ basis.
If $\hat{v}_2$ is orthogonal to $\hat{v}_1$ in the transformed space, then $v_2Av_1=0$.

Plot $f$ in the original and transformed spaces¶

Transform the $x$ coordinates plotted in the above figures to $\hat{v}$ coordinates by multiplying $x$ by $\Lambda^{1/2}Q^T$.
Note, the orange and blue scaled eigenvectors. These effectively stretch and compress in the given directions so that the contours are spherical in the transformed space.
The scaled eigenvectors have length 1 and are rotated in the transformed space.

In [4]:

fig = plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
levels = np.linspace(np.min(Z), np.max(Z), 50)
plt.contour(X,Y,Z, levels);
plt.plot([0,0], [-3,3],':', color='gray')
plt.plot([-3,3],[0,0], ':', color='gray')
plt.plot(xs[0], xs[1], 'o', color='black')
plt.xlabel(r'$x_0$')
plt.ylabel(r'$x_1$')
plt.arrow(xs[0], xs[1], Q[0,0]/Λ[0,0]**0.5, Q[0,1]/Λ[0,0]**0.5, head_width=0.1, length_includes_head=True, color='blue')
plt.arrow(xs[0], xs[1], Q[1,0]/Λ[1,1]**0.5, Q[1,1]/Λ[1,1]**0.5, head_width=0.1, length_includes_head=True, color='orange');
plt.gca().set_aspect('equal', adjustable='box')

plt.subplot(1,2,2)
plt.contour(CX,CY,Z, levels);
plt.plot([0,0],[np.min(CX),np.max(CX)], color='gray', linestyle=':')
plt.plot([np.min(CX),np.max(CX)],[0,0], color='gray', linestyle=':')
cs = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xs)
plt.plot(cs[0], cs[1], 'o', color='black');
plt.xlim([np.min(CX), np.max(CX)])
plt.ylim([np.min(CX), np.max(CX)])
plt.xlabel(r'$\hat{v}_0=(\Lambda^{1/2}Q^Tx)_0$')
plt.ylabel(r'$\hat{v}_1=(\Lambda^{1/2}Q^Tx)_1$');
plt.gca().set_aspect('equal', adjustable='box')
v = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xs)
w = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), Q[:,0]/Λ[0,0]**0.5)
plt.arrow(v[0], v[1], w[0], w[1], head_width=0.1, length_includes_head=True, color='blue')
v = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xs)
w = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), Q[:,1]/Λ[1,1]**0.5)
plt.arrow(v[0], v[1], w[0], w[1], head_width=0.1, length_includes_head=True, color='orange');

plt.tight_layout()

Plot the Gradient Descent solution in the original and transformed spaces¶

In the original space, the GD directions are orthogonal.
In the transformed space, the GD directions are not orthogonal.

In [5]:

fig = plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
levels = np.linspace(np.min(Z), np.max(Z), 50)
plt.contour(X,Y,Z, levels);
plt.plot([0,0], [-3,3],':', color='gray')
plt.plot([-3,3],[0,0], ':', color='gray')
plt.plot(xs[0], xs[1], 'o', color='black')
plt.plot(xstep[:,0], xstep[:,1], 'k-');
plt.xlabel(r'$x_0$')
plt.ylabel(r'$x_1$')
plt.gca().set_aspect('equal', adjustable='box')

plt.subplot(1,2,2)
plt.contour(CX,CY,Z, levels);
plt.plot([0,0],[np.min(CX),np.max(CX)], color='gray', linestyle=':')
plt.plot([np.min(CX),np.max(CX)],[0,0], color='gray', linestyle=':')
cs = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xs)
plt.plot(cs[0], cs[1], 'o', color='black');
plt.xlim([np.min(CX), np.max(CX)])
plt.ylim([np.min(CX), np.max(CX)])
plt.xlabel(r'$\hat{x}_0=(\Lambda^{1/2}Q^Tx)_0$')
plt.ylabel(r'$\hat{x}_1=(\Lambda^{1/2}Q^Tx)_1$');
plt.gca().set_aspect('equal', adjustable='box')

cstep = xstep.copy()
for i in range(len(xstep[:,0])):
    cstep[i,:] = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xstep[i,:])
plt.plot(cstep[:,0], cstep[:,1], 'k-');

plt.tight_layout()

Conjugate directions¶

Consider the first two descent directions in the above plot. Denote these $d_1$ and $d_2$.
In the transformed space, we'll denote these $c_1$ and $c_2$.
If the second descent direction $c_2$ were orthogonal to $c_1$ in the transformed space, then we would land on the answer on the second step.
The orthognality condition was given above: $d_2^TAd_1 = c_2^Tc_1 = 0$.
- We say that $d_2$ is A-orthogonal to $d_1$.
- $d_2$ and $d_1$ are said to be conjugate.

In [6]:

fig = plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
levels = np.linspace(np.min(Z), np.max(Z), 50)
plt.contour(X,Y,Z, levels);
plt.plot([0,0], [-3,3],':', color='gray')
plt.plot([-3,3],[0,0], ':', color='gray')
plt.plot(xs[0], xs[1], 'o', color='black')
plt.xlabel(r'$x_0$')
plt.ylabel(r'$x_1$')
plt.gca().set_aspect('equal', adjustable='box')
xstep = xstep[0:3,:]
xstep[2,:] = xs

plt.plot(xstep[:,0], xstep[:,1], 'k-');

plt.subplot(1,2,2)
plt.contour(CX,CY,Z, levels);
plt.plot([0,0],[np.min(CX),np.max(CX)], color='gray', linestyle=':')
plt.plot([np.min(CX),np.max(CX)],[0,0], color='gray', linestyle=':')
cs = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xs)
plt.plot(cs[0], cs[1], 'o', color='black');
plt.xlim([np.min(CX), np.max(CX)])
plt.ylim([np.min(CX), np.max(CX)])
plt.xlabel(r'$\hat{x}_0=(\Lambda^{1/2}Q^Tx)_0$')
plt.ylabel(r'$\hat{x}_1=(\Lambda^{1/2}Q^Tx)_1$');
plt.gca().set_aspect('equal', adjustable='box')

cstep = xstep.copy()
for i in range(len(xstep[:,0])):
    cstep[i,:] = np.dot(np.dot(np.sqrt(Λ),np.transpose(Q)), xstep[i,:])
plt.plot(cstep[:,0], cstep[:,1], 'k-');
#print(np.dot(cstep[2,:]-cstep[1,:], cstep[1,:]-cstep[0,:]))

plt.tight_layout()

In two dimensions, the first direction was the negative gradient at our starting point.
- Minimizing $f(x)$ along this direction gives the second point.
- Starting at the second point and minimizing $f(x)$ along the direction conjugate to the first direction gives the solution.
- The method converges in two steps.
In three dimensions, we start with a given direction (the negative gradient at our starting point) and take a first step.
- For the second step, the direction should be conjugate to the first direction so that the second direction is orthogonal to the first in the transformed space.
  - But there are infinite vectors conjugate to the first direction since, in the transformed space, there is a two-dimensional plane orthogonal to the first direction.
  - Only one of these vectors in the plane will intersect with the solution, but we can't know that vector without knowing the solution.
  - A good choice for the second direction is to project the transformed negative gradient vector onto this plane.
  - We then minimize $f(x)$ along this direction.
- For the third step, there is only one descending direction that is orthogonal to the two previous directions, and the solution intersects this direction and is at the absolute minimum of $f(x)$.
- Essentially, in a three-dimensional problem, the last two steps are the same as the two-dimensional problem shown in the figures above.
- The method converges in three steps.
In more than three dimensions, the pattern is similar.
- We start at a guess for $x$ and take a step in the descent direction along the negative gradient.
- For succeeding steps, we project the transformed negative gradient descent direction into the hyperplane orthogonal to all previous transformed directions.
- At each of these succeeding steps, we minimize $f(x)$ along the given direction.
- For $n$ dimensions, the method converges in $n$ steps.
- Note that at each step a hyperplane is considered that has one fewer dimensions than at the previous step. The solution is always contained in this hyperplane because of the spherical symmetry of the transformed coordinate space.

Mathematical details¶

Point 1¶

Guess $x_1$
Set descent direction: $d_1=-\nabla f(x_1) = b-Ax_1 = r_1.$

Point 2¶

Step along $d_1$ to get $x_2 = x_1 + \alpha_1d_1$
- Set $\alpha_1=\frac{r_1^Td_1}{d_1^TAd_1}$ to minimize $f(x_1+\alpha_1d_1)$ with respect to $\alpha_1$, as shown above.
At $x_2$ compute $r_2 = b-Ax_2 = -\nabla f(x_2)$. This is the next direction of steepest descent.
In the transformed space, project $\hat{r}_2$ onto the direction orthogonal to $\hat{d}_1$. This will give $\hat{d}_2$. Then multiply by $\Lambda^{1/2}Q^T$ to recover $d_2$.
- Projection. Consider a vector $a$ and orthonormal vectors $w$, $x$, and $y$. Assume $a$ can be written as a linear combination of $w$, $x$, and $y$ (so that $a$ is contained in the space spanned by $w$, $x$, and $y$). Then we can write $a=(a^Tw)w + (a^Tx)x + (a^Ty)y$.
  - $(P_{x,y}a)=a-(a^Tw)w = (a^Tx)x+(a^Ty)y$ is the projection of $a$ onto the subspace spanned by $x$ and $y$, which is orthogonal to $w$.
  - $(P_ya)=a-(a^Tw)w-(a^Tx)x = (a^Ty)y$ is the projection of $a$ onto the subspace spanned by $y$, which is orthogonal to the subpace (plane) spanned by $w$ and $x$.
  - If, e.g. w, is not unit length, we first normalize it so that, e.g. $(a^Tw)w \rightarrow a^T\frac{w}{|w|}\frac{w}{|w|} = \frac{a^Tw}{w^Tw}w$.
- Now the projection of $\hat{r}_2$ onto the space orthogonal to $\hat{d}_1$ is $$\hat{d}_2 = \hat{r}_2 - \left(\frac{\hat{r}_2^T\hat{d}_1}{\hat{d}_1^T\hat{d}_1}\right)\hat{d}_1.$$
- The term in parentheses is a scalar. Hence, $d_2$ is a linear combination of $r_2$ and $d_1$ and is in the space spanned by $r_2$ and $d_1$.
- Multiply by $\Lambda^{1/2}Q^T$ and use $\hat{v} = (\Lambda^{1/2}Q^T)v$ and $A=Q\Lambda Q^T$:
$$d_2 = r_2 - \left(\frac{\hat{r}_2^T\hat{d}_1}{\hat{d}_1^T\hat{d}_1}\right)d_1 = r_2 - \left(\frac{(\Lambda^{1/2}Q^Tr_2)^T(\Lambda^{1/2}Q^Td_1)}{(\Lambda^{1/2}Q^Td_1)^T(\Lambda^{1/2}Q^Td_1)}\right)d_1 = r_2 - \left(\frac{r_2^TQ\Lambda^{1/2}\Lambda^{1/2}Q^Td_1}{d_1^TQ\Lambda^{1/2}\Lambda^{1/2}Q^Td_1}\right)d_1.$$

$$d_2 = r_2 - \left(\frac{r_2^TAd_1}{d_1^TAd_1}\right)d_1.$$
- Note that we don't need to know $Q$ or $\Lambda$.

Point 3¶

$x_3 = x_2 + \alpha_2d_2$
- $\alpha_2 = \frac{r_2^Td_2}{d_2^TAd_2}$
$r_3 = b-Ax_3 = -\nabla f(x_3)$
$d_3 = r_3 - \left(\frac{r_3^TAd_1}{d_1^TAd_1}\right)d_1 - \left(\frac{r_3^TAd_2}{d_2^TAd_2}\right)d_2$
- This is the projection of $\hat{r}_3$ onto the space orthogonal to $\hat{d}_1$ and $\hat{d}_2$, transformed back from the eigenvector basis.

In general, for point $k$ we have:¶

$x_k = x_{k-1} + \alpha_{k-1}d_{k-1}$
- $\alpha_{k-1} = \frac{r_{k-1}^Td_{k-1}}{d_{k-1}^TAd_{k-1}}$
$r_k = b-Ax_k$
$d_k = r_k - \sum_{i=1}^{k-1}\beta_{k,i}d_i$
- $\beta_{k,i} = \frac{r_k^TAd_i}{d_i^TAd_i}$, defined for $i<k$

Algorithmic adjustments¶

As written above, the calculation of the directions $d_k$ at any given step $k$ requires all previous directions. This is inefficient and expensive.
The equations can be simplified by showing that $r_i^Tr_j=0,\,\,\,i\ne j.$

Show that $r_i^Tr_j=0$ for $i\ne j$

That is, the residuals are mutually orthogonal. This is shown as follows.

Find a relation between the directions $d$ and residuals $r$ regarding orthogonality and vector spaces.
- We'll use the A-orthogonality of the $d$'s. \begin{align} r_k &= b-Ax_k = b-A(x_{k-1}+\alpha_{k-1}d_{k-1}) = r_{k-1}-\alpha_{k-1}Ad_{k-1}=b-Ax_{k-1}-\alpha_{k-1}Ad_{k-1}\\ &= b-A(x_{k-2}+\alpha_{k-2}d_{k-2})-\alpha_{k-1}Ad_{k-1} \\ &= r_{k-2}-\alpha_{k-2}Ad_{k-2}-\alpha_{k-1}Ad_{k-1}. \end{align}
Continuing by sequential expansion of the $r$'s:
$$r_k = r_1 - \sum_{i=1}^{k-1}\alpha_iAd_i.\tag{1}$$
- Now, the vectors $d$ are linearly independent, as are the vectors $Ad$. Furthermore, the $d$'s and the $Ad$'s each span the same space as the $r$'s, so we can write $r_1$ as
$$r_1 = \sum_{j=1}^n\delta_jAd_j,\tag{2}$$
where $\delta_j$ are constant coefficients.
- Solve for the $\delta_j$ by multiplying by some $d_i^T$ and using the A-orthogonality: $d_i^TAd_j=0$ for $i\ne j$:
$$d_i^Tr_1 = \sum_{j=1}^n\delta_jd_i^TAd_j = \delta_id_i^TAd_i,$$ $$\delta_i = \frac{d_i^Tr_1}{d_i^TAd_i}\xrightarrow{i \rightarrow j} \delta_j = \frac{d_j^Tr_1}{d_j^TAd_j}\tag{3}.$$
- Now, solve Eq. (1) for $r_1$:
$$r_1 = r_k + \sum_{i=1}^{k-1}\alpha_iAd_i.$$

This is true for arbitrary $k$, and we can therefore set $k=j$ and replace $k$ in this equation with $j$. We then insert this into Eq. (3):

$$\delta_j = \frac{d_j^T(r_j+\sum_{i=1}^{j-1}\alpha_iAd_i)}{d_j^TAd_j} = \frac{d_j^Tr_j}{d_j^TAd_j}\tag{4},$$

by A-orthogonality of the $d$'s.

Note that $\mathbf{\delta_j=\alpha_j}$.
Insert Eq. (4) into Eq. (2) and insert the result into Eq. (1):

$$r_k = \sum_{j=1}^n\alpha_jAd_j - \sum_{j=1}^{k-1}\alpha_jAd_j = \sum_{j=k}^n\alpha_jAd_j.$$
- Now, multiply this with some $d_i^T$: $$d_i^Tr_k = \sum_{j=k}^n\alpha_jd_i^TAd_j = 0\text{ for } i<k.$$
- Hence, $r_k$ is orthogonal to the space spanned by $d_i$ for $i<k$. That is, $r_k$ is orthogonal to all the previous $d$'s.
- Above, we had $d_k = r_k - \sum_{i=1}^{k-1}\beta_{k,i}d_i$, which we can solve for $r_k$:
  
  $$r_k = d_k + \sum_{i=1}^{k-1}\beta_{k,i}d_i.$$
  - Hence, $r_k$ is a linear combination of $d_i$ for $i\le k$ so that $r_k$ is in the space spanned by $d_i$ for $i\le k$.
  - So, if $r_k$ is orthogonal to to the space spanned by $d_i$ for $i<k$, then $r_k$ is orthogonal to $r_i$ for $i<k$.
  - $r_2\perp r_1$, and $r_3\perp r_2$, etc., so
  $$r_i^Tr_j=0,\,\,\mbox{for } i\ne j.$$

Expressions for $\alpha$ and $\beta$¶

Above, we had $r_k = b-Ax_k = b-A(x_{k-1}+\alpha_{k-1}d_{k-1}) = r_{k-1}-\alpha_{k-1}Ad_{k-1}$, or shifting indicies and replacing $k$ with $i$ for convenience: $$r_{i+1} = r_i - \alpha_iAd_i.$$
Multiply this by some $r_k^T$, rearrange and use the $r_i^Tr_j=0$ for $i\ne j$ relation: \begin{align} r_k^TAd_i &= \frac{1}{\alpha_i}(r_k^Tr_i - r_k^Tr_{i+1}) \\ &=\begin{cases} \frac{1}{\alpha_i}r_k^Tr_i & \text{for }k=i, \\ -\frac{1}{\alpha_i}r_k^Tr_{i+1} & \text{for }k=i+1\text{ or }i=k-1, \\ 0 & \text{otherwise.} \end{cases} \end{align}
Now, $\beta_{k,i}=\frac{r_k^TAd_i}{d_i^TAd_i}$ is defined for $i<k$. Considering the numerator, the first case above is not relevant. The middle and last cases give: \begin{equation} \beta_{k,i} = \begin{cases} -\frac{1}{\alpha_{k-1}}\frac{r_k^Tr_k}{d_{k-1}^TAd_{k-1}} & \text{for i=k-1}, \\ 0 & \text{otherwise, meaning }i<k-1 \text{ or }k>i+1, \end{cases} \end{equation}
Inserting the expression for $\alpha_{k-1} = \frac{r_{k-1}^Td_{k-1}}{d_{k-1}^TAd_{k-1}}$, we have \begin{equation} \beta_{k,i} = \begin{cases} -\frac{r_k^Tr_k}{r_{k-1}^Td_{k-1}} & \text{for i=k-1}, \\ 0 & \text{otherwise, meaning }i<k-1 \text{ or }k>i+1, \end{cases} \end{equation}
In the expression above for $d_k = r_k - \sum_{i=1}^{k-1}\beta_{k,i}d_i$, only the final term in the summation is nonzero, and we can denote $\beta_{k,i}$ as just $\beta_k$: $$d_k = r_k - \beta_kd_{k-1},$$ $$\beta_k = -\frac{r_k^Tr_k}{r_{k-1}^Td_{k-1}}.$$
Finally, multiplying $d_k = r_k - \beta_kd_{k-1}$ by $r_k^T$, we have $$r_k^Td_k = r_k^Tr_k - \beta_kr_k^Td_{k-1} = r_k^Tr_k^T$$ since we showed above that $d_i^Tr_k = 0$ for $i<k$.
- Hence, we can rewrite $\beta_k$ and $\alpha_{k-1}$ as $$\beta_k = -\frac{r_k^Tr_k}{r_{k-1}^Tr_{k-1}},$$ $$\alpha_{k-1} = \frac{r_{k-1}^Tr_{k-1}}{d_{k-1}^TAd_{k-1}}.$$

Conjugate Gradient Algorithm¶

\begin{align} &x_1 = \text{guess} \\ &r_1 = d_1 = b-Ax_1 \\ &\text{loop k=2:n or until convergence} \\ &\phantom{xxxx}\alpha = \frac{r_{k-1}^Tr_{k-1}}{d_{k-1}^TAd_{k-1}} \\ &\phantom{xxxx}x_k = x_{k-1} + \alpha d_{k-1} \\ &\phantom{xxxx}r_k = r_{k-1}-\alpha Ad_{k-1} \\ &\phantom{xxxx}\beta = -\frac{r_k^Tr_k}{r_{k-1}^Tr_{k-1}} \\ &\phantom{xxxx}d_k = r_k-\beta d_{k-1} \end{align}

Note, $r_k = r_{k-1}-\alpha Ad_{k-1}$ is used instead of $r_k=b-Ax_k$ to minimize matrix-vector multiplications. Only $Ad_{k-1}$ is needed above, and can be computed once for each iteration. If $r_k=b-Ax_k$ is used, we also need to compute $Ax_k$.
- Shewchuck notes that this can cause some accumulation of roundoff error and suggests occasionally using $r_k=b-Ax_k$ to compute the correct residual.

In [ ]: