Lecture 29: Gradient Descent Methods

Alternatives to exploiting problem structure by operator splitting are suggested by the least action principle. The action

S(q,q˙)=t0t1L(t,q(t),q˙(t))dt,

is a functional over the phase space of the system, e.g., S:2n for a system composed of n point masses. The least action principle states that the observed trajectory minimizes the action, hence it is to be expected that optimization algorithms that solve the problem

minq,q˙S

would be of interest. This indeed is the case and leads to a class of methods that can exhibit remarkably fast convergence and more readily generalize to variable physical properties and arbitrary domain geometry.

1.Spatially dependent diffusivity

A first step in considering linear operators that still exhibit structure but are more complex than the constant-coefficient discretization of the Laplacian 2 is to consider spatially-varying diffusivity in which case the steady-state heat equation in domain Ω becomes

-(αu)=f, (1)

again with Dirichlet boundary conditions u=b on Ω. Maintaining simple domain geometry for now, the centered finite-difference discretization of (1) on Ω=[0,1]×[0,1] with grid points (xi=ih,yj=jh,h=1/(n+1)) becomes

-αi+1/2,jui+1,j-αi-1/2,jui-1,j-αi,j+1/2ui,j+1-αi,j-1/2ui,j-1+4αi,jui,j=ci,j, (2)

where αi,j=(αi+1/2,j+αi-1/2,j+αi,j+1/2+αi,j-1/2)/4 denotes a diffusivity average at (i,j) and 𝒄 contains the boundary conditions and forcing term as before, 𝒄=𝒃+h2𝒇. The sparsity pattern is the same as in the constant diffusivity case, but the system 𝑨𝒖=𝒄 has a system matrix with variable coefficients. The matrix 𝑨 expresses a self-adjoint operator through a symmetric discretization, namely centered finite differences on a uniform grid. It can be expected to be symmetric 𝑨=[ak,r]=𝑨T=[ar,k], as verified by considering row k=(j-1)n+i, that has non-zero components in columns k, k±1, k±n. It is sufficient to verify symmetry for entries within the lower triangle of 𝑨. The k,k-1 component is the coefficient of ui-1,j in (2) ak,k-1=-αi-1/2,j. Symmetry of 𝑨 would require ak,k-1=ak-1,k. The ak-1,k component arises from row k-1

-αi-1/2,jui,j-αi-3/2,jui-2,j-αi-1,j+1/2ui-1,j+1-αi-1,j-1/2ui-1,j-1+4αi-1,jui-1,j=ci-1,j.

The diagonal element for row k-1 has indices (i-1,j) and the kth column has indices (i,j) for which ak-1,k=-αi-1/2,j, indeed verifying ak,k-1=ak-1,k. Such opaque index manipulations can readily be avoided by symmetry considerations as stated above: self-adjoint operator expressed through symmetric discretization. The physical argument is even simpler. Diffusivity expresses how readily heat is transferred between two spatial positions of unequal temperature, and there is no reason for this material property to differ in considering the heat flux from point P to point Q, qPQ=αPQ(uP-uQ) from that from point Q to P, qQP=αQP(uQ-uP). Setting qPQ=-qQP to account for direction of heat flow leads to αPQ=αQP, and this material property is reflected in symmetry of 𝑨. Note that even though the operator (α) might be self-adjoint under appropriate boundary conditions, unsymmetric discretization such as one-sided finite differences can lead to a non-symmetric system matrix 𝑨.

The implications for iterative method convergence can again be surmised from the one-dimensional case with homogeneous boundary conditions x(α(x)xu)=f, u(0)=u(1)=0. The convergence rate for an iterative method depends on the eigenvalues of the matrix 𝑨 obtained by discretization of the operator x(α(x)x). The regular Sturm-Liouvile eigenproblem x(α(x)xu)=λu, u(0)=u(1)=0 is known to have a solution, albeit difficult to obtain analytically. Replacing analytical estimates by a numerical experiment taking α(x)=1+cx, Fig. 1 shows that convergence becomes marginally slower as the diffusivity gradient c increases, though the main difficulty is the ρ(M)1 spectral radius for constant diffusivity.

Figure 1. Spectral radius of Jacobi iteration 𝑴=𝑰-𝑫-1𝑨 for x(α(x)xu)=f with increasing diffusivity gradient α=1+cx.

2.Steepest descent

The heat equation can be obtained as the stationary solution δΦ=0, to an optimization problem for the functional

Φ(u,u')=-Ω[α(u)(u)+uf]d𝒙, (3)

among all functions u that satisfy the boundary condition u=b on Ω. The above can be understood as the generalization of the one-dimensional case

Φ(u,u')=-01(12αu'u'+uf)dx.

The stationarity condition point for Φ is

δΦ=-01(αu'δu'+fδu)dx=-01(αu'ddxδu+fδu)dx=-[αu'δu]x=0x=1+01(ddx(αu')-f)δudx=0.

Since all u must satisfy boundary conditions the perturbations are null at endpoints δu(0)=δu(1)=0, and stationarity for arbitrary perturbations δu implies that

ddx(αu')-f=0,

the one-dimensional variable diffusivity heat equation.

How can the above observations guide algorithm construction? The key point is that the discrete problem should also be expressible as an optimization problem for Φ:m

Φ(𝒖)=12𝒖T𝑨𝒖-𝒖T𝒄=12j=1mk=1mujajkuk-j=1mujcj,

with 𝑨=[ajk]. The discrete stationarity condition is 𝒖Φ=0 leading to

Φul=12j=1mk=1m(δljajkuk+ujajkδlk)-j=1mδljcj.

Using the Kronecker delta properties δll=1, δlj=0 for lj gives

Φul=12k=1malkuk+12j=1mujajl-cl,

which for symmetric 𝑨 leads to

Φul=j=1maljuj-cl=0𝑨𝒖=𝒄. (4)

Symmetric discretization of the self-adjoint operator (αu) produces a symmetric matrix that is unitarily diagonalizable 𝑨=𝑸𝚲𝑸T, and, as seen previously, with strictly positive eigenvalues. Hence stationary points of Φ(𝒖) are minima and the solution to 𝑨𝒖=𝒄 can be sought by minimizing Φ(𝒖).

Equation (4) states that the gradient of Φ is opposite the direction of the residual Φ=𝑨𝒖-𝒄=-𝒓. Since this is the direction of fastest increase of Φ, travel in the opposite direction will decrease Φ leading to an update

𝒖k+1=𝒖k+βk𝒓k, (5)

of the current approximation 𝒖k. The correction direction is also referred to as a search direction for the optimization procedure. In the residual correction formulation

𝒓k=𝒄-𝑨𝒖k,𝒆k=𝑩𝒓k,𝒖k+1=𝒖k+𝒆k,

steepest descent corresponds to the choice 𝑩=βk𝑰. The remaining question is to determine how far to travel along the -Φ(𝒖k)=𝒓k search direction. As β increases the local gradient direction changes. Steepest descent proceeds along the 𝒓k direction until further decrease is no longer possible, that is when the new gradient direction is orthogonal to the previous one

𝒓kT𝒓k+1=0𝒓kT(𝒄-𝑨𝒖k+1)=𝒓kT[𝒄-𝑨(𝒖k+βk𝒓k)]=𝒓kT(𝒓k-βk𝑨𝒓k)=0βk=𝒓kT𝒓k𝒓kT𝑨𝒓k.

The convergence rate is given by the spectral radius of 𝑴=𝑰-𝑩𝑨 that becomes

𝑴=𝑰-βk𝑨=𝑰-𝒓kT𝒓k𝒓kT𝑨𝒓k𝑨.

Recall that the one-dimensional, constant diffusivity heat equation had eigenvalues of 𝑨

vl=4sin2[lπh2],l=1,2,,m.

Eigenvalues of gradient descent iteration are therefore

λl=1-βkνl,l=1,2,,m.

Since βk is the inverse of a Rayleigh quotient, if the residual is in the direction of eigenvector l, βk=1/νl and λl=0 suggesting the possibility of fast convergence. However, the distribution of eigenvalues νk for 𝑨 is uniformly distributed in the interval [0,4] such that the residual component in other eigendirections is not significantly reduced. The typical behavior of gradient descent is rapid decrease of the residual in the first few iterations followed by slow convergence to the solution. Consider the problem

-uxx=π2k=1Kk2sin(kπx),u(0)=u(1)=0,

with solution

u(x)=k=1Ksin(kπx).

The behavior of gradient descent is shown in Fig. 2. A good approximation of the solution shape is obtained after only a few gradient descent iterations, but convergence thereafter is slow.

Figure 2. Convergence of gradient descent. Blue: exact solution. Orange, green, red: iterates after 4, 40, 400 iterations.

3.Conjugate gradient

Steepest descent is characterized by a correction in the direction of the residual (5). Enforcing 𝒓kT𝒓k+1=0 leads to orthogonality of both succesive residuals and correction directions. A more insightful interpretation of (3) is to recognize the role of the scalar products

(f,g)=Ωα(f)(g)d𝒙,(𝒖,𝒗)=𝒖T𝑨𝒗,

in the continuum, discrete cases respectively. Similarly to how vectors that satisfy 𝒖T𝒗=0 are said to be orthogonal, those that satisfy 𝒖T𝑨𝒗=0 are said to be 𝑨-conjugate. Gradient descent minimizes the 2-norm of the error ||𝒆k|| at each iteration. However, the variational formulation suggests that a more appropriate norm is the 𝑨-norm

||𝒆k||𝑨=(𝒆kT𝑨𝒆k)1/2.

This leads to a modification of the search directions 𝒑k, which are no longer taken in the direction of the residual and orthogonal, but rather 𝑨-conjugate

𝒑k+1T𝑨𝒑k=0.

Algorithm Conjugate gradient

𝒙0=𝟎,𝒓0=𝒄,𝒑0=𝒓0

for k=1:MaxIter

βk=𝒓k-1T𝒓k-1/(𝒑k-1T𝑨𝒑k-1)

𝒙k=𝒙k-1+βk𝒑k-1

𝒓k=𝒓k-1-βk𝑨𝒑k-1

γk=𝒓kT𝒓k/(𝒓k-1T𝒓k-1)

𝒑k=𝒓k+βk𝒑k-1

end