Lecture 8: Least Squares Approximation

A typical scenario in many sciences is acquisition of m numbers to describe some object that is understood to actually require only nm parameters. For example, m voltage measurements ui of an alternating current could readily be reduced to three parameters, the amplitude, phase and frequency u(t)=asin(ωt+φ). Very often a simple first-degree polynomial approximation y=ax+b is sought for a large data set D={(xi,yi),i=1,,m}. All of these are instances of data compression, a problem that can be solved in a linear algebra framework.

1.Projection

Consider a partition of a vector space U into orthogonal subspaces U=VW, V=W, W=V. Within the typical scenario described above U=m, Vm, Wm, dimV=n, dimW=m-n. If 𝑽=[ 𝒗1 𝒗n ]m×n is a basis for V and 𝑾=[ 𝒘1 𝒘m-n ]m×(m-n) is a basis for W, then 𝑼=[ 𝒗1 𝒗n 𝒘1 𝒘m-n ] is a basis for U. Even though the matrices 𝑽,𝑾 are not necessarily square, they are said to be orthonormal, in the sense that all columns are of unit norm and orthogonal to one another. Computation of the matrix product 𝑽T𝑽 leads to the formation of the identity matrix within n

𝑽T𝑽=[ 𝒗1T 𝒗2T 𝒗nT ][ 𝒗1 𝒗2 𝒗n ]=[ 𝒗1T𝒗1 𝒗1T𝒗2 𝒗1T𝒗n 𝒗2T𝒗1 𝒗2T𝒗2 𝒗2T𝒗n 𝒗nT𝒗1 𝒗nT𝒗2 𝒗nT𝒗n ]=𝑰n.

Similarly, 𝑾T𝑾=𝑰m-n. Whereas for the square orthogonal matrix 𝑼 multiplication both on the left and the right by its transpose leads to the formation of the identity matrix

𝑼T𝑼=𝑼𝑼T=𝑰m,

the same operations applied to rectangular orthonormal matrices lead to different results

𝑽T𝑽=𝑰n,𝑽𝑽T=[ 𝒗1 𝒗2 𝒗n ][ 𝒗1T 𝒗2T 𝒗nT ]=i=1n𝒗i𝒗iT,rank(𝒗i𝒗iT)=1

A simple example is provided by taking 𝑽=𝑰m,n, the first n columns of the identity matrix in which case

𝑽𝑽T=i=1n𝒆i𝒆iT=[ 𝑰n 𝟎 𝟎 𝟎 ]m×m.

Applying 𝑷=𝑽𝑽T to some vector 𝒃m leads to a vector 𝒓=𝑷𝒃 whose first n components are those of 𝒃, and the remaining m-n are zero. The subtraction 𝒃-𝒓 leads to a new vector 𝒔=(𝑰-𝑷)𝒃 that has the first components equal to zero, and the remaining m-n the same as those of 𝒃. Such operations are referred to as projections, and for 𝑽=𝑰m,n correspond to projection onto the span{𝒆1,,𝒆n}.

Figure 1. Projection in 2. The vectors 𝒓,𝒔2 have two components, but could be expressed through scaling of 𝒆1,𝒆2.

Returning to the general case, the orthogonal matrices 𝑼m×m, 𝑽m×n, 𝑾m×(m-n) are associated with linear mappings 𝒃=𝒇(𝒙)=𝑼𝒙, 𝒓=𝒈(𝒃)=𝑷𝒃, 𝒔=𝒉(𝒃)=(𝑰-𝑷)𝒃. The mapping 𝒇 gives the components in the 𝑰 basis of a vector whose components in the 𝑼 basis are 𝒙. The mappings 𝒈,𝒉 project a vector onto span{𝒗1,,𝒗n}, span{𝒘1,,𝒘m-n}, respectively. When 𝑽,𝑾 are orthogonal matrices the projections are also orthogonal 𝒓𝒔. Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error, especially when the angle between basis vectors is small, and will be avoided henceforth.

Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a general definition.

Definition. The mapping is called a projection if 𝒇𝒇=𝒇, or if for any 𝒖U, 𝒇(𝒇(𝒖))=𝒇(𝒖). With 𝑷 the matrix associated 𝒇, a projection matrix satisfies 𝑷2=𝑷.

𝑷=𝑽𝑽T
𝑷2=𝑷𝑷=𝑽𝑽T𝑽𝑽T=𝑽(𝑽T𝑽)𝑽T=𝑽𝑰𝑽T=𝑽𝑽T=𝑷

2.Gram-Schmidt

Orthonormal vector sets {𝒒1,,𝒒n} are of the greatest practical utility, leading to the question of whether some such a set can be obtained from an arbitrary set of vectors {𝒂1,,𝒂n}. This is possible for independent vectors, through what is known as the Gram-Schmidt algorithm

  1. Start with an arbitrary direction 𝒂1

  2. Divide by its norm to obtain a unit-norm vector 𝒒1=𝒂1/||𝒂1||

  3. Choose another direction 𝒂2

  4. Subtract off its component along previous direction(s) 𝒂2-(𝒒1T𝒂2)𝒒1

  5. Divide by norm 𝒒2=(𝒂2-(𝒒1T𝒂2)𝒒1)/||𝒂2-(𝒒1T𝒂2)𝒒1||

  6. Repeat the above

𝑷1𝒂2=(𝒒1𝒒1T)𝒂2=𝒒1(𝒒1T𝒂2)=(𝒒1T𝒂2)𝒒1

The above geometrical description can be expressed in terms of matrix operations as

𝑨=( 𝒂1 𝒂2 𝒂n )=( 𝒒1 𝒒2 𝒒n )( r11 r12 r13 r1n 0 r22 r23 r2n 0 0 r33 r3n 0 0 rmn )=𝑸𝑹,

equivalent to the system

{ 𝒂1=r11𝒒1 𝒂2=r12𝒒1+r22𝒒2 𝒂n=r1n𝒒1+r2n𝒒2++rnn𝒒n .

The system is easily solved by forward substitution resulting in what is known as the (modified) Gram-Schmidt algorithm, transcribed below both in pseudo-code and in Julia.

Algorithm (Gram-Schmidt)

Given n vectors 𝒂1,,𝒂n

Initialize 𝒒1=𝒂1,..,𝒒n=𝒂n, 𝑹=𝑰n

for i=1 to n

rii=(𝒒iT𝒒i)1/2

if rii<ϵ break;

𝒒i=𝒒i/rii

for j=i+1 to n

rij=𝒒iT𝒂j; 𝒒j=𝒒j-rij𝒒i

end

end

return 𝑸,𝑹

function mgs(A)
  m,n=size(A); Q=copy(A); R=zeros(n,n)
  for i=1:n
    R[i,i]=sqrt(Q[:,i]'*Q[:,i])
    if (R[i,i]<eps())
      break
    end
    Q[:,i]=Q[:,i]/R[i,i]
    for j=i+1:n
      R[i,j]=Q[:,i]'*A[:,j]
      Q[:,j]=Q[:,j]-R[i,j]*Q[:,i]
    end
  end
  return Q,R
end;

Note that the normalization condition ||𝒒ii||=1 is satisifed by two values ±rii, so results from the above implementation might give orthogonal vectors 𝒒1,,𝒒n of different orientations than those returned by the Octave qr function. The implementation provided by computational packages such as Octave contain many refinements of the basic algorithm and it's usually preferable to use these in application

By analogy to arithmetic and polynomial algebra, the Gram-Schmidt algorithm furnishes a factorization

𝑸𝑹=𝑨

with 𝑸m×n with orthonormal columns and 𝑹n×n an upper triangular matrix, known as the QR-factorization. Since the column vectors within 𝑸 were obtained through linear combinations of the column vectors of 𝑨 we have

C(𝑨)=C(𝑸)C(𝑹)

3.QR solution of linear algebra problems

The QR-factorization can be used to solve basic problems within linear algebra.

3.1.Transformation of coordinates

Recall that when given a vector 𝒃m, an implicit basis is assumed, the canonical basis given by the column vectors of the identity matrix 𝑰m×m. The coordinates 𝒙 in another basis 𝑨m×m can be found by solving the equation

𝑰𝒃=𝒃=𝑨𝒙,

by an intermediate change of coordinates to the orthogonal basis 𝑸. Since the basis 𝑸 is orthogonal the relation 𝑸T𝑸=𝑰 holds, and changes of coordinates from 𝑰 to 𝑸, 𝑸𝒄=𝒃, are easily computed 𝒄=𝑸T𝒃. Since matrix multiplication is associative

𝒃=𝑨𝒙=(𝑸𝑹)𝒙=𝑸(𝑹𝒙),

the relations 𝑹𝒙=𝑸T𝒃=𝒄 are obtained, stating that 𝒙 also contains the coordinates of 𝒄 in the basis 𝑹. The three steps are:

  1. Compute the QR-factorization, 𝑸𝑹=𝑨;

  2. Find the coordinates of 𝒃 in the orthogonal basis 𝑸, 𝒄=𝑸T𝒃;

  3. Find the coordinates of 𝒙 in basis 𝑹, 𝑹𝒙=𝒄.

Since 𝑹 is upper-triangular,

( r11 r12 r13 r1m 0 r22 r23 r2m 0 0 r33 r3m 0 0 rmm )[ x1 x2 xm-1 xm ]=[ c1 c2 cm-1 cm ]

the coordinates of 𝒄 in the 𝑹 basis are easily found by back substitution.

Algorithm (Back substitution)

Given R upper-triangular, vectors 𝒄

for i=m down to 1

if rii<ϵ break;

xi=ci/rii

for j=i-1 down to 1

cj=cj-rjixi

end

end

return 𝒙

function bcks(R,c)
  m,n=size(R); x=zeros(m,1)
  for i=m:-1:1
    x[i]=c[i]/R[i,i]
    for j=i-1:-1:1
      c[j]=c[j]-R[j,i]*x[i]
    end
  end
  return x
end;

The above operations are carried out in the background by the backslash operation A\b to solve A*x=b, inspired by the scalar mnemonic ax=bx=(1/a)b.

3.2.General orthogonal bases

The above approch for the real vector space m can be used to determine orthogonal bases for any other vector space by appropriate modification of the scalar product. For example, within the space of smooth functions 𝒞[-1,1] that can differentiated an arbitrary number of times, the Taylor series

f(x)=f(0)1+f'(0)x+f''(0)x2++f(n)(0)xn++

is seen to be a linear combination of the monomial basis 𝑴=[ 1 x x2 ] with scaling coefficients {f(0),f'(0),12f''(0),}. The scalar product

(f,g)=-11f(x)g(x)dx

can be seen as the extension to the [-1,1] continuum of a the vector dot product. Orthogonalization of the monomial basis with the above scalar product leads to the definition of another family of polynomials, known as the Legendre polynomials

Q0(x)=()1/21,Q1(x)=()1/2x,Q2(x)=()1/2(3x2-1),Q4(x)=()1/2(5x3-3x),.

The Legendre polynomials are usually given with a different scaling such that Pk(1)=1, rather than the unit norm condition ||Qk||=(Qk,Qk)1/2=1. The above results can be recovered by sampling of the interval [-1,1] at points xi=(i-1)h-1, h=2/(m-1), i=1,,m, by approximation of the integral by a Riemann sum

-11f(x)Lj(x)dxhi=1mf(xi)Lj(xi)=h𝒇T𝑳j.

Figure 2. Comparison of monomial basis (left) to Legendre polynomial basis (right). The “resolution” of P3(x) can be interpreted as the number of crossings of the y=0 ordinate axis, and is greater than that of the corresponding monomial x3.

3.3.Least squares

The approach to compressing data D={(xi,yi)|i=1,,m.} suggested by calculus concepts is to form the sum of squared differences between y(xi) and yi, for example for y(x)=a0+a1x when carrying out linear regression,

S(a0,a1)=i=1m(y(xi)-yi)2=i=1m(a0+a1xi-yi)2

and seek (a0,a1) that minimize S(a0,a1). The function S(a0,a1)0 can be thought of as the height of a surface above the a0a1 plane, and the gradient Sis defined as a vector in the direction of steepest slope. When at some point on the surface if the gradient is different from the zero vector S𝟎, travel in the direction of the gradient would increase the height, and travel in the opposite direction would decrease the height. The minimal value of S would be attained when no local travel could decrease the function value, which is known as stationarity condition, stated as S=0. Applying this to determining the coefficients (a0,a1) of a linear regression leads to the equations

Sa0=02i=1m(a0+a1xi-yi)=0ma0+(i=1mxi)a1=i=1myi,
Sa1=02i=1m(a0+a1xi-yi)xi=0(i=1mxi)a0+(i=1mxi2)a1=i=1mxiyi.

The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear combination that is sought in a liner regression. Form a vector of errors with components ei=y(xi)-yi, which for linear regression is y(x)=a0+a1x. Recognize that y(xi) is a linear combination of 1 and xi with coefficients a0,a1, or in vector form

𝒆=( 1 x1 1 xm )( a0 a1 )-𝒚=( 𝟏 𝒙 )𝒂-𝒚=𝑨𝒂-𝒚

The norm of the error vector ||𝒆|| is smallest when 𝑨𝒂 is as close as possible to 𝒚. Since 𝑨𝒂 is within the column space of C(𝑨), 𝑨𝒂C(𝑨), the required condition is for 𝒆 to be orthogonal to the column space

𝒆C(𝑨)𝑨T𝒆=( 𝟏T 𝒙T )𝒆=( 𝟏T𝒆 𝒙T𝒆 )=( 0 0 )=𝟎
𝑨T𝒆=𝟎𝑨T(𝑨𝒂-𝒚)=0(𝑨T𝑨)𝒂=𝑨T𝒚=𝒃.

The above is known as the normal system, with 𝑵=𝑨T𝑨 is the normal matrix. The system 𝑵𝒂=𝒃 can be interpreted as seeking the coordinates in the 𝑵=𝑨T𝑨 basis of the vector 𝒃=𝑨T𝒚. An example can be constructed by randomly perturbing a known function y(x)=a0+a1x to simulate measurement noise and compare to the approximate 𝒂 obtained by solving the normal system.

  1. Generate some data on a line and perturb it by some random quantities

    m=100; x=LinRange(0,1,m); a=[2; 3];
    a0=a[1]; a1=a[2]; yex=a0 .+ a1*x; y=(yex+rand(m,1) .- 0.5);

  2. Form the matrices 𝑨, 𝑵=𝑨T𝑨, vector 𝒃=𝑨T𝒚

    A=ones(m,2); A[:,2]=x; N=A'*A; b=A'*y;

  3. Solve the system 𝑵𝒂=𝒃, and form the linear combination 𝒚=𝑨𝒂 closest to 𝒚

    atilde=N\b; [a atilde]

    [ 2.0 1.9215699010834906 3.0 3.03714411616737 ] (6)

The normal matrix basis 𝑵=𝑨T𝑨 can however be an ill-advised choice. Consider 𝑨2×2 given by

𝑨=[ 𝒂1 𝒂2 ]=[ 1 cosθ 0 sinθ ],

where the first column vector is taken from the identity matrix 𝒂1=𝒆1, and second is the one obtained by rotating it with angle θ. If θ=π/2, the normal matrix is orthogonal, 𝑨T𝑨=𝑰, but for small θ, 𝑨 and 𝑵=𝑨T𝑨 are approximated as

𝑨[ 1 1 0 θ ],𝑵=[ 𝒏1 𝒏2 ]=[ 1 1 0 θ2 ].

When θ is small 𝒂1,𝒂2 are almost colinear, and 𝒏1,𝒏2 even more so. This can lead to amplification of small errors, but can be avoided by recognizing that the best approximation in the 2-norm is identical to the Euclidean concept of orthogonal projection. The orthogonal projector onto C(𝑨) is readily found by QR-factorization, and the steps to solve least squares become

  1. Compute 𝑸𝑹=𝑨

  2. The projection of 𝒚 onto the column space of 𝑨 is 𝒛=𝑸𝑸T𝒚, and has coordinates 𝒄=𝑸T𝒚 in the orthogonal basis 𝑸.

  3. The same 𝒛 can also obtained by linear combination of the columns of 𝑨, 𝒛=𝑨𝒂=𝑸𝑸T𝒚, and replacing 𝑨 with its QR-factorization gives 𝑸𝑹𝒂=𝑸𝒄, that leads to the system 𝑹𝒂=𝒄, solved by back-substitution.

Q,R=mgs(A); c=Q'*y;
aQR=R\c; [a atilde aQR]

[ 2.0 1.9215699010834906 1.9065620791027633 3.0 3.03714411616737 3.0503545613518166 ] (7)

The above procedure carried over to approximation by higher degree polynomials.

m=100; n=6; x=LinRange(0,1,m); a=rand(-10:10,n,1); A=ones(m,1);
for j=1:n-1
  global A
  A = [A x.^j];
end
yex=A*a; y=yex .+ 0.001*(rand(m,1) .- 0.5); N=A'*A;
b=A'*y;
atilde=N\b; Q,R=mgs(A); c=Q'*y;
aQR=R\c; [a atilde aQR]

[ 10.0 10.000017230385858 10.000017230388655 -1.0 -0.9992383469668031 -0.9992383470406295 5.0 4.989621421635248 4.989621422094993 -1.0 -0.9668390169144284 -0.9668390180268995 -3.0 -3.0392026170942916 -3.0392026159405616 -7.0 -6.984207346901643 -6.984207347332273 ] (8)

Givendata𝒃,form𝑨,find𝒙,suchthat||𝒆||=||𝑨𝒙-𝒃||isminimized
𝒆=𝒃-𝑨𝒙