Least Squares Approximation

Synopsis. Having established the key theoretical results, the main linear algebra problems can now be solved. The typical scenario within data science applications is that a large dimensional vector representation of some object is available and greater insight is sought by seeking a description in terms of linear combination of a small number of vectors. The large dimensional object might not be exactly recovered and the focus is on obtaining the best possible approximation. In Euclidean spaces with distances measured by the 2-norm, best approximants are readily found by the generalization of the Pythagorean theorem to high-dimensional spaces.

1.Orthogonal projection

Consider a partition of a vector space U into orthogonal subspaces U=VW, V=W, W=V, typically U=m, Vm, Wm, dimV=n, dimW=m-n. If 𝑽=[ 𝒗1 𝒗n ]m×n is a basis for V and 𝑾=[ 𝒘1 𝒘m-n ]m×(m-n) is a basis for W, then 𝑼=[ 𝒗1 𝒗n 𝒘1 𝒘m-n ] is a basis for U. Even though the matrices 𝑽,𝑾 are not necessarily square, they are said to be orthonormal when all columns are of unit norm and orthogonal to one another. In this case computation of the matrix product 𝑽T𝑽 leads to the formation of the identity matrix within n

𝑽T𝑽=[ 𝒗1T 𝒗2T 𝒗nT ][ 𝒗1 𝒗2 𝒗n ]=[ 𝒗1T𝒗1 𝒗1T𝒗2 𝒗1T𝒗n 𝒗2T𝒗1 𝒗2T𝒗2 𝒗2T𝒗n 𝒗nT𝒗1 𝒗nT𝒗2 𝒗nT𝒗n ]=𝑰n.

Similarly, 𝑾T𝑾=𝑰m-n. Whereas for the square orthogonal matrix 𝑼 multiplication both on the left and the right by its transpose leads to the formation of the identity matrix

𝑼T𝑼=𝑼𝑼T=𝑰m,

the same operations applied to rectangular orthogonal matrices lead to different results

𝑽T𝑽=𝑰n,𝑽𝑽T=[ 𝒗1 𝒗2 𝒗n ][ 𝒗1T 𝒗2T 𝒗nT ]=i=1n𝒗i𝒗iT,rank(𝒗i𝒗iT)=1

A simple example is provided by taking 𝑽=𝑰m,n, the first n columns of the identity matrix in which case

𝑽𝑽T=i=1n𝒆i𝒆iT=[ 𝑰n 𝟎 𝟎 𝟎 ]m×m.

Applying 𝑷=𝑽𝑽T to some vector 𝒃m leads to a vector 𝒓=𝑷𝒃 whose first n components are those of 𝒃, and the remaining m-n are zero. The subtraction 𝒃-𝒓 leads to a new vector 𝒔=(𝑰-𝑷)𝒃 that has the first components equal to zero, and the remaining m-n the same as those of 𝒃. Such operations are referred to as projections, and for 𝑽=𝑰m,n correspond to projection onto the span{𝒆1,,𝒆n}.

Returning to the general case, the orthogonal matrices 𝑼m×m, 𝑽m×n, 𝑾m×(m-n) are associated with linear mappings 𝒃=𝒇(𝒙)=𝑼𝒙, 𝒓=𝒈(𝒃)=𝑷𝒃, 𝒔=𝒉(𝒃)=(𝑰-𝑷)𝒃. The mapping 𝒇 gives the components in the 𝑰 basis of a vector whose components in the 𝑼 basis are 𝒙. The mappings 𝒈,𝒉 project a vector onto span{𝒗1,,𝒗n}, span{𝒘1,,𝒘m-n}, respectively. When 𝑽,𝑾 are orthogonal matrices the projections are also orthogonal 𝒓𝒔. Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error, especially when the angle between basis vectors is small, and will be avoided henceforth. Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a general definition.

Definition. The mapping is called a projection if 𝒇𝒇=𝒇, or if for any 𝒖U, 𝒇(𝒇(𝒖))=𝒇(𝒖). With 𝑷 the matrix associated 𝒇, a projection matrix satisfies 𝑷2=𝑷.

Orthogonal projections onto the column space C(𝑸) of an orthonormal matrix are of great practical utility, and satisfy the above definition

𝑷𝑸=𝑸𝑸T
𝑷𝑸2=𝑷𝑸𝑷𝑸=𝑸𝑸T𝑸𝑸T=𝑸(𝑸T𝑸)𝑸T=𝑸𝑰𝑸T=𝑸𝑸T=𝑷𝑸.

2.Gram-Schmidt algorithm

Orthonormal vector sets {𝒒1,,𝒒n} are of the greatest practical utility, leading to the question of whether some such a set can be obtained from an arbitrary set of vectors {𝒂1,,𝒂n}. This is possible for independent vectors, through what is known as the Gram-Schmidt algorithm

  1. Start with an arbitrary direction 𝒂1

  2. Divide by its norm to obtain a unit-norm vector 𝒒1=𝒂1/||𝒂1||

  3. Choose another direction 𝒂2

  4. Subtract off its component along previous direction(s) 𝒂2-(𝒒1T𝒂2)𝒒1

  5. Divide by norm 𝒒2=(𝒂2-(𝒒1T𝒂2)𝒒1)/||𝒂2-(𝒒1T𝒂2)𝒒1||

  6. Repeat the above

𝑷1𝒂2=(𝒒1𝒒1T)𝒂2=𝒒1(𝒒1T𝒂2)=(𝒒1T𝒂2)𝒒1

The above geometrical description can be expressed in terms of matrix operations as

𝑨=( 𝒂1 𝒂2 𝒂n )=( 𝒒1 𝒒2 𝒒n )( r11 r12 r13 r1n 0 r22 r23 r2n 0 0 r33 r3n 0 0 rmn )=𝑸𝑹,

equivalent to the system

{ 𝒂1=r11𝒒1 𝒂2=r12𝒒1+r22𝒒2 𝒂n=r1n𝒒1+r2n𝒒2++rnn𝒒n .

The system is easily solved by forward substitution resulting in what is known as the (modified) Gram-Schmidt algorithm, transcribed below both in pseudo-code and in Julia.

Algorithm (Gram-Schmidt)

Given n vectors 𝒂1,,𝒂n

Initialize 𝒒1=𝒂1,..,𝒒n=𝒂n, 𝑹=𝑰n

for i=1 to n

rii=(𝒒iT𝒒i)1/2

if rii<ϵ break;

𝒒i=𝒒i/rii

for j=i+1 to n

rij=𝒒iT𝒂j; 𝒒j=𝒒j-rij𝒒i

end

end

return 𝑸,𝑹

function mgs(A)
  m,n=size(A); Q=A; R=eye(n);
  for i=1:n
    R[i,i] = sqrt(Q[:,i]'*Q[:,i]);
    if (R[i,i]<eps) break;
    Q[:,i] = Q[:,i]/R[i,i];
    for j=i+1:n
      R[i,j] = Q[:,i]'*A[:,j];
      Q[:,j] = Q[:,j] - R[i,j]*Q[:,i];
    end;
  end;
  return Q,R
end

The input matrix 𝑨 might have linearly dependent columns in which case rii0 for some i, and the if-instruction interrupts the algorithm. The Gram-Schmidt algorithm furnishes a factorization

𝑸𝑹=𝑨

with 𝑸m×n an orthonormal matrix and 𝑹n×n an upper triangular matrix, known as the QR-factorization. Since the column vectors within 𝑸 were obtained through linear combinations of the column vectors of 𝑨, C(𝑨)=C(𝑸). The QR-factorization is of great utility in solving problems within linear algebra.

3.Least squares problems in m

3.1.Problem formulation and solution by orthogonal projection

A typical situation in applications is that a vector 𝒚m represents a complex object with m1. A simpler representation of the object is sought through a linear combination 𝒗=𝑨𝒄, with 𝑨m×n and n<m, usually mn (Fig. 1).

Figure 1. Least squares problem: find 𝒗C(𝑨), 𝑨m×n closest to some given 𝒚 in the 2-norm

The magnitude of the difference between the exact object 𝒚 and the approximation 𝑨𝒄 is measured through a norm, and in least squares the 2-norm is adopted. The problem is to minimize the error ε=||𝒆||=||𝒚-𝑨𝒄||=(𝒆T𝒆)1/2. This is stated mathematically as

min𝒄n||𝒚-𝑨𝒄||,

and within m the minimal (2-norm) distance is obtained when 𝒚-𝒗 is orthogonal to C(𝑨). Note that is another type of norm were to be adopted, the orthogonality condition would no longer necessarily hold. For the 2-norm however, the orthogonality condition leads to a straightforward solution of the problem through orthogonal projection:

  1. Find an orthonormal basis for column space of 𝑨 by 𝑸𝑹 factorization, 𝑸𝑹=𝑨.

  2. State that 𝒗 is the projection of 𝒚, 𝒗=𝑷C(𝑨)𝒚=𝑷𝑸𝒚=𝑸𝑸T𝒚.

  3. State that 𝒗 is within the column space of 𝑨, 𝒗=𝑨𝒄=𝑸𝑹𝒄.

  4. Set equal the two expressions of 𝒗, 𝑸𝑸T𝒚=𝑸𝑹𝒄. This is an equality between two linear combinations of the columns of 𝑸. For 𝑸 orthonormal the scaling coefficients of the two linear combinations must be equal leading to 𝑹𝒄=𝑸T𝒚 .

  5. Solve the triangular system to find 𝒄.

3.2.Linear regression

The approach to compressing data D={(xi,yi)|i=1,,m.} suggested by calculus concepts is to form the sum of squared differences between y(xi) and yi, for example for y(x)=c0+c1x when carrying out linear regression,

S(c0,c1)=i=1m(y(xi)-yi)2=i=1m(c0+c1xi-yi)2

and seek (c0,c1) that minimize S(c0,c1). The function S(c0,c1)0 can be thought of as the height of a surface above the c0c1 plane, and the gradient Sis defined as a vector in the direction of steepest slope. When at some point on the surface if the gradient is different from the zero vector S𝟎, travel in the direction of the gradient would increase the height, and travel in the opposite direction would decrease the height. The minimal value of S would be attained when no local travel could decrease the function value, which is known as stationarity condition, stated as S=0. Applying this to determining the coefficients (c0,c1) of a linear regression leads to the equations

Sc0=02i=1m(c0+c1xi-yi)=0mc0+(i=1mxi)c1=i=1myi,
Sc1=02i=1m(c0+c1xi-yi)xi=0(i=1mxi)c0+(i=1mxi2)c1=i=1mxiyi.

The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear combination that is sought in a liner regression. Form a vector of errors with components ei=y(xi)-yi, which for linear regression is y(x)=c0+c1x. Recognize that y(xi) is a linear combination of 1 and xi with coefficients c0,c1, or in vector form

𝒆=[ 1 x1 1 xm ][ c0 c1 ]-𝒚=[ 𝟏 𝒙 ]𝒄-𝒚=𝑨𝒄-𝒚.

The norm of the error vector ||𝒆|| is smallest when 𝑨𝒂 is as close as possible to 𝒚. Since 𝑨𝒂 is within the column space of C(𝑨), 𝑨𝒂C(𝑨), the required condition is for 𝒆 to be orthogonal to the column space

𝒆C(𝑨)𝑨T𝒆=[ 𝟏T 𝒙T ]𝒆=[ 𝟏T𝒆 𝒙T𝒆 ]=[ 0 0 ]=𝟎
𝑨T𝒆=𝟎𝑨T(𝑨𝒄-𝒚)=0(𝑨T𝑨)𝒄=𝑨T𝒚=𝒃𝑵𝒄=𝒃.

The above is known as the normal system, with 𝑵=𝑨T𝑨 is the normal matrix. The system 𝑵𝒄=𝒃 can be interpreted as seeking the coordinates in the 𝑵=𝑨T𝑨 basis of the vector 𝒃=𝑨T𝒚. An example can be constructed by randomly perturbing a known function y(x)=a0+a1x to simulate measurement noise and compare to the approximate c obtained by solving the normal system.

m=100; x=(0:m-1)./m; c0=2; c1=3; yex=c0.+c1*x; y=(yex.+rand(m,1).-0.5);
A=ones(m,2); A[:,2]=x[:]; At=transpose(A); N=At*A; b=At*y;
c = N\b

[ 1.9524719146017488 3.114552636517776 ] (1)

3.3.Least squares polynomial approximations of data

Forming the normal system of equations can lead to numerical difficulties, especially when the columns of 𝑨 are close to linear dependence. It is preferable to adopt the general procedure of solving a least squares problem by projection, in which case the above linear regression becomes:

𝑸𝑹=𝑨,𝑹𝒄=𝑸T𝒚.
QR=qr(A); Q=QR.Q[:,1:2]; R=QR.R[1:2,1:2];
c = R\(transpose(Q)*y)

[ 1.9524719146017504 3.114552636517771 ] (2)

The above procedure can be easily extended to define quadratic or cubic regression, the problem of finding the best polynomials of degree 2 or 3 that fit the data. Quadratic regression is simply accomplished by adding a column to 𝑨 containing the squares of the 𝒙 vector

𝑨=[ 𝒂1 𝒂2 𝒂3 ]=[ 𝟏 𝒙 𝒙2 ]

with the column vector 𝒂k=𝒙k-1m has components xik-1 for i=1,2,,m.

m=100; x=(0:m-1)./m; c0=2; c1=3; c2=-5; yex=c0.+c1*x.+c2*x.^2;
y=(yex.+rand(m,1).-0.5);
A=ones(m,3); A[:,2]=x[:]; A[:,3]=x[:].^2; QR=qr(A); Q=QR.Q[:,1:3]; R=QR.R[1:3,1:3];
c = R\(transpose(Q)*y)

[ 1.968925863819198 2.9292845098793223 -4.836797382728068 ] (3)