MATH661

Lecture 8: Least Squares Approximation

A typical scenario in many sciences is acquisition of $m$ numbers to describe some object that is understood to actually require only $n ≪ m$ parameters. For example, $m$ voltage measurements $u_{i}$ of an alternating current could readily be reduced to three parameters, the amplitude, phase and frequency $u (t) = a \sin (ω t + φ)$ . Very often a simple first-degree polynomial approximation $y = a x + b$ is sought for a large data set $D = {(x_{i}, y_{i}), i = 1, \dots, m}$ . All of these are instances of data compression, a problem that can be solved in a linear algebra framework.

1.Projection

Consider a partition of a vector space $U$ into orthogonal subspaces $U = V \oplus W$ , $V = W^{⊥}$ , $W = V^{⊥}$ . Within the typical scenario described above $U = ℝ^{m}$ , $V \subset ℝ^{m}$ , $W \subset ℝ^{m}$ , $\dim V = n$ , $\dim W = m - n$ . If $𝑽 = [\begin{array}{ccc} 𝒗_{1} & \dots & 𝒗_{n} \end{array}] \in ℝ^{m \times n}$ is a basis for $V$ and $𝑾 = [\begin{array}{ccc} 𝒘_{1} & \dots & 𝒘_{m - n} \end{array}] \in ℝ^{m \times (m - n)}$ is a basis for W, then $𝑼 = [\begin{array}{cccccc} 𝒗_{1} & \dots & 𝒗_{n} & 𝒘_{1} & \dots & 𝒘_{m - n} \end{array}]$ is a basis for $U$ . Even though the matrices $𝑽, 𝑾$ are not necessarily square, they are said to be orthonormal, in the sense that all columns are of unit norm and orthogonal to one another. Computation of the matrix product $𝑽^{T} 𝑽$ leads to the formation of the identity matrix within $ℝ^{n}$

𝑽^{T} 𝑽 = [\begin{array}{c} 𝒗_{1}^{T} \\ 𝒗_{2}^{T} \\ ⋮ \\ 𝒗_{n}^{T} \end{array}] [\begin{array}{cccc} 𝒗_{1} & 𝒗_{2} & \dots & 𝒗_{n} \end{array}] = [\begin{array}{cccc} 𝒗_{1}^{T} 𝒗_{1} & 𝒗_{1}^{T} 𝒗_{2} & \dots & 𝒗_{1}^{T} 𝒗_{n} \\ 𝒗_{2}^{T} 𝒗_{1} & 𝒗_{2}^{T} 𝒗_{2} & \dots & 𝒗_{2}^{T} 𝒗_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 𝒗_{n}^{T} 𝒗_{1} & 𝒗_{n}^{T} 𝒗_{2} & \dots & 𝒗_{n}^{T} 𝒗_{n} \end{array}] = 𝑰_{n} .

Similarly, $𝑾^{T} 𝑾 = 𝑰_{m - n}$ . Whereas for the square orthogonal matrix $𝑼$ multiplication both on the left and the right by its transpose leads to the formation of the identity matrix

𝑼^{T} 𝑼 = 𝑼 𝑼^{T} = 𝑰_{m},

the same operations applied to rectangular orthonormal matrices lead to different results

𝑽^{T} 𝑽 = 𝑰_{n}, 𝑽 𝑽^{T} = [\begin{array}{cccc} 𝒗_{1} & 𝒗_{2} & \dots & 𝒗_{n} \end{array}] [\begin{array}{c} 𝒗_{1}^{T} \\ 𝒗_{2}^{T} \\ ⋮ \\ 𝒗_{n}^{T} \end{array}] = \sum_{i = 1}^{n} 𝒗_{i} 𝒗_{i}^{T}, rank (𝒗_{i} 𝒗_{i}^{T}) = 1

A simple example is provided by taking $𝑽 = 𝑰_{m, n}$ , the first $n$ columns of the identity matrix in which case

𝑽 𝑽^{T} = \sum_{i = 1}^{n} 𝒆_{i} 𝒆_{i}^{T} = [\begin{array}{cc} 𝑰_{n} & 𝟎 \\ 𝟎 & 𝟎 \end{array}] \in ℝ^{m \times m} .

$\circ$ Applying $𝑷 = 𝑽 𝑽^{T}$ to some vector $𝒃 \in ℝ^{m}$ leads to a vector $𝒓 = 𝑷 𝒃$ whose first $n$ components are those of $𝒃$ , and the remaining $m - n$ are zero. The subtraction $𝒃 - 𝒓$ leads to a new vector $𝒔 = (𝑰 - 𝑷) 𝒃$ that has the first components equal to zero, and the remaining $m - n$ the same as those of $𝒃$ . Such operations are referred to as projections, and for $𝑽 = 𝑰_{m, n}$ correspond to projection onto the span ${𝒆_{1}, \dots, 𝒆_{n}}$ .

∴	I4=1.0I(4); V=I4[:,1:2]; P=VV'

$[\begin{array}{cccc} 1.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 1.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{array}]$ (1)

∴	Q=I4-P; b=rand(4,1); r=Pb; s=Qb; [P b r s]

$[\begin{array}{ccccccc} 1.0 & 0.0 & 0.0 & 0.0 & 0.6307188795156022 & 0.6307188795156022 & 0.0 \\ 0.0 & 1.0 & 0.0 & 0.0 & 0.220874465897813 & 0.220874465897813 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.8430380552425707 & 0.0 & 0.8430380552425707 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.7636106228166419 & 0.0 & 0.7636106228166419 \end{array}]$ (2)

∴

Figure 1. Projection in $ℝ^{2}$ . The vectors $𝒓, 𝒔 \in ℝ^{2}$ have two components, but could be expressed through scaling of $𝒆_{1}, 𝒆_{2}$ .

Returning to the general case, the orthogonal matrices $𝑼 \in ℝ^{m \times m}$ , $𝑽 \in ℝ^{m \times n}$ , $𝑾 \in ℝ^{m \times (m - n)}$ are associated with linear mappings $𝒃 = 𝒇 (𝒙) = 𝑼 𝒙$ , $𝒓 = 𝒈 (𝒃) = 𝑷 𝒃$ , $𝒔 = 𝒉 (𝒃) = (𝑰 - 𝑷) 𝒃$ . The mapping $𝒇$ gives the components in the $𝑰$ basis of a vector whose components in the $𝑼$ basis are $𝒙$ . The mappings $𝒈, 𝒉$ project a vector onto span ${𝒗_{1}, \dots, 𝒗_{n}}$ , span ${𝒘_{1}, \dots, 𝒘_{m - n}}$ , respectively. When $𝑽, 𝑾$ are orthogonal matrices the projections are also orthogonal $𝒓 ⊥ 𝒔$ . Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error, especially when the angle between basis vectors is small, and will be avoided henceforth.

Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a general definition.

Definition. The mapping is called a projection if $𝒇 \circ 𝒇 = 𝒇$ , or if for any $𝒖 \in U$ , $𝒇 (𝒇 (𝒖)) = 𝒇 (𝒖)$ . With $𝑷$ the matrix associated $𝒇$ , a projection matrix satisfies $𝑷^{2} = 𝑷$ .

𝑷 = 𝑽 𝑽^{T}

𝑷^{2} = 𝑷 𝑷 = 𝑽 𝑽^{T} 𝑽 𝑽^{T} = 𝑽 (𝑽^{T} 𝑽) 𝑽^{T} = 𝑽 𝑰 𝑽^{T} = 𝑽 𝑽^{T} = 𝑷

2.Gram-Schmidt

Orthonormal vector sets ${𝒒_{1}, \dots, 𝒒_{n}}$ are of the greatest practical utility, leading to the question of whether some such a set can be obtained from an arbitrary set of vectors ${𝒂_{1}, \dots, 𝒂_{n}}$ . This is possible for independent vectors, through what is known as the Gram-Schmidt algorithm

Start with an arbitrary direction $𝒂_{1}$
Divide by its norm to obtain a unit-norm vector $𝒒_{1} = 𝒂_{1} / || 𝒂_{1} ||$
Choose another direction $𝒂_{2}$
Subtract off its component along previous direction(s) $𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1}$
Divide by norm $𝒒_{2} = (𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1}) / || 𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1} ||$
Repeat the above

𝑷_{1} 𝒂_{2} = (𝒒_{1} 𝒒_{1}^{T}) 𝒂_{2} = 𝒒_{1} (𝒒_{1}^{T} 𝒂_{2}) = (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1}

The above geometrical description can be expressed in terms of matrix operations as

𝑨 = (\begin{array}{cccc} 𝒂_{1} & 𝒂_{2} & \dots & 𝒂_{n} \end{array}) = (\begin{array}{cccc} 𝒒_{1} & 𝒒_{2} & \dots & 𝒒_{n} \end{array}) (\begin{array}{ccccc} r_{11} & r_{12} & r_{13} & \dots & r_{1 n} \\ 0 & r_{22} & r_{23} & \dots & r_{2 n} \\ 0 & 0 & r_{33} & \dots & r_{3 n} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \dots & r_{m n} \end{array}) = 𝑸 𝑹,

equivalent to the system

{\begin{cases} 𝒂_{1} = r_{11} 𝒒_{1} \\ 𝒂_{2} = r_{12} 𝒒_{1} + r_{22} 𝒒_{2} \\ ⋮ \\ 𝒂_{n} = r_{1 n} 𝒒_{1} + r_{2 n} 𝒒_{2} + \dots + r_{n n} 𝒒_{n} \end{cases} .

The system is easily solved by forward substitution resulting in what is known as the (modified) Gram-Schmidt algorithm, transcribed below both in pseudo-code and in Julia.

Algorithm (Gram-Schmidt)

Given $n$ vectors $𝒂_{1}, \dots, 𝒂_{n}$

Initialize $𝒒_{1} = 𝒂_{1}$ ,.., $𝒒_{n} = 𝒂_{n}$ , $𝑹 = 𝑰_{n}$

for $i = 1$ to $n$

$r_{i i} = {(𝒒_{i}^{T} 𝒒_{i})}^{1 / 2}$

if $r_{i i} < ϵ$ break;

$𝒒_{i} = 𝒒_{i} / r_{i i}$

for $j = i$ +1 to $n$

$r_{i j} = 𝒒_{i}^{T} 𝒂_{j}$ ; $𝒒_{j} = 𝒒_{j} - r_{i j} 𝒒_{i}$

end

return $𝑸, 𝑹$

∴

function mgs(A)
  m,n=size(A); Q=copy(A); R=zeros(n,n)
  for i=1:n
    R[i,i]=sqrt(Q[:,i]'*Q[:,i])
    if (R[i,i]<eps())
      break
    end
    Q[:,i]=Q[:,i]/R[i,i]
    for j=i+1:n
      R[i,j]=Q[:,i]'*A[:,j]
      Q[:,j]=Q[:,j]-R[i,j]*Q[:,i]
    end
  end
  return Q,R
end;

∴

$\circ$ Note that the normalization condition $|| 𝒒_{i i} || = 1$ is satisifed by two values $\pm r_{i i}$ , so results from the above implementation might give orthogonal vectors $𝒒_{1}, \dots, 𝒒_{n}$ of different orientations than those returned by the Octave qr function. The implementation provided by computational packages such as Octave contain many refinements of the basic algorithm and it's usually preferable to use these in application

∴	A=rand(3,3); Q,R=mgs(A); Q

$[\begin{array}{ccc} 0.8742159735378021 & - 0.47355882434944824 & 0.1071842875244384 \\ 0.4625228395964951 & 0.7450742206591088 & - 0.4805590791576149 \\ 0.14771274306216073 & 0.4696876042541361 & 0.8703875573255075 \end{array}]$ (3)

∴	Q1,R1=qr(A); Q

∴

By analogy to arithmetic and polynomial algebra, the Gram-Schmidt algorithm furnishes a factorization

𝑸 𝑹 = 𝑨

with $𝑸 \in ℝ^{m \times n}$ with orthonormal columns and $𝑹 \in ℝ^{n \times n}$ an upper triangular matrix, known as the $Q R$ -factorization. Since the column vectors within $𝑸$ were obtained through linear combinations of the column vectors of $𝑨$ we have

C (𝑨) = C (𝑸) \neq C (𝑹)

3. $Q R$ solution of linear algebra problems

The $Q R$ -factorization can be used to solve basic problems within linear algebra.

3.1.Transformation of coordinates

Recall that when given a vector $𝒃 \in ℝ^{m}$ , an implicit basis is assumed, the canonical basis given by the column vectors of the identity matrix $𝑰 \in ℝ^{m \times m}$ . The coordinates $𝒙$ in another basis $𝑨 \in ℝ^{m \times m}$ can be found by solving the equation

𝑰 𝒃 = 𝒃 = 𝑨 𝒙,

by an intermediate change of coordinates to the orthogonal basis $𝑸$ . Since the basis $𝑸$ is orthogonal the relation $𝑸^{T} 𝑸 = 𝑰$ holds, and changes of coordinates from $𝑰$ to $𝑸$ , $𝑸 𝒄 = 𝒃,$ are easily computed $𝒄 = 𝑸^{T} 𝒃$ . Since matrix multiplication is associative

𝒃 = 𝑨 𝒙 = (𝑸 𝑹) 𝒙 = 𝑸 (𝑹 𝒙),

the relations $𝑹 𝒙 = 𝑸^{T} 𝒃 = 𝒄$ are obtained, stating that $𝒙$ also contains the coordinates of $𝒄$ in the basis $𝑹$ . The three steps are:

Compute the $Q R$ -factorization, $𝑸 𝑹 = 𝑨$ ;
Find the coordinates of $𝒃$ in the orthogonal basis $𝑸$ , $𝒄 = 𝑸^{T} 𝒃$ ;
Find the coordinates of $𝒙$ in basis $𝑹$ , $𝑹 𝒙 = 𝒄$ .

Since $𝑹$ is upper-triangular,

(\begin{array}{ccccc} r_{11} & r_{12} & r_{13} & \dots & r_{1 m} \\ 0 & r_{22} & r_{23} & \dots & r_{2 m} \\ 0 & 0 & r_{33} & \dots & r_{3 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \dots & r_{m m} \end{array}) [\begin{array}{c} x_{1} \\ x_{2} \\ ⋮ \\ x_{m - 1} \\ x_{m} \end{array}] = [\begin{array}{c} c_{1} \\ c_{2} \\ ⋮ \\ c_{m - 1} \\ c_{m} \end{array}]

the coordinates of $𝒄$ in the $𝑹$ basis are easily found by back substitution.

Algorithm (Back substitution)

Given $R$ upper-triangular, vectors $𝒄$

for $i = m$ down to $1$

if $r_{i i} < ϵ$ break;

$x_{i} = c_{i} / r_{i i}$

for $j = i$ -1 down to $1$

$c_{j} = c_{j} - r_{j i} x_{i}$

end

return $𝒙$

∴	function bcks(R,c) m,n=size(R); x=zeros(m,1) for i=m:-1:1 x[i]=c[i]/R[i,i] for j=i-1:-1:1 c[j]=c[j]-R[j,i]*x[i] end end return x end;

∴

$\circ$ The above operations are carried out in the background by the backslash operation A\b to solve A*x=b, inspired by the scalar mnemonic $a x = b \Rightarrow x = (1 / a) b$ .

∴	xex=rand(3,1); b=Axex; Q,R=mgs(A); c=Q'b;

∴	x=bcks(R,c); x0=A\b; [xex x0]

$[\begin{array}{cc} 0.007013988011999928 & 0.007013988011999707 \\ 0.945232692504113 & 0.9452326925041127 \\ 0.19665736097315922 & 0.19665736097315994 \end{array}]$ (5)

∴

3.2.General orthogonal bases

The above approch for the real vector space $ℛ_{m}$ can be used to determine orthogonal bases for any other vector space by appropriate modification of the scalar product. For example, within the space of smooth functions $𝒞^{\infty} [- 1, 1]$ that can differentiated an arbitrary number of times, the Taylor series

f (x) = f (0) \cdot 1 + f^{'} (0) \cdot x + f^{''} (0) \cdot x^{2} + \dots + f^{(n)} (0) \cdot x^{n} + \dots +

is seen to be a linear combination of the monomial basis $𝑴 = [\begin{array}{cccc} 1 & x & x^{2} & \dots \end{array}]$ with scaling coefficients ${f (0), f^{'} (0), \frac{1}{2} f^{''} (0), \dots}^{}$ . The scalar product

(f, g) = \int_{- 1}^{1} f (x) g (x) d x

can be seen as the extension to the $[- 1, 1]$ continuum of a the vector dot product. Orthogonalization of the monomial basis with the above scalar product leads to the definition of another family of polynomials, known as the Legendre polynomials

Q_{0} (x) = {()}^{1 / 2} \cdot 1, Q_{1} (x) = {()}^{1 / 2} \cdot x, Q_{2} (x) = {()}^{1 / 2} \cdot (3 x^{2} - 1), Q_{4} (x) = {()}^{1 / 2} \cdot (5 x^{3} - 3 x), \dots .

$\circ$ The Legendre polynomials are usually given with a different scaling such that $P_{k} (1) = 1$ , rather than the unit norm condition $|| Q_{k} || = {(Q_{k}, Q_{k})}^{1 / 2} = 1$ . The above results can be recovered by sampling of the interval $[- 1, 1]$ at points $x_{i} = (i - 1) h - 1$ , $h = 2 / (m - 1)$ , $i = 1, \dots, m$ , by approximation of the integral by a Riemann sum

\int_{- 1}^{1} f (x) L_{j} (x) d x ≅ h \sum_{i = 1}^{m} f (x_{i}) L_{j} (x_{i}) = h 𝒇^{T} 𝑳_{j} .

Figure 2. Comparison of monomial basis (left) to Legendre polynomial basis (right). The “resolution” of $P_{3} (x)$ can be interpreted as the number of crossings of the $y = 0$ ordinate axis, and is greater than that of the corresponding monomial $x^{3}$ .

3.3.Least squares

The approach to compressing data $D = {(x_{i}, y_{i}) | i = 1, \dots, m .}$ suggested by calculus concepts is to form the sum of squared differences between $y (x_{i})$ and $y_{i}$ , for example for $y (x) = a_{0} + a_{1} x$ when carrying out linear regression,

S (a_{0}, a_{1}) = \sum_{i = 1}^{m} {(y (x_{i}) - y_{i})}^{2} = \sum_{i = 1}^{m} {(a_{0} + a_{1} x_{i} - y_{i})}^{2}

and seek $(a_{0}, a_{1})$ that minimize $S (a_{0}, a_{1})$ . The function $S (a_{0}, a_{1}) ⩾ 0$ can be thought of as the height of a surface above the $a_{0} a_{1}$ plane, and the gradient $\nabla S$ is defined as a vector in the direction of steepest slope. When at some point on the surface if the gradient is different from the zero vector $\nabla S \neq 𝟎$ , travel in the direction of the gradient would increase the height, and travel in the opposite direction would decrease the height. The minimal value of $S$ would be attained when no local travel could decrease the function value, which is known as stationarity condition, stated as $\nabla S = 0$ . Applying this to determining the coefficients $(a_{0}, a_{1})$ of a linear regression leads to the equations

\frac{\partial S}{\partial a_{0}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (a_{0} + a_{1} x_{i} - y_{i}) = 0 \Leftrightarrow m a_{0} + (\sum_{i = 1}^{m} x_{i}) a_{1} = \sum_{i = 1}^{m} y_{i},

\frac{\partial S}{\partial a_{1}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (a_{0} + a_{1} x_{i} - y_{i}) x_{i} = 0 \Leftrightarrow (\sum_{i = 1}^{m} x_{i}) a_{0} + (\sum_{i = 1}^{m} x_{i}^{2}) a_{1} = \sum_{i = 1}^{m} x_{i} y_{i} .

The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear combination that is sought in a liner regression. Form a vector of errors with components $e_{i} = y (x_{i}) - y_{i}$ , which for linear regression is $y (x) = a_{0} + a_{1} x$ . Recognize that $y (x_{i})$ is a linear combination of $1$ and $x_{i}$ with coefficients $a_{0}, a_{1}$ , or in vector form

𝒆 = (\begin{array}{cc} 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{m} \end{array}) (\begin{array}{c} a_{0} \\ a_{1} \end{array}) - 𝒚 = (\begin{array}{cc} 𝟏 & 𝒙 \end{array}) 𝒂 - 𝒚 = 𝑨 𝒂 - 𝒚

The norm of the error vector $|| 𝒆 ||$ is smallest when $𝑨 𝒂$ is as close as possible to $𝒚$ . Since $𝑨 𝒂$ is within the column space of $C (𝑨)$ , $𝑨 𝒂 \in C (𝑨)$ , the required condition is for $𝒆$ to be orthogonal to the column space

𝒆 ⊥ C (𝑨) \Rightarrow 𝑨^{T} 𝒆 = (\begin{array}{c} 𝟏^{T} \\ 𝒙^{T} \end{array}) 𝒆 = (\begin{array}{c} 𝟏^{T} 𝒆 \\ 𝒙^{T} 𝒆 \end{array}) = (\begin{array}{c} 0 \\ 0 \end{array}) = 𝟎

𝑨^{T} 𝒆 = 𝟎 \Leftrightarrow 𝑨^{T} (𝑨 𝒂 - 𝒚) = 0 \Leftrightarrow (𝑨^{T} 𝑨) 𝒂 = 𝑨^{T} 𝒚 = 𝒃 .

The above is known as the normal system, with $𝑵 = 𝑨^{T} 𝑨$ is the normal matrix. The system $𝑵 𝒂 = 𝒃$ can be interpreted as seeking the coordinates in the $𝑵 = 𝑨^{T} 𝑨$ basis of the vector $𝒃 = 𝑨^{T} 𝒚$ . An example can be constructed by randomly perturbing a known function $y (x) = a_{0} + a_{1} x$ to simulate measurement noise and compare to the approximate $\tilde{𝒂}$ obtained by solving the normal system.

Generate some data on a line and perturb it by some random quantities

∴	m=100; x=LinRange(0,1,m); a=[2; 3];

∴	a0=a[1]; a1=a[2]; yex=a0 .+ a1*x; y=(yex+rand(m,1) .- 0.5);

∴

Form the matrices $𝑨$ , $𝑵 = 𝑨^{T} 𝑨$ , vector $𝒃 = 𝑨^{T} 𝒚$
∴

A=ones(m,2); A[:,2]=x; N=A'*A; b=A'*y;
∴

Solve the system $𝑵 𝒂 = 𝒃$ , and form the linear combination $\tilde{𝒚} = 𝑨 𝒂$ closest to $𝒚$

∴	atilde=N\b; [a atilde]

$[\begin{array}{cc} 2.0 & 1.9215699010834906 \\ 3.0 & 3.03714411616737 \end{array}]$ (6)

∴

The normal matrix basis $𝑵 = 𝑨^{T} 𝑨$ can however be an ill-advised choice. Consider $𝑨 \in ℝ^{2 \times 2}$ given by

𝑨 = [\begin{array}{cc} 𝒂_{1} & 𝒂_{2} \end{array}] = [\begin{array}{cc} 1 & \cos θ \\ 0 & \sin θ \end{array}],

where the first column vector is taken from the identity matrix $𝒂_{1} = 𝒆_{1}$ , and second is the one obtained by rotating it with angle $θ$ . If $θ = π / 2$ , the normal matrix is orthogonal, $𝑨^{T} 𝑨 = 𝑰$ , but for small $θ$ , $𝑨$ and $𝑵 = 𝑨^{T} 𝑨$ are approximated as

𝑨 ≅ [\begin{array}{cc} 1 & 1 \\ 0 & θ \end{array}], 𝑵 = [\begin{array}{cc} 𝒏_{1} & 𝒏_{2} \end{array}] = [\begin{array}{cc} 1 & 1 \\ 0 & θ^{2} \end{array}] .

When $θ$ is small $𝒂_{1}, 𝒂_{2}$ are almost colinear, and $𝒏_{1}, 𝒏_{2}$ even more so. This can lead to amplification of small errors, but can be avoided by recognizing that the best approximation in the 2-norm is identical to the Euclidean concept of orthogonal projection. The orthogonal projector onto $C (𝑨)$ is readily found by $Q R$ -factorization, and the steps to solve least squares become

Compute $𝑸 𝑹 = 𝑨$
The projection of $𝒚$ onto the column space of $𝑨$ is $𝒛 = 𝑸 𝑸^{T} 𝒚$ , and has coordinates $𝒄 = 𝑸^{T} 𝒚$ in the orthogonal basis $𝑸$ .
The same $𝒛$ can also obtained by linear combination of the columns of $𝑨$ , $𝒛 = 𝑨 𝒂 = 𝑸 𝑸^{T} 𝒚$ , and replacing $𝑨$ with its $Q R$ -factorization gives $𝑸 𝑹 𝒂 = 𝑸 𝒄$ , that leads to the system $𝑹 𝒂 = 𝒄$ , solved by back-substitution.

∴	Q,R=mgs(A); c=Q'*y;

∴	aQR=R\c; [a atilde aQR]

$[\begin{array}{ccc} 2.0 & 1.9215699010834906 & 1.9065620791027633 \\ 3.0 & 3.03714411616737 & 3.0503545613518166 \end{array}]$ (7)

∴

The above procedure carried over to approximation by higher degree polynomials.

∴	m=100; n=6; x=LinRange(0,1,m); a=rand(-10:10,n,1); A=ones(m,1);

∴	for j=1:n-1 global A A = [A x.^j]; end

∴	yex=Aa; y=yex .+ 0.001(rand(m,1) .- 0.5); N=A'*A;

∴

b=A'*y;

∴	atilde=N\b; Q,R=mgs(A); c=Q'*y;

∴	aQR=R\c; [a atilde aQR]

$[\begin{array}{ccc} 10.0 & 10.000017230385858 & 10.000017230388655 \\ - 1.0 & - 0.9992383469668031 & - 0.9992383470406295 \\ 5.0 & 4.989621421635248 & 4.989621422094993 \\ - 1.0 & - 0.9668390169144284 & - 0.9668390180268995 \\ - 3.0 & - 3.0392026170942916 & - 3.0392026159405616 \\ - 7.0 & - 6.984207346901643 & - 6.984207347332273 \end{array}]$ (8)

∴

Given data 𝒃, form 𝑨, find 𝒙, such that || 𝒆 || = || 𝑨 𝒙 - 𝒃 || is minimized

𝒆 = 𝒃 - 𝑨 𝒙