MATH347DS

MATH347DS L07: Least squares

Overview

Gram-Schmidt algorithm, $𝑸 𝑹$ factorization
Projection onto subspaces
Orthogonal projectors
Best approximation in the 2-norm
Linear regression
Polynomial approximation
Polynomial interpolation

Orthonormal vector sets

Definition. The Dirac delta symbol $δ_{i j}$ is defined as

$δ_{i j} = {\begin{cases} 1 & if i = j \\ 0 & if i \neq j \end{cases}$

Definition. A set of vectors ${𝒖_{1}, \dots, 𝒖_{n}}$ is said to be orthonormal if

$𝒖_{i}^{T} 𝒖_{j} = δ_{i j}$

The column vectors of the identity matrix are orthonormal

𝑰 = (\begin{array}{ccc} 𝒆_{1} & \dots & 𝒆_{m} \end{array})

𝒆_{i}^{T} 𝒆_{j} = δ_{i j}

Gram-Schmidt algorithm

An arbitrary vector set can be transformed into an orthonormal set by the Gram-Schmidt algorithm
Idea:
- Start with an arbitrary direction $𝒂_{1}$
- Divide by its norm to obtain a unit-norm vector $𝒒_{1} = 𝒂_{1} / || 𝒂_{1} ||$
- Choose another direction $𝒂_{2}$
- Subtract off its component along previous direction(s) $𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1}$
- Divide by norm $𝒒_{2} = (𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1}) / || 𝒂_{2} - (𝒒_{1}^{T} 𝒂_{2}) 𝒒_{1} ||$
- Repeat the above

Matrix formulation of Gram-Schmidt (

𝑸 𝑹

factorization)

Consider $𝑨 \in ℝ^{m \times n}$ with linearly independent columns. By linear combinations of the columns of $𝑨$ a set of orthonormal vectors $𝒒_{1}, \dots, 𝒒_{n}$ will be obtained. This can be expressed as a matrix product
$𝑨 = (\begin{array}{cccc} 𝒂_{1} & 𝒂_{2} & \dots & 𝒂_{n} \end{array}) = (\begin{array}{cccc} 𝒒_{1} & 𝒒_{2} & \dots & 𝒒_{n} \end{array}) (\begin{array}{ccccc} r_{11} & r_{12} & r_{13} & \dots & r_{1 n} \\ 0 & r_{22} & r_{23} & \dots & r_{2 n} \\ 0 & 0 & r_{33} & \dots & r_{3 n} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \dots & r_{m n} \end{array}) = 𝑸 𝑹$
with $𝑸 \in ℝ^{m \times n}$ , $𝑹 \in ℝ^{n \times n}$ . The matrix $𝑹$ is upper-triangular (also referred to as right-triangular) since to find vector $𝒒_{1}$ only vector $𝒂_{1}$ is used, to find vector $𝒒_{2}$ only vectors $𝒂_{1}, 𝒂_{2}$ are used
The above is equivalent to the system
${\begin{cases} 𝒂_{1} = r_{11} 𝒒_{1} \\ 𝒂_{2} = r_{12} 𝒒_{1} + r_{22} 𝒒_{2} \\ ⋮ \\ 𝒂_{n} = r_{1 n} 𝒒_{1} + r_{2 n} 𝒒_{2} + \dots + r_{n n} 𝒒_{n} \end{cases}$

Matrix formulation of Gram-Schmidt (

𝑸 𝑹

factorization)

The system can be solved to find $𝑹, 𝑸$ by:
1. Imposing $|| 𝒒_{1} || = 1 \Rightarrow r_{11} = || 𝒂_{1} ||$ , $𝒒_{1} = a_{1} / r_{11}$
2. Computing projections of $𝒂_{2}, \dots, 𝒂_{n}$ along $𝒒_{1}$
  $r_{12} = 𝒒_{1}^{T} 𝒂_{2}, \dots, r_{1 n} = 𝒒_{1}^{T} 𝒂_{n}$
3. Subtracting components along $𝒒_{1}$ from $𝒂_{2}, \dots, 𝒂_{n}$
  ${\begin{cases} 𝒂_{2} - r_{12} 𝒒_{1} = r_{22} 𝒒_{2} \\ ⋮ \\ 𝒂_{n} - r_{1 n} 𝒒_{1} = r_{2 n} 𝒒_{2} + \dots + r_{n n} 𝒒_{n} \end{cases}$
4. The above steps reduced the size of the system by 1. Repeating the steps completes the solution. The overall process is known as the Gram-Schmidt algorithm

Gram-Schmidt algorithm

Algorithm (Gram-Schmidt)

Given $n$ vectors $𝒂_{1}, \dots, 𝒂_{n} \in ℝ^{m}$

Initialize $𝒒_{1} = 𝒂_{1}$ ,.., $𝒒_{n} = 𝒂_{n}$ , $𝑹 = 𝑰_{n} \in ℝ^{n \times n}$

for $i = 1$ to $n$

$r_{i i} = {(𝒒_{i}^{T} 𝒒_{i})}^{1 / 2}$ ; $𝒒_{i} = 𝒒_{i} / r_{i i}$

for $j = i + 1$ to $n$

$r_{i j} = 𝒒_{i}^{T} 𝒂_{j}$ ; $𝒒_{j} = 𝒒_{j} - r_{i j} 𝒒_{i}$

end

return $𝑸, 𝑹$

𝑸 𝑹

factorization

For $𝑨 \in ℝ^{m \times n}$ with linearly independent columns, the Gram-Schmidt algorithm furnishes a factorization
$𝑸 𝑹 = 𝑨$
with $𝑸 \in ℝ^{m \times n}$ with orthonormal columns and $𝑹 \in ℝ^{n \times n}$ an upper triangular matrix.
Since the column vectors within $𝑸$ were obtained through linear combinations of the column vectors of $𝑨$ we have
$C (𝑨) = C (𝑸)$

Orthogonal projection of a vector along another vector

Consider a vector $𝒖 \in ℝ^{m}$ , and a unit-norm vector $𝒒_{1} \in ℝ^{m}$

Definition. The orthogonal projection of $𝒖 \in ℝ^{m}$ along direction $𝒒_{1} \in ℝ^{m}$ , $|| 𝒒_{1} || = 1$ is the vector $(𝒒_{1}^{T} 𝒖) 𝒒_{1}$ .

Scalar-vector multiplication commutativity: $(𝒒_{1}^{T} 𝒖) 𝒒_{1} = 𝒒_{1} (𝒒_{1}^{T} 𝒖)$
Matrix multiplication associativity: $𝒒_{1} (𝒒_{1}^{T} 𝒖) = (𝒒_{1} 𝒒_{1}^{T}) 𝒖 = 𝑷_{1} 𝒖$ , with $𝑷_{1} \in ℝ^{m \times m}$

Definition. The matrix $𝑷_{1} = 𝒒_{1} 𝒒_{1}^{T} \in ℝ^{m \times m}$ is the orthogonal projector along direction $𝒒_{1} \in ℝ^{m}$ , $|| 𝒒_{1} || = 1$ .

Orthogonal projection onto a subspace

Consider $n$ orthonormal vectors grouped into a matrix $𝑸 = (\begin{array}{lll} 𝒒_{1} & \dots & 𝒒_{n} \end{array}) \in ℝ^{m \times n}$

The orthogonal projection of $𝒖$ onto the subspace spanned by $𝒒_{1}, \dots, 𝒒_{n}$ is
$𝑷 𝒖 = 𝑷_{1} 𝒖 + \dots + 𝑷_{n} 𝒖 = (𝒒_{1} 𝒒_{1}^{T}) 𝒖 + \dots +_{} (𝒒_{n} 𝒒_{n}^{T}) 𝒖 \Rightarrow$ $𝑷 = 𝒒_{1} 𝒒_{1}^{T} + \dots + 𝒒_{n} 𝒒_{n}^{T} = (\begin{array}{ccc} 𝒒_{1} & \dots & 𝒒_{n} \end{array}) (\begin{array}{c} 𝒒_{1}^{T} \\ ⋮ \\ 𝒒_{n}^{T} \end{array}) = 𝑸 𝑸^{T}$
Definition. The orthogonal projector onto $C (𝑸)$ , $𝑸 \in ℝ^{m \times n}$ with orthonormal column vectors is $𝑷 = 𝑸 𝑸^{T}$

Complementary orthogonal projector

Given $𝒖 \in ℝ^{m}$ and $𝑸 = (\begin{array}{lll} 𝒒_{1} & \dots & 𝒒_{n} \end{array}) \in ℝ^{m \times n}$ with orthonormal columns

Definition. The complementary orthogonal projector to $𝑷 = 𝑸 𝑸^{T}$ is $𝑰 - 𝑷$ , where $𝑸 \in ℝ^{m \times n}$ is a matrix with orthonormal columns.

The complementary orthogonal projector projects a vector onto the left null space, $N (𝑸^{T})$

Orthogonal projectors and linear systems

Consider the linear system $𝑨 𝒙 = 𝒃$ with $𝑨 \in ℝ^{m \times n}$ , $𝒙 \in ℝ^{n}$ , $𝒃 \in ℝ^{m}$ . Orthogonal projectors and knowledge of the four fundamental matrix subspaces allows us to succintly express whether there exist no solutions, a single solution of an infinite number of solutions:
- Consider the factorization $𝑸 𝑹 = 𝑨$ , the orthogonal projector $𝑷 = 𝑸 𝑸^{T}$ , and the complementary orthogonal projector $𝑰 - 𝑷$
- If $|| (𝑰 - 𝑷) 𝒃 || \neq 0$ , then $𝒃$ has a component outside the column space of $𝑨$ , and $𝑨 𝒙 = 𝒃$ has no solution
- If $|| (𝑰 - 𝑷) 𝒃 || = 0$ , then $𝒃 \in C (𝑸) = C (𝑨)$ and the system has at least one solution
- If $N (𝑨) = {𝟎}$ (null space only contains the zero vector, i.e., null space of dimension 0) the system has a unique solution
- If $\dim N (𝑨) = n - r > 0$ , then a vector $𝒚 \in N (𝑨)$ in the null space is written as
  $𝒚 = c_{1} 𝒛_{1} + \dots + c_{n - r} 𝒛_{n - r}$
  and if $𝒙$ is a solution of $𝑨 𝒙 = 𝒃$ , so is $𝒙 + 𝒚$ , since
  $𝑨 (𝒙 + 𝒚) = 𝑨 𝒙 + c_{1} 𝑨 𝒛_{1} + \dots + c_{n - r} 𝑨 𝒛_{n - r} = 𝒃 + 𝟎 + \dots + 𝟎 = 𝒃$
  The linear system has an $(n - r)$ -parameter family of solutions

Best approximation in the 2-norm (least squares)

Figure 1. Least squares problem: find $𝒗 \in C (𝑨)$ , $𝑨 \in ℝ^{m \times n}$ closest to some given $𝒚$ in the 2-norm

Mathematical statement: solve the minimization problem ${min}_{𝒄 \in ℝ^{n}} || 𝒚 - 𝑨 𝒄 ||$
Approach: project $𝒚$ onto the column space of $𝑨$ :
1. Find an orthonormal basis for column space of $𝑨$ by $𝑸 𝑹$ factorization, $𝑸 𝑹 = 𝑨$
2. State that $𝒗$ is the projection of $𝒚$ , $𝒗 = 𝑷_{C (𝑨)} 𝒚 = 𝑷_{𝑸} 𝒚 = 𝑸 𝑸^{T} 𝒚$
3. State that $𝒗$ is within the column space of $𝑨$ , $𝒗 = 𝑨 𝒄 = 𝑸 𝑹 𝒄$
4. Set equal the two expressions of $𝒗$ , $𝑸 𝑸^{T} 𝒚 = 𝑸 𝑹 𝒄 \Rightarrow 𝑹 𝒄 = 𝑸^{T} 𝒚$
5. Solve the triangular system to find $𝒄$ (in Julia, Matlab, Octave: c=R\(Q'y))

Least squares: linear regression calculus approach

In many scientific fields the problem of determining the straight line $y (x) = c_{0} + c_{1} x$ , that best approximate data $𝒟 = {(x_{i}, y_{i}), i = 1, \dots, m}$ arises. The problem is to find the coefficients $c_{0}, c_{1}$ , and this is referred to as the linear regression problem.
The calculus approach: Form sum of squared differences between $y (x_{i})$ and $y_{i}$
$S (c_{0}, c_{1}) = \sum_{i = 1}^{m} {(y (x_{i}) - y_{i})}^{2} = \sum_{i = 1}^{m} {(c_{0} + c_{1} x_{i} - y_{i})}^{2}$
and seek $(c_{0}, c_{1})$ that minimize $S (c_{0}, c_{1})$ by solving the equations
$\frac{\partial S}{\partial c_{0}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (c_{0} + c_{1} x_{i} - y_{i}) = 0 \Leftrightarrow m c_{0} + (\sum_{i = 1}^{m} x_{i}) c_{1} = \sum_{i = 1}^{m} y_{i}$ $\frac{\partial S}{\partial c_{1}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (c_{0} + c_{1} x_{i} - y_{i}) x_{i} = 0 \Leftrightarrow (\sum_{i = 1}^{m} x_{i}) c_{0} + (\sum_{i = 1}^{m} x_{i}^{2}) c_{1} = \sum_{i = 1}^{m} x_{i} y_{i}$

Geometry of linear regression: normal equations

Form a vector of errors with components $e_{i} = y (x_{i}) - x_{i}$ . Recognize that $y (x_{i})$ is a linear combination of $1$ and $x_{i}$ with coefficients $a_{0}, a_{1}$ , or in vector form
$𝒆 = [\begin{array}{ll} 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{m} \end{array}] [\begin{array}{l} c_{0} \\ c_{1} \end{array}] - 𝒚 = [\begin{array}{ll} 𝟏 & 𝒙 \end{array}] 𝒄 - 𝒚 = 𝑨 𝒄 - 𝒚$
The norm of the error vector $|| 𝒆 ||$ is smallest when $𝑨 𝒄$ is as close as possible to $𝒚$ . Since $𝑨 𝒄$ is within the column space of $C (𝑨)$ , $𝑨 𝒄 \in C (𝑨)$ , the required condition is for $𝒆$ to be orthogonal to the column space, leading to the normal equations
$𝒆 ⊥ C (𝑨) \Rightarrow 𝑨^{T} 𝒆 = [\begin{array}{l} 𝟏^{T} \\ 𝒙^{T} \end{array}] 𝒆 = [\begin{array}{l} 𝟏^{T} 𝒆 \\ 𝒙^{T} 𝒆 \end{array}] = [\begin{array}{l} 0 \\ 0 \end{array}] = 𝟎$ $𝑨^{T} 𝒆 = 𝟎 \Leftrightarrow 𝑨^{T} (𝑨 𝒄 - 𝒚) = 0 \Leftrightarrow (𝑨^{T} 𝑨) 𝒄 = 𝑨^{T} 𝒚 (Normal equations)$

If $𝑸 𝑹 = 𝑨$ is known, preferable to solve $𝑸 𝑸^{T} 𝒚 = 𝑸 𝑹 𝒄 \Rightarrow 𝑹 𝒄 = 𝑸^{T} 𝒚$ .

Linear regression example

Generate some data on a line and perturb it by some random quantities
∴

m=100; x=(0:m-1)./m; c0=2; c1=3; yex=c0.+c1*x; y=(yex.+rand(m,1).-0.5);
∴

Form the $𝑸, 𝑹$ matrices, $𝑸 𝑹 = 𝑨$ , (qr(A,0))

∴	A=ones(m,2); A[:,2]=x[:]; QR=qr(A); Q=QR.Q[:,1:2]; R=QR.R[1:2,1:2];

∴

Solve the system $𝑹 𝒙 = 𝑸^{T} 𝒚$

∴	c = R\(transpose(Q)*y)

$[\begin{array}{c} 1.9812089298039193 \\ 2.9758677046339135 \end{array}]$ (1)

∴

Form the linear combination $𝒗 = 𝑨 𝒙$ closest to $𝒃$
∴

v=A*c;
∴

Linear regression result

Plot the perturbed data (black dots), the result of the linear regression (green circles), as well as the line used to generate yex (red line)

∴	plot(x,y,".k",x,v,"og",x,yex,"r"); title("Linear regression"); xlabel("x"); ylabel("y,v,yex");

∴	cd(homedir()*"/Desktop/courses/MATH347DS/images"); savefig("L07Fig02.eps");

∴

Figure 2. Linear reqression through least squares result

Quadratic regression

The calculus approach becomes complex for higher-degree approximation
$y (x) = c_{0} + c_{1} x + c_{2} x^{2} = [\begin{array}{lll} 1 & x & x^{2} \end{array}] [\begin{array}{l} c_{0} \\ c_{1} \\ c_{2} \end{array}] = A (x) 𝒄 .$
Note that $y (x)$ is nonlinear.

The least squares approach retains its simplicity since but $y (c_{0}, c_{1}, c_{2})$ is linear.

∴	m=100; x=(0:m-1)./m; c0=2; c1=3; c2=-5; yex=c0.+c1x.+c2x.^2;

∴	y=(yex.+rand(m,1).-0.5);

∴	A=ones(m,3); A[:,2]=x[:]; A[:,3]=x[:].^2; QR=qr(A); Q=QR.Q[:,1:3]; R=QR.R[1:3,1:3];

∴	c = R\(transpose(Q)*y)

$[\begin{array}{c} 2.019352129655483 \\ 2.8933875491171213 \\ - 4.773241628594068 \end{array}]$ (2)

∴

v=A*c;

∴

Quadratic regression result

Plot the perturbed data (black dots), the result of the linear regression (green circles), as well as the line used to generate yex (red line)

∴	plot(x,y,".k",x,v,"og",x,yex,"r"); title("Quadratic regression"); xlabel("x"); ylabel("y,v,yex");

∴	cd(homedir()*"/Desktop/courses/MATH347DS/images"); savefig("L07Fig03.eps");

∴

Figure 3. Linear reqression through least squares result

The

m = n

case: polynomial interpolation

Definition. The polynomial interpolant of data $𝒟 = {(x_{i}, y_{i}), i = 1, \dots, m}$ with $x_{i} \neq x_{j}$ if $i \neq j$ is a polynomial of degree $m - 1$

$p_{m - 1} (x) = c_{0} + c_{1} x + \dots + c_{m - 1} x^{m - 1}$

that satisfies the conditions $p_{m - 1} (x_{i}) = y_{i}$ , $i = 1, \dots, m$ .

We can apply the same approach. In this particular case, the error $𝒆$ can be made zero.

∴	m=3; x=(0:m-1)./m; c0=2; c1=3; c2=-5; yex=c0.+c1x.+c2x.^2;

∴	A=ones(m,3); A[:,2]=x[:]; A[:,3]=x[:].^2; QR=qr(A); Q=QR.Q[:,1:3]; R=QR.R[1:3,1:3];

∴	c = R\(transpose(Q)*yex)

$[\begin{array}{c} 2.0 \\ 2.9999999999999987 \\ - 4.999999999999999 \end{array}]$ (3)

Note that the coefficients used to generate the data are recovered exactly.