MATH347DS

Synopsis. Having established the key theoretical results, the main linear algebra problems can now be solved. The typical scenario within data science applications is that a large dimensional vector representation of some object is available and greater insight is sought by seeking a description in terms of linear combination of a small number of vectors. The large dimensional object might not be exactly recovered and the focus is on obtaining the best possible approximation. In Euclidean spaces with distances measured by the 2-norm, best approximants are readily found by the generalization of the Pythagorean theorem to high-dimensional spaces.

1.Orthogonal projection

Consider a partition of a vector space

U

into orthogonal subspaces

U = V \oplus W

V = W^{⊥}

W = V^{⊥}

, typically

U = ℝ^{m}

V \subset ℝ^{m}

W \subset ℝ^{m}

\dim V = n

\dim W = m - n

. If

𝑽 = [\begin{array}{ccc} 𝒗_{1} & \dots & 𝒗_{n} \end{array}] \in ℝ^{m \times n}

is a basis for

V

and

𝑾 = [\begin{array}{ccc} 𝒘_{1} & \dots & 𝒘_{m - n} \end{array}] \in ℝ^{m \times (m - n)}

is a basis for W, then

𝑼 = [\begin{array}{cccccc} 𝒗_{1} & \dots & 𝒗_{n} & 𝒘_{1} & \dots & 𝒘_{m - n} \end{array}]

is a basis for

U

. Even though the matrices

𝑽, 𝑾

are not necessarily square, they are said to be orthonormal when all columns are of unit norm and orthogonal to one another. In this case computation of the matrix product

𝑽^{T} 𝑽

leads to the formation of the identity matrix within

ℝ^{n}

Similarly,

𝑾^{T} 𝑾 = 𝑰_{m - n}

. Whereas for the square orthogonal matrix

𝑼

multiplication both on the left and the right by its transpose leads to the formation of the identity matrix

the same operations applied to rectangular orthogonal matrices lead to different results

A simple example is provided by taking

𝑽 = 𝑰_{m, n}

, the first

n

columns of the identity matrix in which case

Applying

𝑷 = 𝑽 𝑽^{T}

to some vector

𝒃 \in ℝ^{m}

leads to a vector

𝒓 = 𝑷 𝒃

whose first

n

components are those of

𝒃

, and the remaining

m - n

are zero. The subtraction

𝒃 - 𝒓

leads to a new vector

𝒔 = (𝑰 - 𝑷) 𝒃

that has the first components equal to zero, and the remaining

m - n

the same as those of

𝒃

. Such operations are referred to as projections, and for

𝑽 = 𝑰_{m, n}

correspond to projection onto the span

{𝒆_{1}, \dots, 𝒆_{n}}

Returning to the general case, the orthogonal matrices

𝑼 \in ℝ^{m \times m}

𝑽 \in ℝ^{m \times n}

𝑾 \in ℝ^{m \times (m - n)}

are associated with linear mappings

𝒃 = 𝒇 (𝒙) = 𝑼 𝒙

𝒓 = 𝒈 (𝒃) = 𝑷 𝒃

𝒔 = 𝒉 (𝒃) = (𝑰 - 𝑷) 𝒃

. The mapping

𝒇

gives the components in the

𝑰

basis of a vector whose components in the

𝑼

basis are

𝒙

. The mappings

𝒈, 𝒉

project a vector onto span

{𝒗_{1}, \dots, 𝒗_{n}}

, span

{𝒘_{1}, \dots, 𝒘_{m - n}}

, respectively. When

𝑽, 𝑾

are orthogonal matrices the projections are also orthogonal

𝒓 ⊥ 𝒔

. Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error, especially when the angle between basis vectors is small, and will be avoided henceforth. Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a general definition.

Definition. The mapping is called a projection if $𝒇 \circ 𝒇 = 𝒇$ , or if for any $𝒖 \in U$ , $𝒇 (𝒇 (𝒖)) = 𝒇 (𝒖)$ . With $𝑷$ the matrix associated $𝒇$ , a projection matrix satisfies $𝑷^{2} = 𝑷$ .

Orthogonal projections onto the column space

C (𝑸)

of an orthonormal matrix are of great practical utility, and satisfy the above definition

2.Gram-Schmidt algorithm

Orthonormal vector sets

{𝒒_{1}, \dots, 𝒒_{n}}

are of the greatest practical utility, leading to the question of whether some such a set can be obtained from an arbitrary set of vectors

{𝒂_{1}, \dots, 𝒂_{n}}

. This is possible for independent vectors, through what is known as the Gram-Schmidt algorithm

The above geometrical description can be expressed in terms of matrix operations as

The system is easily solved by forward substitution resulting in what is known as the (modified) Gram-Schmidt algorithm, transcribed below both in pseudo-code and in Julia.

The input matrix

𝑨

might have linearly dependent columns in which case

r_{i i} \approx 0

for some

i

, and the if-instruction interrupts the algorithm. The Gram-Schmidt algorithm furnishes a factorization

with

𝑸 \in ℝ^{m \times n}

an orthonormal matrix and

𝑹 \in ℝ^{n \times n}

an upper triangular matrix, known as the

Q R

-factorization. Since the column vectors within

𝑸

were obtained through linear combinations of the column vectors of

𝑨

C (𝑨) = C (𝑸)

. The

Q R

-factorization is of great utility in solving problems within linear algebra.

3.Least squares problems in $ℝ^{m}$

3.1.Problem formulation and solution by orthogonal projection

A typical situation in applications is that a vector $𝒚 \in ℝ^{m}$ represents a complex object with $m ≫ 1$ . A simpler representation of the object is sought through a linear combination $𝒗 = 𝑨 𝒄$ , with $𝑨 \in ℝ^{m \times n}$ and $n < m$ , usually $m ≪ n$ (Fig. 1).

Figure 1. Least squares problem: find $𝒗 \in C (𝑨)$ , $𝑨 \in ℝ^{m \times n}$ closest to some given $𝒚$ in the 2-norm

The magnitude of the difference between the exact object $𝒚$ and the approximation $𝑨 𝒄$ is measured through a norm, and in least squares the 2-norm is adopted. The problem is to minimize the error $ε = || 𝒆 || = || 𝒚 - 𝑨 𝒄 || = {(𝒆^{T} 𝒆)}^{1 / 2}$ . This is stated mathematically as

{min}_{𝒄 \in ℝ^{n}} || 𝒚 - 𝑨 𝒄 ||,

and within $ℝ^{m}$ the minimal (2-norm) distance is obtained when $𝒚 - 𝒗$ is orthogonal to $C (𝑨)$ . Note that is another type of norm were to be adopted, the orthogonality condition would no longer necessarily hold. For the 2-norm however, the orthogonality condition leads to a straightforward solution of the problem through orthogonal projection:

Find an orthonormal basis for column space of $𝑨$ by $𝑸 𝑹$ factorization, $𝑸 𝑹 = 𝑨$ .
State that $𝒗$ is the projection of $𝒚$ , $𝒗 = 𝑷_{C (𝑨)} 𝒚 = 𝑷_{𝑸} 𝒚 = 𝑸 𝑸^{T} 𝒚$ .
State that $𝒗$ is within the column space of $𝑨$ , $𝒗 = 𝑨 𝒄 = 𝑸 𝑹 𝒄$ .
Set equal the two expressions of $𝒗$ , $𝑸 𝑸^{T} 𝒚 = 𝑸 𝑹 𝒄$ . This is an equality between two linear combinations of the columns of $𝑸$ . For $𝑸$ orthonormal the scaling coefficients of the two linear combinations must be equal leading to $𝑹 𝒄 = 𝑸^{T} 𝒚$ .
Solve the triangular system to find $𝒄$ .

3.2.Linear regression

The approach to compressing data $D = {(x_{i}, y_{i}) | i = 1, \dots, m .}$ suggested by calculus concepts is to form the sum of squared differences between $y (x_{i})$ and $y_{i}$ , for example for $y (x) = c_{0} + c_{1} x$ when carrying out linear regression,

S (c_{0}, c_{1}) = \sum_{i = 1}^{m} {(y (x_{i}) - y_{i})}^{2} = \sum_{i = 1}^{m} {(c_{0} + c_{1} x_{i} - y_{i})}^{2}

and seek $(c_{0}, c_{1})$ that minimize $S (c_{0}, c_{1})$ . The function $S (c_{0}, c_{1}) ⩾ 0$ can be thought of as the height of a surface above the $c_{0} c_{1}$ plane, and the gradient $\nabla S$ is defined as a vector in the direction of steepest slope. When at some point on the surface if the gradient is different from the zero vector $\nabla S \neq 𝟎$ , travel in the direction of the gradient would increase the height, and travel in the opposite direction would decrease the height. The minimal value of $S$ would be attained when no local travel could decrease the function value, which is known as stationarity condition, stated as $\nabla S = 0$ . Applying this to determining the coefficients $(c_{0}, c_{1})$ of a linear regression leads to the equations

\frac{\partial S}{\partial c_{0}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (c_{0} + c_{1} x_{i} - y_{i}) = 0 \Leftrightarrow m c_{0} + (\sum_{i = 1}^{m} x_{i}) c_{1} = \sum_{i = 1}^{m} y_{i},

\frac{\partial S}{\partial c_{1}} = 0 \Rightarrow 2 \sum_{i = 1}^{m} (c_{0} + c_{1} x_{i} - y_{i}) x_{i} = 0 \Leftrightarrow (\sum_{i = 1}^{m} x_{i}) c_{0} + (\sum_{i = 1}^{m} x_{i}^{2}) c_{1} = \sum_{i = 1}^{m} x_{i} y_{i} .

The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear combination that is sought in a liner regression. Form a vector of errors with components $e_{i} = y (x_{i}) - y_{i}$ , which for linear regression is $y (x) = c_{0} + c_{1} x$ . Recognize that $y (x_{i})$ is a linear combination of $1$ and $x_{i}$ with coefficients $c_{0}, c_{1}$ , or in vector form

𝒆 = [\begin{array}{ll} 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{m} \end{array}] [\begin{array}{l} c_{0} \\ c_{1} \end{array}] - 𝒚 = [\begin{array}{ll} 𝟏 & 𝒙 \end{array}] 𝒄 - 𝒚 = 𝑨 𝒄 - 𝒚 .

The norm of the error vector $|| 𝒆 ||$ is smallest when $𝑨 𝒂$ is as close as possible to $𝒚$ . Since $𝑨 𝒂$ is within the column space of $C (𝑨)$ , $𝑨 𝒂 \in C (𝑨)$ , the required condition is for $𝒆$ to be orthogonal to the column space

𝒆 ⊥ C (𝑨) \Rightarrow 𝑨^{T} 𝒆 = [\begin{array}{l} 𝟏^{T} \\ 𝒙^{T} \end{array}] 𝒆 = [\begin{array}{l} 𝟏^{T} 𝒆 \\ 𝒙^{T} 𝒆 \end{array}] = [\begin{array}{l} 0 \\ 0 \end{array}] = 𝟎

𝑨^{T} 𝒆 = 𝟎 \Leftrightarrow 𝑨^{T} (𝑨 𝒄 - 𝒚) = 0 \Leftrightarrow (𝑨^{T} 𝑨) 𝒄 = 𝑨^{T} 𝒚 = 𝒃 \Leftrightarrow 𝑵 𝒄 = 𝒃 .

The above is known as the normal system, with $𝑵 = 𝑨^{T} 𝑨$ is the normal matrix. The system $𝑵 𝒄 = 𝒃$ can be interpreted as seeking the coordinates in the $𝑵 = 𝑨^{T} 𝑨$ basis of the vector $𝒃 = 𝑨^{T} 𝒚$ . An example can be constructed by randomly perturbing a known function $y (x) = a_{0} + a_{1} x$ to simulate measurement noise and compare to the approximate $\tilde{c}$ obtained by solving the normal system.

∴	m=100; x=(0:m-1)./m; c0=2; c1=3; yex=c0.+c1*x; y=(yex.+rand(m,1).-0.5);

∴	A=ones(m,2); A[:,2]=x[:]; At=transpose(A); N=AtA; b=Aty;

∴

c = N\b

$[\begin{array}{c} 1.9524719146017488 \\ 3.114552636517776 \end{array}]$ (1)

∴

3.3.Least squares polynomial approximations of data

Forming the normal system of equations can lead to numerical difficulties, especially when the columns of $𝑨$ are close to linear dependence. It is preferable to adopt the general procedure of solving a least squares problem by projection, in which case the above linear regression becomes:

𝑸 𝑹 = 𝑨, 𝑹 𝒄 = 𝑸^{T} 𝒚 .

∴	QR=qr(A); Q=QR.Q[:,1:2]; R=QR.R[1:2,1:2];

∴	c = R\(transpose(Q)*y)

$[\begin{array}{c} 1.9524719146017504 \\ 3.114552636517771 \end{array}]$ (2)

The above procedure can be easily extended to define quadratic or cubic regression, the problem of finding the best polynomials of degree 2 or 3 that fit the data. Quadratic regression is simply accomplished by adding a column to

𝑨

containing the squares of the

𝒙

vector

with the column vector

𝒂_{k} = 𝒙^{k - 1} \in ℝ^{m}

has components

x_{i}^{k - 1}

for

i = 1, 2, \dots, m