"Maymester MATH547 Linear Algebra for Applications in Data Science"

Model Reduction

1.Projection of mappings

The least-squares problem

{min}_{𝒙 \in ℝ^{n}} || 𝒚 - 𝑨 𝒙 ||

(1)

focuses on a simpler representation of a data vector $𝒚 \in ℝ^{m}$ as a linear combination of column vectors of $𝑨 \in ℝ^{m \times n}$ . Consider some phenomenon modeled as a function between vector spaces $𝒇 : X \to Y$ , such that for input parameters $𝒙 \in X$ , the state of the system is $𝒚 = 𝒇 (𝒙)$ . For most models $𝒇$ is differentiable, a transcription of the condition that the system should not exhibit jumps in behavior when changing the input parameters. Then by appropriate choice of units and origin, a linearized model

𝒚 = 𝑨 𝒙, 𝑨 \in ℝ^{m \times n},

is obtained if $𝒚 \in C (𝑨)$ , expressed as (1) if $𝒚 \notin C (𝑨)$ .

A simpler description is often sought, typically based on recognition that the inputs and outputs of the model can themselves be obtained as linear combinations $𝒙 = 𝑩 𝒖$ , $𝒚 = 𝑪 𝒗$ , involving a smaller set of parameters $𝒖 \in ℝ^{q}$ , $𝒗 \in ℝ^{p}$ , $p < m$ , $q < n$ . The column spaces of the matrices $𝑩 \in ℝ^{n \times q}$ , $𝑪 \in ℝ^{m \times p}$ are vector subspaces of the original set of inputs and outputs, $C (𝑩) \leq ℝ^{n}$ , $C (𝑪) \leq ℝ^{m}$ . The sets of column vectors of $𝑩, 𝑪$ each form a reduced basis for the system inputs and outputs if they are chosed to be of full rank. The reduced bases are assumed to have been orthonormalized through the Gram-Schmidt procedure such that $𝑩^{T} 𝑩 = 𝑰_{q}$ , and $𝑪^{T} 𝑪 = 𝑰_{p}$ . Expressing the model inputs and outputs in terms of the reduced basis leads to

𝑪 𝒗 = 𝑨 𝑩 𝒖 \Rightarrow 𝒗 = 𝑪^{T} 𝑨 𝑩 𝒖 \Rightarrow 𝒗 = 𝑹 𝒖 .

The matrix $𝑹 = 𝑪^{T} 𝑨 𝑩$ is called the reduced system matrix and is associated with a mapping $𝒈 : U \to V$ , that is a restriction to the $U, V$ vector subspaces of the mapping $𝒇$ . When $𝒇$ is an endomorphism, $𝒇 : X \to X$ , $m = n$ , the same reduced basis is used for both inputs and outputs, $𝒙 = 𝑩 𝒖$ , $𝒚 = 𝑩 𝒗$ , and the reduced system is

𝒗 = 𝑹 𝒖, 𝑹 = 𝑩^{T} 𝑨 𝑩 .

Since $𝑩$ is assumed to be orthogonal, the projector onto $C (𝑩)$ is $𝑷_{𝑩} = 𝑩 𝑩^{T}$ . Applying the projector on the inital model

𝑷_{𝑩} 𝒚 = 𝑷_{𝑩} 𝑨 𝒙

leads to $𝑩 𝑩^{T} 𝒚 = 𝑩 𝑩^{T} 𝑨 𝒙$ , and since $𝒗 = 𝑩^{T} 𝒚$ the relation $𝑩 𝒗 = 𝑩 𝑩^{T} 𝑨 𝑩 𝒖$ is obtained, and conveniently grouped as

𝑩 𝒗 = 𝑩 (𝑩^{T} 𝑨 𝑩) 𝒖 \Rightarrow 𝑩 𝒗 = 𝑩 (𝑹 𝒖),

again leading to the reduced model $𝒗 = 𝑩 𝒖$ . The above calculation highlights that the reduced model is a projection of the full model $𝒚 = 𝑨 𝒙$ on $C (𝑩)$ .

2.Reduced bases

2.1.Correlation matrices

Correlation coefficient.

Consider two functions

x_{1}, x_{2} : ℝ \to ℝ

, that represent data streams in time of inputs

x_{1} (t)

and outputs

x_{2} (t)

of some system. A basic question arising in modeling and data science is whether the inputs and outputs are themselves in a functional relationship. This usually is a consequence of incomplete knowledge of the system, such that while

x_{1}, x_{2}

might be assumed to be the most relevant input, output quantities, this is not yet fully established. A typical approach is to then carry out repeated measurements leading to a data set

D = {(x_{1} (t_{i}), x_{2} (t_{i})) | i = 1, \dots, N .}

, thus defining a relation. Let

𝒙_{1}, 𝒙_{2} \in ℝ^{N}

denote vectors containing the input and output values. The mean values

μ_{1}, μ_{2}

of the input and output are estimated by the statistics

μ_{1} ≅ {\overline{x}}_{1} = \frac{1}{N} \sum_{i = 1}^{N} x_{1} (t_{i}) = E [x_{1}], μ_{2} ≅ {\overline{x}}_{2} = \frac{1}{N} \sum_{i = 1}^{N} x_{2} (t_{i}) = E [x_{2}],

where $E$ is the expectation seen to be a linear mapping, $E : ℝ^{N} \to ℝ$ whose associated matrix is

𝑬 = \frac{1}{N} [\begin{array}{cccc} 1 & 1 & \dots & 1 \end{array}],

and the means are also obtained by matrix vector multiplication (linear combination),

{\overline{x}}_{1} = 𝑬 𝒙_{1}, {\overline{x}}_{2} = 𝑬 𝒙_{2} .

Deviation from the mean is measured by the standard deviation defined for $x_{1}, x_{2}$ by

σ_{1} = \sqrt{E [{(x_{1} - μ_{1})}^{2}]}, σ_{2} = \sqrt{E [{(x_{2} - μ_{2})}^{2}]} .

Note that the standard deviations are no longer linear mappings of the data.

Assume that the origin is chosen such that ${\overline{x}}_{1} = {\overline{x}}_{2} = 0$ . One tool to estalish whether the relation $D$ is also a function is to compute the correlation coefficient

ρ (x_{1}, x_{2}) = \frac{E [x_{1} x_{2}]}{σ_{1} σ_{2}} = \frac{E [x_{1} x_{2}]}{\sqrt{E [x_{1}^{2}] E [x_{2}^{2}]}},

that can be expressed in terms of a scalar product and 2-norm as

ρ (x_{1}, x_{2}) = \frac{𝒙_{1}^{T} 𝒙_{2}}{{|| 𝒙_{1} ||}_{} {|| 𝒙_{2} ||}_{}} .

Squaring each side of the norm property $|| 𝒙_{1} + 𝒙_{2} || ⩽ || 𝒙_{1} || + || 𝒙_{2} ||$ , leads to

{(𝒙_{1} + 𝒙_{2})}^{T} (𝒙_{1} + 𝒙_{2}) ⩽ 𝒙_{1}^{T} 𝒙_{1} + 𝒙_{2}^{T} 𝒙_{2} + 2 || 𝒙_{1} || || 𝒙_{2} || \Rightarrow 𝒙_{1}^{T} 𝒙_{2} ⩽ || 𝒙_{1} || || 𝒙_{2} ||,

known as the Cauchy-Schwarz inequality, which implies $- 1 ⩽ ρ (x_{1}, x_{2}) ⩽ 1$ . Depending on the value of $ρ$ , the variables $x_{1} (t), x_{2} (t)$ are said to be:

uncorrelated, if $ρ = 0$ ;
correlated, if $ρ = 1$ ;
anti-correlated, if $ρ = - 1$ .

The numerator of the correlation coefficient is known as the covariance of $x_{1}, x_{2}$

cov (x_{1}, x_{2}) = E [x_{1} x_{2}] .

The correlation coefficient can be interpreted as a normalization of the covariance, and the relation

cov (x_{1}, x_{2}) = 𝒙_{1}^{T} 𝒙_{2} = ρ (x_{1}, x_{2}) || 𝒙_{1} || || 𝒙_{2} ||,

is the two-variable version of a more general relationship encountered when the system inputs and outputs become vectors.

Patterns in data.

Consider now a related problem, whether the input and output parameters

𝒙 \in ℝ^{n}

𝒚 \in ℝ^{m}

thought to characterize a system are actually well chosen, or whether they are redundant in the sense that a more insightful description is furnished by

𝒖 \in ℝ^{q}, 𝒗 \in ℝ^{p}

with fewer components

p < m, q < n

. Applying the same ideas as in the correlation coefficient, a sequence of

N

measurements is made leading to data sets

𝑿 = [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] \in ℝ^{N \times n}, 𝒀 = [\begin{array}{cccc} 𝒚_{1} & 𝒚_{2} & \dots & 𝒚_{n} \end{array}] \in ℝ^{N \times m} .

Again, by appropriate choice of the origin the means of the above measurements is assumed to be zero

E [𝒙] = 𝟎, E [𝒚] = 𝟎 .

Covariance matrices can be constructed by

𝑪_{𝑿} = 𝑿^{T} 𝑿 = [\begin{array}{c} 𝒙_{1}^{T} \\ 𝒙_{2}^{T} \\ ⋮ \\ 𝒙_{n}^{T} \end{array}] [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] = [\begin{array}{cccc} 𝒙_{1}^{T} 𝒙_{1} & 𝒙_{1}^{T} 𝒙_{2} & \dots & 𝒙_{1}^{T} 𝒙_{n} \\ 𝒙_{2}^{T} 𝒙_{1} & 𝒙_{2}^{T} 𝒙_{2} & \dots & 𝒙_{2}^{T} 𝒙_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 𝒙_{n}^{T} 𝒙_{1} & 𝒙_{n}^{T} 𝒙_{2} & \dots & 𝒙_{n}^{T} 𝒙_{n} \end{array}] \in ℝ^{n \times n} .

Consider now the SVDs of $C_{𝑿} = 𝑵 𝚲 𝑵^{T}$ , $𝑿 = 𝑼 𝚺 𝑺^{T}$ , and from

𝑪_{𝑿} = 𝑿^{T} 𝑿 = {(𝑼 𝚺 𝑺^{T})}^{T} 𝑼 𝚺 𝑺^{T} = 𝑺 𝚺^{T} 𝑼^{T} 𝑼 𝚺 𝑺^{T} = 𝑺 𝚺^{T} 𝚺 𝑺^{T} = 𝑵 𝚲 𝑵^{T},

identify $𝑵 = 𝑺$ , and $𝚲 = 𝚺^{T} 𝚺$ .

Recall that the SVD returns an order set of singular values $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾$ , and associated singular vectors. In many applications the singular values decrease quickly, often exponentially fast. Taking the first $q$ singular modes then gives a basis set suitable for mode reduction

𝒙 = 𝑺_{q} 𝒖 = [\begin{array}{cccc} 𝒔_{1} & 𝒔_{2} & \dots & 𝒔_{q} \end{array}] 𝒖 .