MATH661

Model Reduction

1.Projection of mappings

1.1.Reduced matrices

The least-squares problem

{min}_{𝒙 \in ℝ^{n}} || 𝒚 - 𝑨 𝒙 ||

(1)

focuses on a simpler representation of a data vector $𝒚 \in ℝ^{m}$ as a linear combination of column vectors of $𝑨 \in ℝ^{m \times n}$ . Consider some phenomenon modeled as a function between vector spaces $𝒇 : X \to Y$ , such that for input parameters $𝒙 \in X$ , the state of the system is $𝒚 = 𝒇 (𝒙)$ . For most models $𝒇$ is differentiable, a transcription of the condition that the system should not exhibit jumps in behavior when changing the input parameters. Then by appropriate choice of units and origin, a linearized model

𝒚 = 𝑨 𝒙, 𝑨 \in ℝ^{m \times n},

is obtained if $𝒚 \in C (𝑨)$ , expressed as (1) if $𝒚 \notin C (𝑨)$ .

A simpler description is often sought, typically based on recognition that the inputs and outputs of the model can themselves be obtained as linear combinations $𝒙 = 𝑩 𝒖$ , $𝒚 = 𝑪 𝒗$ , involving a smaller set of parameters $𝒖 \in ℝ^{q}$ , $𝒗 \in ℝ^{p}$ , $p < m$ , $q < n$ . The column spaces of the matrices $𝑩 \in ℝ^{n \times q}$ , $𝑪 \in ℝ^{m \times p}$ are vector subspaces of the original set of inputs and outputs, $C (𝑩) \leq ℝ^{n}$ , $C (𝑪) \leq ℝ^{m}$ . The sets of column vectors of $𝑩, 𝑪$ each form a reduced basis for the system inputs and outputs if they are chosed to be of full rank. The reduced bases are assumed to have been orthonormalized through the Gram-Schmidt procedure such that $𝑩^{T} 𝑩 = 𝑰_{q}$ , and $𝑪^{T} 𝑪 = 𝑰_{p}$ . Expressing the model inputs and outputs in terms of the reduced basis leads to

𝑪 𝒗 = 𝑨 𝑩 𝒖 \Rightarrow 𝒗 = 𝑪^{T} 𝑨 𝑩 𝒖 \Rightarrow 𝒗 = 𝑹 𝒖 .

The matrix $𝑹 = 𝑪^{T} 𝑨 𝑩 \in ℝ^{p \times q}$ is called the reduced system matrix and is associated with a mapping $𝒈 : U \to V$ , that is a restriction to the $U, V$ vector subspaces of the mapping $𝒇$ . When $𝒇$ is an endomorphism, $𝒇 : X \to X$ , $m = n$ , the same reduced basis is used for both inputs and outputs, $𝒙 = 𝑩 𝒖$ , $𝒚 = 𝑩 𝒗$ , and the reduced system is

𝒗 = 𝑹 𝒖, 𝑹 = 𝑩^{T} 𝑨 𝑩 .

Since $𝑩$ is assumed to be orthogonal, the projector onto $C (𝑩)$ is $𝑷_{𝑩} = 𝑩 𝑩^{T}$ . Applying the projector on the inital model

𝑷_{𝑩} 𝒚 = 𝑷_{𝑩} 𝑨 𝒙

leads to $𝑩 𝑩^{T} 𝒚 = 𝑩 𝑩^{T} 𝑨 𝒙$ , and since $𝒗 = 𝑩^{T} 𝒚$ the relation $𝑩 𝒗 = 𝑩 𝑩^{T} 𝑨 𝑩 𝒖$ is obtained, and conveniently grouped as

𝑩 𝒗 = 𝑩 (𝑩^{T} 𝑨 𝑩) 𝒖 \Rightarrow 𝑩 𝒗 = 𝑩 (𝑹 𝒖),

again leading to the reduced model $𝒗 = 𝑩 𝒖$ . The above calculation highlights that the reduced model is a projection of the full model $𝒚 = 𝑨 𝒙$ on $C (𝑩)$ .

1.2.Dynamical system model reduction

An often encountered situation is the reduction of large-dimensional dynamical system

𝑴 \ddot{𝒙} + 𝑫 \dot{𝒙} + 𝑲 𝒙 = 𝒇, 𝑴, 𝑫, 𝑲 \in ℝ^{m \times m}, 𝒙, 𝒇 : ℝ_{+} \to ℝ^{m},

(2)

\dot{𝒙} = \frac{d 𝒙}{d t}, \ddot{𝒙} = \frac{d \dot{𝒙}}{d t},

a generalization to multiple degrees of freedom of the damped oscillator equation

m \ddot{x} + d \dot{x} + k x = f .

In (2), $𝒙 (t)$ are the time-depenent coordinates of the system, $𝒇 (t)$ the forces acting on the system, and $𝑴, 𝑫, 𝑲$ are the mass, drag, stiffness matrices, respectively.

When $m ≫ 1$ , a reduced description is sought by linear combination of $n ≪ m$ basis vectors

𝒙 ≅ \tilde{𝒙} = 𝑩 𝒚 \Rightarrow 𝑴 𝑩 \ddot{𝒚} + 𝑫 𝑩 \dot{𝒚} + 𝑲 𝑩 𝒚 = 𝒇

Choose $𝑩 \in ℝ^{m \times n}$ to have orthonormal columns, and project (2) onto $C (𝑩)$ by multiplication with the projector $𝑷 = 𝑩 𝑩^{T}$

𝑩 𝑩^{T} 𝑴 𝑩 \ddot{𝒚} + 𝑩 𝑩^{T} 𝑫 𝑩 \dot{𝒚} + 𝑩 𝑩^{T} 𝑲 𝑩 𝒚 = 𝑩 𝑩^{T} 𝒇 \Rightarrow

𝑩 (𝑩^{T} 𝑴 𝑩 \ddot{𝒚} + 𝑩^{T} 𝑫 𝑩 \dot{𝒚} + 𝑩^{T} 𝑲 𝑩 𝒚 - 𝑩^{T} 𝒇) = 𝟎 \Leftrightarrow 𝑩 𝒛 = 𝟎 .

Since $N (𝑩) = {𝟎}$ , deduce $𝒛 = 𝟎$ , hence

𝑩^{T} 𝑴 𝑩 \ddot{𝒚} + 𝑩^{T} 𝑫 𝑩 \dot{𝒚} + 𝑩^{T} 𝑲 𝑩 𝒚 = 𝑩^{T} 𝒇 .

Introduce notations

\tilde{𝑴} = 𝑩^{T} 𝑴 𝑩, \tilde{𝑫} = 𝑩^{T} 𝑫 𝑩, \tilde{𝑲} = 𝑩^{T} 𝑲 𝑩

for the reduced mass, drag, stiffness matrices, with $\tilde{𝑴},$ $\tilde{𝑫}$ , $\tilde{𝑲} \in ℝ^{n \times n}$ of smaller size. The reduced coordinates and forces are

\tilde{𝒇} = 𝑩^{T} 𝒇, 𝒚, \tilde{𝒇} \in ℝ^{n} .

The resulting reduced dynamical system is

\tilde{𝑴} \ddot{𝒚} + \tilde{𝑫} \dot{𝒚} + \tilde{𝑲} 𝒚 = \tilde{𝒇} .

2.Reduced bases

One elemenet is missing from the description of model reduction above: how is $𝑩$ determined? Domain-specific knowledge can often dictate an appropriate basis (e.g., Fourier basis fo periodic phenomena). An alternative approach is to extract an appropriate basis from observations of a phenomenon, known as data-driven modeling.

2.1.Correlation matrices

Correlation coefficient.

Consider two functions

x_{1}, x_{2} : ℝ \to ℝ

, that represent data streams in time of inputs

x_{1} (t)

and outputs

x_{2} (t)

of some system. A basic question arising in modeling and data science is whether the inputs and outputs are themselves in a functional relationship. This usually is a consequence of incomplete knowledge of the system, such that while

x_{1}, x_{2}

might be assumed to be the most relevant input, output quantities, this is not yet fully established. A typical approach is to then carry out repeated measurements leading to a data set

D = {(x_{1} (t_{i}), x_{2} (t_{i})) | i = 1, \dots, N .}

, thus defining a relation. Let

𝒙_{1}, 𝒙_{2} \in ℝ^{N}

denote vectors containing the input and output values. The mean values

μ_{1}, μ_{2}

of the input and output are estimated by the statistics

μ_{1} ≅ {\overline{x}}_{1} = \frac{1}{N} \sum_{i = 1}^{N} x_{1} (t_{i}) = E [x_{1}], μ_{2} ≅ {\overline{x}}_{2} = \frac{1}{N} \sum_{i = 1}^{N} x_{2} (t_{i}) = E [x_{2}],

where $E$ is the expectation seen to be a linear mapping, $E : ℝ^{N} \to ℝ$ whose associated matrix is

𝑬 = \frac{1}{N} [\begin{array}{cccc} 1 & 1 & \dots & 1 \end{array}],

and the means are also obtained by matrix vector multiplication (linear combination),

{\overline{x}}_{1} = 𝑬 𝒙_{1}, {\overline{x}}_{2} = 𝑬 𝒙_{2} .

Deviation from the mean is measured by the standard deviation defined for $x_{1}, x_{2}$ by

σ_{1} = \sqrt{E [{(x_{1} - μ_{1})}^{2}]}, σ_{2} = \sqrt{E [{(x_{2} - μ_{2})}^{2}]} .

Note that the standard deviations are no longer linear mappings of the data.

Assume that the origin is chosen such that ${\overline{x}}_{1} = {\overline{x}}_{2} = 0$ . One tool to estalish whether the relation $D$ is also a function is to compute the correlation coefficient

ρ (x_{1}, x_{2}) = \frac{E [x_{1} x_{2}]}{σ_{1} σ_{2}} = \frac{E [x_{1} x_{2}]}{\sqrt{E [x_{1}^{2}] E [x_{2}^{2}]}},

that can be expressed in terms of a scalar product and 2-norm as

ρ (x_{1}, x_{2}) = \frac{𝒙_{1}^{T} 𝒙_{2}}{{|| 𝒙_{1} ||}_{} {|| 𝒙_{2} ||}_{}} .

Squaring each side of the norm property $|| 𝒙_{1} + 𝒙_{2} || ⩽ || 𝒙_{1} || + || 𝒙_{2} ||$ , leads to

{(𝒙_{1} + 𝒙_{2})}^{T} (𝒙_{1} + 𝒙_{2}) ⩽ 𝒙_{1}^{T} 𝒙_{1} + 𝒙_{2}^{T} 𝒙_{2} + 2 || 𝒙_{1} || || 𝒙_{2} || \Rightarrow 𝒙_{1}^{T} 𝒙_{2} ⩽ || 𝒙_{1} || || 𝒙_{2} ||,

known as the Cauchy-Schwarz inequality, which implies $- 1 ⩽ ρ (x_{1}, x_{2}) ⩽ 1$ . Depending on the value of $ρ$ , the variables $x_{1} (t), x_{2} (t)$ are said to be:

uncorrelated, if $ρ = 0$ ;
correlated, if $ρ = 1$ ;
anti-correlated, if $ρ = - 1$ .

The numerator of the correlation coefficient is known as the covariance of $x_{1}, x_{2}$

cov (x_{1}, x_{2}) = E [x_{1} x_{2}] .

The correlation coefficient can be interpreted as a normalization of the covariance, and the relation

cov (x_{1}, x_{2}) = 𝒙_{1}^{T} 𝒙_{2} = ρ (x_{1}, x_{2}) || 𝒙_{1} || || 𝒙_{2} ||,

is the two-variable version of a more general relationship encountered when the system inputs and outputs become vectors.

Patterns in data.

Consider now a related problem, whether the input and output parameters

𝒙 \in ℝ^{n}

𝒚 \in ℝ^{m}

thought to characterize a system are actually well chosen, or whether they are redundant in the sense that a more insightful description is furnished by

𝒖 \in ℝ^{q}, 𝒗 \in ℝ^{p}

with fewer components

p < m, q < n

. Applying the same ideas as in the correlation coefficient, a sequence of

N

measurements is made leading to data sets

𝑿 = [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] \in ℝ^{N \times n}, 𝒀 = [\begin{array}{cccc} 𝒚_{1} & 𝒚_{2} & \dots & 𝒚_{n} \end{array}] \in ℝ^{N \times m} .

Again, by appropriate choice of the origin the means of the above measurements is assumed to be zero

E [𝒙] = 𝟎, E [𝒚] = 𝟎 .

Covariance matrices can be constructed by

𝑪_{𝑿} = 𝑿^{T} 𝑿 = [\begin{array}{c} 𝒙_{1}^{T} \\ 𝒙_{2}^{T} \\ ⋮ \\ 𝒙_{n}^{T} \end{array}] [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] = [\begin{array}{cccc} 𝒙_{1}^{T} 𝒙_{1} & 𝒙_{1}^{T} 𝒙_{2} & \dots & 𝒙_{1}^{T} 𝒙_{n} \\ 𝒙_{2}^{T} 𝒙_{1} & 𝒙_{2}^{T} 𝒙_{2} & \dots & 𝒙_{2}^{T} 𝒙_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 𝒙_{n}^{T} 𝒙_{1} & 𝒙_{n}^{T} 𝒙_{2} & \dots & 𝒙_{n}^{T} 𝒙_{n} \end{array}] \in ℝ^{n \times n} .

Consider now the SVDs of $C_{𝑿} = 𝑵 𝚲 𝑵^{T}$ , $𝑿 = 𝑼 𝚺 𝑺^{T}$ , and from

𝑪_{𝑿} = 𝑿^{T} 𝑿 = {(𝑼 𝚺 𝑺^{T})}^{T} 𝑼 𝚺 𝑺^{T} = 𝑺 𝚺^{T} 𝑼^{T} 𝑼 𝚺 𝑺^{T} = 𝑺 𝚺^{T} 𝚺 𝑺^{T} = 𝑵 𝚲 𝑵^{T},

identify $𝑵 = 𝑺$ , and $𝚲 = 𝚺^{T} 𝚺$ .

Recall that the SVD returns an order set of singular values $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾$ , and associated singular vectors. In many applications the singular values decrease quickly, often exponentially fast. Taking the first $q$ singular modes then gives a basis set suitable for mode reduction

𝒙 = 𝑺_{q} 𝒖 = [\begin{array}{cccc} 𝒔_{1} & 𝒔_{2} & \dots & 𝒔_{q} \end{array}] 𝒖 .

3.Stochastic systems - Karhunen-Loève theorem

The data reduction inherent in SVD representations is a generic feature of natural phenomena. A paradigm for physical systems is the evolution of correlated behavior against a backdrop of thermal enery, typically represented as a form of noise.

One mathematical technique to model such systems is the definition of a stochastic process ${X_{t}}_{a ⩽ t ⩽ b}$ , where for each fixed $t$ , $X_{t}$ is a random variable, i.e., a measurable function $X : Ω \to E$ from a set of possible outcomes $Ω$ to a measurable space $E$ . The set $Ω$ is the sample space of a probability triple $(Ω, ℱ, P)$ , where for $\forall S \subseteq E$

P (X \in S) = P ({ω \in Ω} | X (ω) \in S |) .

A measurable space is a set coupled with procedure to determine measurable subsets, known as a $σ$ -algebra.

Theorem. Let $X_{t}$ be a zero-mean ( $𝔼 [X_{t}] = 0$ ), square-integrable stochastic process defined over probability space $(Ω, ℱ, P)$ indexed by $t \in ℝ$ , $a ⩽ t ⩽ b$ . Then $X_{t}$ admits a representation

$X_{t} = \sum_{k = 1}^{\infty} Z_{k} e_{k} (t),$

with

$Z_{k} = \int_{a}^{b} X_{t} e_{k} (t) d t, 𝔼 [Z_{k}] = 0, 𝔼 [Z_{i}, Z_{j}] = δ_{i j} σ_{j} .$