MATH347DS

MATH347DS L06: The singular value decomposition

Overview

The singular value decomposition (SVD)
- Motivation
- Theorem
Another essential diagram: SVD finds orthonormal bases for $C (𝑨), N (𝑨^{T}), C (𝑨^{T}), N (𝑨)$
SVD computation
Rank-1 expansion of a matrix
Matrix norm
SVD in image compression, analysis
The pseudo-inverse

Motivation singular value decomposition (SVD)

Motivation: FTLA does not specify bases for $C (𝑨), N (𝑨^{T}), C (𝑨^{T}), N (𝑨)$ .
Question: Is there some “natural” basis for the fundamental matrix subspaces?
Consider linear mapping: $𝒇 : ℝ^{n} \to ℝ^{m}$ , $𝒇 (𝒙) = 𝑨 𝒙$ , $𝑨 \in ℝ^{m \times n}$
- The input $𝒙$ is given in the identity matrix basis, $𝒙 = 𝑰_{n} 𝒙$
- The output $𝒃 = 𝒇 (𝒙) = 𝑨 𝒙$ is also obtained in the identity matrix basis, $𝒚 = 𝑰_{m} 𝒚$
- In these basis the effect of $𝑨$ might be costly to compute
  $𝒃 = x_{1} 𝒂_{1} + x_{2} 𝒂_{2} + \dots + x_{n} 𝒂_{n} \Rightarrow b_{i} = x_{1} a_{i 1} + x_{2} a_{i 2} + \dots + x_{n} a_{i, n} .$
- What would be simpler? One possibility: a simple scaling of each component suggested by
  $𝒃 = [\begin{array}{l} b_{1} \\ b_{2} \\ ⋮ \\ b_{m} \end{array}] = 𝑫 𝒙 = diag (λ_{1}, λ_{2}, \dots, λ_{m}) 𝒙 = [\begin{array}{llll} λ_{1} & 0 & \dots & 0 \\ 0 & λ_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & λ_{m} \end{array}] [\begin{array}{l} x_{1} \\ x_{2} \\ ⋮ \\ x_{m} \end{array}] \Rightarrow b_{i} = λ_{i} x_{i}$

Stating desired behavior

Seek different bases for domain, codomain of $𝒇 : ℝ^{n} \to ℝ^{m}$ , $𝒇 (𝒙) = 𝑨 𝒙$ , $𝑨 \in ℝ^{m \times n}$
- an orthonormal basis $𝑽$ in $ℝ^{n}$ , $𝑽 \in ℝ^{n \times n}$ , $𝑽 𝑽^{T} = 𝑽^{T} 𝑽 = 𝑰_{n}$
  $𝑰 𝒙 = 𝑽 𝒚 \Rightarrow 𝒚 = 𝑽^{T} 𝒙$
- an orthonormal basis $𝑼$ in $ℝ^{m}$ , $𝑼 \in ℝ^{m \times m}$ , $𝑼 𝑼^{T} = 𝑼^{T} 𝑼 = 𝑰_{m}$
  $𝑰 𝒃 = 𝑼 𝒄 \Rightarrow 𝒄 = 𝑼^{T} 𝒃$
- impose that the effect of $𝑨$ in the new bases is a simple component scaling
  $𝒄 = 𝚺 𝒚 \Rightarrow 𝑼^{T} 𝒃 = 𝚺 𝑽^{T} 𝒙 \Rightarrow 𝒃 = 𝑼 𝚺 𝑽^{T} 𝒙 \Rightarrow$ $𝑨 = 𝑼 𝚺 𝑽^{T}$
- Note that $𝚺 \in ℝ^{m \times n}$

Can it be done? Yes: Singular value decomposition theorem

Theorem. (SVD) For any $𝑨 \in ℝ^{m \times n}$ , $𝑨 = 𝑼 𝚺 𝑽^{T}$ , with $𝑼 \in ℝ^{m \times m}$ , $𝑽 \in ℝ^{n \times n}$ orthogonal, $𝚺 \in ℝ_{+}^{m \times n}$ pseudo-diagonal $𝚺 = diag (σ_{1}, \dots, σ_{r}, \dots, 0)$ , $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾ σ_{r} > 0$ , $r ⩽ min (m, n)$

The SVD is determined by eigendecomposition of $𝑨^{T} 𝑨$ , and $𝑨 𝑨^{T}$

$𝑨^{T} 𝑨 = {(𝑼 𝚺 𝑽^{T})}^{T} (𝑼 𝚺 𝑽^{T}) = 𝑽 (𝚺^{T} 𝚺) 𝑽^{T}$ , an eigendecomposition of $𝑨^{T} 𝑨$ . The columns of $𝑽$ are eigenvectors of $𝑨^{T} 𝑨$ and called right singular vectors of $𝑨$
$𝑨 𝑨^{T} = (𝑼 𝚺 𝑽^{T}) {(𝑼 𝚺^{T} 𝑽^{T})}^{T} = 𝑼 (𝚺 𝚺^{T}) 𝑼^{T}$ , an eigendecomposition of $𝑨 𝑨^{T}$ . The columns of $𝑼$ are eigenvectors of $𝑨 𝑨^{T}$ and called left singular vectors of $𝑨$
The matrix $𝚺$ has zero elements except for the diagonal that contains $σ_{i}$ , the singular values of $𝑨$ , computed as the square roots of the eigenvalues of $𝑨^{T} 𝑨$ (or $𝑨 𝑨^{T}$ )

An all-encompassing diagram: SVD and matrix subspaces

SVD of $𝑨 \in ℝ^{m \times n}$ reveals: $rank (𝑨)$ , bases for $C (𝑨), N (𝑨^{T}), C (𝑨^{T}), N (𝑨)$

𝑨 = [\begin{array}{llllll} 𝒖_{1} & \dots & 𝒖_{r} & 𝒖_{r + 1} & \dots & 𝒖_{m} \end{array}] [\begin{array}{lllll} σ_{1} \\ ⋱ \\ σ_{r} \\ 0 \\ ⋱ \end{array}] [\begin{array}{l} {𝒗_{1}}^{T} \\ ⋮ \\ {𝒗_{r}}^{T} \\ {𝒗_{r + 1}}^{T} \\ ⋮ \\ {𝒗_{n}}^{T} \end{array}]

SVD Computation

From $𝑨 = 𝑼 𝚺 𝑽^{T}$ deduce $𝑨 𝑨^{T} = 𝑼 𝚺^{2} 𝑼^{T}$ , $𝑨^{T} 𝑨 = 𝑽 𝚺^{2} 𝑽^{T}$ , hence $𝑼$ is the eigenvector matrix of $𝑨 𝑨^{T}$ , and $𝑽$ is the eigenvector matrix of $𝑨^{T} 𝑨$

SVD computation is available in Julia, Octave, Matlab, Mathematica ...

∴	short(x) = round(x,digits=6);

∴	A=[2 -1; -3 1]; F=svd(A); U=F.U; Σ=Diagonal(F.S); Vt=F.Vt; short.([A UΣVt])

$[\begin{array}{cccc} 2.0 & - 1.0 & 2.0 & - 1.0 \\ - 3.0 & 1.0 & - 3.0 & 1.0 \end{array}]$ (1)

∴	short.([U Σ Vt'])

$[\begin{array}{cccccc} - 0.576048 & 0.817416 & 3.864328 & 0.0 & - 0.932722 & - 0.360597 \\ 0.817416 & 0.576048 & 0.0 & 0.258777 & 0.360597 & - 0.932722 \end{array}]$ (2)

∴

SVD of $𝑨 = [\begin{array}{ll} 2 & - 1 \\ 3 & 1 \end{array}]$ diagram, $𝒇 (𝒙) = 𝑨 𝒙$

Additive decomposition of

𝑨

$𝑨 = 𝑼 𝚺 𝑽^{T}$ , carry out block multiplication
$𝑨 = [\begin{array}{llllll} 𝒖_{1} & \dots & 𝒖_{r} & 𝒖_{r + 1} & \dots & 𝒖_{m} \end{array}] [\begin{array}{lllll} σ_{1} \\ ⋱ \\ σ_{r} \\ 0 \\ ⋱ \end{array}] [\begin{array}{l} {𝒗_{1}}^{T} \\ ⋮ \\ {𝒗_{r}}^{T} \\ {𝒗_{r + 1}}^{T} \\ ⋮ \\ {𝒗_{n}}^{T} \end{array}] \Rightarrow$ $𝑨 = [\begin{array}{llllll} 𝒖_{1} & \dots & 𝒖_{r} & 𝒖_{r + 1} & \dots & 𝒖_{m} \end{array}] [\begin{array}{l} σ_{1} {𝒗_{1}}^{T} \\ ⋮ \\ σ_{r} {𝒗_{r}}^{T} \\ 𝟎 \\ ⋮ \\ 𝟎 \end{array}] = σ_{1} 𝒖_{1} 𝒗_{1}^{T} + \dots + σ_{r} 𝒖_{r} 𝒗_{r}^{T}$
The above is known as a “rank-one” expansion since $rank (𝒖_{k} 𝒗_{k}^{T}) = 1$ . Note that $𝒖_{k} 𝒗_{k}^{T} \in ℝ^{m \times n}$ and is a matrix whose columns are $n$ scalings of $𝒖_{k}$
SVD theorem: $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾ σ_{r} > 0$ , Often $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾ σ_{k} ≫ σ_{k + 1} ⩾ \dots ⩾ σ_{r} > 0$

Matrix 2-norm

$𝑼, 𝑽$ specify intrinsic directions within $ℝ^{m}, ℝ^{n}$ along which $𝑨$ acts as scaling transformation
Applying linear mapping to the $𝒗_{1}$ vector, $𝒇 (𝒗_{1}) = 𝑨 𝒗_{1}$
$𝑨 𝒗_{1} = (\sum_{i = 1}^{p} σ_{i} 𝒖_{i} 𝒗_{i}^{T}) 𝒗_{1} = \sum_{i = 1}^{p} σ_{i} 𝒖_{i} (𝒗_{i}^{T} 𝒗_{1}) = σ_{1} 𝒖_{1}$
Direction most amplified by $𝒇 (𝒙) = 𝑨 𝒙$ is $𝒗_{1}$ and the result is the vector $σ_{1} 𝒖_{1}$
Define a matrix norm as the largest amplification factor
${|| 𝑨_{} ||}_{2} = {max}_{{|| 𝒙 ||}_{2} = 1} {|| 𝑨 𝒙 ||}_{2} .$
The largest singular value $σ_{1}$ is the matrix 2-norm
$σ_{1} = {max}_{{|| 𝒙 ||}_{2} = 1} {|| 𝑨 𝒙 ||}_{2} .$

Low-rank matrix approximation

Full SVD
$𝑨 = \sum_{i = 1}^{r} σ_{i} 𝒖_{i} 𝒗_{i}^{T}, r ⩽ min (m, n) .$
Truncated SVD
$𝑨 ≅ 𝑨_{p} = \sum_{i = 1}^{p} σ_{i} 𝒖_{i} 𝒗_{i}^{T} .$
Many applications, e.g., image compression

Figure 1. Successive SVD approximations of Andy Warhol's painting, Marilyn Diptych (~1960), with $k = 10, 20, 40$ rank-one updates.

Why does SVD image compression work?

Consider $x_{1}, x_{2} : ℝ \to ℝ$ , data streams in time of inputs $x_{1} (t)$ and outputs $x_{2} (t)$
Is there some function $f$ linking outputs to inputs? $f (x_{1} (t)) = x_{2} (t)$
Seek answer by first asking: is $x_{2}$ correlated to $x_{1}$ ?
Introduce mean values
$μ_{1} ≅ {\overline{x}}_{1} = \frac{1}{N} \sum_{i = 1}^{N} x_{1} (t_{i}) = E [x_{1}], μ_{2} ≅ {\overline{x}}_{2} = \frac{1}{N} \sum_{i = 1}^{N} x_{2} (t_{i}) = E [x_{2}] .$
$E$ is the expectation, a linear mapping, $E : ℝ^{N} \to ℝ$ whose associated matrix is
$𝑬 = \frac{1}{N} [\begin{array}{cccc} 1 & 1 & \dots & 1 \end{array}] .$
Shift data such that ${\overline{x}}_{1} = {\overline{x}}_{2} = 0$ . Define correlation coefficient
$ρ (x_{1}, x_{2}) = \frac{E [x_{1} x_{2}]}{σ_{1} σ_{2}} = \frac{E [x_{1} x_{2}]}{\sqrt{E [x_{1}^{2}] E [x_{2}^{2}]}} = \frac{𝒙_{1}^{T} 𝒙_{2}}{{|| 𝒙_{1} ||}_{} {|| 𝒙_{2} ||}_{}} .$
uncorrelated, if $ρ = 0$ ; correlated, if $ρ = 1$ ; anti-correlated, if $ρ = - 1$ .

Examples

Correlated signals

∴	t=0:0.01:1; x1=1.0t; x2=t.^2; rho=transpose(x1)x2/norm(x1)/norm(x2)

$0.968249831385581$

∴

Uncorrelated signals

∴	m=size(x1)[1]; x3=2(rand(m,1).-0.5)[:,1]; rho=transpose(x1)x3/norm(x1)/norm(x3)

$0.040354121344516436$

∴

Anticorrelated signals

∴	x4=-t.^2; rho=transpose(x1)*x4/norm(x1)/norm(x4)

Extend correlation to input, output vectors

Are input and output parameters $𝒙 \in ℝ^{n}$ , $𝒚 \in ℝ^{m}$ well chosen?
Perhaps components are redundant, a more economical description might be
$𝒖 \in ℝ^{q}, 𝒗 \in ℝ^{p}, p < m, q < n$
Extend idea from correlation coefficient: take $N$ measurements
$𝑿 = [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] \in ℝ^{N \times n}, 𝒀 = [\begin{array}{cccc} 𝒚_{1} & 𝒚_{2} & \dots & 𝒚_{n} \end{array}] \in ℝ^{N \times m} .$
Choose origin such that $E [𝒙] = 𝟎, E [𝒚] = 𝟎 .$
Covariance matrix (generalization of single variable variance)
$𝑪_{𝑿} = 𝑿^{T} 𝑿 = [\begin{array}{c} 𝒙_{1}^{T} \\ 𝒙_{2}^{T} \\ ⋮ \\ 𝒙_{n}^{T} \end{array}] [\begin{array}{cccc} 𝒙_{1} & 𝒙_{2} & \dots & 𝒙_{n} \end{array}] = [\begin{array}{cccc} 𝒙_{1}^{T} 𝒙_{1} & 𝒙_{1}^{T} 𝒙_{2} & \dots & 𝒙_{1}^{T} 𝒙_{n} \\ 𝒙_{2}^{T} 𝒙_{1} & 𝒙_{2}^{T} 𝒙_{2} & \dots & 𝒙_{2}^{T} 𝒙_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 𝒙_{n}^{T} 𝒙_{1} & 𝒙_{n}^{T} 𝒙_{2} & \dots & 𝒙_{n}^{T} 𝒙_{n} \end{array}]$

Reduced description by truncation of covariance matrix SVD

SVDs: $𝑿 = 𝑼 𝚺 𝑽^{T}$ , $C_{𝑿} = 𝑿^{T} 𝑿 = 𝑽 𝚲 𝑽^{T}$ , $𝚲 = 𝚺^{T} 𝚺$
Take first $q$ column vectors of $𝑽$ , $𝑽_{q} = [\begin{array}{lll} 𝒗_{1} & \dots & 𝒗_{q} \end{array}]$ , $q < n$
$𝒙 = 𝑽_{q} 𝒖 = [\begin{array}{cccc} 𝒗_{1} & 𝒗_{2} & \dots & 𝒗_{q} \end{array}] 𝒖 .$
System description in terms of $𝒖 \in ℝ^{q}$ is more economical than that in terms of $𝒙 \in ℝ^{n}$
In image compression, successive pixel columns are correlated and reduced descriptions are possible

SVD solution of linear algebra problems: linear systems, least squares

Consider linear system $𝑨 𝒙 = 𝒃$ , $𝑨 \in ℝ^{m \times m}$ , $rank (𝑨) = m$ . SVD solution steps:

Compute the SVD, $𝑼 𝚺 𝑽^{T} = 𝑨$ ;
Find the coordinates of $𝒃$ in the orthogonal basis $𝑼$ , $𝒄 = 𝑼^{T} 𝒃$ ;
Scale the coordinates of $𝒄$ by the inverse of the singular values $y_{i} = c_{i} / σ_{i}$ , $i = 1, \dots, m$ , such that $Σ 𝒚 = 𝒄$ is satisfied;
Find the coordinates of $𝒚$ in basis $𝑽^{T}$ , $𝒙 = 𝑽 𝒚$ .

What if $𝑨 \in ℝ^{m \times n}$ , $rank (𝑨) = r < m$ . If $𝒃 \in C (𝑨)$ above procedure still works with a simple modification of step 3 with $i$ going now from 1 to $r$

$𝑼 𝚺 𝑽^{T} = 𝑨$ ;
$𝒄 = 𝑼^{T} 𝒃$ ;
$y_{i} = c_{i} / σ_{i}$ , $i = 1, \dots, r$
$𝒙 = 𝑽 𝒚$ .

If $𝑨 \in ℝ^{m \times n}$ , $rank (𝑨) = r < m$ . If $𝒃 \notin C (𝑨)$ , the above steps give the best approximation of $𝒃$ by linear combination of columns of $𝑨$ in the 2-norm

Pseudo-inverse matrix

Since the steps to solve a linear system or find best approximation are identical define a matrix $𝑨^{+}$ that carries out all steps:
$(𝑼 𝚺 𝑽^{T}) 𝒙 = 𝒃 \Leftrightarrow 𝑼 (𝚺 𝑽^{T} 𝒙) = 𝒃$ $(𝚺 𝑽^{T} 𝒙) = 𝑼^{T} 𝒃 \Leftrightarrow 𝚺 (𝑽^{T} 𝒙) = 𝑼^{T} 𝒃$
Recall $𝚺 = diag (σ_{1}, σ_{2}, \dots, σ_{r}, 0, \dots 0)$ . Define $𝚺^{+} =$ $diag (1 / σ_{1}, 1 / σ_{2}, \dots, 1 / σ_{r}, 0, \dots 0)$
$𝑽^{T} 𝒙 = 𝚺^{+} 𝑼^{T} 𝒃$ $𝒙 = 𝑽 𝚺^{+} 𝑼^{T} 𝒃$
Gather all above steps into a single matrix $𝑨^{+} = 𝑽 𝚺^{+} 𝑼^{T}$ called the pseudo-inverse of $𝑨$ .
The solution then to a linear system (either exact solution of best approximation) is
$𝒙 = 𝑨^{+} 𝒃$
In Julia, Matlab, Octave, above procedure is implemented through x=A\b.