MATH661

Lecture 25: Introduction to nonlinear approximation

All of the approximation techniques presented so far have been based upon linear approximation. For example, the polynomial interpolant

p_{m} (t) = a_{0} + a_{1} t + \dots + a_{m} t^{m}

of function $f (t)$ based upon data set $𝒟 = {(x_{i}, y_{i}), i = 0, 1, \dots, m}$ is linear in the unkown coefficients $a_{0}, \dots, a_{m}$ . A topic of current research is the development of nonlinear approximation procedures, for instance an approximation of $f : ℝ \to ℝ$ by $g : ℝ \to ℝ$

f (t) ≅ g (t, 𝒂),

where $g$ is nonlinear in the parameters $𝒂 \in ℝ^{n}$ . Whereas the fundamental theorem of linear algebra completely characterizes linear approximants, there is currently no complete theory of nonlinear approximants. This leads to the ubiquity of heuristic techniques such as the mimicking of biological neurons that leads to artificial neuron approximants. The moniker of “machine learning” has been associated with adoption of such techniques, though the field is perhaps more insightfully seen as a natural development of linear approximants to consideration of nonlinear approximants.

1.Historical analogue - operator calculus

The appearance of heuristic solution techniques in nonlinear approximation is typical of exploration of new mathematical fields. An instructive comparable development is the refinement of the formal rules of Heaviside operator calculus into a complete theory of distributions after some six decades of mathematical research.

1.1.Heavisde study of telgraphist equation

In late nineteenth century, telegrapher's equations, a system of linear PDEs for current $I (x, t)$ and voltage $V (x, t)$

\frac{\partial}{\partial x} V (x, t) = - L \frac{\partial}{\partial t} I (x, t) - R I (x, t)

\frac{\partial}{\partial x} I (x, t) = - C \frac{\partial}{\partial t} C (x, t) - G V (x, t)

Heaviside avoided solution of the PDEs by reduction to an algebraic formulation historical formulation, e.g., for the ODE for $y (t)$

\frac{d y}{d t} + a y = b

Heaviside considered the associated algebraic problem for $Y (s)$

s Y + a Y = b \Rightarrow Y (s) = \frac{b}{a + s} \Rightarrow y (t) = ℒ^{- 1} [Y (s)]

1.2.Development of mathematical theory of operator calculus

Why should I refuse a good dinner simply because I don't understand the digestive processes involved? (Heaviside, ?)

Heaviside's formal framework (1890's) for solving ODEs was discounted since it lacked mathematical rigour.

Russian mathematician 1920's established first results (Vladimirov)
Theory of Distributions (Schwartz, 1950s)

2.Basic approximation theory

Consider function $f : ℝ^{d} \to ℝ$ , $d ≫ 1$ assumed large, $f$ of unknown form, difficult to compute for general input. Seek $g : ℝ^{n} \to ℝ$ , $T : ℝ^{d} \to ℝ^{n}$ such that

|| f - g \circ T || < ε

for some $ε > 0$ .

2.1.Linear approximation example

Choose a basis set (Monomials, Exponentials, Wavelets) ${ϕ_{1}, ϕ_{2}, \dots}$ to approximation of $L^{2} (ℝ)$ functions in Hibert space

g_{n} (t) = \sum_{j = 1}^{n} (f, ϕ_{j}) ϕ_{j} = \sum_{j = 1}^{n} c_{j} ϕ_{j}

The approximation is convergent if

{lim}_{n \to \infty} || f - g \circ T || = 0,

This assumes $c_{j} =$ $(f, ϕ_{j})$ rapidly decrease.

Theorem. (Parseval) The Fourier transform is unitary. For $A, B : ℝ \to ℂ$ , square integrable, $2 π$ -periodic with Fourier series

$A (t) = \sum_{n = - \infty}^{\infty} a_{n} e^{i n t}, B (t) = \sum_{n = - \infty}^{\infty} b_{n} e^{i n t},$

$\sum_{n = - \infty}^{\infty} a_{n} {\overline{b}}_{n} = \frac{1}{2 π} \int_{- π}^{π} A (t) \overline{B} (t) d t .$

Bessel inequality:

\sum_{j = 1}^{n} {| (f, ϕ_{j}) |}^{2} ⩽ {|| f ||}_{2} .

Fourier coefficient decay: for $f \in C^{(k - 1)} (ℝ)$ , $f^{(k - 1)}$ absolutely continuous,

| c_{n} | ⩽ {min}_{0 ⩽ j ⩽ k} \frac{{|| f^{(j)} ||}_{1}}{{| n |}^{j}} .

In practice: coefficients decay as

$1 / n$ for functions with discontinuities on a set of Lebesgue measure 0;
$1 / n^{2}$ for functions with discontinuous first derivative on a set of Lebesgue measure 0;
$1 / n^{3}$ for functions with discontinuous second derivative on a set of Lebesgue measure 0.

Fourier coefficients for analytic functions decay faster than any monomial power $c_{n} = ο (n^{- p}), \forall p \in ℕ$ , a property known as exponential convergence.

Denote such approximations by $ℒ$ , and they are linear

ℒ (α f + β g) = α ℒ (f) + β ℒ (g)

2.2.Non-Linear approximation example

Choose a basis set (Monomials, Exponentials, Wavelets) ${ϕ_{1}, ϕ_{2}, \dots}$ to approximation of $L^{2} (ℝ)$ functions in Hibert space

g_{n} (t) = \sum_{j = 1}^{n} c_{j} ϕ_{j}

Let $Φ_{n} = {φ_{k (1)}, φ_{k (2)}, \dots, φ_{k (n)}}$ such

(f, φ_{k (1)}) ⩾ (f, φ_{k (2)}) ⩾ \dots ⩾ (f, φ_{k (n)}) .

Choose $c_{j}$ = $(f, φ_{k (j)})$ , and

g_{n} (t) = \sum_{j = 1}^{n} c_{j} ϕ_{j} .

Denote such approximations by $𝒢$ , and they are non-linear.

3.Nonlinear approximation by composition

Consider function $f : ℝ^{d} \to ℝ$ , $d ≫ 1$ assumed large, $f$ of unknown form, difficult to compute for general input. Seek $g : ℝ^{n} \to ℝ$ , $T : ℝ^{d} \to ℝ^{n}$ such that

|| f - g \circ T || < ε

for some $ε > 0$ .

What questions do you ask?

Does $T exist ?$

$\forall f, ε, \exists T, such that || f - g \circ T || < ε$

Can arbitrary $ε$ be achieved?

Can we construct $T ?$

By what procedure?
$T = T_{1} \circ T_{2} \circ \dots \circ T_{J}$
with $T_{i}$ simple modifications of identity (ReLU)
${min}_{T_{1}, \dots T_{J}} || f - g \circ T_{1} \circ T_{2} \circ \dots \circ T_{J} ||$ $𝑻_{j} (𝒙) = η (𝑨_{j} 𝒙 + 𝒃_{j})$ $η (t) = {\begin{cases} 0 & t < 0 \\ t & t ⩾ 0 \end{cases} .$
At what cost?

How big is $n ?$