Scientific computation

This textbook is drafted and meant to be worked through in TeXmacs, a scientific editing platform that features “live documents” with embedded computational examples constructed in freely available mathematical software systems such as Asymptote, Eukleides, Gnuplot, Julia, Maxima, and Octave.

Abstract 5

1.Preliminaries 9

2.Linear Algebra 11

2.1.Linear Combinations 11

2.2.Singular Value Decomposition 11

2.3.Problems of Linear Algebra 11

2.4.Linear Operator Factorization 11

3.Function Approximation 13

3.1.Pointwise approximation 13

3.1.1.Interpolation in the monomial basis 13

3.1.2.Interpolation in orthogonal bases 13

3.1.3.Interpolation in the $B$ -spline basis 13

3.2.Approximation over an interval 13

4.Operator Approximation 15

4.1.Linear Operators 15

4.1.1.Integration 15

4.1.2.Differentiation 15

4.2.Nonlinear Operators 15

4.2.1.Linear approximants 15

4.2.2.Nonlinear approximants 15

Chapter 1
Preliminaries

Most scientific disciplines introduce an idea of the amount of some entity or property of interest. Furthermore, the amount is usually combined with the concept of a number, an abstraction of the observation that the two sets {Mary, Jane, Tom} and {apple, plum, cherry} seem quite different, but we can match one distinct person to one distinct fruit as in {Maryplum, Janeapple, Tomcherry}. In contrast, we cannot do the same matching of distinct persons to a distinct color from the set {red, green}, and one of the colors must be shared between two persons. Formal definition of the concept of a number from the above observations is surprisingly difficult since it would be self-referential due to the apperance of the numbers “one” and “two”. Leaving this aside, the key concept is that of quantity of some property of interest that is expressed through a number.

Several types of numbers have been introduced in mathematics to express different types of quantities, and the following will be used throughout this text:

: The set of natural numbers, , infinite and countable, ;
: The set of integers, , infinite and countable;
: The set of rational numbers , infinite and countable;
: The set of real numbers, infinite, not countable, can be ordered;
: The set of complex numbers, , infinite, not countable, cannot be ordered.

These sets of numbers form a hierarchy, with . The size of a set of numbers is an important aspect of its utility in describing natural phenomena. The set has three elements, and its size is defined by the cardinal number, . The sets have an infinite number of elements, but the relation

z = {\begin{cases} - n / 2 & for n even \\ (n + 1) / 2 & for n odd \end{cases} .

defines a one-to-one correspondence between and , so these sets are of the same size denoted by the transfinite number (aleph-zero). The rationals can also be placed into a one-to-one correspondence with , hence

| ℕ | = | ℤ | = | ℚ | = ℵ_{0} .

In contrast there is no one-to-one mapping of the reals to the naturals, and the cardinality of the reals is (Fraktur-script c). Georg Cantor established set theory and introduced a proof technique known as the diagonal argument to show that . Intuitively, there are exponentially more reals than naturals.

One of the foundations of the scientific method is quantification, ascribing numbers to phenomena of interest. To exemplify the utility of different types of number to describe natural phenomena, consider common salt (sodium chloride, Fig. ) which has the chemical formula NaCl with the sodium ions () and chloride ions () spatially organized in a cubic lattice, with an edge length Å (1 Å = m) between atoms of the same type. Setting the origin of a Cartesian coordinate system at a sodium atom, the position of some atom within the lattice is

(x, y, z) = (i \frac{a}{2}, j \frac{a}{2}, k \frac{a}{2}) .

Sodium atoms are found positions where is even, while chloride atoms are found at positions where is odd. The Cartesian coordinates describe some arbitrary position in space, which is conceptualized as a continuum and placed into one-to-one correspondence with . A particular lattice position can be specified simply through a label consisting of three integers . The position can be recovered through a scaling operation

(x, y, z) = \frac{a}{2} (i, j, k),

and the number that modifies the length scale from to , it is called a scalar.

A computer has a finite amount of memory, hence cannot represent all numbers, but rather subsets of the above number sets. Current digital computers internally use numbers represented through binary digits, or bits. Many computer number types are defined for specific purposes, and are often encountered in applications such as image representation or digital data acquisition. Here are the main types.

Subsets of

The number types uint8, uint16, uint32, uint64 represent subsets of the natural numbers (unsigned integers) using 8, 16, 32, 64 bits respectively. An unsigned integer with bits can store a natural number in the range from to . Two arbitrary natural numbers, written as can be added and will give another natural number, . In contrast, addition of computer unsigned integers is only defined within the specific range to . If , the result might be displayed as the maximum possible value or as .

Subsets of

The number types int8, int16, int32, int64 represent subsets of the integers. One bit is used to store the sign of the number, so the subset of that can be represented is from to .

Subsets of

Computers approximate the real numbers through the set of floating point numbers. Floating point numbers that use bits are known as single precision, while those that use are double precision. A floating point number is stored internally as where , are bits within the mantissa of length , and , are bits within the exponent, along with signs for each. The default number type is usually double precision, more concisely referred to Float64. Common irrational constants such as , are predefined as irrationals and casting to Float64 or Float32 gives floating point approximation. Unicode notation is recognized. Specification of a decimal point indicates a floating point number; its absence indicates an integer.

The approximation of the reals by the floats is characterized by: floatmax(), the largest float, floatmin the smallest positive float, and eps() known as machine epsilon. Machine epsilon highlights the differences between floating point and real numbers since it is defined as the largest number that satisfies . If of course implies , but floating points exhibit “granularity”, in the sense that over a unit interval there are small steps that are indistinguishable from zero due to the finite number of bits available for a float. The granularity of double precision expressed by machine epsilon is sufficient to represent natural phenomena, and floating point errors can usually be kept under control,

Keep in mind that perfect accuracy is a mathematical abstraction, not encountered in nature. In fields such as sociology or psychology 3 digits of accuracy are excellent, in mechanical engineering this might increase to 6 digits, or in electronic engineering to 8 digits. The most precisely known physical constant is the Rydberg constant known to 12 digits, hence a mathematical statement such as

x = 2.6309283450461248350319486319845

is unlikely to have any real significance, while

x = 2.631 \pm 0.0005

is much more informative.

Within the reals certain operations are undefined such as . Special float constants are defined to handle such situations: Inf is a float meant to represent infinity, and NaN (“not a number”) is meant to represent an undefinable result of an arithmetic operation.

Complex numbers are specified by two reals, in Cartesian form as , or in polar form as , , . The computer type complex is similarly defined from two floats and the additional constant I is defined to represent . Functions are available to obtain the real and imaginary parts within the Cartesian form, or the absolute value and argument of the polar form.

The reals form an algebraic structure known as a field . The set of floats together with floating point addition and multiplication are denoted as . Operations with floats do not have the same properties as the reals, but are assumed to have a relative error bounded by machine epsilon

where is the floating point representation of . The above is often used in error analysis as

Computer number sets are a first example of approximation: replacing some complicated object with a simpler one. It is one of the key mathematical ideas studied throughout this text.

Care should be exercised about the cummulative effect of many floating point operations. An informative example is offered by Zeno's paradox of motion, that purports that fleet-footed Achilles could never overcome an initial head start of given to the lethargic Tortoise since, as stated by Aristotle:

The above is often formulated by considering that the first half of the initial head start must be overcome, then another half and so on. The distance traversed after such steps is

Calculus resolves the paradox by rigorous definition of the limit and definition of velocity as , , .

Undertake a numerical invesigation and consider two scenarios, with increasing or decreasing step sizes

Irrespective of the value for , in floating point arithmetic. Recall however that computers use binary representations internally, so division by powers of two might have unique features (indeed, it corresponds to a bit shift operation). Try subdividing the head start by a different number, perhaps to get an “irrational” numerical investigation of Zeno's paradox of motion. Define now the distance traversed by step sizes that are scaled by starting from one to , traversed by step sizes scaled by starting from

Again, in the reals the above two expressions are equal, , but this is no longer verified computationally for all , not even within a tolerance of machine epsilon.

This example gives a first glimpse of the steps that need to be carried out in addition to mathematical analysis to fully characterize an algorithm. Since , a natural question is whether one is more accurate than the other. For some arbitrary ratio , the exact value is known

Note that errors are about the size of machine epsilon for , but are zero for , it seems that the summation ordering gives the exact value. A bit of reflection reinforces this interpretation: first adding small quantities allows for carry over digits to be accounted for.

This example is instructive beyond the immediate adage of “add small quantities first”. It highlights the blend of empirical and analytical approaches that is prevalent in scientific computing.

Single values given by some algorithm are of little value in the practice of scientific computing. The main goal is the construction of a sequence of approximations that enables assessment of the quality of an approximation. Recall from calculus that converges to if can be made as small as desired for all beyond some threshold. In precise mathematical language this is stated through:

Though it might seem natural to require a sequence of approximations to converge to an exact solution

Both situations arise when approximating numbers and serve as useful reference points when considering approximation other types of mathematical objects such as functions. For example, the number is readily defined in geometric terms as the ratio of circle circumference to diameter, but can only be approximately expressed by a rational number, e.g., . The exact value of is only obtained as the limit of an infinite number of operations with rationals. There are many such infinite representations, one of which is the Leibniz series

Rather, the Leibniz series should be understood as an algorithm, i.e., a sequence of elementary operations that leads to succesively more accurate approximations of

Complex analysis provides a convergence proof starting from properties of the function

This elegant result does not address however the points raised above: if were not known, how could the convergence of the sequence be assessed? A simple numerical experiment indicates that the familiar value of is only recovered for large , with 10000 terms insufficient to ensure five significant digits.

Instead of evaluating distance to an unknown limit, as in , one could evaluate if terms get closer to one another as in , a condition that can readily be checked in an algorithm.

Note that the distance between any two terms after the threshold must be smaller than an arbitrary tolerance . For example the sequence is not a Cauchy sequence even though the distance between successive terms can be made arbitrarily small

Verification of decreasing successive distance is therefore a necessary but not sufficient condition to assess whether a sequence is a Cauchy sequence. Furthermore, the distance between successive iterates is not necessarily an indication of the distance to the limit. Reprising the Leibniz example, successive terms can be further apart than the distance to the limit, though terms separated by 2 are closer than the distance to the limit (a consequence of the alternating Leibniz series signs)

Another question is whether a Cauchy sequence is itself convergent. For sequences of reals this is true, but the Leibniz sequence furnishes a counterexample since it contains rationals and converges to an irrational. Such aspects that arise in number approximation sequences become even more important when considering approximation sequences composed of vectors or functions.

Consideration of floating point arithmetic indicates adaptation of the mathematical concept of convergence is required in scientific computation. Recall that machine epsilon is that largest number such that is true, and characterizes the granularity of the floating point system. A reasonable adaptation of the notion of convergence might be:

What emerges is the need to consider a degree of uncertainty in an approximating sequence. If the uncertainty can be bounded to the intrinsic granularity of the number system, a good approximation is obtained.

The problem of approximating numbers uncovers generic aspects of scientific computing:

The objective of scientific computation is to solve some problem by constructing a sequence of approximations . The condition suggested by mathematical analysis would be , with . As already in the Leibniz series approximation of , acceptable accuracy might only be obtained for large . Since could be an arbitrarily complex mathematical object, such slowly convering approximating sequences are of little practical interest. Scientific computing seeks approximations of the solution with rapidly decreasing error. This change of viewpoint with respect to analysis is embodied in the concepts of rate and order of convergence.

A more useful approach is to determine estimates of the rate and order of convergence over some range of iterations that are sufficiently accurate. Rewriting () as

suggests introducing the distance between successive iterates , and considering the condition

As an example, consider the derivative of at , as given by the calculus definition

Investigation of the numerical results indicates increasing accuracy in the estimate of , with decreasing step size . The distance between successive approximation sequence terms also decreases. It is more intuitive to analyze convergence behavior through a plot rather than a numerical table.

The intent of the rate and order of approximation definitions is to state that the distance between successive terms behaves as

in the hope that this is a Cauchy sequence, and successively closer terms actually indicate convergence. The convergence parameters can be isolated by taking logarithms, leading to a linear dependence

The above computations indicate , known as linear convergence. Figure b shows the common practice of depicting guide lines of slope 1 (black) and slope 2 (green) to visually ascertain the rate of convergence. Once the order of approximation is determined, the rate of aproximation is estimated from

The above results suggest successive approximants become closer spaced according to

Repeat the above experiment at , where , and using a different approximation of the derivative

For this experiment, in addition to the rate and order of approximation , also determine the rate and order of convergence using

Given some approximation sequence , , with solution of problem , it is of interest to construct a more rapidly convergent sequence , . Knowledge of the order of convergence can be used to achieve this purpose by writing

For , the above is a polynomial equation of degree that can be solved to obtain . Since () is an approximation, solving () gives an approximation of the exact limit.

One of the widely used acceleration techniques was published by Aitken (1926, but had been in use since Medieval times) for in which case () gives

might converge faster towards the limit. Investigate by revisiting the numerical experiment on approximation of the derivative of at , using

Several approaches may be used in construction of an approximating sequence . The approaches exemplified below for , can be generalized when is some other type of mathematical object.

Another example, again giving an approximation of is the Srinivasa Ramanujan series

An example of the generalization of this approach is the Taylor series of a function. For example, the familiar sine power series

is analogous, but with rationals now replaced by monomials, and the limit is now a function . The general term is

Approximating sequences need not be constructed by adding a correction. Consider the approximation of given by Wallis's product (1656)

in which the correction is multiplicative with numerators given by nested radicals. Similar to the symbol for addition, and the symbol for multiplication, the N symbol is used to denote nested radicals

Yet another alternative is that of continued fractions, with one possible approximation of given by

Using this notation, the sequences arising in the continued fraction representation of are , chosen as for , and for .

The above correction techniques used arithmetic operations. The repeated radical coefficients in the Viète formula suggest consideration of repeated composition of arbitrary functions to construct the approximant

This is now a general framework, in which all of the preceeding correction approaches can be expressed. For example, the continued fraction formula () is recovered through the functions

This general framework is of significant current interest since such composition of nonlinear functions is the basis of deep neural network approximations.

Abstract

Table of contents

Chapter 1
Preliminaries

Chapter 2
Linear Algebra

2.1.Linear Combinations

2.2.Singular Value Decomposition

2.3.Problems of Linear Algebra

2.4.Linear Operator Factorization

Chapter 3
Function Approximation

3.1.Pointwise approximation

3.1.1.Interpolation in the monomial basis

3.1.2.Interpolation in orthogonal bases

3.1.3.Interpolation in the $B$ -spline basis

3.2.Approximation over an interval

Chapter 4
Operator Approximation

4.1.Linear Operators

4.1.1.Integration

4.1.2.Differentiation

4.2.Nonlinear Operators

4.2.1.Linear approximants

4.2.2.Nonlinear approximants

Abstract

Table of contents

Chapter 1Preliminaries

Chapter 2Linear Algebra

2.1.Linear Combinations

2.2.Singular Value Decomposition

2.3.Problems of Linear Algebra

2.4.Linear Operator Factorization

Chapter 3Function Approximation

3.1.Pointwise approximation

3.1.1.Interpolation in the monomial basis

3.1.2.Interpolation in orthogonal bases

3.1.3.Interpolation in the B-spline basis

3.2.Approximation over an interval

Chapter 4Operator Approximation

4.1.Linear Operators

4.1.1.Integration

4.1.2.Differentiation

4.2.Nonlinear Operators

4.2.1.Linear approximants

4.2.2.Nonlinear approximants

Chapter 1
Preliminaries

Chapter 2
Linear Algebra

Chapter 3
Function Approximation

3.1.3.Interpolation in the $B$ -spline basis

Chapter 4
Operator Approximation