Number Approximation

1.Numbers

1.1.Number sets

Most scientific disciplines introduce an idea of the amount of some entity or property of interest. Furthermore, the amount is usually combined with the concept of a number, an abstraction of the observation that the two sets A={Mary, Jane, Tom} and B={apple, plum, cherry} seem quite different, but we can match one distinct person to one distinct fruit as in {Maryplum, Janeapple, Tomcherry}. In contrast, we cannot do the same matching of distinct persons to a distinct color from the set {red, green}, and one of the colors must be shared between two persons. Formal definition of the concept of a number from the above observations is surprisingly difficult since it would be self-referential due to the apperance of the numbers “one” and “two”. Leaving this aside, the key concept is that of quantity of some property of interest that is expressed through a number.

Several types of numbers have been introduced in mathematics to express different types of quantities, and the following will be used throughout this text:

The set of natural numbers, ={0,1,2,3,}, infinite and countable, +={1,2,3,};

The set of integers, ={0,±1,±2,±3,}, infinite and countable;

The set of rational numbers ={p/q,p,q+}, infinite and countable;

The set of real numbers, infinite, not countable, can be ordered;

The set of complex numbers, ={x+iy,x,y}, infinite, not countable, cannot be ordered.

These sets of numbers form a hierarchy, with . The size of a set of numbers is an important aspect of its utility in describing natural phenomena. The set S={Mary,Jane,Tom} has three elements, and its size is defined by the cardinal number, |S|=3. The sets ,,,, have an infinite number of elements, but the relation

z={ -n/2 forneven (n+1)/2 fornodd .

defines a one-to-one correspondence between n and z, so these sets are of the same size denoted by the transfinite number 0 (aleph-zero). The rationals can also be placed into a one-to-one correspondence with , hence

||=||=||=0.

In contrast there is no one-to-one mapping of the reals to the naturals, and the cardinality of the reals is ||=𝔠 (Fraktur-script c). Georg Cantor established set theory and introduced a proof technique known as the diagonal argument to show that 𝔠=20. Intuitively, there are exponentially more reals than naturals.

1.2.Quantification

One of the foundations of the scientific method is quantification, ascribing numbers to phenomena of interest. To exemplify the utility of different types of number to describe natural phenomena, consider common salt (sodium chloride, Fig. 1) which has the chemical formula NaCl with the sodium ions (Na+) and chloride ions (Cl-) spatially organized in a cubic lattice, with an edge length a=5.6402 Å (1 Å = 10-10 m) between atoms of the same type. Setting the origin of a Cartesian coordinate system Oxyz at a sodium atom, the position of some atom within the lattice is

(x,y,z)=(ia2,ja2,ka2).

Sodium atoms are found positions where i+j+k is even, while chloride atoms are found at positions where i+j+k is odd. The Cartesian coordinates (x,y,z) describe some arbitrary position in space, which is conceptualized as a continuum and placed into one-to-one correspondence with 3. A particular lattice position can be specified simply through a label consisting of three integers (i,j,k)3. The position can be recovered through a scaling operation

(x,y,z)=a2(i,j,k),

and the number a/2 that modifies the length scale from 1 to a/2, it is called a scalar.

Figure 1. Left: Polycrystalline sodium chloride. Right: Cubic lattice structure of a single sodium chloride crystal

1.3.Computer number sets

A computer has a finite amount of memory, hence cannot represent all numbers, but rather subsets of the above number sets. Current digital computers internally use numbers represented through binary digits, or bits. Many computer number types are defined for specific purposes, and are often encountered in applications such as image representation or digital data acquisition. Here are the main types.

Subsets of

The number types uint8, uint16, uint32, uint64 represent subsets of the natural numbers (unsigned integers) using 8, 16, 32, 64 bits respectively. An unsigned integer with b bits can store a natural number in the range from 0 to 2b-1. Two arbitrary natural numbers, written as i,j can be added and will give another natural number, k=i+j. In contrast, addition of computer unsigned integers is only defined within the specific range 0 to 2b-1. If k>2b-1, the result might be displayed as the maximum possible value or as kmod2b.

i=UInt8(15); j=UInt8(10); k=i+j

25

i=UInt8(150); j=UInt8(200); k=i+j; [k i+j mod(350,256)]

[ 94 94 94 ] (1)

i=UInt8(150); j=UInt8(200); k=i-j; [k i-j mod(-50,256)]

[ 206 206 206 ] (2)

typeof(i-j)

UInt8

Subsets of

The number types int8, int16, int32, int64 represent subsets of the integers. One bit is used to store the sign of the number, so the subset of that can be represented is from 1-2b-1 to 2b-1-1.

i=Int8(15); j=Int8(21); k=i+j

36

i=Int8(100); j=Int8(101); k=i+j; [k i+j mod(201,128)-128]

[ -55 -55 -55 ] (3)

typeof(k)

Int8

[typemin(Int8) typemax(Int8)]

[ -128 127 ] (4)

Subsets of ,,

Computers approximate the real numbers through the set 𝔽 of floating point numbers. Floating point numbers that use b=32 bits are known as single precision, while those that use b=64 are double precision. A floating point number x𝔽 is stored internally as x=±.B1B2Bm×2±b1b2be where Bi, i=1,,mare bits within the mantissa of length m, and bj, j=1,,e are bits within the exponent, along with signs ± for each. The default number type is usually double precision, more concisely referred to Float64. Common irrational constants such as e, π are predefined as irrationals and casting to Float64 or Float32 gives floating point approximation. Unicode notation is recognized. Specification of a decimal point indicates a floating point number; its absence indicates an integer.

pi

π

typeof(pi)

Irrational{:π}

[Float32(pi) Float64(pi) Float64(π)]

[ 3.1415927410125732 3.141592653589793 3.141592653589793 ] (5)

a=2.3; b=2; c=3.; [typeof(a) typeof(b) typeof(c)]

DataType[Float64 Int64 Float64]

The approximation of the reals by the floats 𝔽 is characterized by: floatmax(), the largest float, floatmin the smallest positive float, and eps() known as machine epsilon. Machine epsilon highlights the differences between floating point and real numbers since it is defined as the largest number ϵ𝔽 that satisfies 1+ϵ=1. If ε of course 1+ε=1 implies ε=0, but floating points exhibit “granularity”, in the sense that over a unit interval there are small steps that are indistinguishable from zero due to the finite number of bits available for a float. The granularity of double precision expressed by machine epsilon is sufficient to represent natural phenomena, and floating point errors can usually be kept under control,

[floatmin(Float32) floatmax(Float32) eps(Float32)]

[ 1.1754944e-38 3.4028235e38 1.1920929e-7 ] (6)

[floatmin(Float64) floatmax(Float64) eps(Float64)]

[ 2.2250738585072014e-308 1.7976931348623157e308 2.220446049250313e-16 ] (7)

Keep in mind that perfect accuracy is a mathematical abstraction, not encountered in nature. In fields such as sociology or psychology 3 digits of accuracy are excellent, in mechanical engineering this might increase to 6 digits, or in electronic engineering to 8 digits. The most precisely known physical constant is the Rydberg constant known to 12 digits, hence a mathematical statement such as

x=2.6309283450461248350319486319845

is unlikely to have any real significance, while

x=2.631±0.0005

is much more informative.

Within the reals certain operations are undefined such as 1/0. Special float constants are defined to handle such situations: Inf is a float meant to represent infinity, and NaN (“not a number”) is meant to represent an undefinable result of an arithmetic operation.

[1/0 -1.0/0.0 1/Inf -1/Inf Inf/Inf]

[ - 0.0 -0.0 NaN ] (8)

Complex numbers z are specified by two reals, in Cartesian form as z=x+iy, x,y or in polar form as z=ρeiθ, ρ,θ, ρ0. The computer type complex is similarly defined from two floats and the additional constant I is defined to represent -1=i=eiπ/2. Functions are available to obtain the real and imaginary parts within the Cartesian form, or the absolute value and argument of the polar form.

z1=1+1im; z2=1-im; [z1+z2 z1/z2]

[ 2.0+0.0i -0.0+1.0i ] (9)

[real(z1) real(z2) real(z1+z2) real(z1/z2)]

[ 1.0 1.0 2.0 -0.0 ] (10)

[imag(z1) imag(z2) imag(z1+z2) imag(z1/z2)]

[ 1.0 -1.0 0.0 1.0 ] (11)

[abs(z1) abs(z2) abs(z1+z2) abs(z1/z2)]

[ 1.4142135623730951 1.4142135623730951 2.0 1.0 ] (12)

[angle(z1) angle(z2) angle(z1+z2) angle(z1/z2)]

[ 0.7853981633974483 -0.7853981633974483 0.0 1.5707963267948966 ] (13)

Figure 2. Hierarchy of number types in the Julia language.

2.Approximation

2.1.Axiom of floating point arithmetic

The reals form an algebraic structure known as a field (,+,). The set of floats together with floating point addition and multiplication are denoted as (𝔽,,). Operations with floats do not have the same properties as the reals, but are assumed to have a relative error bounded by machine epsilon

x,y,|fl(x)fl(y)-xyxy|ϵ,(,){(,+),(,)},

where fl(x)𝔽 is the floating point representation of x. The above is often used in error analysis as

fl(x)fl(y)=xy(1+ε),|ε|ϵ.

Computer number sets are a first example of approximation: replacing some complicated object with a simpler one. It is one of the key mathematical ideas studied throughout this text.

2.2.Cummulative floating point operations

Care should be exercised about the cummulative effect of many floating point operations. An informative example is offered by Zeno's paradox of motion, that purports that fleet-footed Achilles could never overcome an initial head start of D=2 given to the lethargic Tortoise since, as stated by Aristotle:

In a race, the quickest runner can never over­take the slowest, since the pursuer must first reach the point whence the pursued started, so that the slower must always hold a lead.

The above is often formulated by considering that the first half of the initial head start must be overcome, then another half and so on. The distance traversed after N such steps is

DN=1+12++12N=1-(1/2)N+11-(1/2)=2[1-(1/2)N+1]<2.

Calculus resolves the paradox by rigorous definition of the limit D=limNDN=2 and definition of velocity as v(t)=limδt0(D(t+δt)-D(t))/δt, δt=1/N, D(t)=2[1-(1/2)t/δt].

Undertake a numerical invesigation and consider two scenarios, with increasing or decreasing step sizes

DN=1+12++12N,CN=12N+12N-1++1.

In (,+,) associativity ensures DN=CN.

N=10; D=2.0 .^ (0:-1:-N); C=2.0 .^ (-N:1:0); sum(D)==sum(C)

true

N=20; D=2.0 .^ (0:-1:-N); C=2.0 .^ (-N:1:0); sum(D)==sum(C)

true

Irrespective of the value for N, DN=CN in floating point arithmetic. Recall however that computers use binary representations internally, so division by powers of two might have unique features (indeed, it corresponds to a bit shift operation). Try subdividing the head start by a different number, perhaps π to get an “irrational” numerical investigation of Zeno's paradox of motion. Define now the distance SN traversed by step sizes that are scaled by 1/π starting from one to TN, traversed by step sizes scaled by π starting from π-N

SN=1+1π+1π2++1πN,TN=1πN+1πN-1++1.

Again, in the reals the above two expressions are equal, SN=TN, but this is no longer verified computationally for all N, not even within a tolerance of machine epsilon.

fpi=Float64(pi);
N=10; S=fpi .^ (0:-1:-N); T=fpi .^ (-N:1:0); sum(S)==sum(T)

true

N=20; S=fpi .^ (0:-1:-N); T=fpi .^ (-N:1:0); sum(S)==sum(T)

false

sum(S)-sum(T)<eps(Float64)

false

This example gives a first glimpse of the steps that need to be carried out in addition to mathematical analysis to fully characterize an algorithm. Since SNTN, a natural question is whether one is more accurate than the other. For some arbitrary ratio a, the exact value is known

EN=1-(1/a)N+11-(1/a),

and can be used to evaluate the errors |SN-EN|, |TN-EN|.

function E(N,a)
   (1-(1/a)^(N+1))/(1-(1/a))
end;
function εs(N,a)
   S=a .^ (0:-1:-N)
   abs(sum(S)-E(N,a))
end;
function εt(N,a)
   T=a .^ (-N:1:0)
   abs(sum(T)-E(N,a))
end;

Carrying out the computations leads to results in Fig. 3.

n=30; errs=zeros(Float64,n); errt=zeros(Float64,n);
for i=1:n
  errs[i]=εs(N,fpi); errt[i]=εt(N,fpi); 
end
clf(); plot(1:n,errs,1:n,errt,marker="o"); title("Summation error"); grid("on"); xlabel("n"); ylabel("εs,εt"); legend(["εs"; "εt"]);

Figure 3. Summation order errors.

Note that errors are about the size of machine epsilon for SN, but are zero for TN, it seems that the summation ordering TN=a-N+a-N+1++1 gives the exact value. A bit of reflection reinforces this interpretation: first adding small quantities allows for carry over digits to be accounted for.

This example is instructive beyond the immediate adage of “add small quantities first”. It highlights the blend of empirical and analytical approaches that is prevalent in scientific computing.

3.Successive approximations

3.1.Sequences in

Single values given by some algorithm are of little value in the practice of scientific computing. The main goal is the construction of a sequence of approximations {xn}n that enables assessment of the quality of an approximation. Recall from calculus that {xn}n converges to x if |xn-x| can be made as small as desired for all n beyond some threshold. In precise mathematical language this is stated through:

Definition. {xn}n converges to x if ε>0, N(ε) such that |xn-x|<ε for n>N(ε).

Though it might seem natural to require a sequence of approximations to converge to an exact solution x

limnxn=x,

such a condition is problematic on multiple counts:

  1. the exact solution is rarely known;

  2. the best approximation might be achieved for some finite range n1nn2, rather than in the n limit.

Both situations arise when approximating numbers and serve as useful reference points when considering approximation other types of mathematical objects such as functions. For example, the number π is readily defined in geometric terms as the ratio of circle circumference to diameter, but can only be approximately expressed by a rational number, e.g., π22/7. The exact value of π is only obtained as the limit of an infinite number of operations with rationals. There are many such infinite representations, one of which is the Leibniz series

π4=1-13+15-17+19-..

No finite term

Ln=k=0n(-1)k2k+1

of the above Leibniz series equals π/4, i.e.,

nsuchthatLn=π/4.

Rather, the Leibniz series should be understood as an algorithm, i.e., a sequence of elementary operations that leads to succesively more accurate approximations of π/4

limnLn=π/4.

Complex analysis provides a convergence proof starting from properties of the arctan function

ddzarctan(z)=11+z2π4=arctan(1)-arctan(0)=01dz1+z2.

For |z|<1 the sequence Sn=k=0n(-z2)k of partial sums of a geometric series converges uniformly

k=0(-z2)k=limn1-(-z2)n1-(-z2)=11+z2,

and can be integrated term by term to give

π4=01k=0(-z2)kdz=k=001(-z2)kdz=k=0(-1)k2k+1.

This elegant result does not address however the points raised above: if π were not known, how could the convergence of the sequence {Ln}n be assessed? A simple numerical experiment indicates that the familiar value of π is only recovered for large n, with 10000 terms insufficient to ensure five significant digits.

function L(n)
  L=1.0; s=-1.0
  for k=1:n
    L += s/(2*k+1); s = -s
  end
  return 4*L
end

L

[L(100) L(1000) L(10000) Float64(π)]

[ 3.1514934010709914 3.1425916543395442 3.1416926435905346 3.141592653589793 ] (14)

3.2.Cauchy sequences

Instead of evaluating distance to an unknown limit, as in |Ln-π|<ε, one could evaluate if terms get closer to one another as in |Ln-Lm|<ε, a condition that can readily be checked in an algorithm.

Definition. {xn}n is a Cauchy sequence if ε>0, N(ε) such that |xn-xm|<ε for all m,n>N(ε).

Note that the distance between any two terms after the threshold N(ε) must be smaller than an arbitrary tolerance ε. For example the sequence an=n is not a Cauchy sequence even though the distance between successive terms can be made arbitrarily small

an+1-an=n+1-n=(n+1-n)(n+1+n)n+1+n=1n+1+n<12n.

Verification of decreasing successive distance is therefore a necessary but not sufficient condition to assess whether a sequence is a Cauchy sequence. Furthermore, the distance between successive iterates is not necessarily an indication of the distance to the limit. Reprising the Leibniz example, successive terms can be further apart than the distance to the limit, though terms separated by 2 are closer than the distance to the limit (a consequence of the alternating Leibniz series signs)

n=1000; [log10(abs(L(n)-L(n-1))) log10(abs(L(n)-π))]

[ -2.6991870973082537 -3.000434185835426 ] (15)

 [log10(abs(L(n)-L(n-2))) log10(abs(L(n)-π))]

[ -5.698969895788488 -3.000434185835426 ] (16)

Another question is whether a Cauchy sequence is itself convergent. For sequences of reals this is true, but the Leibniz sequence furnishes a counterexample since it contains rationals and converges to an irrational. Such aspects that arise in number approximation sequences become even more important when considering approximation sequences composed of vectors or functions.

3.3.Sequences in 𝔽

Consideration of floating point arithmetic indicates adaptation of the mathematical concept of convergence is required in scientific computation. Recall that machine epsilon ϵ is that largest number such that 1+ϵ=1 is true, and characterizes the granularity of the floating point system. A reasonable adaptation of the notion of convergence might be:

Definition. {xn}n , xn𝔽 converges to x𝔽 if ε>ϵ, N(ε) such that |xn-x|<ε for n>N(ε).

What emerges is the need to consider a degree of uncertainty in an approximating sequence. If the uncertainty can be bounded to the intrinsic granularity of the number system, a good approximation is obtained.

Summary.
The problem of approximating numbers uncovers generic aspects of scientific computing: