Iteration – Michael Ganger's Articles

One of the first important tools a beginner programmer learns is “The Loop.” Unlike assignment or expressions, “looping” brings extra power to a program. Typically, a beginner without the “looping” power will copy-paste a section of code with minor edits to accomplish a repetitive task.

Once “looping” is learned, it becomes second-nature. After all, many real-life tasks can be modeled by iteration (“loops”). What most programmers do not realize is that, mathematically, iteration is quite sophisticated.

Function Iteration when Programming

Procedural programming languages (think C, Java, Python, etc.) obscure the math behind looping, favoring simplicity instead.¹ However, it is relatively easy to recover. Let $s_t$ be the memory state of a loop at iteration $t[/late], [latex]i_t$ be external input from, say, a user or a sensor, and let $o_t$ be the output of the loop. What we get is a simple sequence:

I omitted some of the inputs and outputs for simplicity.

How does this relate to functions? In the diagram above, I used $f$ to denote the function moving the state from $t$ to $t+1$ —in other words, the body of the loop! The process looks like this:

(s_{t+1},o_t) = f(s_t,i_t)

To me, this still looks a little too complicated. Let's simplify it. First, if you've programmed for a while, you know that a lot of the loops (I'd say almost all) you write don't take external input. The data is already there at the start. Second, giving the output its own variable is completely unnecessary. We can just assume that some portion of $s$ contains the output of the previous state. We normally don't even care about the in-between outputs.² The diagram then becomes:

Let's walk through the process.

Someone/something sets up $s_0$
The function $f$ is applied to $s_t$ to produce $s_{t+1}$ .
The program outputs $s_n$ (the last iteration).

Pretty simple, huh? But really, all we did so far is play around with notation. Let's use a concrete example instead:

x = 10.0
for _ in range(10):
    x = x / 2.0
print(x)
# x = 0.009765625

Cool. So $s_0 = 10$ , $f(s) = s/2$ , and $s_{10}=5/512$ . Intuitively, as $n\to\infty$ , $s_n\to 0$ , which leads to the next section...

Fixed Points

So, it turns out that well before people were writing loops in Python, people were studying what happens when you stuff the output of a function back into its input. Why? Well, as it turns out, it's actually a pretty convenient way to model the natural world. Take, for instance, a population of rabbits with no predators and assume that a rabbit produces about 1.5 offspring per year. We can express the number of rabbits as

r_{t+1} = f_\mathrm{rabbits}(r_t) = r_t + 1.5 r_t = 2.5 r_t

Nothing complicated here—the number of rabbits next year is completely dependent on the number in the current year. Of course, this is unrealistic—rabbits cannot reproduce indefinitely—so we modify the equation so that the rabbit population stops increasing when the population reaches the maximum capacity $C$ :

f(r) = r + \alpha r(C-r)

So what happens when we want to know how many rabbits there will be in 5 years? Or maybe 10 years? All we have to do is keep reapplying $f$ to the original population size:

r_{t+n} = f^n(r_t) = (\underbrace{f\circ \cdots \circ f}_n)(r_t)

The algebra for this quickly becomes too complicated to be useful. Instead, I decided to make a graph to show what happens (for fun):


$\alpha$
$n$
$C$

I encourage you to play around with the sliders to get a sense for how each of parameter affects the graph. It's the most visually interesting (in my opinion) to set $n$ at its maximum and vary $\alpha$ . However, what we're interested in here is what happens to the graph when $\alpha$ is fixed and $n$ increases—in other words, when the function $f$ is continually reapplied. The key first observation is this: when $\alpha$ is low enough, increasing $n$ does not change where the two lines intersect. Mathematically, this translates to:

\bar r = f(\bar r)

where $\bar r$ is known as a fixed point. A little algebra will show that the only two fixed points are $\bar r \in \{0,C\}$ . Intuitively, our rabbit population cannot grow if a) it does not exist or b) it is at capacity.

But happens if the rabbit population is just slightly different from one of the fixed points? Some more algebra can be enlightening:

\begin{aligned} f^n(\bar r+\delta) &\approx f^n(\bar r) + (f^n)'(\bar r)\delta \\ &= \bar r + f'(\bar r)^n \delta \end{aligned}

where the derivative can be expressed recursively:

(f^n)'(r) = f'(f^{n-1}(r))(f^{n-1})'(r) \\\;\\ (f^n)'(\bar r) = f'(\bar r)(f^{n-1})'(\bar r) = f'(\bar r)^n

There are obviously three cases here:


$\left\|f'(\bar r)\right\| < 1$	$f^n(\bar r + \delta)$ converges to $\bar r$ as $n\to\infty$
$\left\|f'(\bar r)\right\| > 1$	$f^n(\bar r + \delta)$ diverges (perhaps to a different fixed point).
$\left\|f'(\bar r)\right\| = 1$	Undetermined (but probably pretty rare).

In words, if the derivative of $f$ at a fixed point is between -1 and 1, it is attractive; if it outside of that range, it is repelling. The basin of attraction for a fixed point is the set of points which converge to it in the limit (in $\mathbb{R}^2$ coloring the basins tastefully yields pretty fractal pictures).

Orbits

We can do other fun things with fixed points. For instance, an orbit is just a generalization of fixed points to multiple iterations:

f^k(\tilde x_k) = \tilde x_k

Above, $\tilde x_k$ is a point in the orbit. A 2-orbit in a rabbit population might have the following interpretation:

A population below but near capacity produces a big crop of offspring, who initially consume minimal resources (grass).
The next year, the offspring become adults and grass becomes scarce (since the population is over capacity). Less offspring are produced.
A smaller population of rabbits survives to the third year that is below but near capacity.

Really, any interpretation will do to explain an orbit as long as it includes delay effects.

Number theory makes a surprising entrance here. Clearly, for any integer $l$ such that $l \mid k$ (read " $l$ divides $k$ "), the set of $k$ -orbits includes the set of $l$ -orbits; this includes the set of 1-orbits (fixed points). Thus, for example, this means that if you plot $f^2,f^4,\ldots$ they will share a common set of locations where they cross the line from the origin.³ Another interesting consequence to consider is that the intersection of prime $k$ -orbit sets is the set of 1-orbits (fixed points). This is an important concept for random number generators. To show this, observe that

f^m(\tilde x) = f^n(\tilde x) = \tilde x \implies f^{\left|km-ln\right|}(\tilde x) = \tilde x

for positive integers $k,l,m,n$ . Let $k = m^{n-2}$ and $l = (m^{n-1}-1)/n$ ; then $km-ln=1$ for prime $n,m$ . This implies that any $\tilde x$ that is an $n$ - and $m$ -orbit is also a fixed point (a 1-orbit).⁴

Practically speaking, if you choose two prime values of $n$ (limited to 2,3,5,7,11,13,17,19) on the graph above, the only crossing-points they will share are fixed points.

Gradient Descent

A good example of why a good understanding of iteration and fixed points is important comes from machine learning: gradient descent. Typically, the goal is to minimize a loss function $L$ (usually differentiable) with respect to some parameters $\theta$ . Vanilla gradient descent updates the weights using the following:

\theta_{t+1} = \theta_t - \alpha \frac{\partial L}{\partial \theta}(\theta_t) = g(\theta_t)

Look familiar? We just keep applying $g$ to the weights until they converge! This very general view is easily extended to include momentum (just redefine $\theta$ and $g$ as needed) and stochasticity (replace $g$ with a family of functions $(g_i)$ ).

The typical issue in gradient descent is that $g$ usually has more than one fixed point; in fact, there are usually a lot of fixed points.⁵ So understanding what generates these fixed points is pretty crucial to understanding the local minima that hinder global convergence.

Reinforcement Learning

Yet another machine learning example of fixed points is from Bellman's equation, commonly seen in Reinforcement Learning:

V(s) = \max_a \mathbb E_{s'} \{r(s,a,s') + \gamma V(s') \mid s,a \}

In words, the value of state $s$ is the average reward collected over time when the policy is to choose the best action $a$ . This is also a fixed point equation, except it is over the whole function $V$ instead of a single real number:

V(s) = B(V)(s)

It turns out that the optimal function $V^*$ is the unique fixed point of $B$ . If $V$ is represented as a finite-length vector, the two basic algorithms to compute $V^*$ , Value Iteration and Policy Iteration, almost immediately follow.

Evolutionary Algorithms

Even evolutionary algorithms allow an iterative interpretation. Let $P$ be a population and $(e_i)$ be a family of functions which modify the population. At each time step, we compute the new population according to

P_{t+1} = e_t(P_t)

By choosing $e_i$ and $P_0$ carefully, the hope is that $P_n$ (quickly) converges to an approximately optimal population.⁶ By this (very) general definition, stochastic gradient descent and its variations are just evolutionary algorithms with single-member populations.⁷

Conclusion

By lifting many commonplace techniques and algorithms to the abstraction of functions, we can glean insights that might have been obscured on the lower level by algebra and technicalities. In the modern world where deep learning abounds, this abstraction is increasingly important; after all, what is a neural network except a function that happens to be efficient to compute and train?

Footnotes

Functional languages, of course, do not obscure this, at the cost of imposing extra mental effort on the programmer for basic problems. At least in my opinion. I’m sure serious Haskell or Lisp developers would challenge me. For complex problems (i.e. multi-threaded and asynchronous), the tables flip and functional paradigms are easier to reason about.
Except when debugging.
I encourage you to try this on the graph above.
The choice of integer coefficients relies on Fermat's Little Theorem, and is sufficient but not unique.
A notable exception is linear regression.
Of course, the introduction of stochasticity in allowing a family of functions brings more complication to what a "fixed point" is, but I feel that the basic intuition is the same.
Increasing the population size might lead to interesting parallel algorithms. One idea would be to have independent processors apply gradient descent to their own parameter vectors (i.e. mutation) and periodically crossover between processors.