Planet Scheme

July 22, 2014

Programming Praxis

Maximum Sum Path

Sometimes here at Programming Praxis we publish exercises taken from homework in computer science classes, sometimes we publish exercises that are based on or related to problems at Project Euler, and sometimes we publish exercises that based on common technical interview questions. Today we hit the trifecta: a common interview question that is frequently assigned as homework in computer science classes and that also appears at Project Euler (twice!):

Given a binary tree with integers stored in its nodes, find the maximum sum of any path from root to leaf.

For instance, in the binary tree represented by the triangle shown below, the maximum sum path goes from the 3 at the root, to its child 7, to its grandchild 4, to the leaf node 9, a sum of 23. Since there is no other path with a greater sum, the maximum sum is 23:

  7 4
 2 4 6
8 5 9 3

Your task is to write a program to compute the maximum sum path through a binary tree; also write a variant of the program that returns the path. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 22, 2014 09:00 AM

July 18, 2014

Programming Praxis

How Fermat Factored Integers

Living in the age of computers, I’ve always been fascinated by how much arithmetic it was possible to do before the advent of computers. The Egyptians knew the peasant method of multiplication when the built the pyramids. Archimedes knew the value of pi two centuries before Christ, with an approximation we still use today. Heron had an algorithm for calculating square roots two thousand years ago. And so on.

By the time of Fermat, Euler and Gauss, arithmetic was done with simple algorithms, rules of thumb, and large tables. Sadly, the tables were frequently wrong; Charles Babbage invented his Difference Engine for the purpose of automatically constructing accurate tables.

We studied Fermat’s algorithm for factoring integers in a previous exercise:

function fermat(n)
    x := floor(sqrt(n))
    r := x * x - n
    t := 2 * x + 1
    while r is not a square
        r := r + t
        t := t + 2
    x := (t - 1) / 2
    y := sqrt(r)
    return x - y, x + y

But how did Fermat actually work his algorithm? In addition to the basic arithmetic, he used some simple rules for identifying squares and a table of the squares of numbers similar to that on the next page, with the largest square at least as large as n.

Let’s consider the factorization of n = 13290059 as an example. First, Fermat removed small factors by trial division, say for factors less than 30 or 50. Then, Fermat looked up n in the table. If n appeared in the body of the table, then it is a perfect square, and the square root is its factor. But in the general case, Fermat finds x as the square root of the next smaller entry in the table. For 13290059, x = 3645, because 36452 < 13290059 < 36462. Then t = 7291 and r = -4034, which is not a perfect square (because it’s negative).

Now the iteration begins. Fermat calculated r and t in 45 steps as follows, first r, then t below it, then a sum line, then the sum r = r + t, then t incremented by 2 from the previous t, then a sum line, then a new sum r, at each step checking if r is a square:

0:  r =  -4034 = 3645 * 3645 - 13290059
    t =   7291 = 2 * 3645 + 1
1:  r =   3257 ends in 7; not a square
    t =   7293 = 7291 + 2
2:  r =  10550 ends in 50; not a square
    t =   7295 = 7293 + 2
3:  r =  17845 ends in 45; not a square
44: r = 318662 ends in 2; not a square
    t =   7381 = 7379 + 2
45: r = 326041 in table: 571 * 571

At that point, Fermat could calculate the factors of n: x = (7381 – 1) / 2 = 3690, y = 571, and the factors of n are xy = 3690 − 571 = 3119 and x + y = 3690 + 571 = 4261.

The key to Fermat’s calculation was identifying the case where r is a square. Fermat used three rules: First, the last two digits of the number had to be 00, e1, e4, 25, o6 or e9, where e is any even digit and o is any odd digit. Second, the digital root of the number had to be 1, 4, 7 or 9: sum the digits of n; if the result is less than 10, then return the sum of the digits, otherwise return the digital root of the sum of the digits. In practice, this goes much more quickly than it appears by “casting out nines” wherever they appear in the digits of n or the result. Third, if he had not already determined that the number was not a square, he looked it up in his table of squares. The whole calculation goes much more quickly than I can describe it here.

And that’s how Fermat factored integers. Of course, sometimes he failed, when it took too many iterations to find a square, or when his table of squares was too small, or when he made an error in the arithmetic. Fortunately, it was easy enough for him to confirm that any factorization he found was correct by multiplying the factors.

Your task is to write a program that factors integers the same way Fermat factored integers. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the program in the comments below.

by programmingpraxis at July 18, 2014 09:00 AM

July 15, 2014

Programming Praxis

How Many Distinct Products In A Times Table?

We all learned our times tables in elementary school. For instance, here is the standard 9 by 9 times table:

1  2  3  4  5  6  7  8  9
2  4  6  8 10 12 14 16 18
3  6  9 12 15 18 21 24 27
4  8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81

That table has 36 distinct products: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 21, 24, 25, 27, 28, 30, 32, 35, 36, 40, 42, 45, 48, 49, 54, 56, 63, 64, 72 and 81.

Your task is to write a program that computes the number of distinct products in an n by n times table. There is an obvious O(n2) algorithm; can you do better? When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 15, 2014 09:00 AM

July 14, 2014

Jeremy Kun

An Update on “Coloring Resilient Graphs”

A while back I announced a preprint of a paper on coloring graphs with certain resilience properties. I’m pleased to announce that it’s been accepted to the Mathematical Foundations of Computer Science 2014, which is being held in Budapest this year. Since we first published the preprint we’ve actually proved some additional results about resilience, and […]

by Jeremy Kun at July 14, 2014 03:00 PM

July 13, 2014

GNU Guix

Guix at OpenBio Codefest 2014

On Wednesday, July 9th, David Thompson gave a brief introduction to GNU Guix at the Open Bioinformatics Codefest 2014 hackathon. The objective of the Codefest is to give developers of bioinformatics software a chance to be fully focused on their projects for a few days and work in person. These developers are concerned with the reliability and reproducibility of their operating systems, and the limitations of their package management utilities.

See the slides on-line.

by David Thompson at July 13, 2014 08:26 PM

July 11, 2014

Programming Praxis

FSM Generator

A few weeks ago we had an exercise that solved the problem of removing singleton occurrences of a given character from a string while leaving multiple occurrences of the given character intact. Our solution used a finite state machine that iterated over the input string writing output as it went.

The finite state machine used in that solution was hand-crafted to solve the given problem. But it is possible to write a program that takes a definition of a finite state machine and creates a function that takes an input string and applies the finite state machine to it. Such a program isn’t hard to create and provides a useful utility for small parsing problems.

Your task is to write a program that generates a finite state machine from a simple description; the format and capabilities of the machine are up to you, but it should be at least powerful enough to solve the “remove singleton” problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 11, 2014 09:00 AM

July 08, 2014

Programming Praxis

SQUFOF, Continued Fraction Version

Square Forms Factorization, abbreviated SQUFOF, is a special-purpose method for factoring integers. It is commonly used to split small semi-primes in the double-large-prime variants of the quadratic sieve and number field sieve factoring algorithms. We studied the “binary quadratic forms” variant of SQUFOF in a previous exercise; today we study the “continued fraction” variant of SQUFOF.

Although the mathematical basis of SQUFOF is the quadratic forms of Gauss, SQUFOF was discovered by Daniel Shanks while he was investigating the failures of the continued fraction factoring algorithm; you should read his paper, which is as good a demonstration of the process of mathematical discovery as anything I have ever read. Shanks’ algorithm expands the continued fraction of the square root of n, the number being factored, searching among the convergents for a quadratic form that is a perfect square; in this, it differs from the CFRAC algorithm, which combines multiple convergents to assemble a perfect square. We copy the algorithm from the paper by Jason Gower and Sam Wagstaff that provides the standard description of SQUFOF:

[ Note: In the description that follows, be careful to distinguish the symbols and Q; the first one has a circumflex over the Q, the second one doesn't. Some browsers render the "combining circumflex accent" readably, but some don't. On the browsers available to me, only one out of four rendered the symbol readably, sorta kinda, and only when I expanded the font to be quite large. If in doubt, look at the program on the next page, where the first symbol is named qhat and the second symbol is named q. ]

The variable N is the integer to factor, S remembers q0, q holds the current qi, P and P’ hold two consecutive values of Pi, and Q hold two consecutive values of Qi, and t is a temporary variable used in updating Q. Finally, B is an upper bound on the number of forms tested for being square forms before the algorithm gives up.

1. Initialize: Read the odd positive integer N to be factored. If N is the square of an integer, output the square root and stop. If N ≡ 1 mod 4, then set D ← 2N; otherwise, set DN. In any case, set S ← ⌊√D⌋, Q̂ ← 1, PS, QDP · P, L ← ⌊2√2√D⌋, B ← 2 · L, and i ← 0.

2. Cycle forward to find a proper square form: Steps 2a through 2e are repeated for i = 1, 2, 3, ….

2a: Set q ← ⌊(S + P) / Q⌋ and P’q · QP.

2b: If Ql, then:

If Q is even, put the pair (Q / 2, P mod (Q / 2) onto the QUEUE; otherwise, if QL / 2, then put the pair (Q, P mod Q) onto the QUEUE.

2c: Set tQ̂ + q · (PP’), Q̂ ← t, and PP’.

2d: If i is odd, go to Step 2e. If Q is not the square of an integer, go to Step 2e. Otherwise, set r ← √Q, a positive integer. If there is no pair (r, t) in the QUEUE for which r divides Pt, then go to Step 3. If r > 1 and there is a pair (r, t) in the QUEUE for which r divides Pt, then remove all pairs from the beginning of the QUEUE up to and including this pair and go to Step 2e. If r = 1 and there is a pair (1, t) in the QUEUE, then the algorithm fails because we have traversed the entire principal period of quadratic forms of discriminant 4N without finding a proper square form.

2e: Let ii + 1. If i > B, give up. Otherwise, go to Step 2a.

3. Compute an inverse square root of the square form: Set Q̂ ← r, PP + r · ⌊(SP) / r⌋, and Q ← (DP · P) / Q̂.

4. Cycle in the reverse direction to find a factor of N:

4a: Set q ← ⌊(S + P) / Q⌋ and P’q · QP.

4b: If P = P’, then go to Step 5.

4c: Set tQ̂ + q · (PP’, Q̂ ← Q, Qt, and PP’ and go to Step 4a.

5. Print the factor of N: If Q is even, set QQ / 2. Output the factor Q of N.

Later, in Section 5.2 of their paper, Gowers and Wagstaff give instructions for using a multiplier m to help find a factor of N, changing the algorithm given above as follows:

Change Step 1 to this: Read the odd positive integer N to be factored. If N is the square of an integer, output the square root and stop. If mN ≡ 1 mod 4, then set D ← 2mN; otherwise, set DmN. In any case, set S ← ⌊√D⌋, Q̂ ← 1, PS, QDP · P, L ← ⌊2√2√D⌋, B ← 2 · L, and i ← 0.

Change Step 2b to: Let gQ / gcd(Q, 2m). If gL, put (g, P mod g) onto the QUEUE. [ Note: If you're reading the linked version of the Gowers and Wagstaff paper, this formula is erroneously given as (g, B mod g); that error cost me several hours of pain. ]

In Step 5, replace Q by Q / gcd(Q, 2m) before the factor Q is output.

To find a factor, first use multiplier m = 1. If that fails, try another multiplier from the set 3, 5, 7, 11, 3 · 5 = 15, 3 · 7 = 21, 3 · 11 = 33, 5 · 7 = 35, 5 · 11 = 55, 7 · 11 = 77, 3 · 5 · 7 = 105, 3 · 5 · 11 = 165, 3 · 7 · 11 = 231, 5 · 7 · 11 = 385, 3 · 5 · 7 · 11 = 1155, or any other squarefree product of small odd primes.

Your task is to implement SQUFOF with multipliers. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 08, 2014 09:00 AM

July 07, 2014

GNU Guix

GNU dmd 0.2 released

GNU dmd 0.2 has been released. It provides new features such as the ability to load new definitions for existing services, as well as bug fixes and an improved manual.

GNU dmd is a dependency-based service manager meant to be used as the init system in GNU. It is written in Guile Scheme.

by Ludovic Courtès at July 07, 2014 10:04 PM

Jeremy Kun

When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation

Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a […]

by Jeremy Kun at July 07, 2014 03:00 PM

July 04, 2014

Programming Praxis

Minimum And Maximum

We have today a two-part interview question:

First, write a program that takes an array of integers and returns the minimum and maximum elements of the array. Use as few comparisons as possible.

Second, write a program that takes an array of integers and returns the second minimum and maximum elements of the array; that is, the second-smallest element and the second-largest element. Again, use as few comparisons as possible.

Your task is to write the two programs described above, and state the number of comparisons each makes in the general case; be sure they are proof against malicious input. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 04, 2014 09:00 AM

July 02, 2014

Jeremy Kun

AMS Network Science Mathematical Research Community

I don’t usually write promotional posts because I don’t enjoy reading them as much as I enjoy reading the technical posts. But I know that a lot of early graduate students and undergraduates read my blog, and this would be of interest to many of them. I just got back from Utah yesterday where I attended […]

by Jeremy Kun at July 02, 2014 06:47 PM

July 01, 2014

Programming Praxis

Moonrise, Moonset

We have previously studied the times of sunrise and sunset and the phases of the moon. Today we look at the times of moonrise and moonset.

The standard source for astronomical calculations is the Naval Observatory; they point to a 1989 article in Sky & Telescope for the calculation. The article isn’t online, but the code is: a horribly ugly BASIC program, reproduced on the next page.

Your task is to write a program that calculates the times of moonrise and moonset. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 01, 2014 09:00 AM

Andy Wingo

flow analysis in guile

Greets, and welcome back to the solipsism! I've been wandering the wilderness with my Guile hackings lately, but I'm finally ready to come back to civilization. Hopefully you will enjoy my harvest of forest fruit.

Today's article is about flow analysis and data structures. Ready? Let's rock!

flow analysis

Many things that a compiler would like to know can be phrased as a question of the form, "What do I know about the data flowing through this particular program point?" Some things you might want to know are:

  1. The set of variables that must be live.

  2. The set of variables that happen to be live. This is the same as (1) except it includes variables that aren't needed but haven't been clobbered by anything.

  3. The set of expressions whose results are still valid (i.e., haven't been clobbered by anything else).

  4. An upper and lower bound on the range of numeric variables.

Et cetera. I'll talk about specific instances of flow analysis problems in the future, but today's article is a bit more general.

The first thing to note about these questions is that they don't necessarily need or have unique answers. If GCC decides that it can't prove anything about the ranges of integers in your program, it's not the end of the world -- it just won't be able to do some optimizations that it would like to do.

At the same time, there are answers that are better and worse than others, and answers that are just invalid. Consider a function of the form:

int f():
  int a = 1
  int b = 2
  int c = a + b
  int d = b + c
  int z = x + y
  return z

In this function, there are 27 different expressions, including the return, and 27 different program points. (You can think of a program point as a labelled sub-expression. In this example none of the expressions have sub-expressions.) If we number the program points in order from 0 to 26, we will have a program that first executes expression 0 (int a = 1), then 1, and so on to the end.

Let's plot some possible solutions to the live variable flow-analysis problem for this program.

Here we see two solutions to the problem (in light and dark blue), along with a space of invalid solutions (in red). The Y axis corresponds to the variables of the program, starting with a on the bottom and finishing with z on the top.

For example, consider position 4 in the program, corresponding to int e = c + d. It is marked in the graph with a vertical bar. After position 4, the only values that are used in the rest of the program are d and e. These are the variables that are contained within the light-blue area. It wouldn't be invalid to consider a, b, and c to be live also, but it also wouldn't be as efficient to allocate space and reason about values that won't contribute to the answer. The dark blue space holds those values that may harmlessly be considered to be live, but which actually aren't live.

It would, however, be invalid to consider the variable f to be live after position 4, because it hasn't been defined yet. This area of the variable space is represented in red on the graph.

Of course, the space of all possible solutions isn't possible to represent nicely on a two-dimensional graph; we're only able to show two with colors, and that not very well as they overlap. This difficulty cuts close to the heart of the data-flow problem: that it ultimately requires computing a two-dimensional answer, which necessarily takes time and space O(n2) in program size.

Or does it?

classical flow analysis frameworks

The classical way to do flow analysis is to iterate a set of data-flow equations over an finite lattice until you reach a fixed point.

That's a pithy sentence that deserves some unpacking. If you're already comfortable with what it means, you can skip a couple sections.

Still here? Cool, me too. Let's take a simple example of sign analysis. The problem is to determine, for the integer variables of a program, at every point in the program, which ones may be negative (-), which ones may be zero (0), and which may be positive (+). All of these are conceptually bit-flags.

For example, in this program:

int f(int x):
 L0:  while (x >= 0)
 L1:    int y = x - 1
 L2:    x = y
 L3:  return x

We can assign the flags -0+ to the argument x as the initial state that flows into L0, because we don't know what it is ahead of time, and it is the only variable in scope. We start by representing the initial state of the solution as a set of sets of state values:

state := {L0: {x: -0+}, L1: Ø, L2: Ø, L3: Ø}

In this notation, Ø indicates a program point that hasn't been visited yet.

Now we iterate through all labels in the program, propagating state to their successors. Here is where the specific problem being solved "hooks in" to the generic classical flow analysis framework: before propagating to a successor, a flow equation transforms the state that flows into a program point to a state that flows out, to the particular successor. In this case we could imagine equations like this:

visit_test(expr, in, true_successor, false_successor):
  if expr matches "if var >= 0":
    # On the true branch, var is not negative.
    propagate(in + {var: in[var] - -}, true_successor)
    # On the false branch, var is not zero and not positive.
    propagate(in + {var: in[var] - 0+}, false_successor)
  else if ...

visit_expr(expr, in, successor):
  if expr matches "left = right - 1":
    if in[right] has +:
      if in[right] has 0:
        # Subtracting one from a non-negative arg may be negative.
        propagate(in + {left: in[right] + -}, successor)
        # Subtracting one from a positive arg may be 0.
        propagate(in + {left: in[right] + 0}, successor)
      # Subtracting one from a nonpositive arg will be negative.
      propagate(in + {left: -}, successor)
  else if expr matches "left = right":
    propagate(in + {left: in[right]}, successor)

The meat of classical data-flow analysis is the meet operation:

propagate(out, successor):
  if state[successor] is Ø:
    state[successor] = out
    state[successor] = meet(out, state[successor]):

# A version of meet for sign analysis
meet(out, in):
  return intersect_vars_and_union_values(out, in)

Let's run this algorithm by hand over the example program. Starting from the initial state, we propagate the L0→L1 and L0→L3 edges:

visit_test("if x <= 0", {x: -0+}, L1, L3)
→ propagate({x: 0+}, L1)
→ state[L1] = {x: 0+}
→ propagate({x: -}, L3)
→ state[L3] = {x: -}

Neat. Let's keep going. The successor of L1 is L2:

visit_expr("y = x - 1", {x: 0+}, L2)
→ propagate({x: 0+, y: -0+}, L2)
→ state[L1] = {x: 0+, y: -0+}

L2→L0 is a back-edge, returning to the top of the loop:

visit_expr("x = y", {x: 0+, y: -0+}, L0)
→ propagate({x: -0+, y: -0+}, L0)
→ state[L0] = meet({x: -0+, y: -0+}, state[L0])
→ state[L0] = meet({x: -0+, y: -0+}, {x: -0+})
→ state[L0] = {x: 0+}

Finally, L3 has no successors, so we're done with this iteration. The final state is:

{L0: {x: -0+},
 L1: {x: 0+},
 L2: {x: 0+, y: -0+},
 L3: {x: -}}

which indeed corresponds with what we would know intuitively.

fixed points and lattices

Each of the steps in our example flow analysis was deterministic: the result was calculated from the inputs and nothing else. However the backwards branch in the loop, L2→L0, could have changed inputs that were used by the previous L0→L1 and L0→L3 forward edges. So what we really should do is iterate the calculation to a fixed point: start it over again, and run it until the state doesn't change any more.

It's easy to see in this case that running it again won't end up modifying the state. But do we know that in all cases? How do we know that iteration would terminate at all? It turns out that a few simple conditions are sufficient.

The first thing to ensure is that state space being explored is finite. Here we can see this is the case, because there are only so many ways you can combine -, 0, and +. Each one may be present or not, and so we have 2n = 23 = 8 possible states. The elements of the state array will be a set with at most one entry for each variable, so the whole state space is finite. That at least ensures that an answer exists.

Next, the "meet" operation has to be commutative, associative, and idempotent. The above example used intersect_vars_and_union_values. We intersect vars because it only makes sense to talk about a variable at a program point if the variable dominates the program point. It didn't make sense to propagate y on the L2→L0 branch, for example. It's usually a good idea to model a data-flow problem using sets, as set union and intersection operations fulfill these commutative, associative, and distributive requirements.

Finally, the state being modelled should have a partial order, and functions that add information along control-flow edges -- above, visit_test and visit_expr -- should preserve this partial ordering. That is to say, visit_test and visit_expr should be monotonic. This means that no matter on what control paths data propagates, we keep building towards an answer with more information, making forward progress. This condition is also easily fulfilled with sets, or more generally with any lattice. (A lattice is nothing more than a data type that fulfills these conditions.)

Iterating the data-flow equations until the state stops changing will find a fixed point of the lattice. Whether you find the greatest or least fixed point is another question; I can't help linking to Paul Khuong's old article on Québécois student union strikes for a lovely discussion.

Another question is, how many iterations are necessary to reach a fixed point? I would first note that although in our walk-through we iterated in forward order (L0, L1, L2, L3), we could have visited nodes in any order and the answer would be the same. I'll cut to the chase and say that if:

  1. you represent your state with bitvectors

  2. the control-flow graph is reducible (has only natural loops)

  3. the meet operation on values is bitvector union or intersection

  4. you visit the program points in topologically sorted order

If these conditions are fulfilled, then you will reach a fixed point after LC + 2 iterations, where LC is the "loop-connectness number" of your graph. You can ensure (1), (3), and (4) by construction. (Reverse post-order numbering is an easy way to fulfill (4).) (2) can be ensured by using programming languages without goto (a for loop is always a natural loop) but can be violated by optimizing compilers (for example, via contification).

Loop connectedness is roughly equivalent to the maximum nesting level of loops in the program, which has experimentally been determined to rarely exceed 3. Therefore in practice, data-flow analysis requires a number of steps that is O(n * 5) = O(n) in program size.

For more information on data-flow analysis, including full proofs and references, see Carl Offner's excellent, excellent manuscript "Notes on Graph Algorithms used in Optimizing Compilers". I don't know of any better free resource than that. Thanks, Carl!

an aside: the kCFA algorithms

I just finished describing what I called "classical" data-flow analysis. By that I mean to say that people have been doing it since the 1970s, which is classical enough as far as our industry goes. However with the rise of functional languages in the 1980s, it became unclear how to apply classical data-flow analysis on a language like Scheme. Let's hear it from the horse's mouth:

This brings us to the summer of 1984. The mission was to build the world's most highly-optimising Scheme compiler. We wanted to compete with C and Fortran. The new system was T3, and the compiler was to be called Orbit. We all arrived at WRL and split up responsibility for the compiler. Norman was going to do the assembler. Philbin was going to handle the runtime (as I recall). Jonathan was project leader and (I think) wrote the linker. Kranz was to do the back end. Kelsey, the front end. I had passed the previous semester at CMU becoming an expert on data-flow analysis, a topic on which I completely grooved. All hot compilers do DFA. It is necessary for all the really cool optimisations, like loop-invariant hoisting, global register allocation, global common subexpression elimination, copy propagation, induction-variable elimination. I knew that no Scheme or Lisp compiler had ever provided these hot optimisations. I was burning to make it happen. I had been writing 3D graphics code in T, and really wanted my floating-point matrix multiplies to get the full suite of DFA optimisation. Build a DFA module for T, and we would certainly distinguish ourselves from the pack. So when we divided up the compiler, I told everyone else to back off and loudly claimed DFA for my own. Fine, everyone said. You do the DFA module. Lamping signed up to do it with me.

Lamping and I spent the rest of the summer failing. Taking trips to the Stanford library to look up papers. Hashing things out on white boards. Staring into space. Writing little bits of experimental code. Failing. Finding out *why* no one had ever provided DFA optimisation for Scheme. In short, the fundamental item the classical data-flow analysis algorithms need to operate is not available in a Scheme program. It was really depressing. I was making more money than I'd ever made in my life ($600/week). I was working with *great* guys on a cool project. I had never been to California before, so I was discovering San Francisco, my favorite city in the US and second-favorite city in the world. Silicon Valley in 1984 was beautiful, not like the crowded strip-mall/highway hell hole it is today. Every day was perfect and beautiful when I biked into work. I got involved with a gorgeous redhead. And every day, I went in to WRL, failed for 8 hours, then went home.

It was not a good summer.

At the end of the summer, I slunk back to CMU with my tail between my legs, having contributed not one line of code to Orbit.

Olin Shivers, A history of T

It took him another 7 years, but Shivers stuck with it, and in the end came out with the family of algorithms known as k-CFA. Instead of focusing on loops, which Scheme doesn't have syntactically, Shivers used continuation-passing style to ruthlessly simplify Scheme into a dialect consisting of not much more than function calls, and focused his attention on function calls. The resulting family of flow algorithms can solve flow equations even in the presence of higher-order functions -- a contribution to computer science born out of necessity, failure, and stubbornness.

With all those words, you'd think that I'd be itching to use k-CFA in Guile, and I'm not. Unfortunately even the simplest, least expressive version (0-CFA) is O(n2); 1-CFA is exponential. I don't have time for that. Instead, Guile is able to use classical DFA because it syntactically distinguishes labelled continuations and functions, and contifies functions to continuations where possible, which makes the Scheme DFA problem exactly the same as in any other language.

n times what?

Now that we have established that the number of visit operations is O(n), it remains to be seen what the individual complexity of a visit operation is in order to determine the total complexity. The naïve thing is just to use bitvectors, with each of the bitvectors having as many entries as the program has variables, times however many bits we are using.

This leads to O(|L|*|V|) space and time complexity, where |L| is the number of program points (labels) and |V| is the number of variables. As the number of variables is generally proportional to the size of program, we can approximate this as O(n2).

In practice, this means that we can use data-flow analysis to programs up to about 10000 labels in size. Sign analysis on a 10000-label function would require 100002*3/8 = 37.5 MB of memory, which is already a bit hefty. It gets worse if you need to represent more information. I was recently doing some flow-sensitive type and range inference, storing 12 bytes per variable per program point; for a 10000-label function, that's more than a gigabyte of memory. Badness.

shared tails

Although it was the type inference case that motivated this investigation, sign inference is similar and more simple so let's go with that. The visit_expr and visit_test functions above are only ever going to add additional information about the variables that are used in or defined by an expression; in practice this is a small finite number. What if we chose a representation of state that could exploit this fact by only adding O(1) amounts of data, sharing a common tail with preceding expressions?

If we draw a control-flow graph for the sign analysis program, we get something like:

The goal is to create a data structure that looks like the dominator tree. For "normal" control-flow edges -- those whose destination only have one predecessor -- we can avoid the "meet" operations, and just copy the predecessor's out set to the successor's in set. We then define "meet" as an adjoin operation that effectively conses the new information onto a shared tail, if it wasn't there already. The first iteration through the CFG will initialize the shared tail of a given control-flow join to the set of variables flowing into the join's dominator. Subsequent information will adjoin (cons) on new incoming values. In this case the resulting data structure ends up looking like:

Here the italic references like L1 indicate shared structure, and the tuples annotating the edges represent additional information flow, beyond that information that was already present in the successor's dominator.

Of course, you can implement this with linked lists and it will work fine. The problem there will be lookup speed -- when your visit operation (visit_expr or visit_test) goes to look up the sign of a variable, or the same happens via the meet operation, you get O(n) lookup penalties. Anticipating this, I implemented this with a version of Phil Bagwell's vhashes, which promise O(log n) variable lookup. See Guile's documentation, or Bagwell's excellent paper.

Note that you can't remove items from sets once they have been added in a shared-tail flow analysis; to keep the meet function monotonic, you have to instead insert tombstone entries. Not so nice, but it is what it is.

A shared-tail flow analysis consumes only O(1) additional memory per node, leading to O(n) space complexity. I have some measured space and time graphs below that show this experimentally as well.

space and time

Unfortunately, lookup time on branchy vhashes is really terrible: O(log n) in the best case, and O(n) at worst. This is especially acute because there is no easy way to do unions or intersections on vhashes -- you end up having to compute the unshared heads of the two vhashes you are merging, and looking up elements in one in the other... I could go on, but I think you believe me when I say it gets complicated and slow. It's possible to beat a bitvector approach in time for relatively "big" problems like type analysis, but for common subexpression elimination where I was just storing a bit per expression, it was tough to beat the speed of bitvectors.

I started looking for another solution, and in the end came on a compromise that I am much happier with, and again it's Phil Bagwell to the rescue. Instead of relying on vhashes that explicitly share state, I use Clojure-style persistent sparse bit-sets and bit-maps that share state opportunistically.

Guile's intset module implements a bitvector as a functional tree whose branches are vectors and whose leaves are fixnums. Each leaf represents one range of 32 integers, and each branch on top of it increases the range by a factor of 8. Branches can be sparse, so not all integers in the range of an intset need leaves.

As you would expect, adjoining an element onto such a tree is O(log n). Intersecting is much faster than vhashes though, as intsets partition the key space into power-of-two blocks. Intsets try hard to share state, so that if your adjoin would return the same value, the result is the same object, at the same address. This allows sub-trees to be compared for equality via pointer comparison, which is a great fast-path for intersection and union.

Likewise, Guile's new intmap module allow the association of larger values with integer keys.

science! fetch your white coats and lab books!

I had the chance to actually test the system with all three of these data structures, so I compiled one of Guile's bigger files and recorded the memory used and time taken when solving a 1-bit flow analysis problem. This file has around 600 functions, many of them small nested functions, many of them macro-generated, some of them quite loopy, and one big loopless one (6000 labels) to do the initialization.

First, a plot of how many bytes are consumed per label during while solving this 1-bit DFA.

Note that the X axis is on a log scale.

The first thing that pops out at me from these graphs is that the per-label overhead vhash sizes are indeed constant. This is a somewhat surprising result for me; I thought that iterated convergence would make this overhead depend on the size of the program being compiled.

Secondly we see that the bitvector approach, while quadratic in overall program size, is still smaller until we get to about 1000 labels. It's hard to beat the constant factor for packed bitvectors! Note that I restricted the Y range, and the sizes for the bitvector approach are off the charts for N > 1024.

The intset size is, as we expected, asymptotically worse than vhashes, but overall not bad. It stays on the chart at least. Surprisingly, intsets are often better than vhashes for small functions, where we can avoid allocating branches at all -- note the "shelves" in the intset memory usage, at 32 and 256 entries, respectively, corresponding to the sizes that require additional levels in the tree. Things keep on rising with n, but sublinearly (again, recall that the X axis is on a log scale).

Next, a plot of how many nanoseconds it takes per label to solve the DFA equation above.

Here we see, as expected, intsets roundly beating vhashes for all n greater than about 200 or so, and show sublinear dependence on program size.

The good results for vhashes for the largest program are because the largest program in this file doesn't have any loops, and hardly any branching either. It's the best case for vhashes: all appends and no branches. Unfortunately, it's not the normal case.

It's not quite fair to compare intsets to bitvectors, as Guile's bitvectors are implemented in C and intsets are implemented in Scheme, which runs on a bytecode VM right now. But still, the results aren't bad, with intsets even managing to beat bitvectors for the biggest function. The gains there probably pay for the earlier losses.

This is a good result, considering that the goal was to reduce the space complexity of the algorithm. The 1-bit case is also the hardest case; when the state size grows, as in type inference, the gains of using structure-sharing trees grow accordingly.


Let's wrap up this word-slog. A few things to note.

Before all this DFA work in Guile, I had very little appreciation of the dangers of N-squared complexity. I mean, sometimes I had to to think about it, but not often, expecially if your constant factors are low, or so I thought. But I got burned by it; hopefully the next time, if any, will be a long time coming.

I was happily, pleasantly surprised at the expressiveness and power of Bagwell/Clojure-style persistent data structures when applied to the kinds of problems that I work on. Space-sharing can make a fundamental difference to the characteristics of an algorithm, and Bagwell's data structures can do that well. Intsets simplified my implementations because I didn't have to reason much about space-sharing on my own -- finding the right shared tail for vhashes is, as I said, an unmitigated mess.

Finally I would close by saying that I was happy to fail in such interesting (to me) ways. It has been a pleasant investigation and I hope I have been able to convey some of the feeling of it. If you want to see what the resulting algorithm looks like in practice, see compute-truthy-expressions.

Until next time, happy hacking!

by Andy Wingo at July 01, 2014 08:00 AM

June 27, 2014

Programming Praxis


We’re going to try something different today. We often have interview questions here, but always of the type that require writing a program. Today, we will have one of the brain-teaser type of interview question:

How many raindrops fall on the planet every year?

Your task is to estimate the answer to the question posed above. When you are finished, you are welcome to read a suggested solution or to discuss your solution in the comments below.

by programmingpraxis at June 27, 2014 09:00 AM

June 24, 2014

Programming Praxis

Busy Beaver

Yesterday was Alan Turing’s birthday, so today we will write a program for a turing machine.

The Busy Beaver is a turing machine that performs the maximum work for a given configuration of machine; the concept was invented by Tibor Radó in his 1962 paper On Non-Computable Functions. We won’t look at the underlying theory, although it is fascinating if you have the time. Instead, we’ll be content to implement the first few busy beavers. Here are the two-symbol busy beavers for one, two, three and four states, from Wikipedia; the two symbols are 0 and 1, the states are letters A through D plus the halting state H, and the moves L and R are left and right:

0 1RH
1 unused
  A B
0 1RB 1LA
1 1LB 1RH
  A B C
0 1RB 0RC 1LC
1 1RH 1RB 1LA
  A B C D
0 1RB 1LA 1RH 1RD
1 1LB 0RC 1LD 0RA

Your task is to implement the four busy beavers. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 24, 2014 09:00 AM