# Planet Scheme

## October 21, 2016

### Programming Praxis

#### Checking Primality

[ I haven’t had time to prepare a new exercise because I’ve been busy finishing up a big project at work. So the bad news is we have a repeat exercise, but the good news is that my project is done! ]

We have today an exercise that has appeared on message boards at least a dozen times in the last few weeks; it must be that time in the semester when beginning programming students are working on their mid-term homework assignments.

Write a program that, given an integer, determines whether or not it is prime.

Your task is to write a program that determines if a given integer is prime, in the style of a beginning programmer who is less than confident of his knowledge of variable assignment and loops; do not use a complicated algorithm. When you are finished, you are welcome to read a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## October 20, 2016

### Peter Bex

#### CHICKEN's numeric tower: part 5

Now that we have covered the most important algorithms, it's time to take a look at the internal data representation of extended numerals. This will be the final part in this series of posts.

## Ratnums and cplxnums

Recall from my data representations article that fixnums are represented as immediate values, directly within a machine word. Flonums are represented in boxed form, by putting them in an opaque bytevector-like structure.

The data representations of complex numbers and rational numbers are pretty simple. Each have their own type tag, and they both contain two slots: the numerator and denominator in case of rational numbers, and the real and imaginary parts in case of complex numbers.

As you can see in the above diagram, the representations of ratnums and cplxnums are very similar. In the example, the slots contain just fixnums. Rational numbers are the simplest here: they can only contain integers (bignums or fixnums). Complex numbers can consist of any number type except other complex numbers, but the exactness of the real and imaginary components must match. This means you can't have 1.5+2/3i, for example.

In its most complex (haha!) form, a complex number contains a rational number in both the real and the imaginary parts, and these rational numbers both contain bignums as their numerator and denominator. In this situation, the entire complex number takes up a whopping minimum of 29 machine words: 3 words for the wrapper complex number, 2 times 3 words for each of the ratnums, and 4 times 5 words for the bignums inside the ratnums.

We'll now look into why bignums require at least 5 words.

## Bignums

Initially I tried to represent bignums as a kind of opaque bytevector, much like how flonums are stored. Memory-wise this is the best representation as it has no unnecessary overhead: only 2 extra words; one for the header and one for the sign. On a 32-bit machine it would look like this:

This representation is somewhat wasteful, because it uses a full machine word to represent the sign, which is only one bit of information! This is done to keep the bignum digits word-aligned, which is important for performance. The sign could be shoved into the header if we really wanted to be frugal on memory, but doing so would also complicate type detection. Alternatively, we could store the bignum's digits in 2s complement form so the sign is simply the high bit of the top digit, but that complicates several algorithms.

Regarding the "bytevector" part: because the limbs are word-aligned, it makes more sense to represent the size in words rather than bytes. Unfortunately, there's no way to do this with the current data representation of CHICKEN. This was the direct cause of the following bug: Someone tried to represent the largest known prime number in CHICKEN, and it failed to behave correctly because we didn't have enough header bits to represent its size. This was just for fun, so no harm was done, but when someone will actually need such numbers in practice, they're out of luck. One of these days we're going to have to tackle this problem...

### Performance takes a hit

When I first integrated the "numbers" egg into CHICKEN 5, I also did some benchmarking. It turned out that my initial version made some benchmarks up to 8 times slower, though on average it would slow things down by a factor of 2. As pointed out by Alex Shinn and Felix Winkelmann, the reason it impacts some benchmarks so badly has to do with allocation.

Let's compile a loop going from zero to n, like so:

```;; Very silly code, calculates 100 * 100 in a stupid way
(let lp ((i 0))
(if (= i 100)
(* i i)

Originally, in CHICKEN 4 without the full numeric tower, the compiled C code looked like this:

```/* lp in k207 in k204 in k201 */
static void C_fcall f_220(C_word t0,C_word t1,C_word t2){
C_word tmp;
C_word t3;
C_word t4;
C_word t5;
C_word t6;
C_word *a;
loop:
C_check_for_interrupt;
if(!C_demand(C_calculate_demand(4, 0, 2))) {
C_save_and_reclaim_args((void *)trf_220, 3, t0, t1, t2);
}
a=C_alloc(4); /* Allocate flonum for overflow situation */
if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
t3=t1;
{
C_word av2[2];
av2[0] = t3;
av2[1] = C_a_i_times(&a, 2, t2, t2);
((C_proc)(void*)(*((C_word*)t3+1)))(2, av2);
}
} else {
t3 = C_a_i_plus(&a, 2, t2, C_fix(1));
C_trace("test.scm:4: lp");
t5=t1;
t6=t3;
t1=t5;
t2=t6;
goto loop;
}
}```

It's not much to look at, but this is very close to optimal code: It's a C loop, which allocates a fixed size of memory from the stack/nursery into which it can write the result of + or *, in case they would overflow.

The compiler knows how it can compile + and * to "inlineable" C functions. Many of the most performance-critical functions are built into the compiler like that. But because the compiler (currently) doesn't perform range analysis, it's not smart enough to figure out that none of these operators in this example can cause an overflow. This bites us especially hard when introducing bignums: because we need to assume that any operator may overflow, we must be able to allocate a bignum. And assuming the result of these operators may be bignums, the next iteration of the loop is a bignum. Adding two bignums of unknown sizes together results in another bignum of unknown size.

Because of the above, we can't pre-allocate in a tight C loop. Instead, we must split our loop in two. This is needed to allow the garbage collector to kick in: if you'll recall from the garbage collector post, we need a continuation both for liveness analysis and as a place to jump back to after GC.

One part of our lp calls an allocating procedure, wrapping up the other part in a continuation:

```/* k251 in lp in k223 in k220 in k217 in k214 (first part of our "lp") */
static void C_ccall f_253(C_word c, C_word *av) {
C_word tmp;
C_word t0 = av[0];
C_word t1 = av[1];
C_word t2;
C_word *a;
C_check_for_interrupt;
if(!C_demand(C_calculate_demand(0, c, 2))) {
C_save_and_reclaim((void *)f_253, 2, av);
}
C_trace("test.scm:6: lp");
t2 = ((C_word*)((C_word*)t0)[2])[1];
f_236(t2, ((C_word*)t0)[3], t1);
}

/* lp in k223 in k220 in k217 in k214 (second part of our "lp") */
static void C_fcall f_236(C_word t0, C_word t1, C_word t2){
C_word tmp;
C_word t3;
C_word *a;
C_check_for_interrupt;
if(!C_demand(C_calculate_demand(4, 0, 3))) {
C_save_and_reclaim_args((void *)trf_236, 3, t0, t1, t2);
}
a=C_alloc(4);
if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
C_trace("test.scm:5: *");
{
C_word av2[4];
av2[0] = C_SCHEME_UNDEFINED;
av2[1] = t1;
av2[2] = t2;
av2[3] = t2;
C_2_basic_times(4,av2);
}
} else {
/* Allocate continuation of (add1 i), which is the first part (f_253) */
t3=(*a = C_CLOSURE_TYPE|3, a[1] = (C_word)f_253,
a[2] = ((C_word*)t0)[2], a[3] = t1, tmp = (C_word)a, a += 4, tmp);
{
C_word av2[4];
av2[0] = C_SCHEME_UNDEFINED;
av2[1] = t3;
av2[2] = t2;
av2[3] = C_fix(1);
C_2_basic_plus(4,av2);
}
}
}```

As you can imagine, allocating a continuation on the stack every time is pretty heavy, and function calling isn't as cheap as a goto loop either. The first part of the loop doesn't even do anything. It just acts as a continuation to be received by the plus call. You can probably imagine how terrible the code would look if we compiled something like (/ (* (+ a b) (+ c d)) 2). That's at least 4 continuations, instead of a few simple statements.

For this reason, my patch was rejected (and rightly so!). The message was clear: code that doesn't use bignums should never pay a performance penalty just because bignums exist.

In order to fix this situation, I had to come up with a radical change to how bignums worked, or face the possibility that a full numeric tower would not make it into CHICKEN 5.

### Adding a new "scratch space" memory region

If we want to make the extended numeric operators as fast as the originals, we must be able to inline them. This prevents garbage collection, because we don't get the continuation for an inlined call. But what if they allocate some unknown quantity of memory? We can't allocate on the stack or heap, because that could cause either to fill up, requiring a GC.

So, the obvious solution is to allocate these objects elsewhere. A separate memory space in which bignums can be stored. But what if that space fills up? Don't we need to initiate a GC then? But this is where we're in luck: bignums are not compound objects! They are huge slabs of opaque data, much like strings. Because they can't refer to other objects, we are dealing with a simplified garbage collection problem: only the objects pointing to a bignum need to be updated.

Unfortunately, finding all the live objects that point to a bignum would be quite difficult. Luckily, like many problems in computer science, this can be easily solved by adding another level of indirection. While we're calling inline functions, we can allocate small objects on the stack, which will remain there, never moving until the next GC. We can use this to our advantage: whenever a bignum is needed, we allocate a fixed-size wrapper object on the stack. This object points into the scratch space, where the actual bignum data lives. See the following diagram:

In the diagram, we have a bignum representing the number 24386824307922, which we've put in a list and a vector, and we also have the rational number 1/24386824307922, which refers to the same bignum in its denominator. All these objects can be on the stack or on the heap. We have no control over them; the user can set any object slot to hold the bignum. We do have control over the wrapper object, and only the wrapper object directly points into scratch space. Because bignums are opaque objects in Scheme, the wrapper is invisible. Thus, user code is (in principle) unable to access the wrapper's data slot, so there will be no direct references to the bignum data portion. This means we're free to move it around without updating anything but the wrapper object's slot.

Note that in the scratch space, we also store a back-pointer to the wrapper object's slot. This allows us to update the wrapper object after moving its matching bignum data blob around. This way, we can reallocate the scratch space when more space is needed.

Some of the native functions like Karatsuba multiplication or Burnikel-Ziegler division generate many temporary values. All such hand-written code has been tuned to erase a bignum's back-pointer when that bignum is no longer needed. It makes the code quite a bit hairier, but it allows (limited) garbage collection to be done when reallocating the scratch space.

With this setup, all numeric operations only need to allocate memory to hold a bignum wrapper object. This is a fixed size, much like in CHICKEN 4, and it means numeric operations can once again be inlined!

Oh, and why a bignum takes up 5 words? Well, sometimes we know that a procedure receives 2 fixnums. In that case, we can pre-allocate a bignum for the case when it overflows. Because we know in its maximum size in advance, there's no need for the scratch space; we can just allocate it in the nursery. For uniformity reasons, such a bignum still requires a wrapper object (2 words) and a bignum data blob (3 words: its header, the sign and one limb). This sounds complicated, but it shortens the specialised code for two fixnums, and allocating only in the nursery is also faster.

## Some parting thoughts

Adding full numeric tower support has been extremely educational for me. I'm not really a math person, but having a clear goal like this motivated me to dive in deep into the literature. Overall, I'm happy with how it turned out, but there are always improvements.

For example, instead of doing everything in C it would (of course!) be preferable to do it all in Scheme. Unfortunately, CHICKEN's design makes it hard to do this in an efficient way: it's currently impossible to export Scheme procedures that can be called inline (i.e., non-CPS calls) without copying their full code into the call site. If we can find a way, it would be possible to do 90% of the algorithms in Scheme. The principle on which this would be based can be found in an old paper about the Lucid Common Lisp implementation. Basically, you implement a handful of primitives natively, and everything else can be done in Lisp. For example, SBCL is implemented this way too.

As far as I can tell, of the more popular Scheme implementations, Gambit is the only one that actually does this. I've been very impressed with Gambit in general. Besides having pretty readable Scheme code for bignum algorithms, Gambit has some superior bignum algorithms, most get close to (and in rare cases even surpass) GMP performance. This is mostly due to the hard work of Bradley Lucier, a numerical analyst who has also provided useful feedback on some of my work on the numbers egg, and this series of blog posts. He really knows this stuff! Most other Scheme implementations are in C and still pretty slow due to the algorithms they use, unless of course they use GMP.

In CHICKEN, there is a lot of room for optimisations. But I also think we shouldn't try to implement every algorithm under the sun. Things should generally be fast enough to serve the most common cases. Typical code doesn't use bignums, and if it does it's only small bignums (for instance, when using 64-bit integers with the FFI), which is why I think we should optimise for these cases. For example, my implementations of Karatsuba and Burnikel-Ziegler aren't great, so if anyone feels like having a stab at improving these things we already have (or simply replacing them with a better algorithm), please do!

## References

### Christian Kellermann

#### New year's todos: Making a PDF calendar with cairo

In this piece I will tell you about my latest hackish script: Creating a PDF calendar with CHICKEN Scheme's cairo binding. This includes a nice detour about date arithmentic as well as unreadable code for the actual calendar generation.

Still interested? Good! Read on, my dear.

In the past the wife has made a family calendar for our wall in A3 format. This has been done manually (in Adobe Illustrator IIRC) and involved a lot of tedious repetitive entries of all holidays and anniversaries.

The argument has been that computer generated calendars look ugly. A quick research showed that, there are hardly any free software programs out there that would come close to the original.

So I started out to write up my own. What would I need for a calendar? Let‘s count:

• A date representation, that allows one to query the weekday of a given date, and getting to know the date for the easter sunday.
• A graphic backend that can render PDFs so I can go to a copy shop to print the A3 pages.

Not too much! From a previous (unfinished) project I had some bits and pieces for representing dates in a '(year month day) list structure, even possibly representing „partial“ dates as I called them. With partial dates one would leave off the right tail(s) for uncertain dates, i.e. when you want all entries from a given month you‘d use '(year month) and so on. But I digress…

I also implemented the weekday query already (since in said project I have implemented a date-picker widget) using Gauss‘ method. Searching for a method to calculate the date for easter sunday as a lot of german holidays depend on this, I stumbled again over Mr. Gauss for having found a method to properly calculate this date. I am using the enhanced method as told by wikipedia (german only, sorry).

In scheme that beast looks like this (almost verbatim taken from wikipedia):

```(define (easter-sunday date)
(let* ((x (first date))
(k (quotient x 100))
(m (- (+ 15 (quotient (+ (* 3 k) 3) 4))
(quotient (+ (* 8 k) 13) 25)))
(s (- 2 (quotient (+ (* 3 k) 3) 4) ))
(a (modulo x 19))
(d (modulo (+ (* 19 a) m) 30))
(r (quotient
(quotient (+ d a) 11)
29))
(og (- (+ 21 d) r))
(sz (- 7 (modulo (+ x (quotient x 4) s) 7)))
(oe (- 7 (modulo (- og sz) 7)))
(os (+ og oe)))
(if (< os 32)
(list x 3 os)
(list x 4 (modulo os 31)))))```

The only adjustment I have made is that I return a proper date and not March 32nd for April 1st.

So the essential holidays can be created like this:

```;; the name is a bit misleading, it should be called 'days depending
;; on the easter date' but that's also a dumb name
(define (roman-catholic-holidays date)
(let* ((y (first date))
(es (easter-sunday date))
(christmas-day (day-of-week `(,y 12 24)))
`(,y 12 24)
(days+ `(,y 12 24) (- (add1 christmas-day))))))
`(((,y 1 6) . "Heilig. Drei Könige")
(,(days+ es -52) . "Weiberfastnacht")
(,(days+ es -48) . "Rosenmontag")
(,(days+ es -47) . "Faschingsdienstag")
(,(days+ es -46) . "Aschermittwoch")
(,(days+ es -7) . "Palmsonntag")
(,(days+ es -3) . "Gründonnerstag")
(,(days+ es -2) . "Karfreitag")
(,es . "Ostersonntag")
(,(days+ es 1) . "Ostermontag")
(,(days+ es 39) . "Christi Himmelfahrt")
(,(days+ es 49) . "Pfingstsonntag")
(,(days+ es 50) . "Pfingstmontag")
(,(days+ es 60) . "Fronleichnam")
((,y 12 24) . "Heiligabend")
((,y 12 25) . "1. Weihnachtsfeiertag")
((,y 12 26) . "2. Weihnachtsfeiertag"))))```

(Not all of these are bank holidays, but I have included them for good measure, after I got through all this trouble…)

You may note that this may contain more than one entry for a given date. The code recognises this by „querying“ with filter-map instead of alist-ref and concatenating results in a proper way.

So with all this we can think about drawing stuff. In CHICKEN Scheme I have a nice binding to the cairo library available, that does support PDF surfaces, so let‘s use this.

As for drawing primitives I need a rectangle and some nice TTF font rendering, both easily available.

As I had a nice layout to copy from, writing the resulting calendar has been a breeze.

The result looks like this:

See the full image for further details.

The font used in this example is the nice EB Garamond by Georg Duffner.

The code includes a second mechanism for anniversaries/birthdays which have been omitted in this screenshot.

Things you will have to change to make it work for yourself:

• Localisation of weekday names and months
• Check the holidays for your location, even for Germany these differ wildly from state to state
• Another layout for different paper sizes
• Adjust the global year variable
• A currently broken idea is the display of moon phases for fun. Calculating moon phases is another problem altogether I have learned…

If you are willing to do all this, then download cal.tgz (142KB) and run it with „csi -s cal.scm“ to get a „calendar-2017.pdf“.

If you have cool ideas, I am looking forward to reading about them (or patches!).

## October 18, 2016

### Peter Bex

#### CHICKEN's numeric tower: part 4

In this instalment of the blog series, we'll take a look at how string->number and number->string are implemented for bignums.

## Performing base conversions

Performing calculations is all nice and useful, but you eventually want to print the results back to the user. And of course the user needs a way to enter such large numbers into the system. So, converting between numbers and strings is an essential operation. The Scheme standard requires support for conversion between bases 2, 8, 10 and 16, but many practical implementations support conversion between arbitrary bases, and why not? It doesn't really require more effort.

### Converting strings to numbers

Let's start with the simpler of the two directions: string->number. The naive way of converting a string in base n to a number is to scan the string from left to right (high to low), adding the digit to the result and multiplying the result by n:

```/* "result" is a pre-allocated, zeroed out bignum of the right size */
while (*str != '0') /* Assuming NUL-terminated string */
{
int digit = hex_char_to_digit((int)*str++);
}```

This is very simple and elegant, but also quite slow. The Scheme48 code also checks for invalid characters in this loop, while in CHICKEN, the reader performs this check. So, when converting a digit stream to a bignum, we already know the digit stream contains only valid digits.

A simple improvement can be made to this algorithm: we can avoid traversing the entire bignum for every string digit. Instead, we collect multiple digits in a register until we fill up a halfword. Then, we scale up the bignum, adding the collected halfword:

```do {
C_word big_digit = 0;  /* Collected digit */
C_word factor = radix; /* Multiplication factor for bignum */

/* Keep collecting from str while factor fits a half-digit */
while (str < str_end && C_fitsinbignumhalfdigitp(factor)) {
str_digit = hex_char_to_digit((int)*str++);
big_digit = radix * big_digit + str_digit;
}

/* Scaling up with carry avoids traversing bignum twice */
big_digit = bignum_digits_destructive_scale_up_with_carry(
digits, last_digit, factor / radix, big_digit);

if (big_digit) { /* If there was a carry, increment size of bignum */
(*last_digit++) = big_digit;
}
} while (str < str_end);```

Remember the previous post in this series? Now you know where the design behind bignum_digits_destructive_scale_up_with_carry comes from: we can scale up the bignum with an initial "carry" value that's our collected digit. The return value is the resulting carry (if any), so we know when to increase the bignum's size by moving the last_digit pointer. This pointer makes it easier to detect the final length of the bignum, which can be hard to predict precisely. We can't predict the exact size, but we can calculate the maximum size. Because we allocate this maximum size, this moving of the end pointer is safe.

If the string's base is a power of two, we can perform an even better optimisation: We don't need to multiply or add to the bignum, we can just write straight to the bignum's digits!

```int radix_shift = C_ilen(radix) - 1;  /* Integer length (the power of two) */
C_uword big_digit = 0;       /* The current bignum digit being constructed */
int n = 0;                /* The number of bits read so far into big_digit */

/* Read from least to most significant digit.  This is much easier! */
while (str_end > str_start) {
str_digit = hex_char_to_digit((int)*--str_end);

big_digit |= (C_uword)str_digit << n;
n += radix_shift;                             /* Processed n bits so far */

if (n >= C_BIGNUM_DIGIT_LENGTH) {             /* Filled up the digit? */
n -= C_BIGNUM_DIGIT_LENGTH;                 /* Number of remainder bits */
*digits++ = big_digit;
big_digit = str_digit >> (radix_shift - n); /* Keep only the remainder */
}
}
/* If radix isn't an exact divisor of digit length, write final remainder */
if (n > 0) *digits++ = big_digit;```

From my benchmarks it looks like CHICKEN's string->number implementation is among the fastest of the popular Scheme implementations for power of two bases, due to this bit banging loop.

### Converting numbers to strings

The naive way of converting a number to a string in base n is to do the opposite of converting a string in base n to a number: we repeatedly divide by the target base and prepend this number to the string.

```char *characters = "0123456789abcdef";

/* This fills the "buf" array *back to front*, so index starts at the
* end of "buf".  counter represents # of characters written.
*/
do {
*index-- = characters[digit];

/* If we reached the current string's length, reallocate in
* increments of BIGNUM_STR_BLOCK_SIZE.
*/
if (++counter == len) {
char *newbuf = C_malloc(len + BIGNUM_STR_BLOCK_SIZE);
if (newbuf == NULL) return ERR;

C_memcpy(newbuf + BIGNUM_STR_BLOCK_SIZE, buf, len);
C_free(buf);
buf = newbuf;
index = newbuf + BIGNUM_STR_BLOCK_SIZE - 1;
len += BIGNUM_STR_BLOCK_SIZE;
} while(bignum_length(working_copy) > 0);```

This is the original version as provided by Scheme48. Again, it's very short and clean. It operates on a copy of the bignum, which it destructively scales down by dividing it by radix. The remainder digit is written to the string after conversion to a character. Many implementations use this algorithm, but it can be improved pretty easily, in basically the same way we improved the reverse operation. Instead of dividing by the radix on ever loop iteration, you can chop off a big lump. This is the remainder of dividing by a large number. Then, you divide this remainder in a loop while emitting string digits until you hit zero, then repeat until the bignum is zero.

Another improvement over the Scheme48 code is that you can pre-calculate a (pessimistic) upper bound on the number of digits, so you can avoid the reallocation (which implies a memory copy). For powers of two, this can be done precisely. For other radixes you can shorten the buffer only once at the end, and only if it turns out to be necessary.

This is really very simple, except for the finishing up part, where we shorten the buffer:

```int steps;
C_uword base;         /* This is the "lump" we cut off every time (divisor) */
C_uword *scan = start + C_bignum_size(bignum); /* Start scanning at the end */

/* Calculate the largest power of radix (string base) that fits a halfdigit.
* If radix is 10, steps = log10(2^halfdigit_bits), base = 10^steps
*/
steps++;

base /= radix; /* Back down: we always overshoot in the loop by one step */

while (scan > start) {
/* Divide by base. This chops "steps" string digits off of the bignum */
big_digit = bignum_digits_destructive_scale_down(start, scan, base);

if (*(scan-1) == 0) scan--; /* Adjust if we exhausted the highest digit */

for(i = 0; i < steps && index >= buf; ++i) {      /* Emit string digits */
C_uword tmp = big_digit / radix;
big_digit = tmp;
}
}

/* Move index onto first nonzero digit.  We're writing a bignum
here: it can't consist of only zeroes. */
while(*++index == '0');

if (negp) *--index = '-';

/* Shorten with distance between start and index. */
if (buf != index) {
i = C_header_size(string) - (index - buf);
C_memmove(buf, index, i); /* Move start of number to beginning. */
C_block_header(string) = C_STRING_TYPE | i; /* Mutate strlength. */
}```

Finally, if the radix is a power of two, we can do a straight bit-to-bit extraction like we did with string->number:

```int radix_shift = C_ilen(radix) - 1;    /* Integer length (the power of two) */

/* Again, we go from least significant to most significant digit */
while (scan < end) {
C_uword big_digit = *scan++;
int big_digit_len = C_BIGNUM_DIGIT_LENGTH;

while(big_digit_len > 0 && index >= buf) {
*index-- = characters[radix_digit]; /* Write it (as character) to string */
big_digit >>= radix_shift;                            /* Drop this digit */
}
}```

Unfortunately, there are some caveats that make this slightly trickier than you would guess. The above code is simplified, and only works for some radixes. If your radix is 2ⁿ, and your base digit size is 2ᵐ bits, then this code works if m is a multiple of n. Otherwise, you'll need to take care of overlaps. Octal numbers are the biggest problem here, because they're 3 bits per string digit and the bignum digit sizes of 32 or 64 bits don't divide cleanly by 3.

Getting this right complicates the algorithm enough to make it slightly too hairy to present here (there's some more shifting and if checks involved). If you're interested how to handle this, you can always study the CHICKEN sources.

#### Divide and conquer

Because of the many divisions, number->string is much slower than string->number. Luckily, we can speed up the former by relying once again on a recursive divide and conquer style algorithm.

This requires you to know the string's expected size in the target base, and will divide the number by half that. For example, if you wish to convert the number 12345678 to a decimal string, you can decide to split it in two. If you had a perfect guess of the string's length (which is 8), you can split the expected string in two halves by dividing the number by 10⁴, or 10000, giving us the quotient 1234 and the remainder 5678. These can recursively be converted to a string and finally appended together. Note that if you have a pessimistic upper limit of the final string length, it'll be slower, but will still produce a correct result. The code for this is quite straightforward:

```(define (integer->string/recursive n base expected-string-size)
(let*-values (((halfsize) (fxshr (fx+ expected-string-size 1) 1))
((b^M/2) (integer-power base halfsize))
((hi lo) (quotient&remainder n b^M/2))
((strhi) (number->string hi base))
((strlo) (number->string (abs lo) base)))
(string-append strhi
;; Fix up any leading zeroes that were stripped from strlo
(make-string (fx- halfsize (string-length strlo)) #\0)
strlo)))```

Because the number 120034, when split in two, generates 120 and 34, we need the make-string call to add back the leading zeroes, otherwise we would get 12034 as the output. This can be omitted if you have a more low-level number->string implementation which doesn't truncate leading zeroes.

While I was researching this, I found out about a technique called the "scaled remainder tree" algorithm. This algorithm is supposedly even faster than the simple recursive algorithm I just showed you. Unfortunately, I was unable to wrap my head around it. Maybe you will have better luck!

There doesn't seem to be that much information on how to efficiently perform base conversion, even though there are quite a few clever implementation techniques out there. If you spend the time searching, you'll be sure to find some gems.

• The y-cruncher page on radix conversion is full of ideas. Y-cruncher is a program to calculate digits of pi by efficiently using multiple CPU cores. This page has several algorithms, including recursive string splitting and scaled remainder tree. Unfortunately, the program itself is proprietary so you can't study its implementations of these algorithms.
• Modern Computer Arithmetic, which has quickly become my favourite bignum textbook, has an interesting alternative string to number technique based on iterative multiplication of the input string. I still want to try that out some day. It also hints at scaled remainder tree for number to string conversion, but doesn't explain how it works.
• Division-Free Binary-to-Decimal Conversion by Cyril Bouvier and Paul Zimmermann explains a number to string conversion based on the scaled remainder tree. Unfortunately, this paper only confused me.
• It looks like Gambit has a recursive divide-and-conquer implementation of string->number that's slightly different from ours. Their recursive number->string implementation is rather interesting too.

### Programming Praxis

#### An Impossible Interview Question

Today’s exercise is an interview question from Amazon:

You are to write a program that reads a stream of characters from the input and returns a random character from the stream. Any character should have an equal probability of being returned. You may only use a single character of storage space; in particular, you may not save multiple characters from the input stream.

Your task is to write a program that solves the Amazon interview task. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## October 15, 2016

### Peter Bex

#### CHICKEN's numeric tower: part 3

Now that you understand the basic bignum algorithms, let's look at various tricks to speed up these operations.

## Faster multiplication

Like I mentioned in the previous part of this series, the primary school method for addition and subtraction is the fastest known, for bignums of any size. And you can't really get better than O(n). However, primary school multiplication is O(n²), and there are several better algorithms than that.

### Multiplication by a fixnum

As you now know, multiplication is done by looping over the half-digits of the two bignum arguments in a nested loop. Nested loops are something to be avoided as much as possible, because this means you're looking at a quadratic time algorithm, in other words it will perform O(n²) operations.

When multiplying a bignum by a fixnum, the naive implementation is to "promote" the fixnum into a bignum, and then perform a standard bignum multiplication. However, if the fixnum fits in a half-digit, you can avoid the nested loop. Complexity-wise, this isn't a great improvement, as the outer loop is only run once anyway. But, because the small value fits in a machine word, you only read from one bignum address instead of two on every iteration of the inner loop. Now that makes a big difference! In Scheme48, this improved algorithm looks like this:

```static void
bignum_destructive_scale_up(bignum_type bignum, bignum_digit_type factor)
{
bignum_digit_type carry = 0;
bignum_digit_type * scan = (BIGNUM_START_PTR (bignum));
bignum_digit_type two_digits;
bignum_digit_type product_low;
#define product_high carry
bignum_digit_type * end = (scan + (BIGNUM_LENGTH (bignum)));
BIGNUM_ASSERT ((factor > 1) && (factor < BIGNUM_RADIX_ROOT));
while (scan < end)
{
two_digits = (*scan);
product_low = ((factor * (HD_LOW (two_digits))) + (HD_LOW (carry)));
product_high =
((factor * (HD_HIGH (two_digits))) +
(HD_HIGH (product_low)) +
(HD_HIGH (carry)));
(*scan++) = (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
carry = (HD_HIGH (product_high));
}
/* A carry here would be an overflow, i.e. it would not fit.
Hopefully the callers allocate enough space that this will
never happen.
*/
BIGNUM_ASSERT (carry == 0);
return;
#undef product_high
}```

The current version of this algorithm in CHICKEN is a bit shorter and slightly more versatile because it returns the carry if the result doesn't fit:

```static C_uword bignum_digits_destructive_scale_up_with_carry(
C_uword *start, C_uword *end, C_uword factor, C_uword carry)
{
C_uword digit, p;

assert(C_fitsinbignumhalfdigitp(carry));
assert(C_fitsinbignumhalfdigitp(factor));

while (start < end) {
digit = (*start);

p = factor * C_BIGNUM_DIGIT_LO_HALF(digit) + carry;
carry = C_BIGNUM_DIGIT_LO_HALF(p);

p = factor * C_BIGNUM_DIGIT_HI_HALF(digit) + C_BIGNUM_DIGIT_HI_HALF(p);
(*start++) = C_BIGNUM_DIGIT_COMBINE(C_BIGNUM_DIGIT_LO_HALF(p), carry);
carry = C_BIGNUM_DIGIT_HI_HALF(p);
}
return carry;
}```

This is a destructive operation, which means it doesn't operate on an "empty" target bignum. Instead, you copy the original bignum, which is then mutated in-place. That makes it faster because you're only reading and writing to a single digit array, so it's much more localised in memory. In the next post, we'll see why returning the carry can be helpful.

Strictly speaking, you could do the same for addition and subtraction: if one of the arguments is a bignum and the other a fixnum, you could destructively add or subtract it. In fact, Scheme48 does this, but the CHICKEN implementation does not. If you'll recall, the CHICKEN bignum implementation already copies the larger of the two numbers into a new bignum and modifies it in-place while adding the smaller number. The effect of this is almost the same as Scheme48 does, while also improving the default case of adding two bignums.

There's a second optimisation that can be done, which Scheme48 does not do at all. If the factor by which we're multiplying is a power of two, you can simply shift the result by the 2log of the factor! The code for detecting this is pretty ugly, so I'm not going to show it. Just remember that this makes a huge difference in practice.

### Karatsuba's multiplication

Multiplying two bignums can be done faster, too. There are two important algorithms: Karatsuba multiplication and FFT-based multiplications like Schönhage-Strassen. I must confess that my understanding of the FFT is too weak to implement, let alone explain the Schönhage-Strassen algorithm, so if you want to read about that, check the reading list at the end of this post.

Luckily, the Karatsuba algorithm is easy to explain. It was named after the Russian mathematician Anatoly Karatsuba, who was the first to reject the idea that O(n²) is the best we can do. His approach is based on divide and conquer. It uses a simple algebraic trick to reduce work on each step.

Let's say we want to multiply two bignums of two limbs, x[1,2] and y[1,2] in base B. The result is of course xy. Rephrased as an equation:

` xy = (x₁·B + x₂) · (y₁·B + y₂)`

This can be written out as:

` xy = (x₁·y₁)·B² + (x₁·y₂ + x₂·y₁)·B + x₂·y₂`

If we call these three components a, b and c, then xy = a·B² + b·B + c.

Now, you can calculate all three components separately, but this requires exactly as many steps as the "school book" algorithm we already have, namely O(n²). However, the crucial insight that Karatsuba had is that you can derive b from a and c with a simpler multiplication. He actually did this by first making it more complex. This shows the man's genius: complicating things to simplify them isn't exactly intuitive! So we have a, b and c:

```a = x₁·y₁
b = x₁·y₂ + x₂·y₁
c = x₂·y₂
```

We first complicate b by adding (a-a) and (c-c), then expand:

```b = x₁·y₂ + x₂·y₁ + (a - a) + (c - c)
b = x₁·y₂ + x₂·y₁ + ((x₁·y₁) - (x₁·y₁)) + (x₂·y₂) - (x₂·y₂)
```

Next, we remove the parentheses and reorder by moving one a and one c to the left, and finally we can re-group the moved variables:

```b = x₁·y₂ + x₂·y₁ + x₁·y₁ - x₁·y₁ + x₂·y₂ - x₂·y₂
b = x₁·y₁ + x₂·y₂ + x₂·y₁ + x₁·y₂ - x₁·y₁ - x₂·y₂
b = (x₁ + x₂)·(y₁ + y₂) - x₁·y₁ - x₂·y₂
```

And ta-da, as if by magic, we have:

```b = (x₁ + x₂)·(y₁ + y₂) - a - c
```

If you compare this with the original definition of b, you'll notice that the new definition has one multiplication instead of two. The multiplication factors are slightly larger than in the original formula, but because there's only one, we end up doing less work. Of course, we'll still have to do all those additions and subtractions, but with big enough numbers, it's faster than multiplying twice. That's because the complexity of this algorithm has dropped from O(n²) to something less. The exact complexity turns out to be O(n^log3).

Now, this is the asymptotic complexity. You can easily tell from the new formula of b that it uses several more operations and it requires splitting up x and y in two parts, which requires allocating memory for them and copying out the parts. This is a lot of overhead, which means that Karatsuba's algorithm only becomes more efficient than the "school book" algorithm for (very) large numbers. The same is true for the FFT-based algorithms mentioned earlier, those improve performance only for even larger numbers, because they split up the numbers into even more pieces. This is a (better) reason why it's probably not worth adding the extra code and complexity of an FFT implementation to CHICKEN core.

In code, a naive implementation of Karatsuba's algorithm is quite simple as well:

```;; Where this is called, we ensure that |x| <= |y|
(define (karatsuba-multiply x y)
(let* ((rs (fx* (signum x) (signum y)))     ; Result's sign (-1 or +1)
(x (abs x))
(n (bignum-digit-count y))
(n/2 (fxshr n 1))                    ; (floor (/ n 2))
(bits (fx* n/2 *bignum-digit-bits*))
(x-hi (extract-digits x n/2 #f))     ; x[n..n/2]
(x-lo (extract-digits x 0 n/2)))     ; x[n/2..0]
(let* ((y (abs y))
(y-hi (extract-digits y n/2 #f))   ; y[n..n/2]
(y-lo (extract-digits y 0 n/2))    ; y[n/2..0]
(a  (* x-hi y-hi))
(b  (* x-lo y-lo))
(c  (* (- x-hi x-lo)
(- y-hi y-lo))))
(* rs (+ (arithmetic-shift a (fx* bits 2))
(+ (arithmetic-shift (+ b (- a c)) bits)
b)))))))```

Especially helpful here is that the operators prefixed by fx indicate clearly when we're working with fixnums. This uses absolute numbers because that's simpler to deal with, especially when using extract-digits to quickly get a range of bignum digits. It might be possible to make this faster by operating directly on the signed numbers.

In CHICKEN 5, this algorithm has been translated to C for stupid reasons which I will explain in a later post. But this Scheme code is taken directly from the numbers egg and cleaned up only slightly.

According to MCA, an optimal implementation should avoid allocating the intermediate calculations of (- x-hi x-lo) and (- y-hi y-lo). In a low level C-based implementation like CHICKEN 5's, it might be easier to perform in-place modification of these numbers, but so far I haven't been successful in doing this. Nevertheless, our Karatsuba implementation is efficient enough for now.

Sometimes, the Karatsuba algorithm is referred to as the Toom-Cook algorithm. That's because Toom and Cook figured out a way to generalise the algorithm. This way, you can split the numbers into any number of components, instead of two components like Karatsuba did. Apparently there's a sweet spot in number sizes where a 3-way split is faster than a 2-way split, but pretty soon after that, the numbers get big enough and the FFT-based algorithms overtake Toom-Cook in efficiency.

## Faster division

Faster multiplication is interesting, but division is the real tortoise in this race, so let's see how we can speed that up. It turns out that the approaches are rather similar to those of multiplication.

### Division by a fixnum

Recall that the division algorithm needs to "guess" how many times the denominator fits in the numerator based on the first half-digit (plus some magic surrounding the second half-digit). If the denominator is itself is only a half-digit, there's no need to guess.

So, just like when multiplying by a small fixnum, we have a destructive division algorithm that operates on a copy of the numerator. The Scheme48 version I started with:

```/* Given (denominator > 1), it is fairly easy to show that
(quotient_high < BIGNUM_RADIX_ROOT), after which it is easy to see
that all digits are < BIGNUM_RADIX. */

static bignum_digit_type
bignum_destructive_scale_down(bignum_type bignum, bignum_digit_type denominator)
{
bignum_digit_type numerator;
bignum_digit_type remainder = 0;
bignum_digit_type two_digits;
#define quotient_high remainder
bignum_digit_type * start = (BIGNUM_START_PTR (bignum));
bignum_digit_type * scan = (start + (BIGNUM_LENGTH (bignum)));
BIGNUM_ASSERT ((denominator > 1) && (denominator < BIGNUM_RADIX_ROOT));
while (start < scan)
{
two_digits = (*--scan);
numerator = (HD_CONS (remainder, (HD_HIGH (two_digits))));
quotient_high = (numerator / denominator);
numerator = (HD_CONS ((numerator % denominator), (HD_LOW (two_digits))));
(*scan) = (HD_CONS (quotient_high, (numerator / denominator)));
remainder = (numerator % denominator);
}
return (remainder);
#undef quotient_high
}```

And the smaller version based, again, on the division algorithm from Hacker's Delight:

```static C_uword bignum_digits_destructive_scale_down(
C_uword *start, C_uword *end, C_uword denominator)
{
C_uword digit, k = 0;
C_uhword q_j_hi, q_j_lo;

/* Single digit divisor case from Hacker's Delight, Figure 9-1,
*/
while (start < end) {
digit = (*--end);

k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_HI_HALF(digit)); /* j */
q_j_hi = k / denominator;
k -= q_j_hi * denominator;

k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_LO_HALF(digit)); /* j-1 */
q_j_lo = k / denominator;
k -= q_j_lo * denominator;

*end = C_BIGNUM_DIGIT_COMBINE(q_j_hi, q_j_lo);
}
return k; /* The remainder */
}```

And, just like with multiplication, if you're dividing by a power of two, there's an easy optimisation: you can simply shift the numerator right by as many bits as the 2log of the denominator. The remainder is formed by the bits that were shifted out.

### Burnikel-Ziegler division

And here too, there's a divide and conquer algorithm named after the mathematician (or, in this case, mathematicians) who discovered it. This algorithm is a lot more complicated than Karatsuba's relatively simple algebraic trick, and a lot harder to implement correctly. The paper is long and still not very helpful when it comes down to the crucial details. I found the super-elegant presentation in MCA to be more helpful in figuring out details. Especially the algorithm's start-up procedure is tricky to get right. I will use the explanation style from MCA because it is simpler than the original paper.

The algorithm is similar to the classical "school book" division algorithm, but "in the large". The basic idea is that we operate on partial bignums at a time instead of on half-digits. The core algorithm handles only 2n/n division. This means that the numerator must be twice the size of the denominator. It splits the numerator in two halves. Each half is then divided by the entire denominator and finally recombined to form the result. These two divisions are themselves also done in two steps, thereby making the numbers smaller in the recursion.

Because the intermediate division only divides by the first half of the denominator, the result may end up negative. So, like the schoolbook method, this algorithm also makes an adjustment when re-joining the two intermediate results. The core algorithm is as follows:

• If the denominator is smaller than some limit, fall back to "primary school" algorithm, otherwise:
• Split the denominator B in two: B₁·β + B₂. So, if B is a bignum of n limbs, the base β is half that.
• Next, split the numerator A into four such parts: A₁·β³ + A₂·β² + A₃·β + A₄.
• First half:
• Divide A₁·β + A₂ by B₁, yielding the guessed quotient Q̂₁ and remainder R₁ (the recursive step).
• Combine the remainder R₁ with A₃ and subtract Q̂·B₂, yielding R̂₁ = R₁·β + A₃ - Q̂·B₂.
• While R̂₁ < Q̂₁·B₂, adjust the guess; Q̂₁:=Q̂₁-1 and R̂₁:=R̂₁+B.
• Second half:
• Divide R̂₁ by B₁, yielding the guessed quotient Q̂₂ and remainder R (another recursive step).
• Combine the remainder R with A₄ and subtract Q̂·B₂, yielding R̂ = R + A₄ - Q̂·B₂.
• While R̂ < Q̂₂·B₂, adjust the guess; Q̂₂:=Q̂₂-1 and R̂:=R̂+B.
• Recombination of quotient after division:
• The final quotient is Q̂₁·β + Q̂₂, the final remainder is just .

The interesting part is that B₂ is only ever used for checking the guess. It is not involved in any division. Of course, in the recursion B₂ is also split in two parts, so the high half will be used in the next division, and so on.

In the diagram you can see how it works on a (very) small sample:

In the diagram, B₁ = 31 is shown in white on the left in the first and fourth rows. B₂ = 21 is shown in green on the left in the second and final rows. In the first row you also see highlighted in white A₁·β + A₂ = 3456 and R₁ = 15. In the second row, A₃ = 78 is shown in white, as it drops down to form R̂₁ = R₁·β + A₃ = 1578 with Q̂₁ = 111.

Between the second and third rows, Q̂₁ = 110 is adjusted to 111 and R̂₁ = -753 is adjusted to 2368 by adding the numerator.

In the third row we continue with the second half, dividing R̂₁ by B₁ in the same manner and then recombining R = 43 with A₄ = 67 and subtracting Q̂·B₂ = 1575. As no more adjustments are needed, we're done, with R̂ = 2792 and Q̂₂ = 75. Combine Q̂₁,Q̂₂ into Q = 11075 and we're done!

Burnikel and Ziegler present the algorithm in their paper in a bit of a roundabout way that didn't make sense to me at first. It requires understanding the big picture, which they don't really explain up front. So it's best to read the paper entirely, and then go back and re-read it to grasp the details. It's a bit bottom-up in a sense, because they refactor it into two algorithms; one for dividing numbers 2n/n, and one for dividing numbers 3n/2n. This confused me no end, as it resulted in a bit of a cyclic definition.

In the explanation I gave above, 2n/n is the overall algorithm as a whole. The first and second "halves" of the algorithm are really identical, and represented by Burnikel and Ziegler as two calls to the 3n/2n algorithm. This can be seen in the Scheme code below:

```(define (digit-bits n) (fx* *bignum-digit-bits* n))    ; Small helper

;; Here and in 2n/1n we pass both b and [b1, b2] to avoid splitting
;; up the number more than once.  This is a helper function for 2n/n.
(define (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n)
(if (< (arithmetic-shift a12 (fxneg (digit-bits n))) b1)
(let* ((n/2 (fxshr n 1))                     ; (floor (/ n 2))
(b11 (extract-digits b1 n/2 #f))      ; b1[n..n/2]
(b12 (extract-digits b1 0 n/2)))      ; b1[n/2..0]
(burnikel-ziegler-2n/1n a12 b1 b11 b12 n))
;; Don't bother dividing if a1 is a larger number than b1.
;; We use a maximum guess instead (see paper for proof).
(let ((base*n (digit-bits n)))
(values (- (arithmetic-shift 1 base*n) 1)  ; B^n-1
(+ (- a12 (arithmetic-shift b1 base*n)) b1))))
(let ((r1a3 (+ (arithmetic-shift r1 (digit-bits n)) a3)))
(let lp ((r^ (- r1a3 (* q^ b2)))
(q^ q^))
(if (negative? r^)
(lp (+ r^ b) (- q^ 1))                     ; Adjust!
(values q^ r^))))))

;; The main 2n/n algorithm which calls 3n/2n twice.  Here, a is the
;; numerator, b the denominator, n is the length of b (in digits) and
;; b1 and b2 are the two halves of b (these never change).
(define (burnikel-ziegler-2n/1n a b b1 b2 n)
(if (or (fxodd? n) (fx< n DIV-LIMIT))                ; Can't recurse?
(quotient&remainder a b)                         ; Use school division
(let* ((n/2 (fxshr n 1))
;; Split a and b into n-sized parts [a1, ..., a4] and [b1, b2]
(a12 (extract-digits a n #f))             ; a[m..n]
(a3  (extract-digits a n/2 n))            ; a[n..n/2]
(a4  (extract-digits a 0 n/2)))           ; a[n..0]
;; Calculate high quotient and intermediate remainder (first half)
(receive (q1 r1) (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n/2)
;; Calculate low quotient and final remainder (second half)
(receive (q2 r) (burnikel-ziegler-3n/2n r1 a4 b b1 b2 n/2)
;; Recombine quotient parts as q = [q1,q2]
(values (+ (arithmetic-shift q1 (digit-bits n/2)) q2) r))))))```

The reason b1, b2 are passed in but not a1, ..., a4 has to do with the "full" algorithm, which deals with unbalanced division where a may be bigger than 2n, given b of size n. There, b is constant, so it pays off to "cache" b1 and b2. Because a keeps changing, we don't cache it.

This full algorithm for dividing two numbers x and y of arbitrary lengths is as follows: If the denominator y is of size n, we can simply chop up the numerator x into n-sized pieces. We then perform a division algorithm on those pieces, using a sort of "sliding window" over x. This passes [x_{i+1},x_i] and y to 2n/n, and recombines the remainder r_i with x_{i-1} to get [r_i,x_{i-1}], which is used for 2n/n in the next iteration, and so on.

Well, in theory it's simple...

```(define (quotient&remainder/burnikel-ziegler x y return-quot? return-rem?)
;; The caller will already have verified that abs(x) > abs(y), but we
;; need to remember the signs of the input and make them absolute.
(let* ((q-neg? (if (negative? y) (not (negative? x)) (negative? x)))
(r-neg? (negative? x))
(x (abs x))
(y (abs y))
(s (bignum-digit-count y))
;; Define m as min{2^k|(2^k)*DIV_LIMIT > s}.
;; This ensures we shift as little as possible (less pressure
;; on the GC) while maintaining a power of two until we drop
;; below the threshold, so we can always split N in half.
(m (fx* 2 (integer-length (fx/ s DIV-LIMIT))))
(j (fx/ (fx+ s (fx- m 1)) m))  ; j = s/m, rounded up
(n (fx* j m))
;; Normalisation, just like with normal school division
(norm-shift (fx- (digit-bits n) (integer-length y)))
(x (arithmetic-shift x norm-shift))
(y (arithmetic-shift y norm-shift))
;; l needs to be the smallest value so that a < base^{l*n}/2
(l (fx/ (fx+ (bignum-digit-count x) n) n))
(l (if (fx= (digit-bits l) (integer-length x)) (fx+ l 1) l))
(t (fxmax l 2))
(y-hi (extract-digits y (fxshr n 1) #f))   ; y[n..n/2]
(y-lo (extract-digits y 0 (fxshr n 1))))   ; y[n/2..0]
(let lp ((zi (arithmetic-shift x (fxneg (digit-bits (fx* (fx- t 2) n)))))
(i (fx- t 2))
(quot 0))
(receive (qi ri) (burnikel-ziegler-2n/1n zi y y-hi y-lo n)
(let ((quot (+ (arithmetic-shift quot (digit-bits n)) qi)))
(if (fx> i 0)
(let ((zi-1 (let* ((base*n*i-1 (fx* n (fx- i 1)))
(base*n*i   (fx* n i))
;; x_{i-1} = x[n*i..n*(i-1)]
(xi-1 (extract-digits x base*n*i-1 base*n*i)))
(+ (arithmetic-shift ri (digit-bits n)) xi-1))))
(lp zi-1 (fx- i 1) quot))
;; De-normalise remainder, but only if necessary
(let ((rem (if (or (not return-rem?) (eq? 0 norm-shift))
ri
(arithmetic-shift ri (fxneg norm-shift)))))
;; Return values (quot, rem or both) with correct sign:
(cond ((and return-quot? return-rem?)
(values (if q-neg? (- quot) quot)
(if r-neg? (- rem) rem)))
(return-quot? (if q-neg? (- quot) quot))
(else (if r-neg? (- rem) rem))))))))))```

As you can see, this procedure is extremely hairy. The trickery is in how the bignums are chopped up into n-sized pieces. The sizes we use should be nice powers of two, which improves the algorithm's effectiveness. Notice the (or (fxodd? n) (fx< n DIV-LIMIT)) check in 2n/1n; whenever that is true, we fall back to the school division. This has to be avoided as much as possible, so that's why we try to scale up the number x to nicely rounded multiples of n. At the same time, you have to make sure that the numbers don't grow too large, because that would create more work for the algorithm!

The particular calculation is tricky, but the idea is simple: you want to scale up both numbers to the closest power of two that's larger than the cutoff size. Then, the numerator is scaled up so that it is a size that's a multiple of n, the final size of the denominator. No doubt my implementation of this part of the algorithm can be simplified.

First, start with the reading list of the previous post, because most of those references discuss advanced algorithms as well. The ones below are either more specific or more advanced than the descriptions you'll find in the standard references.

## October 14, 2016

### Programming Praxis

#### Three More List Manipulation Exercises

We have three more list-manipulation exercises today, for those students midway through their beginning programming course who need practice with linked lists:

1. Define a function that takes two lists, the second of which is a list of non-negative integers the same length as the first list, and returns a list of elements from the first list, in reverse order, each repeated a number of times as specified by the corresponding element of the second list.
2. Rearrange the integers in a list so that all the odd numbers appear before all the even numbers, with both odd and even numbers in the same relative order in the output as they were in the input.
3. Write a function that returns the nth item from the end of a list.

Your task is to write the three specified programs. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## October 13, 2016

### Peter Bex

#### CHICKEN's numeric tower: part 2

This is the second part documenting my journey to add full numeric tower support to CHICKEN core. In this post I explain some of the basic algorithms. You'll need to understand these before going on to the next part, which deals with fancier versions of these algorithms.

## Classical numerical algorithms

Like I mentioned in my previous post, the Scheme48 numerical code used only the so-called "classical" algorithms. Comments in the Scheme48 code refer to Donald Knuth's seminal work, The Art of Computer Programming, Volume 2, chapter 4. Interestingly, after these classical algorithms, Knuth explains a few faster algorithms, but Scheme48 did not implement these.

Addition and subtraction are extremely simple algorithms: you simply loop over the limbs of both numbers simultaneously, and add them together, taking care to propagate the carry or borrow from the previous position. This is the same algorithm you learned in primary school. The difference is that the computer can add a whole machine word, while at school you would handle one decimal position at a time. This is O(n) in complexity:

For subtraction the algorithm is the same, except it uses borrowing instead of carrying. You might wonder what happens if the value being subtracted is bigger than the one being subtracted from. If those numbers are both positive, that results in a negative number, but when subtracting a negative number from a smaller positive number, its result would be positive.

The solution is simple in case you're using unsigned representation with explicit sign: You compare the absolute values first. If the second value is larger than the first, you swap them first. Then you subtract them and toggle the sign of the result: If a - b = x, then multiplying all factors by -1 gives: -a + b = -x, or simply b - a = -x.

As far as I'm aware, the primary school algorithm is it. There are no shortcuts, and no quicker ways around it. However, Scheme48 used a surprising representation for their bignums: the limbs inside the bignum did not make use of the top two bits in the machine word. Presumably they did this for portability and correctness. You see, in C, signed overflow is undefined, just so compilers can eke out a little more performance. I think this is completely ridiculous, and it's another source of security issues, but that's what life with C is like.

However, CHICKEN uses the -fwrapv compiler option to enforce sane overflow behaviour. That means CHICKEN bignums are free to use all available bits in a machine word. This representation will also use slightly less memory for really large bignums, especially on 32-bit systems. But, more importantly, it's faster because there's less masking and checking going on. Here's the heart of Scheme48's bignum addition:

```while (scan_y < end_y)
{
sum = ((*scan_x++) + (*scan_y++) + carry);
if (sum < BIGNUM_RADIX)     /* No overflow */
{
(*scan_r++) = sum;
carry = 0;
}
else                        /* Overflow, adjust and set carry */
{
carry = 1;
}
}```

And here is CHICKEN's:

```while (scan_y < end_y) {
digit = *scan_r;
if (carry) {
sum = digit + *scan_y++ + 1;
carry = sum <= digit; /* Overflow if wrapped result is smaller or equal */
} else {
sum = digit + *scan_y++;
carry = sum < digit;  /* Overflow if wrapped result is smaller */
}
(*scan_r++) = sum;
}```

Aside from the difference in coding style, you can see that Scheme48 needs to adjust the result if we got a carry. The BIGNUM_RADIX is the maximum bignum digit value plus one. In terms of instructions, this masking and checking doesn't make that much of a difference, surprisingly enough.

But while tweaking this algorithm, I discovered that a nice performance improvement could be gained: First, copy the larger bignum to the result bignum, and then you loop over the second bignum, adding its limbs to the result's limbs, modifying it in-place. I suppose this is faster because you're only manipulating two pointers at a time rather than three. This is why scan_x is not used in the CHICKEN version. This requires memcpy to be fast, so on some systems, the CHICKEN approach can potentially be slower than the Scheme48 one.

### Multiplication

Multiplication is where things start to get more interesting. The classical algorithm is still pretty basic, but much slower because it's O(n²) in complexity. This is because in this algorithm, we multiply each position in the first number by every position in the other number, in a nested loop:

As the graphic attempts to clarify, we take only half-digits when multiplying, because the result must fit a single digit. This slows things down even further, because we can only iterate over the limbs at half speed. In Scheme48's code, it looked like this:

```#define x_digit x_digit_high
#define y_digit y_digit_high
#define product_high carry
while (scan_x < end_x)
{
x_digit = (*scan_x++);
x_digit_low = (HD_LOW (x_digit));
x_digit_high = (HD_HIGH (x_digit));
carry = 0;
scan_y = start_y;
scan_r = (start_r++);
while (scan_y < end_y)
{
y_digit = (*scan_y++);
y_digit_low = (HD_LOW (y_digit));
y_digit_high = (HD_HIGH (y_digit));
product_low =
((*scan_r) +
(x_digit_low * y_digit_low) +
(HD_LOW (carry)));
product_high =
((x_digit_high * y_digit_low) +
(x_digit_low * y_digit_high) +
(HD_HIGH (product_low)) +
(HD_HIGH (carry)));
(*scan_r++) =
(HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
carry =
((x_digit_high * y_digit_high) +
(HD_HIGH (product_high)));
}
(*scan_r) += carry;
}```

The #define statements at the start are rather interesting, and seem to have been meticulously chosen to maximise re-use of variables. This was probably done to cajole inefficient compilers into re-using registers. Some of the bignum code is originally from 1986, when C compilers weren't very sophisticated! The HD_CONS macro combines two halfwords together, while the HD_LOW and HD_HIGH extract the low and high halfword from a machine word, respectively:

```#define HD_LOW(digit) ((digit) & BIGNUM_HALF_DIGIT_MASK)
#define HD_HIGH(digit) ((digit) >> BIGNUM_HALF_DIGIT_LENGTH)
#define HD_CONS(high, low) (((high) << BIGNUM_HALF_DIGIT_LENGTH) | (low))```

Remember, Scheme48 bignum digits use only 30 bits on a 32-bit machine and 62 bits on a 64-bit machine, so the masking and shifting is required. Because CHICKEN bignum digits now used the full machine word, I was able to replace it with another, much shorter implementation, which relies on "automatic" truncation of machine words:

```/* From Hacker's Delight, Figure 8-1 (top part) */
for (j = 0; j < length_y; ++j) {
yj = C_uhword_ref(yd, j);
if (yj == 0) continue;
carry = 0;
for (i = 0; i < length_x; ++i) {
product = (C_uword)C_uhword_ref(xd, i) * yj +
(C_uword)C_uhword_ref(rd, i + j) + carry;
C_uhword_set(rd, i + j, product);
carry = C_BIGNUM_DIGIT_HI_HALF(product);
}
C_uhword_set(rd, j + length_x, carry);
}```

As the comment says, this code is adapted from the fantastic booklet "Hacker's Delight" by Henry S. Warren, so any elegance you see in this code is not due to me! The original code is even more elegant, but it assumes little-endian order of bignum digits and the halfwords within these digits. On big endian machines the halfwords will be swapped within their machine words, so I introduced C_uhword_ref and C_uhword_set, which are ugly macros to select the higher/lower halfword of the relevant machine word:

```/* The bignum digit representation is fullword- little endian, so on
* LE machines the halfdigits are numbered in the same order.  On BE
* machines, we must swap the odd and even positions.
*/
#ifdef C_BIG_ENDIAN
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)^1]
#else
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)]
#endif
#define C_uhword_set(x, p, d)        (C_uhword_ref(x,p) = (d))```

The (C_uhword *) casts here ensure that only a halfword is extracted. Most machines have an instruction to fetch a halfword into a register, which is much faster than masking it out. So, even if it's ugly and hacky, I vastly prefer this over the Scheme48 code.

### Division

Oh boy, where to start? The above algorithms are so simple, but division, now that's quite a can of worms. To make things worse, many textbooks (including Knuth) gloss over important details, assuming that readers can figure it out on their own.

The first problem is that, unlike the above algorithms, the traditional pen and paper-algorithm doesn't translate well to the computer. Let's look at an example division, performed by hand as you would have learned it in primary school. Here, we divide 543456 (the dividend or numerator) by 344 (the divisor or denominator):

The notation might differ slightly from what you're used to (different schools use different notations, apparently), but the algorithm should be familiar: Given a denominator of n digits, you take n+1 digits from the numerator (but n in the first step!), then divide them by the denominator. You write the quotient on the right. Then you subtract the remainder from the digits you took from the numerator, and you continue with the next digit, until you hit the last digit of the numerator. The final subtraction gives you the remainder at the bottom, and the digits you wrote on the right together form the quotient.

There is a problem with this "algorithm", though: it requires you to divide each numerator part by the entire denominator. If the denominator is a bignum, you're still dividing one bignum by another! Using this algorithm recursively won't work either, because it doesn't reduce the denominator's size.

However, it turns out that you can guess the results of these intermediate divisions, based on the first few digits of both numbers. Intuitively, you can get a pretty good guess of how many times a number fits in another by doing a trial division of their leading digits.

For example, a number like 3xx can fit about 2x times in a number like 7xxx. In other words, our guess is 7/3 = 2. For example, the number 300 will fit 23 times in 7000. This guess isn't completely accurate: for example, the number 399 will fit only 17 times in 7000. Note that the leading digit is now a 1 instead of a 2, which means our guess was bad. So in some cases we need to correct the guess. Note that a guess may never exceed 9, because we're calculating one decimal position of the quotient. All this leads to the following relatively simple algorithm:

• Make a guess based on trial division of the leading digits as described above;
• Multiply the denominator by the guess, to get a result;
• Subtract this result from the numerator, but:
• If the subtraction goes below zero, add back the denominator and adjust the guess.

This algorithm would work, but it takes many iterations. It can be improved by taking into account two leading digits of the denominator, instead of one. This improves the accuracy of the guess, and it can be done easily if we only use halfdigits in our calculation (which we'll have to do anyway to avoid overflow when multiplying). In the picture below, for simplicity and brevity, each digit represents one halfword.

The picture above is pretty complicated! I hope it clarifies the algorithm a little. The picture clearly shows two places where this algorithm guessed wrong, in which case we need to adjust some values (shown in red).

To understand the algorithm, first note the highlighted quotient digits with a question mark below them. These indicate that the quotient digit is a guess.

We tentatively multiply this guess by the first halfdigit of the denominator, and subtract it from the current remainder, giving a result in green. Then, we append the next digit from the denominator (in blue) to the result we just got. Finally, we multiply the next digit from the numerator (yellow) by its first digit, and see if the number is less than the combined intermediate remainder. This means the guess was correct; otherwise the guess is incorrect, because the remainder would be negative.

If the guess was wrong, we need to adjust the guess by subtracting one and performing the check again until the guess is correct. You can see this happening near the bottom of the first column in the above picture.

Once we have a correct guess based on the first two halfdigits, we go ahead and calculate the remainder. To do this, we multiply the full n digits of the denominator by the guess, and subtract the first n digits of the remaining numerator. All this can be done simultaneously, in O(n), even!

Unfortunately, after having calculated the remainder, it can turn out negative. This means the original guess was bad after all! In this case we must make a last-minute adjustment, by subtracting one from the quotient, and then adding the denominator to the remainder. This is shown in the picture in the first two steps of the second column.

The actual implementation of this horribly complicated algorithm in Scheme48 was also very complex and extremely long (it's all of the stuff between lines 1045 and 1383). So, instead of attempting to understand and rework this to be faster and more consistent with CHICKEN core, once again I opted to steal an implementation from Hacker's Delight. It looks like this:

```static C_regparm void
bignum_destructive_divide_normalized(C_word big_u, C_word big_v, C_word big_q)
{
C_uword *v = C_bignum_digits(big_v),
*u = C_bignum_digits(big_u),
*q = big_q == C_SCHEME_UNDEFINED ? NULL : C_bignum_digits(big_q),
p,               /* product of estimated quotient & "denominator" */
hat, qhat, rhat, /* estimated quotient and remainder digit */
vn_1, vn_2;      /* "cached" values v[n-1], v[n-2] */
C_word t, k;              /* Two helpers: temp/final remainder and "borrow" */
/* We use plain ints here, which theoretically may not be enough on
* 64-bit for an insanely huge number, but it is a _lot_ faster.
*/
int n = C_bignum_size(big_v) * 2,       /* in halfwords */
m = (C_bignum_size(big_u) * 2) - 2; /* Correct for extra digit */
int i, j;		                  /* Just two loop variables */

/* Part 2 of Gauche's aforementioned trick: */
if (C_uhword_ref(v, n-1) == 0) n--;

/* These won't change during the loop, but are used in every step. */
vn_1 = C_uhword_ref(v, n-1);
vn_2 = C_uhword_ref(v, n-2);

/* See also Hacker's Delight, Figure 9-1.  This is almost exactly that. */
for (j = m - n; j >= 0; j--) {
/* First, determine the initial guess: */
hat = C_BIGNUM_DIGIT_COMBINE(C_uhword_ref(u, j+n), C_uhword_ref(u, j+n-1));
if (hat == 0) {
if (q != NULL) C_uhword_set(q, j, 0);
continue;
}
qhat = hat / vn_1;
rhat = hat % vn_1;

/* Next, keep making early adjustments to the guess
* until it matches the first two digits:
*/

/* Two whiles is faster than one big check with an OR.  Thanks, Gauche! */
while(qhat >= (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) { qhat--; rhat += vn_1; }
while(qhat * vn_2 > C_BIGNUM_DIGIT_COMBINE(rhat, C_uhword_ref(u, j+n-2))
&& rhat < (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) {
qhat--;
rhat += vn_1;
}

/* Finally, multiply and subtract: */
k = 0;
for (i = 0; i < n; i++) {
p = qhat * C_uhword_ref(v, i);
t = C_uhword_ref(u, i+j) - k - C_BIGNUM_DIGIT_LO_HALF(p);
C_uhword_set(u, i+j, t);
k = C_BIGNUM_DIGIT_HI_HALF(p) - (t >> C_BIGNUM_HALF_DIGIT_LENGTH);
}
t = C_uhword_ref(u,j+n) - k;
C_uhword_set(u, j+n, t);

/* Subtracted too much?
*/
if (t < 0) {
qhat--;
k = 0;
for (i = 0; i < n; i++) {
t = (C_uword)C_uhword_ref(u, i+j) + C_uhword_ref(v, i) + k;
C_uhword_set(u, i+j, t);
k = t >> C_BIGNUM_HALF_DIGIT_LENGTH;
}
C_uhword_set(u, j+n, (C_uhword_ref(u, j+n) + k));
}
if (q != NULL) C_uhword_set(q, j, qhat);
} /* end of "j" loop */
}```

There are some shoutouts to Gauche, which is a beautifully-crafted Scheme implementation in C. The particular "trick" referred to here simplifies the calculation of our allocation sizes a little bit by ensuring we never shift more than a halfdigit when normalising (see next section).

As you can see from the implementation, the "multiply and subtract" is actually done in one loop which scans over the remainder u and denominator v at the same time, so this is not "magic"; we can perform the multiply and subtract steps over the entire bignum in one efficient O(n) loop. Perhaps surprisingly, the overall algorithm is O(n²), just like multiplication. Division is still much slower than multiplication because each "step" performs more operations (just look at the algorithms!).

### Normalisation

A real-world implementation of the above division algorithm will try to reduce the number of guess adjustments. This is done by first normalising or scaling the numbers. This is done by multiplying both the numerator and denominator with the same power of two before starting to do the division. Afterwards, the remainder must be scaled back by dividing by that power of two. Instead of multiplying and dividing, you can of course just shift the numbers.

The number by which is multiplied depends on the numerator's first digit; it must be scaled up to be at least half of the base. In base 10, you need to scale it up to at least 5, while in a "full machine word" base it's even easier: you simply shift the entire number so that the highest bit of the most significant limb is set. How Scheme48 did this:

```bignum_digit_type v1 = (BIGNUM_REF ((denominator), (length_d - 1)));
while (v1 < (BIGNUM_RADIX / 2))  /* Is the high bit set yet? */
{
v1 <<= 1;
shift += 1;
}```

In the CHICKEN version, we take a simpler approach by subtracting the integer length from the digit length, which effectively is the same as counting the number of leading zeroes ("nlz"):

```C_uword v1 = *(C_bignum_digits(denominator) + length - 1);
shift = C_BIGNUM_DIGIT_LENGTH - C_ilen(v1); /* nlz */```

Then, both numbers are copied into temporary buffers which are shifted left in-place by the number of bits calculated here.

Normalisation works by preventing the algorithm from overshooting. Think about it: any guess may always be too high, never too low! So if you scale the first digit to be as high as possible, you can't so easily make a guess that is too high. It's weird, but the math seems to work out.

## A reading list for beginners

I am writing this blog post series mostly as a quick overview and introduction to the struggles and approaches taken in CHICKEN's bignum implementation. It is not intended as a full-on tutorial. If you are serious about implementing a full numeric tower (good for you!) or diving deeper into the CHICKEN code, you'll need more. Unfortunately, good and easy to understand documentation is surprisingly hard to find, so here's a reading list to save you some effort.

• Knuth's The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. The definitive reference. Many will say that every self-respecting hacker should have read these books, but truth be told they're rather tough to get through. But even if you do give up working through the books, they serve as great reference material. Why are these books so tough? They're math-heavy (especially the first book), and Knuth uses his own "hypothetical" MIX architecture for all code examples and exercises. Yes, everything is in assembly language! Nevertheless, the books are very thorough, and they're obviously written out of love for the craft.
• Tom St Denis's book Multi-Precision Math is much more gentle than Knuth's books. This is the companion book for LibTomMath, a public domain, well-commented bignum library, explicitly written to be easy to understand. The book and library cover mostly classic algorithms, but there are also a handful of "advanced" algorithms, and several special-purpose optimised versions.
• Per Brinch Hansen's Multiple-Length Division Revisited: A Tour of the Minefield. This little gem is helpful if you are having trouble following textbook explanations of the classical division algorithm. It was written out of frustration with the poor quality of existing explanations.
• MpNT: A Multi-Precision Number Package by Tiplea et al. is another overview of a library's algorithms. This is a bit terser and more math-heavy than the LibTom book, but also covers several more advanced algorithms. This is a very good and complete reference.
• Finally, Modern Computer Arithmetic by Richard Brent and Paul Zimmermann is probably the tersest, but also the most complete guide to efficient algorithms that I've found so far. These guys know what they're talking about: this book truly covers the "state of the art". Only for advanced students of numerics :)
• As a bonus, if you're serious about efficiency: The "algorithms" section of the GMP manual. These are terse and incomplete, and you usually won't get a complete understanding just by reading them. However, since GMP is the most popular implementation, it is also the fastest: Researchers usually create a proof of concept implementation for GMP and compare it to the existing algorithms. So, it is important to know which algorithms GMP is currently using, and then try to find better papers that explain them.

## October 12, 2016

### Andy Wingo

#### An incomplete history of language facilities for concurrency

I have lately been in the market for better concurrency facilities in Guile. I want to be able to write network servers and peers that can gracefully, elegantly, and efficiently handle many tens of thousands of clients and other connections, but without blowing the complexity budget. It's a hard nut to crack.

Part of the problem is implementation, but a large part is just figuring out what to do. I have often thought that modern musicians must be crushed under the weight of recorded music history, but it turns out in our humble field that's also the case; there are as many concurrency designs as languages, just about. In this regard, what follows is an incomplete, nuanced, somewhat opinionated history of concurrency facilities in programming languages, with an eye towards what I should "buy" for the Fibers library I have been tinkering on for Guile.

* * *

Modern machines have the raw capability to serve hundreds of thousands of simultaneous long-lived connections, but it’s often hard to manage this at the software level. Fibers tries to solve this problem in a nice way. Before discussing the approach taken in Fibers, it’s worth spending some time on history to see how we got here.

One of the most dominant patterns for concurrency these days is “callbacks”, notably in the Twisted library for Python and the Node.js run-time for JavaScript. The basic observation in the callback approach to concurrency is that the efficient way to handle tens of thousands of connections at once is with low-level operating system facilities like `poll` or `epoll`. You add all of the file descriptors that you are interested in to a “poll set” and then ask the operating system which ones are readable or writable, as appropriate. Once the operating system says “yes, file descriptor 7145 is readable”, you can do something with that socket; but what? With callbacks, the answer is “call a user-supplied closure”: a callback, representing the continuation of the computation on that socket.

Building a network service with a callback-oriented concurrency system means breaking the program into little chunks that can run without blocking. Whereever a program could block, instead of just continuing the program, you register a callback. Unfortunately this requirement permeates the program, from top to bottom: you always pay the mental cost of inverting your program’s control flow by turning it into callbacks, and you always incur run-time cost of closure creation, even when the particular I/O could proceed without blocking. It’s a somewhat galling requirement, given that this contortion is required of the programmer, but could be done by the compiler. We Schemers demand better abstractions than manual, obligatory continuation-passing-style conversion.

Callback-based systems also encourage unstructured concurrency, as in practice callbacks are not the only path for data and control flow in a system: usually there is mutable global state as well. Without strong patterns and conventions, callback-based systems often exhibit bugs caused by concurrent reads and writes to global state.

Some of the problems of callbacks can be mitigated by using “promises” or other library-level abstractions; if you’re a Haskell person, you can think of this as lifting all possibly-blocking operations into a monad. If you’re not a Haskeller, that’s cool, neither am I! But if your typey spidey senses are tingling, it’s for good reason: with promises, your whole program has to be transformed to return promises-for-values instead of values anywhere it would block.

An obvious solution to the control-flow problem of callbacks is to use threads. In the most generic sense, a thread is a language feature which denotes an independent computation. Threads are created by other threads, but fork off and run independently instead of returning to their caller. In a system with threads, there is implicitly a scheduler somewhere that multiplexes the threads so that when one suspends, another can run.

In practice, the concept of threads is often conflated with a particular implementation, kernel threads. Kernel threads are very low-level abstractions that are provided by the operating system. The nice thing about kernel threads is that they can use any CPU that is the kernel knows about. That’s an important factor in today’s computing landscape, where Moore’s law seems to be giving us more cores instead of more gigahertz.

However, as a building block for a highly concurrent system, kernel threads have a few important problems.

One is that kernel threads simply aren’t designed to be allocated in huge numbers, and instead are more optimized to run in a one-per-CPU-core fashion. Their memory usage is relatively high for what should be a lightweight abstraction: some 10 kilobytes at least and often some megabytes, in the form of the thread’s stack. There are ongoing efforts to reduce this for some systems but we cannot expect wide deployment in the next 5 years, if ever. Even in the best case, a hundred thousand kernel threads will take at least a gigabyte of memory, which seems a bit excessive for book-keeping overhead.

Kernel threads can be a bit irritating to schedule, too: when one thread suspends, it’s for a reason, and it can be that user-space knows a good next thread that should run. However because kernel threads are scheduled in the kernel, it’s rarely possible for the kernel to make informed decisions. There are some “user-mode scheduling” facilities that are in development for some systems, but again only for some systems.

The other significant problem is that building non-crashy systems on top of kernel threads is hard to do, not to mention “correct” systems. It’s an embarrassing situation. For one thing, the low-level synchronization primitives that are typically provided with kernel threads, mutexes and condition variables, are not composable. Also, as with callback-oriented concurrency, one thread can silently corrupt another via unstructured mutation of shared state. It’s worse with kernel threads, though: a kernel thread can be interrupted at any point, not just at I/O. And though callback-oriented systems can theoretically operate on multiple CPUs at once, in practice they don’t. This restriction is sometimes touted as a benefit by proponents of callback-oriented systems, because in such a system, the callback invocations have a single, sequential order. With multiple CPUs, this is not the case, as multiple threads can run at the same time, in parallel.

Kernel threads can work. The Java virtual machine does at least manage to prevent low-level memory corruption and to do so with high performance, but still, even Java-based systems that aim for maximum concurrency avoid using a thread per connection because threads use too much memory.

In this context it’s no wonder that there’s a third strain of concurrency: shared-nothing message-passing systems like Erlang. Erlang isolates each thread (called processes in the Erlang world), giving each it its own heap and “mailbox”. Processes can spawn other processes, and the concurrency primitive is message-passing. A process that tries receive a message from an empty mailbox will “block”, from its perspective. In the meantime the system will run other processes. Message sends never block, oddly; instead, sending to a process with many messages pending makes it more likely that Erlang will pre-empt the sending process. It’s a strange tradeoff, but it makes sense when you realize that Erlang was designed for network transparency: the same message send/receive interface can be used to send messages to processes on remote machines as well.

No network is truly transparent, however. At the most basic level, the performance of network sends should be much slower than local sends. Whereas a message sent to a remote process has to be written out byte-by-byte over the network, there is no need to copy immutable data within the same address space. The complexity of a remote message send is O(n) in the size of the message, whereas a local immutable send is O(1). This suggests that hiding the different complexities behind one operator is the wrong thing to do. And indeed, given byte read and write operators over sockets, it’s possible to implement remote message send and receive as a process that serializes and parses messages between a channel and a byte sink or source. In this way we get cheap local channels, and network shims are under the programmer’s control. This is the approach that the Go language takes, and is the one we use in Fibers.

Structuring a concurrent program as separate threads that communicate over channels is an old idea that goes back to Tony Hoare’s work on “Communicating Sequential Processes” (CSP). CSP is an elegant tower of mathematical abstraction whose layers form a pattern language for building concurrent systems that you can still reason about. Interestingly, it does so without any concept of time at all, instead representing a thread’s behavior as a trace of instantaneous events. Threads themselves are like functions that unfold over the possible events to produce the actual event trace seen at run-time.

This view of events as instantaneous happenings extends to communication as well. In CSP, one communication between two threads is modelled as an instantaneous event, partitioning the traces of the two threads into “before” and “after” segments.

Practically speaking, this has ramifications in the Go language, which was heavily inspired by CSP. You might think that a channel is just a an asynchronous queue that blocks when writing to a full queue, or when reading from an empty queue. That’s a bit closer to the Erlang conception of how things should work, though as we mentioned, Erlang simply slows down writes to full mailboxes rather than blocking them entirely. However, that’s not what Go and other systems in the CSP family do; sending a message on a channel will block until there is a receiver available, and vice versa. The threads are said to “rendezvous” at the event.

Unbuffered channels have the interesting property that you can `select` between sending a message on channel a or channel b, and in the end only one message will be sent; nothing happens until there is a receiver ready to take the message. In this way messages are really owned by threads and never by the channels themselves. You can of course add buffering if you like, simply by making a thread that waits on either sends or receives on a channel, and which buffers sends and makes them available to receives. It’s also possible to add explicit support for buffered channels, as Go, core.async, and many other systems do, which can reduce the number of context switches as there is no explicit buffer thread.

Whether to buffer or not to buffer is a tricky choice. It’s possible to implement singly-buffered channels in a system like Erlang via an explicit send/acknowlege protocol, though it seems difficult to implement completely unbuffered channels. As we mentioned, it’s possible to add buffering to an unbuffered system by the introduction of explicit buffer threads. In the end though in Fibers we follow CSP’s lead so that we can implement the nice `select` behavior that we mentioned above.

As a final point, `select` is OK but is not a great language abstraction. Say you call a function and it returns some kind of asynchronous result which you then have to `select` on. It could return this result as a channel, and that would be fine: you can add that channel to the other channels in your `select` set and you are good. However, what if what the function does is receive a message on a channel, then do something with the message? In that case the function should return a channel, plus a continuation (as a closure or something). If `select` results in a message being received over that channel, then we call the continuation on the message. Fine. But, what if the function itself wanted to `select` over some channels? It could return multiple channels and continuations, but that becomes unwieldy.

What we need is an abstraction over asynchronous operations, and that is the main idea of a CSP-derived system called “Concurrent ML” (CML). Originally implemented as a library on top of Standard ML of New Jersey by John Reppy, CML provides this abstraction, which in Fibers is called an operation1. Calling `send-operation` on a channel returns an operation, which is just a value. Operations are like closures in a way; a closure wraps up code in its environment, which can be later called many times or not at all. Operations likewise can be performed2 many times or not at all; performing an operation is like calling a function. The interesting part is that you can compose operations via the `wrap-operation` and `choice-operation` combinators. The former lets you bundle up an operation and a continuation. The latter lets you construct an operation that chooses over a number of operations. Calling `perform-operation` on a choice operation will perform one and only one of the choices. Performing an operation will call its `wrap-operation` continuation on the resulting values.

While it’s possible to implement Concurrent ML in terms of Go’s channels and baked-in `select` statement, it’s more expressive to do it the other way around, as that also lets us implement other operations types besides channel send and receive, for example timeouts and condition variables.

1 CML uses the term event, but I find this to be a confusing name. In this isolated article my terminology probably looks confusing, but in the context of the library I think it can be OK. The jury is out though.

2 In CML, synchronized.

* * *

Thanks very much to Matthew Flatt, Matthias Felleisen, and Michael Sperber for pushing me towards CML. In the beginning I thought its benefits were small and complication large, but now I see it as being the reverse. Happy hacking :)

### Guile News

#### GNU Guile 2.0.13 released [security fixes]

We've just released a new version of GNU Guile, version 2.0.13, which is a security release for Guile (see the original announcement).

This handles a significant security vulnerability affecting the live REPL, CVE-2016-8606. Due to the nature of this bug, Guile applications themselves in general aren't vulnerable, but Guile developers are. Arbitrary Scheme code may be used to attack your system in this scenario. (A more minor security issue is also addressed, CVE-2016-8605.)

There is also a lesson here that applies beyond Guile: the presumption that "localhost" is only accessible by local users can't be guaranteed by modern operating system environments. If you are looking to provide local-execution-only, we recommend using Unix domain sockets or named pipes. Don't rely on localhost plus some port.

To give context, Guile supports a nice live-hacking feature where a user can expose a REPL to connect to, through Geiser or so on. This allows Guile users to hack programs even while programs are running.

When using the live hacking feature, the default in Guile has been to expose a port over localhost to which code may be passed. The assumption for this is that only a local user may write to localhost, so it should be safe. Unfortunately, users simultaneously developing Guile and operating modern browsers are vulnerable to a combination of an HTML form protocol attack and a DNS rebinding attack. How to combine these attacks is published in the article How to steal any developer's local database.

In Guile's case, the general idea is that you visit some site which presumably loads some JavaScript code (or tricks the developer into pressing a button which performs a POST), and the site operator switches the DNS from their own IP to 127.0.0.1. Then a POST is done from the website to 127.0.0.1 with the body containing Scheme code. This code is then executed by the Guile interpreter on the listening port.

The version we are releasing mitigates this problem by detecting incoming HTTP connections and closing them before executing any code.

However, there is a better long term solution, which is already available even to users of older versions of Guile: Guile supports Unix domain sockets in POSIX environments. For example, users may run the command:

to open and listen to a socket at /tmp/guile-socket. Geiser users may then connect using M-x geiser-connect-local. This is considerably safer.

We hope that other program authors take heed of this lesson as well: many programs make use of localhost + port as a way of limiting connections. Unfortunately, in today's complex networked environment, this isn't a safe assumption. It's very difficult to predict what programs may provide a way of chaining requests to an application listening on localhost, and certainly difficult on a system where web browsers are involved. Take heed!

## October 11, 2016

### Programming Praxis

#### Three List Manipulation Exercises

We must be approaching the middle of the autumn academic term, because the beginning-programmer message boards are starting to fill up with linked-list exercises; one enterprising fellow even sent me an email asking for help on his homework. Here are three tasks that involve manipulating linked lists. We’ve probably done all three before, and they’re simple enough for many of the people that read and respond here, but learners seem to have trouble with them, so we’ll discuss them here:

1. Write a function that takes a linked list of integers and returns a new linked list with all the odd integers removed.
2. Write a function that removes every nth item from a linked list.
3. Write a function that reverses the two halves of a list; if the number of items in the list is odd, the middle element should go with the second half of the list

Your task is to write the three functions described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## October 10, 2016

### Peter Bex

#### CHICKEN's numeric tower: part 1

Originally, CHICKEN only supported fixnums (small integers) and flonums (floating-point numbers). The upcoming CHICKEN 5 will support a full numeric tower, adding arbitrary-length integers, rational numbers and complex numbers. This is the first in a series of blog posts about my journey to make this a reality. We'll start with a bit of background information. Later parts will dive into the technical details.

## In the beginning, there were two numerical types

Like I mentioned, CHICKEN originally only supported fixnums and flonums. This is still the case in CHICKEN 4. When a fixnum overflows, it is coerced into a flonum. On 32-bit systems, this buys us 52 bits of precision, which is more than the 30 bits of precision fixnums offer:

``` #;1> most-positive-fixnum
1073741823
#;2> (+ most-positive-fixnum 1)
1073741824.0```

This works reasonably well, and is well-behaved until you go beyond the 52 bits supported by the floating-point representation:

``` #;3> (flonum-print-precision 100)
#;4> (expt 2 53)
9007199254740992.0
#;5> (+ (expt 2 53) 1)
9007199254740992.0
#;6> (= (expt 2 53) (+ (expt 2 53) 1))
#t```

On a 64-bit machine, overflow of the 62 bits of a fixnum to the 52 bits of a flonum is rather weird:

``` #;1> (= most-positive-fixnum (- (+ most-positive-fixnum 1) 2))
#t```

Since we only have fixnums and flonums, any attempt to enter a rational number will result in a flonum:

``` #;1> 1/2
0.5
#;2> 1/3
0.333333333333333```

Complex numbers are not supported at all:

``` #;1> 1+2i

Error: unbound variable: 1+2i```

Of course, some people still needed to work with complex numbers, so a long time ago, Thomas Chust created the "complex" egg. This added complex number support to the reader, and the basic numeric operators were overridden to support complex numbers. About a year later, Felix Winkelmann created the original version of the "numbers" egg, using Thomas's code for complex numbers. This added arbitrarily large integers ("bignums" in Lisp parlance) and rational number support via the GNU MP library. Thus, CHICKEN finally had a full numeric tower, and it was completely optional, too. Pretty awesome!

## Cracks start to show

Unfortunately, it's not as awesome as it sounds. There are some problems with having parts of the numeric tower as an add-on, instead of having it all in core:

• In a Scheme with modules, + from the scheme module should always refer to the same procedure. So if a module imports that + instead of the one from the numbers module, it will not understand extended numeric types. This means that you can't easily combine a library that uses numbers with one that doesn't. If you pass a bignum to the library that does not use numbers, it will raise an exception. This is mostly a problem with Scheme itself, which doesn't have a clean way to define polymorphic procedures. This makes the numeric tower a built-in special case. It is possible to mutate procedures, but allowing for that implies a big performance hit on all code, even if you don't use the numbers egg.
• The numbers egg extends the reader to support extended numeric literals. This means that if some code somewhere loads the numbers egg, the reader extension is active even though you didn't load numbers yourself. This can cause confusion because normal numeric operations don't accept these numbers. For an example, see this bug report.
• Speaking of extended numeric literals: the compiler doesn't know how to serialise those into the generated C code. This means you can't compile Scheme code containing such literals. You'd have to use string->number everywhere, instead. I found a clever hack to make this work with the numbers egg, but it isn't fool-proof. For instance, it doesn't work when cross-compiling to a platform with different endianness, or if one platform is 32-bit and the other is 64-bit.
• The compiler can optimise tight loops by using inline C functions for primitive operations such as the built-in numerical procedures. A current weak spot of CHICKEN is that (as far as I know), eggs can't add such inline C function replacements. So, any code that uses the numbers egg is doomed to have bad performance in critical loops. I think making inlining of C functions available for user code would be a great project (hint, hint!).
• Because the FFI (foreign function interface) is built into the compiler, it doesn't support bignums. This means 64-bit integers returned from C are converted to flonums, losing precision. Eggs can't hook into the FFI deeply enough to override this.

One could argue that these are all language or implementation limitations. On the one hand, that's a fair argument. On the other hand, keeping everything "open" so it can be tweaked by the user prevents many optimisations. It also makes the implementation more complex. For instance, there are hooks in core specifically for the numbers egg, to support reading and writing extended literals. The numeric tower needs deeper integration than most other things because numbers are a basic type, much like symbols, strings or lists. So, it makes more sense to have this in the core system.

## The start of my quest

Traditionally, Lisps have supported a full numeric tower. At least since the MacLISP days (the early 1970s; see also The History of Lisp), bignums have been pretty standard. Scheme formalises this in the standard, but it does not require full support for all numeric types. Still, in my opinion any serious Lisp or Scheme implementation should support the full numerical tower. It's one of those things that make Lisp unique and more cause for that famous smugness of us Lisp weenies.

It is fantastic when a language supports arbitrarily large integers. Not having to worry about overflows helps prevent various nasty security bugs (luckily, overflowing into flonums, like CHICKEN, mitigates most of these). Bignums can also make it much easier to interact with native code, because integer width is never a problem. It basically frees the programmer from having to think about "unimportant" low-level details. Rational numbers (i.e., fractions like 1/2 or 3/5) and complex numbers are just icing on the cake that add a real feeling of "thoroughness" to Lisp.

This idea, and the fact that other "proper" Scheme implementations support the full numeric tower out of the box always frustrated me. I believe people are less likely to take CHICKEN seriously as a full Scheme implementation. Especially new users are often surprised when CHICKEN does not work as expected. Tutorials don't mention that the numeric tower is partly optional!

More experienced users were also frustrated with the limitations of having numbers as a separate egg, like you can see for example in this thread. In it, some of the problems are indicated, and it is also made clear why a GNU MP-based implementation should not be part of CHICKEN.

From all of this, I decided that the best way to get bignums into core would be to start with finding a good BSD-licensed implementation. Then I could replace GMP with this new implementation in the numbers egg, tweak it to use CHICKEN's naming conventions and finally integrate the new code into core. How hard could it be, really? Little did I suspect that 5 years later, the code would finally be introduced to core!

## A very slow, but BSD-licensed implementation

Finding a BSD-licensed bignum implementation is not very difficult, and I quickly settled on the Scheme48 implementation, which was originally taken from MIT Scheme. I've always admired Scheme48 for its extremely clean and easy to understand code base, and CHICKEN core already used the syntax-rules implementation from Scheme48, so it made a lot of sense to use their code. Unfortunately, it turned out that the implementation was extremely inefficient, especially when dealing with rational numbers ("ratnums"). After a few weeks of intensive hacking to fix the worst problems, it was finally ready.

This new implementation was much more efficient than the GMP-based numbers egg, but that's only because the GMP-based version relied heavily on finalizers to clean up memory. The new version integrated properly with the CHICKEN garbage collector. This reduced a whole lot of overhead. Having said that, GMP itself is the fastest bignum implementation you'll ever find, so if you can at all get away with using it in your project, do so!

## CHICKEN 5 is announced

The CHICKEN core team (of which I'm a member) decided that CHICKEN 5 should be a clean break, with no backwards compatibility. We wanted to finally restructure the core libraries, which had become rather messy, and change a few confusing aspects about modules. Doing this with backwards compatibility would sap too much development energy and possibly result in an even bigger mess. When this decision was made, I decided that this would be the perfect opportunity to finally integrate the numbers egg into core.

I had been working on the numbers egg on and off over the past years, hoping for a good moment to add it to core. When the opportunity presented itself, at first I naively thought a few tweaks would suffice to integrate it. I thought I only had to make some name changes and rearrange some functions. The Scheme48 code base used very descriptive and highly abstract naming, whereas CHICKEN uses terse names and has both inline and CPS variants for primitive operations. Besides, quite a bit of code in the numbers egg was purely in Scheme, whereas CHICKEN has a more-or-less official C API. So, I had to convert some of the functions to C. This would probably also result in some performance improvements.

## Small changes lead to a total rewrite

During the conversion to C, I noticed various opportunities for performance improvements. For instance, the Scheme48 code still relied on malloc() to allocate temporary numbers in several places. Where this was done, the final result of an operation would then be allocated into GC-managed memory and the temporary buffer was immediately freed.

Rewriting the code to allocate directly in GC-able memory resulted in quite the restructuring of the code, because we'd need to have a restartable continuation at every point where an allocation would take place. For example, here's the code for negating a bignum:

```static void big_neg(C_word c, C_word self, C_word k, C_word x)
{
bignum_type big = big_of(x); /* Extract bignum data */
C_word negated_big = bignum_new_sign(big, !(BIGNUM_NEGATIVE_P (big)));
C_return_bignum(k, negated_big);
}

static bignum_type bignum_new_sign(bignum_type bignum, int negative_p)
{
bignum_type result =
(bignum_allocate ((BIGNUM_LENGTH (bignum)), negative_p));  /* mallocs */
bignum_destructive_copy (bignum, result);  /* basically a manual memcpy */
return (result);
}```

It looks very simple, but a lot is going on under the hood. The C_return_bignum function contained all the hairy complexity; it would either convert the bignum to a fixnum, deallocate the bignum and call the passed continuation, or it would set up a continuation that would copy the bignum into a heap-allocated copy and deallocate the original bignum, and pass that to an allocation function.

This was changed into the following, which uses the core's _u_ naming convention to indicate that the function is unsafe, i.e. it doesn't check its arguments:

```void C_ccall C_u_bignum_negate(C_word c, C_word self, C_word k, C_word x)
{
C_word kab[C_SIZEOF_CLOSURE(3)], *ka = kab, k2, negp, size;

/* Create continuation k2, to call after allocation */
k2 = C_closure(&ka, 3, (C_word)bignum_negate_2, k, x);

negp = C_i_not(C_u_i_bignum_negativep(x)); /* Toggle sign */
size = C_u_i_bignum_size(x);
C_allocate_bignum(3, (C_word)NULL, k2, size, negp, C_SCHEME_FALSE);
}

static void bignum_negate_2(C_word c, C_word self, C_word new_big)
{
C_word k = C_block_item(self, 1), /* Extract original continuation */
old_big = C_block_item(self, 2); /* Extract original bignum */

/* Copy old bignum digits to newly allocated (negated) bignum */
C_memcpy(C_bignum_digits(new_big), C_bignum_digits(old_big),

C_kontinue(k, new_big); /* "Return" the new bignum by calling k with it */
}```

The new version looks hairier but does less, because it allocates the bignum directly into the nursery or the heap. Because this may require a GC, it needs to have a continuation, which can be invoked from the GC's trampoline. That's the reason this has to be cut into two separate C functions. There are functions that allocate 2 bignums or even more, which I had to cut up into 3 or more functions!

Besides using "native" naming conventions, this new version also gets rid of the unnecessary, un-CHICKENish bignum_type abstraction. Instead, it uses only C_word as its type. This also removed the need for some questionable type casts. Luckily, the final negating version that ended up in CHICKEN 5 is a lot simpler and again only one function, but that required a breakthrough in thinking that I hadn't had at this point yet. I will discuss this breakthrough in the final post in this series.

After having taken care of all the functions, very little remained of the original Scheme48 code. It was completely mutilated! Earlier I had to rewrite some of the Scheme code to improve performance, and now I was restructuring the C code. To top it off, after studying other bignum implementations, it became clear that the Scheme48 code was pretty slow when compared to other Schemes. It only implemented the "classical" algorithms, and it was optimised for readability, not speed.

So, I studied up on better algorithms to make it perform more acceptably. In the next few parts, I'll share with you a few things that I've learned.

## October 07, 2016

### Programming Praxis

#### Sticks

We continue our occasional series of textbook exercises:

You are given a bunch of sticks, each of integer length. Two sticks can be combined into a single, larger stick by a process that costs the sum of the lengths of the two sticks. Your goal is to build a single stick combining all the original sticks, at minimal cost.

For example, suppose you initially have three sticks of lengths 1, 2, and 4. You could combine sticks 2 and 4 at a cost of 6, then combine that stick with stick 1 at a cost of 7, for a total cost of 13. But it is better to first combine sticks 1 and 2 at a cost of 3, then combine that stick with stick 4 at a cost of 7, for a total cost of 10.

Your task is to write a program that combines a bunch of sticks at minimal cost. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## October 04, 2016

### Programming Praxis

#### I’m Embarrassed!

Matthew Arcus pointed out a bug in my solution to the previous problem:

```> (dollar 1234567.9999)
\$1,234,567.100```

I can’t remember a similarly bad bug in any previous Programming Praxis problem.

You get the day off today. There is no exercise for you to do. You are welcome to read or run my corrected solution, or to post your own solution or discuss the exercise in the comments below.

## September 30, 2016

### Programming Praxis

#### Dollar Format

We have a simple task today, a function that formats a number in dollar format. A number like 1234567.8912 should be rounded to two positions after the decimal point, have commas inserted every three positions before the decimal point, and have a dollar sign prepended; thus, the function should format 1234567.8912 as \$1,234,567.89.

Your task is to write a function that returns numbers in dollar format; if your language provides such a facility natively, you are not permitted to use it. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.