Planet Scheme

February 05, 2016

Programming Praxis


Every so often I get it in mind to start solving the daily cryptogram in the newspaper. For instance:


The first word is either A or I, and the repeated two-letter words NK, NK, NA and NA suggest words like IF, IS, IT or OF, ON, OR. Further, the digraph NK also appears as YNK. So we guess that NK is IS, YNK is HIS, and V is A. Now the author LVCE AHVNK is _A__ __AI_, which rather strongly suggests MARK TWAIN. Now we have the phrase WH_N TH_ S_N IS SHININ_, which is almost certainly WHEN THE SUN IS SHINING, and the rest of the cryptogram is easy: A BANKER IS A FELLOW WHO LENDS YOU HIS UMBRELLA WHEN THE SUN IS SHINING AND WANTS IT BACK THE MINUTE IT BEGINS TO RAIN. — MARK TWAIN.

We’ve written a program to solve cryptograms, a long time ago, using a dictionary attack, but sometimes it’s still fun to solve them by hand. One way a computer can help is by creating frequency tables of single letters, digraphs, and trigraphs; in the puzzle shown above, a table of digraphs leads quickly to NX, which appears six times, along with NK (three times) and NA (two times), which quickly limits the possibilities for the letter N. We didn’t use it in the analysis above, but AYG is the most common trigram, and both times it appears it is a word by itself, so it is almost certainly THE or AND. Thus, frequency tables quickly lead to a solution.

Your task is to write a program that generates frequency tables, including single characters, digraphs, trigraphs, and other lengths. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at February 05, 2016 09:00 AM

February 04, 2016

Andy Wingo

guile compiler tasks

Hey! We released Guile 2.1.2, including the unboxing work, and we fixed the slow bootstrap problem by shipping pre-built bootstraps in tarballs. A pretty OK solution in my opinion; check it out!

future work

At this point I think I'm happy with Guile's compiler and VM, enough for now. There is a lot more work to do but it's a good point at which to release a stable series. There will probably be a number of additional pre-releases, but not any more significant compiler/VM work that must be done before a release.

However, I was talking with Guilers at FOSDEM last weekend and we realized that although we do a pretty good job at communicating the haps in compiler-land, we don't do a good job at sharing a roadmap or making it possible for other folks to join the hack. And indeed, it's been difficult to do so while things were changing so much: I had to get things right in my head before joining in the confusion of other people's heads.

In that spirit I'd like to share a list of improvements that it would be nice to make at some point. If you take one of these tasks, be my guest: find me on IRC (wingo on freenode) and let me know, and I'll help as I am able. You need to be somewhat independent; I'm not offering a proper mentoring or anything, more like office hours or something, where you come with the problem you are having and I commiserate and give context/background/advice as I am able.

So with that out of the way, here's a huge list of stuff! Following this, more details on each one.

  1. stripping binaries

  2. full source in binaries

  3. cps in in binaries

  4. linking multiple modules together

  5. linking a single executable

  6. instruction explosion

  7. elisp optimizations

  8. prompt removal

  9. basic register allocation

  10. optimal register allocation

  11. unboxed record fields

  12. textual CPS

  13. avoiding arity checks

  14. unboxed calls and returns

  15. module-level inlining

  16. cross-module inlining

As a bonus, in the end I'll give some notes on native compilation. But first, the hacks!

stripping binaries

Guile uses ELF as its object file format, and currently includes source location information as DWARF data. On space-constrained devices this might be too much. Your task: add a hack to the linker that can strip existing binaries. Read Ian Lance Taylor's linker articles for more background, if you don't know things about linkers yet.

full source in binaries

Wouldn't it be nice if the ELF files that Guile generates actually included the source as well as the line numbers? We could do that, in a separate strippable ELF section. This point is like the reverse of the previous point :)

cps in in binaries

We could also include the CPS IR in ELF files too. This would enable some kinds of link-time optimization and cross-module inlining. You'd need to define a binary format for CPS, like LLVM bitcode or so. Neat stuff :)

linking multiple modules together

Currently in Guile, just about every module is a separate .go file. Loading a module will cause a few stat calls and some seeks and reads and all that. Wouldn't it be nice if you could link together all the .go files that were commonly used into one object? Again this is a linker hack, but it needs support from the run-time as well: when the run-time goes to load a file, it should first check in a registry if that file has been logically provided by some other file. We'd be able to de-duplicate constant data from various modules. However there is an initialization phase when loading a .go file which effectively performs all the relocations needed by constants that need a fix-up at load-time; see the ELF article I linked to above for more. For some uses, it would be OK to produce one relocation/initialization procedure. For others, if you expected to only load a fraction of the modules in a .go file, it would be a lose on startup time,
so you would probably need to support lazy relocation when a module is first loaded.

Anyway, your task would be to write a linker hack that loads a bunch of .go files, finds the relocations in them, de-duplicates the constants, and writes out a combined .go file that includes a table of files contained in it. Good luck :) This hack would work great for Emacs, where it's effectively a form of unexec that doesn't actually rely on unexec.

linking a single executable

In the previous task, you could end up with the small guile binary that links to libguile (or your binary linking to libguile), and then a .go file containing all the modules you are interestd in. It sure would be nice to be able to link those together into just one binary, or at least to link the .go into the Guile binary. If the Guile is statically linked itself, you would have a statically linked application. If it's dynamically linked, it would remain dynamically linked. Again, a linker hack, but one that could provide a nicer way to distribute Guile binaries.

instruction explosion

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do
this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

elisp optimizations

Guile implements Emacs Lisp, and does so well. However it hasn't been the focus of a lot of optimization. Emacs has a lot of stuff going on on its side, and so have we, so we haven't managed to replace the Elisp interpreter in Emacs with one written in Guile, though Robin Templeton has brought us a long way forward. We need someone to do both the integration work but also to poke the compiler and make sure it's a clear win.

prompt removal

It's pretty natural to use delimited continuations when compiling some kind of construct that includes a break statement to Guile, whether that compiler is part of Elisp or just implemented as a Scheme macro. But, many instances of prompts can be contified, resulting in no overhead at run-time. Read up on contification and contify the hell out of some prompts!

basic register allocation

Guile usually tries its best to be safe-for-space: only the data which might be used in the future of a program is kept alive, and the rest is available for garbage collection. Notably, this applies to function arguments, temporaries, and lexical variables: if a value is dead, the GC can collect it and re-use its space. However this isn't always what you want. Sometimes you might want to have all variables that are in scope to be available, for better debugging. Your task would be to implement a "slot allocator" (which is really register allocation) that keeps values alive in the parts of the programs that they dominate.

optimal register allocation

On the other hand, our slot allocator -- which is basically register allocation, but for stack slots -- isn't so great. It does OK but you can often end up shuffling values in a loop, which is the worst. Your task would be to implement a proper register allocator: puzzle-solving, graph-coloring, iterative coalescing, something that really tries to do a good job. Good luck!

unboxed record fields

Guile's "structs", on which records are implemented, support unboxed values, but these values are untyped, not really integrated with the record layer, and always boxed in the VM. Your task would be to design a language facility that allows us to declare records with typed fields, and to store unboxed values in those fields, and to cause access to their values to emit boxing/unboxing instructions around them. The optimizer will get rid of those boxing/unboxing instructions if it can. Good luck!

textual CPS

The CPS language is key to all compiler work in Guile, but it doesn't have a nice textual form like LLVM IR does. Design one, and implement a parser and an unparser!

avoiding arity checks

If you know the procedure you are calling, like if it's lexically visible, then if you are calling it with the right number of arguments you can skip past the argument check and instead do a call-label directly into the body. Would be pretty neat!

unboxed calls and returns

Likewise if a function's callers are all known, it might be able to unbox its arguments or return value, if that's a good idea. Tricky! You could start with a type inference pass or so, and maybe that could produce some good debugging feedback too.

module-level inlining

Guile currently doesn't inline anything that's not lexically visible. Unfortunately this restriction extends to top-level definitions in a module: they are treated as mutable and so never inlined/optimized/etc. Probably we need to change the semantics here such that a module can be compiled as a unit, and all values which are never mutated can be assumed to be constant. Probably you also want a knob to turn off this behavior, but really you can always re-compile and re-load a module as a whole if re-loading a function at run-time doesn't work because it was inlined. Anyway. Some semantic work here, but some peval work as well. Be careful!

cross-module inlining

Likewise Guile currently doesn't inline definitions from other modules. However for small functions this really hurts. Guile should probably serialize tree-il for small definitions in .go files, and allow peval to speculatively inline imported definitions. This is related to the previous point and has some semantic implications.

bobobobobobonus! native compilation

Thinking realistically, native compilation is the next step. We have the object file format, cool. We will need the ability to call out from machine code in .go files to run-time functions, so we need to enhance the linker, possibly even with things like PLT/GOT sections to avoid dirtying too many pages. We need to lower the CPS even further, to get closer to some kind of machine model, then go specific, with an assembler for each architecture. The priority in the beginning will be simplicity and minimal complexity; good codegen will come later. This is obviously the most attractive thing but it's also the most tricky, design-wise. I want to do at least part of this, so though you can't have it all, you are welcome to help :)

That's it for now. I'll amend the post with more things as and when I think of them. Comments welcome too, as always. Happy hacking!

by Andy Wingo at February 04, 2016 09:38 PM

February 03, 2016

Guile News

GNU Guile 2.1.2 released (beta)

We are chuffed to announce GNU Guile release 2.1,2, the next pre-release in what will become the 2.2 stable series. This release features unboxed arithmetic and dramatically faster build times, along with a number of small speed and memory improvements.

See the original announcement at for full details.

by Andy Wingo at February 03, 2016 04:42 PM

February 02, 2016

Joshua Herman

Using JS on Arduino (Part 1)

Using Javascript to control an arduino seems far fetched but there is this great library call Johnny Five which allows you to control an arduino (or many other boards see . In this article we will go over how to set up using your Arduino with javascript. The first step should be loading the Arduino application on your computer and running the blink app.

by Joshua Herman ( at February 02, 2016 03:57 PM

Programming Praxis

Prime Gaps

We studied maximal prime gaps in a recent exercise. In today’s exercise we will look at minimal prime gaps — the smallest gap of a given size. For instance, the minimal prime gap of 2 is the gap from 3 to 5, the minimal prime gap of 4 is the gap from 7 to 11, the minimal prime gap of 6 is the gap from 13 to 19, and so on.

Your task is to write a program that finds minimal prime gaps. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at February 02, 2016 09:00 AM

January 31, 2016

Joe Marshall

Race results are in

Some people wanted to compare machines, so here is the exact code I used and some sample values I got from running it on my laptop. I'm curious what values other people get.
;;; -*- Mode: scheme; coding:us-ascii -*-

;;; This is code for MIT/GNU Scheme.  We attempt to measure the speed of ASSQ.
;;; The code avoids obvious abstractions because we don't want to
;;; measure the speed of function calling.

;; Inline the standard scheme primitives.
(declare (usual-integrations))

;; We change the size and alignment of the heap by consing a small
;; data structure to this list.
(define *change-heap-size-and-alignment* '())

;;; Creates a test alist that looks like this:
;;; ((13 . "13")
;;;  (23 . "23")
;;;  (25 . "25")
;;;  (18 . "18")
;;;  (0 . "0")
;;;  (19 . "19")
;;;  (5 . "5")
;;;  (4 . "4")
;;;  (6 . "6")
;;;    ...
(define (make-test-alist n-elements)
  (let ((alist 
         (fold-left (lambda (alist number)
                      (alist-cons number (number->string number)
                    ;; shuffle the numbers from 1 to n
                    (map car
                          (map (lambda (n)
                                 (cons n (random 1.0)))
                               (iota n-elements))
                          (lambda (l r) (< (cdr l) (cdr r))))))))
    (set! *change-heap-size-and-alignment* 
          (cons (vector #f) *change-heap-size-and-alignment*))

;;; Creates an alist of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-alist size n-lookups)
  (let ((test-alist (make-test-alist size)))
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       (fix:+ idx 1)))
            (answer '() (assq idx test-alist)))
           ((fix:>= i n-lookups) answer))))))
Here's a sample run on my laptop. We make an alist with 10 elements and call ASSQ 100,000,000 times on it, fetching each entry about the same number of times.
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-alist 10 100000000))

;process time: 2260 (2260 RUN + 0 GC); real time: 2259
;process time: 2260 (2260 RUN + 0 GC); real time: 2265
;process time: 2290 (2290 RUN + 0 GC); real time: 2291
;process time: 2250 (2250 RUN + 0 GC); real time: 2247
;process time: 2260 (2260 RUN + 0 GC); real time: 2259
;process time: 2240 (2240 RUN + 0 GC); real time: 2240
;process time: 2240 (2240 RUN + 0 GC); real time: 2243
;process time: 2250 (2250 RUN + 0 GC); real time: 2258
;process time: 2240 (2240 RUN + 0 GC); real time: 2247
;process time: 2250 (2250 RUN + 0 GC); real time: 2250
Process time is reported in milliseconds, so it took about 2.26 seconds to do 100,000,000 million lookups. This divides out to .0000000225 seconds = 22.5 nanoseconds per lookup.

It should be about linear with the size of the list, so we'd expect a 100 element list to take somewhere around 225 nanoseconds.

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-alist 100 100000000))

;process time: 20720 (20720 RUN + 0 GC); real time: 20753
;process time: 20700 (20700 RUN + 0 GC); real time: 20733
;process time: 20640 (20640 RUN + 0 GC); real time: 20671
;process time: 20690 (20690 RUN + 0 GC); real time: 20695
;process time: 20670 (20670 RUN + 0 GC); real time: 20690
;process time: 21010 (21010 RUN + 0 GC); real time: 21026
;process time: 20800 (20800 RUN + 0 GC); real time: 20832
;process time: 20760 (20760 RUN + 0 GC); real time: 20747
;process time: 20710 (20710 RUN + 0 GC); real time: 20702
;process time: 20690 (20690 RUN + 0 GC); real time: 20700
;Value: #t
Testing a hash table:
(define (make-test-hash-table n-entries)
  (alist->hash-table (make-test-alist n-entries)))

;;; Creates a hash-table of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-hash-table size n-lookups)
  (let ((test-hash-table (make-test-hash-table size)))
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       (fix:+ idx 1)))
            (answer '() (hash-table/get test-hash-table idx #f)))
           ((fix:>= i n-lookups) answer))))))
Put 10 elements or a thousand in a hash table, it takes a constant amount of time to look things up:
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-hash-table 10 100000000))

;process time: 8320 (8320 RUN + 0 GC); real time: 8321
;process time: 8300 (8300 RUN + 0 GC); real time: 8304
;process time: 8420 (8420 RUN + 0 GC); real time: 8419
;process time: 8280 (8280 RUN + 0 GC); real time: 8304
;process time: 8380 (8380 RUN + 0 GC); real time: 8387
;process time: 8280 (8280 RUN + 0 GC); real time: 8288
;process time: 8320 (8320 RUN + 0 GC); real time: 8311
;process time: 8330 (8330 RUN + 0 GC); real time: 8327
;process time: 8290 (8290 RUN + 0 GC); real time: 8290
;process time: 8310 (8310 RUN + 0 GC); real time: 8307
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-hash-table 1000 100000000))

;process time: 8400 (8400 RUN + 0 GC); real time: 8403
;process time: 8550 (8550 RUN + 0 GC); real time: 8553
;process time: 8620 (8620 RUN + 0 GC); real time: 8639
;process time: 8420 (8420 RUN + 0 GC); real time: 8435
;process time: 8400 (8400 RUN + 0 GC); real time: 8425
;process time: 8460 (8460 RUN + 0 GC); real time: 8455
;process time: 8460 (8460 RUN + 0 GC); real time: 8459
;process time: 8480 (8480 RUN + 0 GC); real time: 8486
;process time: 8500 (8500 RUN + 0 GC); real time: 8502
;process time: 8520 (8520 RUN + 0 GC); real time: 8518
;Value: #t

Testing an rb-tree:
(define (make-test-rb-tree n-entries)
  (alist->rb-tree (make-test-alist n-entries) fix:= fix:<))

;;; Creates a rb-tree of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-rb-tree size n-lookups)
  (let ((test-rb-tree (make-test-rb-tree size)))
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       (fix:+ idx 1)))
            (answer '() (rb-tree/lookup test-rb-tree idx #f)))
           ((fix:>= i n-lookups) answer))))))

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 10 100000000))

;process time: 3910 (3910 RUN + 0 GC); real time: 3908
;process time: 3810 (3810 RUN + 0 GC); real time: 3805
;process time: 4090 (4090 RUN + 0 GC); real time: 4090
;process time: 3970 (3970 RUN + 0 GC); real time: 3967
;process time: 4060 (4060 RUN + 0 GC); real time: 4051
;process time: 3980 (3980 RUN + 0 GC); real time: 3979
;process time: 4040 (4040 RUN + 0 GC); real time: 4040
;process time: 4090 (4090 RUN + 0 GC); real time: 4094
;process time: 3810 (3810 RUN + 0 GC); real time: 3810
;process time: 4090 (4090 RUN + 0 GC); real time: 4092
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 100 100000000))

;process time: 7700 (7700 RUN + 0 GC); real time: 7720
;process time: 7760 (7760 RUN + 0 GC); real time: 7767
;process time: 7700 (7700 RUN + 0 GC); real time: 7710
;process time: 7890 (7890 RUN + 0 GC); real time: 7893
;process time: 7920 (7920 RUN + 0 GC); real time: 7914
;process time: 7650 (7650 RUN + 0 GC); real time: 7646
;process time: 7740 (7740 RUN + 0 GC); real time: 7738
;process time: 7760 (7760 RUN + 0 GC); real time: 7761
;process time: 7670 (7670 RUN + 0 GC); real time: 7671
;process time: 8140 (8140 RUN + 0 GC); real time: 8136
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 1000 100000000))

;process time: 13130 (13130 RUN + 0 GC); real time: 13124
;process time: 13160 (13160 RUN + 0 GC); real time: 13153
;process time: 13150 (13150 RUN + 0 GC); real time: 13149
;process time: 13140 (13140 RUN + 0 GC); real time: 13140
;process time: 13310 (13310 RUN + 0 GC); real time: 13304
;process time: 13170 (13170 RUN + 0 GC); real time: 13172
;process time: 13140 (13140 RUN + 0 GC); real time: 13167
;process time: 13250 (13250 RUN + 0 GC); real time: 13238
;process time: 13300 (13300 RUN + 0 GC); real time: 13318
;process time: 13420 (13420 RUN + 0 GC); real time: 13416
;Value: #t
And wt-trees while we're at it:
(define (make-test-wt-tree n-elements)
  (alist->wt-tree number-wt-type (make-test-alist n-elements)))

(define (time-wt-tree size n-lookups)
  (let ((test-wt-tree (make-test-wt-tree size)))
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       (fix:+ idx 1)))
            (answer '() (wt-tree/lookup test-wt-tree idx #f)))
           ((fix:>= i n-lookups) answer))))))
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-wt-tree 100 100000000))

;process time: 6400 (6400 RUN + 0 GC); real time: 6397
;process time: 6740 (6740 RUN + 0 GC); real time: 6736
;process time: 6760 (6760 RUN + 0 GC); real time: 6763
;process time: 6070 (6070 RUN + 0 GC); real time: 6068
;process time: 6450 (6450 RUN + 0 GC); real time: 6461
;process time: 6800 (6800 RUN + 0 GC); real time: 6812
;process time: 6330 (6330 RUN + 0 GC); real time: 6346
;process time: 6060 (6060 RUN + 0 GC); real time: 6066
;process time: 6050 (6050 RUN + 0 GC); real time: 6039
;process time: 6300 (6300 RUN + 0 GC); real time: 6303
;Value: #t

by Joe Marshall ( at January 31, 2016 05:28 AM

January 30, 2016

Joe Marshall

Alist vs. hash-table

An alist is a simple data structure that holds key-value pairs in a linked list. When a key is looked up, the list is searched to find it. The time it takes is proportional to the length of the list, or the number of entries.

A hash-table is a more complex data structure that holds key-value pairs in a set of "hash buckets". When a key is looked up, it is first "hashed" to find the correct bucket, then that bucket is searched for the entry. The time it takes depends on a number of things, the hash algorithm, the number of buckets, the number of entries in the bucket, etc. A hash-table can be faster than an alist because the hashing step is quick and the subsequent search step will have very few entries to search.

In theory, an alist takes time proportional to the number of entries, but a hash-table takes constant time independent of the number of entries. Let's find out if this is true for MIT/GNU Scheme.

I wrote a little program that measures how long it takes to look things up in an alist vs. a hash table. Here's what I measured:
It does indeed seem that alists are linear and hash tables are constant in lookup time. But the hashing step of a hash table does take time, so short alists end up being faster that hash tables. The breakeven point looks like a tad over 25 elements. So if you expect 25 or fewer entries, an alist will perform better than a hash table. (Of course different implementations will have different break even points.)

A tree data structure is slightly more complex than an alist, but simpler than a hash table. Looking up an entry in a tree takes time proportional to the logarithm of the number of entries. The logarithm function grows quite slowly, so a tree performs pretty well over a very large range of entries. A tree is slower than an alist until you have about 15 entries. At this point, the linear search of an alist cannot compete with the logarithmic search of a tree. The time it takes to search a tree grows, but quite slowly. It takes more than 100 entries before a tree becomes as slow as a hash table.
With a big enough tree, the growth is so slow that you can pretend it is constant.

by Joe Marshall ( at January 30, 2016 05:08 AM

January 29, 2016

Programming Praxis

Compare Strings With One Error

I don’t know if today’s exercise comes from an interview question or from somebody’s homework, but it’s a good exercise:

Given two strings, determine if they are the same except for exactly one difference. Two identical strings fail to match. Two strings that differ in two or more characters fail to match. Two strings with lengths that differ by one match if and only if they are identical except for one additional character. Indicate the index where the difference occurs, or report failure if the two input strings do not differ in exactly one character.

Your task is to write a program that compares strings with one error. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 29, 2016 04:24 PM

January 28, 2016

Joe Marshall

Latest amusement

Here's a procedure that CDRs down a list until it hits the end:
(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))
The inner loop of the procedure compiles to this:
 (cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))

 (mov q (r 0) (@r 4))         ;; Fetch 'tail'

 (mov q (r 1) (r 0))          ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))         ;; Check for pair?, jump if not
 (jne (@pcr label-14))

 (and q (r 0) (r 5))          ;; Mask the tag

 (mov q (r 1) (@ro 0 8))      ;; Fetch the CDR

 (mov q (@r 4) (r 1))         ;; Assign 'tail'

 (jmp (@pcr loop-3))          ;; Do it again.
On my laptop, each iteration of this loop takes a few nanoseconds.

I noticed that there seems to be a lot more going on than CDRing down a list. The interrupt check is responsible for a lot of the overhead. MIT/GNU Scheme polls for interrupts. The compiler inserts an interrupt check at every procedure entry and backwards branch. This ensures that only a small, bounded number of instructions can run between interrupt polls. The interrupt poll and the heap check are simultaneous because of a hack. To interrupt Scheme, you save the "Top of Heap" pointer and set it to zero. This causes the heap check to fail as if there were an out of memory condition. The out of memory handler first checks to see if this was actually an interrupt masquerading as an out of memory, and restores the heap pointer if so.

The two instructions,

(cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))
perform the check. The first compares register 7 with the contents of memory that register 6 points at. The convention for the compiler is that register 7 contains the free pointer and register 6 holds the address of MEMTOP. The interrupt check does a memory cycle.

The reason you poll for interrupts is that you need the virtual machine to be in a coherent state because the interrupt could attempt to run arbitrary code. The garbage collector in particular has to be able to parse the machine state
at an interrupt. Interrupts are polled at apply time. Before application, the interrupts are checked. If an interrupt is to be processed, a continuation is pushed that will do the original application, and an interrupt handler is applied instead.

In MIT-Scheme, the virtual machine is a stack machine, and at apply time the entire state of the virtual machine is pointed to by the stack pointer. This is a good state to be in when you handle interrupts or garbage collect. But this means that arguments are passed on the stack. This instruction:

(mov q (r 0) (@r 4))
loads register 0 with the contents of memory at register 4. (The compiler convention is that register 4 is the stack pointer.) In other words, it is fetching the value of the argument tail from the stack. This is the second memory cycle.

Between interrupt polls, the compiled code can freely manipulate Scheme objects without worrying about being embarrassed by the garbage collector. But at apply time, when a possible interrupt poll could happen, the compiled code must put the virtual machine back into a coherent state. In particular, modified values are stored back on the stack.I This instruction just before the jump puts the value of tail back on the stack before jumping to the top of the loop.

(mov q (@r 4) (r 1)) 
That's three memory cycles in addition to the actual fetching of the CDR in this instruction:
(mov q (r 1) (@ro 0 8))
Of course these are quite inexpensive memory cycles because the top of stack is cached, but there is at least the time time it takes to validate the cache entry.

The interrupt poll occurs every time around the loop, so we copy the arguments back and forth between the stack and registers on each iteration. If we unroll the loop we can avoid some of the copying:

(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))

(define (last-cdr2 list)
  (let loop ((tail list))
    (if (not (pair? tail))
        (let ((tail (cdr tail)))
          (if (not (pair? tail))
              (loop (cdr tail)))))))
The loop in last-cdr2 compiles to this:
 (cmp q (r 7) (@r 6))           ;; Interrupt check
 (jge (@pcr interrupt-15))  

 (mov q (r 0) (@r 4))           ;; Fetch 'tail'

 (mov q (r 1) (r 0))            ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))           ;; Check for pair?
 (jne (@pcr label-17))

 (and q (r 0) (r 5))            ;; Mask the tag

 (mov q (r 1) (@ro 0 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 1))           ;; Assign 'tail'

 (mov q (r 0) (r 1))
 (shr q (r 0) (&u #x3a))        ;; Extract the tag

 (cmp b (r 0) (&u 1))           ;; Check for pair?
 (jne (@pcr label-19))

 (and q (r 1) (r 5))            ;; Mask the tag

 (mov q (r 0) (@ro 1 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 0))           ;; Assign 'tail'

 (jmp (@pcr loop-6))            ;; Do it again
If I count correctly, there are six memory cycles per iteration, two of which are CDRs. We also avoid a single jmp instruction per iteration.

Here's the timing on my machine:

3.643 nanoseconds per cdr
3.643 nanoseconds per cdr
3.711 nanoseconds per cdr
3.643 nanoseconds per cdr
3.576 nanoseconds per cdr

2.766 nanoseconds per cdr
2.699 nanoseconds per cdr
2.834 nanoseconds per cdr
2.699 nanoseconds per cdr

by Joe Marshall ( at January 28, 2016 05:55 AM

January 26, 2016

Programming Praxis

Higher-Order String Functions

Scheme provides the higher-order functions map, for-each and fold that operate on lists. In today’s exercise we will write versions of those functions that operate on strings:

(string-map proc str) applies proc to each character in str, replacing the character with the output of proc. proc takes a single character and returns a single character.

(string-for-each proc str) applies proc to each character in str; the function is executed only for its side-effects. proc takes a single character and returns nothing.

(string-fold proc base str) calls function (proc base c) on each character of str, working left to right, accumulating the result in base as it goes. proc takes a base of any type, and a character, and returns a value of the same type as base. (string-fold-right proc base str) is the same, except that it works from right to left.

Your task is to implement these four functions in the language of your choice. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 26, 2016 09:00 AM

January 25, 2016

Joe Marshall

Locality and GC

I was curious about how long it took to CDR down a list. I wrote a little program that adds up the elements in a list
and tried it out.

(define test-list (iota (expt 10 6)))

(define (sum-list list)
  (let loop ((total 0)
	     (tail  list))
    (if (pair? tail)
	(loop (+ total (car tail)) (cdr tail))
	(begin (if (not (null? tail))
		   (error:not-list list 'sum-list))

(define (test-sum-list n)
   (do ((i 0 (+ i 1))
        (answer '() (sum-list test-list)))
       ((not (< i n)) answer)))

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4280 (4280 RUN + 0 GC); real time: 4281
;Value: 499999500000
But after garbage collection, the performance dropped a bit.
1 ]=> (gc-flip)

;GC #18 15:18:59: took:   0.13   (0%) CPU,   0.13   (0%) real; free: 246114893
;Value: 246114893

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4490 (4490 RUN + 0 GC); real time: 4483
;Value: 499999500000
Now it's taking 4.49 ms.

The garbage collector in MIT Scheme is a very simple, stop the world, copying collector. The GC algorithm traverses the heap in breadth-first order. An undesired effect of this is data structures being spread about the heap. When I first built the test-list, the cons cells were close together. Here are the first few addresses:
1 ]=> (do ((tail test-list (cdr tail))
                (i 0 (+ i 1)))
               ((not (< i 10)) #f)
             (display (object-datum tail))
;Value: #f
But after garbage collection...
1 ]=> (gc-flip)

;GC #19 15:33:28: took:   0.15   (1%) CPU,   0.16   (0%) real; free: 246114593
;Value: 246114593

1 ]=> (do ((tail test-list (cdr tail))
              (i 0 (+ i 1)))
             ((not (< i 10)) #f)
           (display (object-datum tail))
;Value: #f
All the cons cells are on different pages.

This is annoying, so I hacked the GC so that it transports chains of cons-cells eagerly. If we do the same experiment, before the flip,

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4280 (4280 RUN + 0 GC); real time: 4280
;Value: 499999500000
And after.
1 ]=> (gc-flip)

;GC #5 15:38:53: took:   0.23   (2%) CPU,   0.24   (0%) real; free: 249429521
;Value: 249429521

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4100 (4100 RUN + 0 GC); real time: 4100
;Value: 499999500000
Now we see a performance increase. If we look at the test list,
1 ]=> (do ((tail test-list (cdr tail))
           (i 0 (+ i 1)))
          ((not (< i 10)) #f)
        (display (object-datum tail))
;Value: #f

You can see that the cons-cells addresses are next to each other.

by Joe Marshall ( at January 25, 2016 01:17 AM

January 22, 2016

GNU Guix

Meet Guix at FOSDEM!

One week to FOSDEM! This year, there will be no less than six Guix-related talks. This and the fact that we are addressing different communities is exciting.

First, on Saturday morning, in the GNU Guile track (room K.3.201):

On Saturday afternoon:

On Sunday noon:

See you there!

About GNU Guix

GNU Guix is a functional package manager for the GNU system. The Guix System Distribution or GuixSD is an advanced distribution of the GNU system that relies on GNU Guix and respects the user's freedom.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. Guix uses low-level mechanisms from the Nix package manager, except that packages are defined as native Guile modules, using extensions to the Scheme language. GuixSD offers a declarative approach to operating system configuration management, and is highly customizable and hackable.

GuixSD can be used on an i686 or x86_64 machine. It is also possible to use Guix on top of an already installed GNU/Linux system, including on mips64el and armv7.

by Ludovic Courtès at January 22, 2016 09:33 AM

Programming Praxis


The Shannon entropy of a file is a measure of the information content in the file; higher entropy implies more information. Shannon entropy is computed as H = -1 * sum(pi * log2(pi)) where pi is the frequency of each symbol i in the input (frequency is the percentage of the total number of symbols). Shannon entropy is just one facet of the study of information theory, which is fundamental to computer science.

Your task is to compute the Shannon entropy of a file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 22, 2016 09:00 AM

January 21, 2016

Andy Wingo

talks i would like to give in 2016

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to women and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

by Andy Wingo at January 21, 2016 11:59 AM

January 19, 2016

Andy Wingo

unboxing in guile

Happy snowy Tuesday, hackfolk! I know I said in my last dispatch that I'd write about Lua soon, but that article is still cooking. In the meantime, a note on Guile and unboxing.

on boxen, on blitzen

Boxing is a way for a programming language implementation to represent a value.

A boxed value is the combination of a value along with a tag providing some information about the value. Both the value and the tag take up some space. The value can be thought to be inside a "box" labelled with the tag and containing the value.

A value's tag can indicate whether the value's bits should be interpreted as an unsigned integer, as a double-precision floating-point number, as an array of words of a particular data type, and so on. A tag can also be used for other purposes, for example to indicate whether a value is a pointer or an "immediate" bit string.

Whether values in a programming language are boxed or not is an implementation consideration. It can be the case that in languages with powerful type systems that a compiler can know what the representation of all values are in all parts of all programs, and so boxing is never needed. However, it's much easier to write a garbage collector if values have a somewhat uniform representation, with tag bits to tell the GC how to trace any pointers that might be contained in the object. Tags can also carry run-time type information needed by a dynamically typed language like Scheme or JavaScript, to allow for polymorphic predicates like number? or pair?.

Boxing all of the values in a program can incur significant overhead in space and in time. For example, one way to implement boxes is to allocate space for the tag and the value on the garbage-collected heap. A boxed value would then be referred to via a pointer to the corresponding heap allocation. However, most memory allocation systems align their heap allocations on word-sized boundaries, for example on 8-byte boundaries. That means that the low 3 bits of a heap allocation will always be zero. If you make a bit string whose low 3 bits are not zero, it cannot possibly be a valid pointer. In that case you can represent some types within the set of bit strings that cannot be valid pointers. These values are called "immediates", as opposed to "heap objects". In Guile, we have immediate representations for characters, booleans, some special values, and a subset of the integers. Alternately, a programming language implementation can represent values as double-precision floating point numbers, and shove pointers into the space of the NaN values. And for heap allocations, some systems can associate one tag with a whole page of values, minimizing per-value boxing overhead.

The goal of these optimizations is to avoid heap allocation for some kinds of boxes. While most language implementations have good garbage collectors that make allocation fairly cheap, the best way to minimize allocation cost is to refrain from it entirely.

In Guile's case, we currently use a combination of low-bit tagging for immediates, including fixnums (a subset of the integers), and tagged boxes on the heap for everything else, including floating-point numbers.

Boxing floating-point numbers obviously incurs huge overhead on floating-point math. You have to consider that each intermediate value produced by a computation will result in the allocation of another 8 bytes for the value and 4 or 8 bytes for the tag. Given that Guile aligns allocations on 8-byte boundaries, the result is a 16-byte allocation in either case. Consider this loop to sum the doubles in a bytevector:

(use-modules (rnrs bytevectors))
(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

Each trip through the loop is going to allocate not one but two heap floats: one to box the result of bytevector-ieee-double-native-ref (whew, what a mouthful), and one for the sum. If we have a bytevector of 10 million elements, that will be 320 megabytes of allocation. Guile can allocate short-lived 16-byte allocations at about 900 MB/s on my machine, so summing this vector is going to take at least 350ms, just for the allocation. Indeed, without unboxing I measure this loop at 580ms for a 10 million element vector:

> (define v (make-f64vector #e10e6 1.0))
> ,time (f64-sum v)
$1 = 1.0e7
;; 0.580114s real time, 0.764572s run time.  0.268305s spent in GC.

The run time is higher than the real time due to parallel marking. I think in this case, allocation has even higher overhead because it happens outside the bytecode interpreter. The add opcode has a fast path for small integers (fixnums), and if it needs to work on flonums it calls out to a C helper. That C helper doesn't have a pointer to the thread-local freelist so it has to go through a more expensive allocation path.

Anyway, in the time that Guile takes to fetch one f64 value from the vector and add it to the sum, the CPU ticked through some 150 cycles, so surely we can do better than this.

unboxen, unblitzen

Let's take a look again at the loop to see where the floating-point allocations are produced.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

It turns out there's no reason for the loquatiously-named bytevector-ieee-double-native-ref to return a boxed number. It's a monomorphic function that is well-known to the Guile compiler and virtual machine, and it even has its own opcode. In Guile 2.0 and until just a couple months ago in Guile 2.2, this function did box its return value, but that was because the virtual machine had no facility for unboxed values of any kind.

To allow bytevector-ieee-double-native-ref to return an unboxed double value, the first item of business was then to support unboxed values in Guile's VM. Looking forward to unboxed doubles, we made a change such that all on-stack values are 64 bits wide, even on 32-bit systems. (For simplicity, all locals in Guile take up the same amount of space. For the same reason, fetching 32-bit floats also unbox to 64-bit doubles.)

We also made a change to Guile's "stack maps", which are data structures that tell the garbage collector which locals are live in a stack frame. There is a stack map recorded at every call in a procedure, to be used when an activation is pending on the stack. Stack maps are stored in a side table in a separate section of the compiled ELF library. Live values are traced by the garbage collector, and dead values are replaced by a special "undefined" singleton. The change we made was to be able to indicate that live values were boxed or not, and if they were unboxed, what type they were (e.g. unboxed double). Knowing the type of locals helps the debugger to print values correctly. Currently, all unboxed values are immediates, so the GC doesn't need to trace them, but it's conceivable that we could have unboxed pointers at some point. Anyway, instead of just storing one bit (live or dead) per local in the stack map, we store two, and reserve one of the bit patterns to indicate that
the local is actually an f64 value.

But the changes weren't done then: since we had never had unboxed locals, there were quite a few debugging-related parts of the VM that assumed that we could access the first slot in an activation to see if it was a procedure. This dated from a time in Guile where slot 0 would always be the procedure being called, but the check is bogus ever since Guile 2.2 allowed local value slots corresponding to the closure or procedure arguments to be re-used for other values, if the closure or argument was dead. Another nail in the coffin of procedure-in-slot-0 was driven by closure optimizations, in which closures whose callees are all visible could specialize the representation of their closure in non-standard ways. It took a while, but unboxing f64 values flushed out these bogus uses of slot 0.

The next step was to add boxing and unboxing operations to the VM (f64->scm and scm->f64, respectively). Then we changed bytevector-ieee-double-native-ref to return an unboxed value and then immediately box it via f64->scm. Similarly for bytevector-ieee-double-native-set!, we unbox the value via scm->f64, potentially throwing a type error. Unfortunately our run-time type mismatch errors got worse; although the source location remains the same, scm->f64 doesn't include the reason for the unboxing. Oh well.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
              (+ sum boxed))

When we lower Tree-IL to CPS, we insert the needed f64->scm and scm->f64 boxing and unboxing operations around bytevector accesses. Cool. At this point we have a system with unboxed f64 values, but which is slower than the original version because every f64 bytevector access involves two instructions instead of one, although the instructions themselves together did the same amount of work. However, telling the optimizer about these instructions could potentially eliminate some of them. Let's keep going and see where we get.

Let's attack the other source of boxes, the accumulation of the sum. We added some specialized instuctions to the virtual machine to support arithmetic over unboxed values. Doing this is potentially a huge win, because not only do you avoid allocating a box for the result, you also avoid the type checks on the incoming values. So we add f64+, f64-, and so on.

Unboxing the + to f64+ is a tricky transformation, and relies on type analysis. Our assumption is that if type analysis indicates that we are in fact able to replace a generic arithmetic instruction with a combination of operand unboxing, unboxed arithmetic, and a boxing operation, then we should do it. Separating out the boxes and the monomorphic arithmetic opens the possibility to remove the resulting box, and possibly remove the unboxing of operands too. In this case, we run an optimization pass and end up with something like:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)
                     (scm->f64 boxed)))))

Scalar replacement via fabricated expressions will take the definition of boxed as (f64->scm f64) and fabricate a definition of f64 as (scm->f64 boxed), which propagates down to the f64+ so we get:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)

Dead code elimination can now kill boxed, so we end up with:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i)))
               (f64+ (scm->f64 sum)

Voilà, we removed one allocation. Yay!

As we can see from the residual code, we're still left with one f64->scm boxing operation. That expression is one of the definitions of sum, one of the loop variables. The other definition is 0.0, the starting value. So, after specializing arithmetic operations, we go through the set of multiply-defined variables ("phi" variables) and see what we can do to unbox them.

A phi variable can be unboxed if all of its definitions are unboxable. It's not always clear that you should unbox, though. For example, maybe you know via looking at the definitions for the value that it can be unboxed as an f64, but all of its uses are boxed. In that case it could be that you throw away the box when unboxing each definition, only to have to re-create them anew when using the variable. You end up allocating twice as much instead of not at all. It's a tricky situation. Currently we assume a variable with multiple definitions should only be unboxed if it has an unboxed use. The initial set of unboxed uses is the set of operands to scm->f64. We iterate this set to a fixed point: unboxing one phi variable could cause others to be unbox as well. As a heuristic, we only require one unboxed use; it could be there are other uses that are boxed, and we could indeed hit that pessimal double-allocation case. Oh well!

In this case, the intermediate result looks something like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (let ((f64 (bytevector-ieee-double-native-ref v i)))
                  (f64+ (scm->f64 sum-box)

After the scalar replacement and dead code elimination passes, we end up with something more like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (f64+ sum
                    (bytevector-ieee-double-native-ref v i)))

Well this is looking pretty good. There's still a box though. Really we should sink this to the exit, but as it happens there's something else that accidentally works in our favor: loop peeling. By peeling the first loop iteration, we create a control-flow join at the loop exit that defines a phi variable. That phi variable is subject to the same optimization, sinking the box down to the join itself. So in reality the result looks like:

(define (f64-sum v)
  (let ((i 0)
        (sum (scm->f64 0.0))
        (len (bytevector-length v)))
     (if (< i len)
         (let ((i (+ i 8))
               (sum (f64+ sum
                          (bytevector-ieee-double-native-ref v i))))
           (let lp ((i i) (sum sum))
             (if (< i len)
                 (lp (+ i 8)
                     (f64+ sum (bytevector-ieee-double-native-ref v i)))

As you can see, the peeling lifted the length computation up to the top too, which is a bonus. We should probably still implement allocation sinking, especially for loops for which peeling isn't an option, but the current status often works well. Running f64-sum on a 10-million-element packed double array goes down from 580ms to 99ms, or to some 25 or 30 CPU cycles per element, and of course no time in GC. Considering that this loop still has the overhead of bytecode interpretation and cache misses, I think we're doing A O K.


It used to be that using packed bytevectors of doubles was an easy way to make your program slower using types (thanks to Sam Tobin-Hochstadt for that quip). The reason is that although a packed vector of doubles uses less memory, every access to it has to allocate a new boxed number. Compare to "normal" vectors where sure, it uses more memory, but fetching an element fetches an already-boxed value. Now with the unboxing optimization, this situation is properly corrected... in most cases.

The major caveat is that for unboxing to work completely, each use of a potentially-unboxable value has to have an alternate implementation that can work on unboxed values. In our example above, the only use was f64+ (which internally is really called fadd), so we win. Writing an f64 to a bytevector can also be unboxed. Unfortunately, bytevectors and simple arithmetic are currently all of the unboxable operations. We'll implement more over time, but it's a current limitation.

Another point is that we are leaning heavily on the optimizer to remove the boxes when it can. If there's a bug or a limitation in the optimizer, it could be the box stays around needlessly. It happens, hopefully less and less but it does happen. To be sure you get the advantages, you need to time the code and see if it's spending significant time in GC. If it is, then you need to disassemble your code to see where that's happening. It's not a very nice thing, currently. The Scheme-like representations I gave above were written by hand; the CPS intermediate language is much more verbose than that.

Another limitation is that function arguments and return values are always boxed. Of course, the compiler can inline and contify a lot of functions, but that means that to use abstraction, you need to build up a mental model of what the inliner is going to do.

Finally, it's not always obvious to the compiler what the type of a value is, and that necessarily limits unboxing. For example, if we had started off the loop by defining sum to be 0 instead of 0.0, the result of the loop as a whole could be either an exact integer or an inexact real. Of course, loop peeling mitigates this to an extent, unboxing sum within the loop after the first iteration, but it so happens that peeling also prevents the phi join at the loop exit from being unboxed, because the result from the peeled iteration is 0 and not 0.0. In the end, we are unable to remove the equivalent of sum-box, and so we still allocate once per iteration. Here is a clear case where we would indeed need allocation sinking.

Also, consider that in other contexts the type of (+ x 1.0) might actually be complex instead of real, which means that depending on the type of x it might not be valid to unbox this addition. Proving that a number is not complex can be non-obvious. That's the second way that fetching a value from a packed vector of doubles or floats is useful: it's one of the rare times that you know that a number is real-valued.

on integer, on fixnum

That's all there is to say about floats. However, when doing some benchmarks of the floating-point unboxing, one user couldn't reproduce some of the results: they were seeing huge run-times for on a microbenchmark that repeatedly summed the elements of a vector. It turned out that the reason was that they were on a 32-bit machine, and one of the loop variables used in the test was exceeding the fixnum range. Recall that fixnums are the subset of integers that fit in an immediate value, along with their tag. Guile's fixnum tag is 2 bits, and fixnums have a sign bit, so the most positive fixnum on a 32-bit machine is 229—1, or around 500 million. It sure is a shame not to be able to count up to #xFFFFFFFF without throwing an allocation party!

So, we set about seeing if we could unbox integers as well in Guile. Guile's compiler has a lot more visibility as to when something is an integer, compared to real numbers. Anything used as an index into a vector or similar data structure must be an exact integer, and any query as to the length of a vector or a string or whatever is also an integer.

Note that knowing that a value is an exact integer is insufficient to unbox it: you have to also know that it is within the range of your unboxed integer data type. Here we take advantage of the fact that in Guile, type analysis also infers ranges. So, cool. Because the kinds of integers that can be used as indexes and lengths are all non-negative, our first unboxed integer type is u64, the unsigned 64-bit integers.

If Guile did native compilation, it would always be a win to unbox any integer operation, if only because you would avoid polymorphism or any other potential side exit. For bignums that are within the unboxable range, the considerations are similar to the floating-point case: allocation costs dominate, so unboxing is almost always a win, provided that you avoid double-boxing. Eliminating one allocation can pay off a lot of instruction dispatch.

For fixnums, though, things are not so clear. Immediate tagging is such a cheap way of boxing that in an interpreter, the extra instructions you introduce could outweigh any speedup from having faster operations.

In the end, I didn't do science and I decided to just go ahead and unbox if I could. We are headed towards native compilation, this is a necessary step along that path, and what the hell, it seemed like a good idea at the time.

Because there are so many more integers in a typical program than floating-point numbers, we had to provide unboxed integer variants of quite a number of operations. Of course we could unconditionally require unboxed arguments to vector-ref, string-length and so on, but in addition to making u64 variants of arithmetic, we also support bit operations like logand and such. Unlike the current status with floating point numbers, we can do test-and-branch over unboxed u64 comparisons, and we can compare u64 values to boxed SCM values.

In JavaScript, making sure an integer is unboxed is easy: you just do val | 0. The bit operation | truncates the value to a uint32 32-bit two's-complement signed integer (thanks to Slava for the correction). In Guile though, we have arbitrary-precision bit operations, so although (logior val 0) would assert that val is an integer, it wouldn't necessarily mean that it's unboxable.

Instead, the Guile idiom for making sure you have an unboxed integer in a particular range should go like this:

(define-inlinable (check-uint-range x mask)
  (let ((x* (logand x mask)))
    (unless (= x x*)
      (error "out of range" x))

A helper like this is useful to assert that an argument to a function is of a particular type, especially given that arguments to functions are always boxed and treated as being of unknown type. The logand asserts that the value is an integer, and the comparison asserts that it is within range.

For example, if we want to implement a function that does modular 8-bit addition, it can go like:

(define-inlinable (check-uint8 x)
  (check-uint-range x #xff))
(define-inlinable (truncate-uint8 x)
  (logand x #xff))
(define (uint8+ x y)
  (truncate-uint8 (+ (check-uint8 x) (check-uint8 y))))

If we disassemble this function, we get something like:

Disassembly of #<procedure uint8+ (x y)> at #xa8d0f8:

   0    (assert-nargs-ee/locals 3 2)    ;; 5 slots (2 args)
   1    (scm->u64/truncate 4 3)
   2    (load-u64 1 0 255)
   5    (ulogand 4 4 1)
   6    (br-if-u64-=-scm 4 3 #f 17)     ;; -> L1
;; [elided code to throw an error if x is not in range]
  23    (scm->u64/truncate 3 2)
  24    (ulogand 3 3 1)
  25    (br-if-u64-=-scm 3 2 #f 18)     ;; -> L2
;; [elided code to throw an error if y is not in range]
  43    (uadd 4 4 3)
  44    (ulogand 4 4 1)
  45    (u64->scm 3 4)
  46    (return-values 2)               ;; 1 value

The scm->u64/truncate instructions unbox an integer, but truncating it to the u64 range. They are used when we know that any additional bits won't be used, as in this case where we immediately do a logand of the unboxed value. All in all it's not a bad code sequence; there are two possible side exits for each argument (not an integer signalled by the unboxing, and out of range signalled by the explicit check), and no other run-time dispatch. For now I think we can be pretty happy with the code.

That's about it for integer unboxing. We also support unboxed signed 64-bit integers, mostly for use as operands or return values from bytevector-s8-ref and similar unboxed accessors on bytevectors. There are fewer operations that have s64 variants, though, compared to u64 variants.


Up until now in Guile, it could be that you might have to avoid Scheme if you needed to do some kinds of numeric computation. Unboxing floating-point and integer numbers makes it feasible to do more computation in Scheme instead of having to rely in inflexible C interfaces. At the same time, as a Scheme hacker I feel much more free knowing that I can work on 64-bit integers without necessarily allocating bignums. I expect this optimization to have a significant impact on the way I program, and what I program. We'll see where this goes, though. Until next time, happy hacking :)

by Andy Wingo at January 19, 2016 11:57 AM