Planet Scheme

August 04, 2020

Programming Praxis

Loglog

There are 19,055 distinct words in the Bible:

$ cat bible.txt | tr -cs A-Za-z ‘
‘ | sort -u | wc -w
19055

It’s easy enough to count the number of distinct items in a set (its “cardinality”) when the set is small, but when the set is large, the intermediate storage required for the distinct items can be overwhelming.

Phillipe Flajolet and various co-authors wrote a series of papers in which they developed methods of estimating the cardinality of a set with only a small amount of auxiliary storage, using randomization; Flajolet’s algorithms can be seen as an improvement on Robert Morris’ counting algorithm that we studied in a previous exercise. We will study Flajolet’s loglog algorithm in today’s exercise and perhaps have a look at his other algorithms in future exercises.

The basic idea is to apply a hash function to each element of the set. The first bit of the hash value will be zero about half the time, the first two bits of the hash value will be zero about a quarter of the time, the first three bits of the hash value will be zero about an eighth of the time, and so on; by looking at the maximum number of leading zero-bits, we can estimate the cardinality of the set. Flajolet extends this algorithm by splitting the counts among 2k buckets and averaging the estimated cardinalities; the bucket is selected randomly by looking at the last k bits of the hash value.

Your task is to implement Flajolet’s loglog algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at August 04, 2020 09:00 AM

July 23, 2020

GNU Guix

Improve Internationalization Support for the Guix Data Service

The first half of my Outreachy internship is already over and I am really excited to share my experience. Over the past weeks I’ve had the opportunity to work on the Guix Data Service, watch myself change, and accomplish way more than I thought I would.

The Guix Data Service processes, stores and provides data about Guix over time. It provides a complementary interface to Guix itself by having a web interface and API to browse and access the data.

The work I have done so far revolves around storing translated lint checker descriptions as well as package synopsis and descriptions in the Guix Data Service PostgreSQL database and making them available through the Guix Data Service web interface.

Initially the Guix Data Service database had translated versions of lint warning messages available, but they were not accessible through the web interface, so I made that possible during the contribution period.

Working on making lint warning messages available on the web interface made it easier for me to understand how translations for lint checker descriptions and package synopsis and descriptions would be stored in the database and later on be made available through the Guix Data Service web interface. At this point, the Guix Data Service supports package synopsis and descriptions as well as lint checker descriptions in various locales.

Guix Data Service page for the audacity package, in the Spanishlocale

Hopefully these changes will provide the Guix Data Service users with a more feasible way to interact with Guix data.

I have to note that this is my first internship and I was initially reluctant to believe that I would be able to tackle or successfully accomplish the tasks I was assigned, but with my mentor’s help and guidance I managed to. So far it has been a rewarding experience because it has helped me make progress in so many aspects, whilst contributing to a project that will potentially increase inclusion.

While working on this project, I’ve significantly improved my Guile, SQL, and Git skills and I am now more aware of how software localization is achieved. In addition to getting more technically skilled, this internship has taught me how to manage time and emotions when dealing with more than one activity at a time.

Now that a good share of what was initially planned to be done is accomplished, my mentor suggested working on something using the Guix Data Service data and I will be engaged in that during the remaining half.

These first 7 weeks of my internship have gone by really fast, but I have enjoyed everything and I am so eager to experience what's to come.

by Danjela Lura at July 23, 2020 12:00 PM

July 21, 2020

Programming Praxis

Lost Boarding Pass

We have today a fun little problem from probability:

On a sold-out flight, 100 people line up to board the plane. The first passenger in the line has lost his boarding pass but was allowed in, regardless. He takes a random seat. Each subsequent passenger takes his or her assigned seat if available, or a random unoccupied seat, otherwise. What is the probability that the last passenger to board the plane finds his seat unoccupied?

Your task is to determine the requested probability, either by reasoning mathematically or by writing a program to demonstrate the probability. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 21, 2020 09:00 AM

July 17, 2020

GNU Guix

Running a Ganeti cluster on Guix

The latest addition to Guix's ever-growing list of services is a little-known virtualization toolkit called Ganeti. Ganeti is designed to keep virtual machines running on a cluster of servers even in the event of hardware failures, and to make maintenance and recovery tasks easy.

It is comparable to tools such as Proxmox or oVirt, but has some distinctive features. One is that there is no GUI: third party ones exist, but are not currently packaged in Guix, so you are left with a rich command-line client and a fully featured remote API.

Another interesting feature is that installing Ganeti on its own leaves you no way to actually deploy any virtual machines. That probably sounds crazy, but stems from the fact that Ganeti is designed to be API-driven and automated, thus it comes with a OS API and users need to install one or more OS providers in addition to Ganeti. OS providers offer a declarative way to deploy virtual machine variants and should feel natural to Guix users. At the time of writing, the providers available in Guix are debootstrap for provisioning Debian- and Ubuntu-based VMs, and of course a Guix provider.

Finally Ganeti comes with a sophisticated instance allocation framework that efficiently packs virtual machines across a cluster while maintaining N+1 redundancy in case of a failover scenario. It can also make informed scheduling decisions based on various cluster tags, such as ensuring primary and secondary nodes are on different power distribution lines.

(Note: if you are looking for a way to run just a few virtual machines on your local computer, you are probably better off using libvirt or even a Childhurd as Ganeti is fairly heavyweight and requires a complicated networking setup.)

Preparing the configuration

With introductions out of the way, let's see how we can deploy a Ganeti cluster using Guix. For this tutorial we will create a two-node cluster and connect instances to the local network using an Open vSwitch bridge with no VLANs. We assume that each node has a single network interface named eth0 connected to the same network, and that a dedicated partition /dev/sdz3 is available for virtual machine storage. It is possible to store VMs on a number of other storage backends, but a dedicated drive (or rather LVM volume group) is necessary to use the DRBD integration to replicate VM disks.

We'll start off by defining a few helper services to create the Open vSwitch bridge and ensure the physical network interface is in the "up" state. Since Open vSwich stores the configuration in a database, you might as well run the equivalent ovs-vsctl commands on the host once and be done with it, but we do it through the configuration system to ensure we don't forget it in the future when adding or reinstalling nodes.

(use-modules (gnu)
             (gnu packages linux)
             (gnu packages networking)
             (gnu services shepherd))

(define (start-interface if)
  #~(let ((ip #$(file-append iproute "/sbin/ip")))
      (invoke/quiet ip "link" "set" #$if "up")))

(define (stop-interface if)
  #~(let ((ip #$(file-append iproute "/sbin/ip")))
      (invoke/quiet ip "link" "set" #$if "down")))

;; This service is necessary to ensure eth0 is in the "up" state on boot
;; since it is otherwise unmanaged from Guix PoV.
(define (ifup-service if)
  (let ((name (string-append "ifup-" if)))
    (simple-service name shepherd-root-service-type
                    (list (shepherd-service
                           (provision (list (string->symbol name)))
                           (start #~(lambda ()
                                      #$(start-interface if)))
                           (stop #~(lambda (_)
                                     #$(stop-interface if)))
                           (respawn? #f))))))

;; Note: Remove vlan_mode to use tagged VLANs.
(define (create-openvswitch-bridge bridge uplink)
  #~(let ((ovs-vsctl (lambda (cmd)
                       (apply invoke/quiet
                              #$(file-append openvswitch "/bin/ovs-vsctl")
                              (string-tokenize cmd)))))
      (and (ovs-vsctl (string-append "--may-exist add-br " #$bridge))
           (ovs-vsctl (string-append "--may-exist add-port " #$bridge " "
                                     #$uplink
                                     " vlan_mode=native-untagged")))))

(define (create-openvswitch-internal-port bridge port)
  #~(invoke/quiet #$(file-append openvswitch "/bin/ovs-vsctl")
                  "--may-exist" "add-port" #$bridge #$port
                  "vlan_mode=native-untagged"
                  "--" "set" "Interface" #$port "type=internal"))

(define %openvswitch-configuration-service
  (simple-service 'openvswitch-configuration shepherd-root-service-type
                  (list (shepherd-service
                         (provision '(openvswitch-configuration))
                         (requirement '(vswitchd))
                         (start #~(lambda ()
                                    #$(create-openvswitch-bridge
                                       "br0" "eth0")
                                    #$(create-openvswitch-internal-port
                                       "br0" "gnt0")))
                         (respawn? #f)))))

This defines a openvswitch-configuration service object that creates a logical switch br0, connects eth0 as the "uplink", and creates a logical port gnt0 that we will use later as the main network interface for this system. We also create an ifup service that can bring network interfaces up and down. By themselves these variables do nothing, we also have to add them to our operating-system configuration below.

Such a configuration might be suitable for a small home network. In a datacenter deployment you would likely use tagged VLANs, and maybe a traditional Linux bridge instead of Open vSwitch. You can also forego bridging altogether with a routed networking setup, or do any combination of the three.

With this in place, we can start creating the operating-system configuration that we will use for the Ganeti servers:

;; [continued from the above configuration snippet]

(use-service-modules base ganeti linux networking ssh)

(operating-system
  (host-name "node1")
  [...]
  ;; Ganeti requires that each node and the cluster name resolves to an
  ;; IP address.  The easiest way to achieve this is by adding everything
  ;; to the hosts file.
  (hosts-file (plain-file "hosts" "
127.0.0.1       localhost
::1             localhost

192.168.1.200   ganeti.lan
192.168.1.201   node1
192.168.1.202   node2
"))
  (kernel-arguments
   (append %default-kernel-arguments
           '(;; Disable DRBDs usermode helper, as Ganeti is the only entity
             ;; that should manage DRBD.
             "drbd.usermode_helper=/run/current-system/profile/bin/true")))

  (packages (append (map specification->package
                         '("qemu" "drbd-utils" "lvm2"
                           "ganeti-instance-guix"
                           "ganeti-instance-debootstrap"))
                    %base-packages))

  (services (cons* (service ganeti-service-type
                            (ganeti-configuration
                             (file-storage-paths '("/srv/ganeti/file-storage"))
                             (os
                              (list (guix-os %default-guix-variants)
                                    (debootstrap-os
                                     (list (debootstrap-variant
                                            "buster"
                                            (debootstrap-configuration
                                             (suite "buster")))
                                           (debootstrap-variant
                                            "testing+contrib+paravirtualized"
                                            (debootstrap-configuration
                                             (suite "testing")
                                             (hooks
                                              (local-file
                                               "paravirt-hooks"
                                               #:recursive? #t))
                                             (extra-pkgs
                                              (delete "linux-image-amd64"
                                                      %default-debootstrap-extra-pkgs))
                                             (components '("main" "contrib"))))))))))

                   ;; Ensure the DRBD kernel module is loaded.
                   (service kernel-module-loader-service-type
                            '("drbd"))

                   ;; Create a static IP on the "gnt0" Open vSwitch interface.
                   (service openvswitch-service-type)
                   %openvswitch-configuration-service
                   (ifup-service "eth0")
                   (static-networking-service "gnt0" "192.168.1.201"
                                              #:netmask "255.255.255.0"
                                              #:gateway "192.168.1.1"
                                              #:requirement '(openvswitch-configuration)
                                              #:name-servers '("192.168.1.1"))

                   (service openssh-service-type
                            (openssh-configuration
                             (permit-root-login 'without-password)))
                   %base-services)))

Here we declare two OS "variants" for the debootstrap OS provider. Debootstrap variants rely on a set of scripts (known as "hooks") in the installation process to do things like configure networking, install bootloader, create users, etc. In the example above, the "buster" variant uses the default hooks provided by Guix which configures network and GRUB, whereas the "testing+contrib+paravirtualized" variant use a local directory next to the configuration file named "paravirt-hooks" (it is copied into the final system closure).

We also declare a default guix-os variant provided by Guix's Ganeti service.

Ganeti veterans may be surprised that each OS variant has its own hooks. The Ganeti deployments I've seen use a single set of hooks for all variants, sometimes with additional logic inside the script based on the variant. Guix offers a powerful abstraction that makes it trivial to create per-variant hooks, obsoleting the need for a big /etc/ganeti/instance-debootstrap/hooks directory. Of course you can still create it if you wish and set the hooks property of the variants to #f.

Not all Ganeti options are exposed in the configuration system yet. If you find it limiting, you can add custom files using extra-special-file, or ideally extend the <ganeti-configuration> data type to suite your needs. You can also use gnt-cluster copyfile and gnt-cluster command to distribute files or run executables, but undeclared changes in /etc may be lost on the next reboot or reconfigure.

Initializing a cluster

At this stage, you should run guix system reconfigure with the new configuration on all nodes that will participate in the cluster. If you do this over SSH or with guix deploy, beware that eth0 will lose network connectivity once it is "plugged in to" the virtual switch, and you need to add any IP configuration to gnt0.

The Guix configuration system does not currently support declaring LVM volume groups, so we will create these manually on each node. We could write our own declarative configuration like the %openvswitch-configuration-service, but for brevity and safety reasons we'll do it "by hand":

pvcreate /dev/sdz3
vgcreate ganetivg /dev/sdz3

On the node that will act as the "master node", run the init command:

Warning: this will create new SSH keypairs, both host keys and for the root user! You can prevent that by adding --no-ssh-init, but then you will need to distribute /var/lib/ganeti/known_hosts to all hosts, and authorize the Ganeti key for the root user in openssh-configuration. Here we let Ganeti manage the keys for simplicity. As a bonus, we can automatically rotate the cluster keys in the future using gnt-cluster renew-crypto --new-ssh-keys.

gnt-cluster init \
    --master-netdev=gnt0 \
    --vg-name=ganetivg \
    --enabled-disk-templates=file,plain,drbd \
    --drbd-usermode-helper=/run/current-system/profile/bin/true \
    --enabled-hypervisors=kvm \
    --hypervisor-parameters=kvm:kvm_flag=enabled \
    --nic-parameters=mode=openvswitch,link=br0 \
    --no-etc-hosts \
    ganeti.lan

--no-etc-hosts prevents Ganeti from automatically updating the /etc/hosts file when nodes are added or removed, which makes little sense on Guix because it is recreated every reboot/reconfigure.

See the gnt-cluster manual for information on the available options. Most can be changed at runtime with gnt-cluster modify.

If all goes well, the command returns no output and you should have the ganeti.lan IP address visible on gnt0. You can run gnt-cluster verify to check that the cluster is in good shape. Most likely it complains about something:

root@node1 ~# gnt-cluster verify
Submitted jobs 3, 4
Waiting for job 3 ...
Thu Jul 16 18:26:34 2020 * Verifying cluster config
Thu Jul 16 18:26:34 2020 * Verifying cluster certificate files
Thu Jul 16 18:26:34 2020 * Verifying hypervisor parameters
Thu Jul 16 18:26:34 2020 * Verifying all nodes belong to an existing group
Waiting for job 4 ...
Thu Jul 16 18:26:34 2020 * Verifying group 'default'
Thu Jul 16 18:26:34 2020 * Gathering data (1 nodes)
Thu Jul 16 18:26:34 2020 * Gathering information about nodes (1 nodes)
Thu Jul 16 18:26:35 2020 * Gathering disk information (1 nodes)
Thu Jul 16 18:26:35 2020 * Verifying configuration file consistency
Thu Jul 16 18:26:35 2020 * Verifying node status
Thu Jul 16 18:26:35 2020   - ERROR: node node1: hypervisor kvm parameter verify failure (source cluster): Parameter 'kernel_path' fails validation: not found or not a file (current value: '/boot/vmlinuz-3-kvmU')
Thu Jul 16 18:26:35 2020 * Verifying instance status
Thu Jul 16 18:26:35 2020 * Verifying orphan volumes
Thu Jul 16 18:26:35 2020 * Verifying N+1 Memory redundancy
Thu Jul 16 18:26:35 2020 * Other Notes
Thu Jul 16 18:26:35 2020 * Hooks Results

When using the KVM hypervisor, Ganeti expects to find a dedicated kernel image for virtual machines in /boot. For this tutorial we only use fully virtualized instances (meaning each VM runs its own kernel), so we can set kernel_path to an empty string to make the warning disappear:

gnt-cluster modify -H kvm:kernel_path=

Now let's add our other machine to the cluster:

gnt-node add node2

Ganeti will log into the node, copy the cluster configuration and start the relevant Shepherd services. You may need to authorize node1's SSH key first. Run gnt-cluster verify again to check that everything is in order:

gnt-cluster verify

If you used --no-ssh-init earlier you will likely get SSH host key warnings here. In that case you should update /var/lib/ganeti/known_hosts with the new node information, and distribute it with gnt-cluster copyfile or by adding it to the OS configuration.

The above configuration will make three operating systems available:

# gnt-os list
Name
debootstrap+buster
debootstrap+testing+contrib+paravirtualized
guix+default

Let's try them out. But first we'll make Ganeti aware of our network so it can choose a static IP for the virtual machines.

# gnt-network add --network=192.168.1.0/24 --gateway=192.168.1.1 lan
# gnt-network connect -N mode=openvswitch,link=br0 lan

Now we can add an instance:

root@node1 ~# gnt-instance add --no-name-check --no-ip-check \
    -o debootstrap+buster -t drbd --disk 0:size=5G \
    --net 0:network=lan,ip=pool bustervm1
Thu Jul 16 18:28:58 2020  - INFO: Selected nodes for instance bustervm1 via iallocator hail: node1, node2
Thu Jul 16 18:28:58 2020  - INFO: NIC/0 inherits netparams ['br0', 'openvswitch', '']
Thu Jul 16 18:28:58 2020  - INFO: Chose IP 192.168.1.2 from network lan
Thu Jul 16 18:28:58 2020 * creating instance disks...
Thu Jul 16 18:29:03 2020 adding instance bustervm1 to cluster config
Thu Jul 16 18:29:03 2020 adding disks to cluster config
Thu Jul 16 18:29:03 2020  - INFO: Waiting for instance bustervm1 to sync disks
Thu Jul 16 18:29:03 2020  - INFO: - device disk/0:  0.60% done, 5m 26s remaining (estimated)
[...]
Thu Jul 16 18:31:08 2020  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Thu Jul 16 18:31:08 2020  - INFO: Instance bustervm1's disks are in sync
Thu Jul 16 18:31:08 2020  - INFO: Waiting for instance bustervm1 to sync disks
Thu Jul 16 18:31:08 2020  - INFO: Instance bustervm1's disks are in sync
Thu Jul 16 18:31:08 2020 * running the instance OS create scripts...
Thu Jul 16 18:32:09 2020 * starting instance...

Ganeti will automatically select the optimal primary and secondary node for this VM based on available cluster resources. You can manually specify primary and secondary nodes with the -n and -s options.

By default Ganeti assumes that the new instance is already configured in DNS, so we need --no-name-check and --no-ip-check to bypass some sanity tests.

Try adding another instance, now using the Guix OS provider with the 'plain' (LVM) disk backend:

gnt-instance add --no-name-check --no-ip-check -o guix+default \
    -t plain --disk 0:size=5G -B memory=1G,vcpus=2 \
    --net 0:network=lan,ip=pool \
    guix1

The guix+default variant has a configuration that starts an SSH server and authorizes the hosts SSH key, and configures static networking based on information from Ganeti. To use other configuration files, you should declare variants with the config file as the configuration property. The Guix provider also supports "OS parameters" that lets you specify a specific Guix commit or branch:

gnt-instance add --no-name-check --no-ip-check \
    -o guix+gnome -O "commit=<commit>" \
    -H kvm:spice_bind=0.0.0.0,cpu_type=host \
    -t file --file-storage-dir=/srv/ganeti/file-storage \
    --disk 0:size=20G -B minmem=1G,maxmem=6G,vcpus=3 \
    --net 0:network=lan,ip=pool -n node1 \
    guix2

You can connect to a VM serial console using gnt-instance console <instance>. For this last VM we used a hypothetical 'guix+gnome' variant, and added a graphical SPICE console that you can connect to remotely using the spicy command.

If you are new to Ganeti, the next steps is to familiarize yourself with the gnt- family commands. Fun stuff to do include gnt-instance migrate to move VMs between hosts, gnt-node evacuate to migrate all VMs off a node, and gnt-cluster master-failover to move the master role to a different node.

If you wish to start over for any reason, you can use gnt-cluster destroy.

Final remarks

The declarative nature of Guix maps well to Ganetis OS API. OS variants can be composed and inherit from each other, something that is not easily achieved with traditional configuration management tools. The author had a lot of fun creating native data types in the Guix configuration system for Ganetis OS configuration, and it made me wonder whether other parts of Ganeti could be made declarative such as aspects of instance and cluster configuration. In any case I'm happy and excited to finally be able to use Guix as a Ganeti host OS.

Like most services in Guix, Ganeti comes with a system test that runs in a VM and ensures that things like initializing a cluster work. The continuous integration system runs this automatically whenever a dependency is updated, and provides comfort that both the package and service is in a good shape. Currently it has rudimentary service tests, but it can conceivably be extended to provision a real cluster inside Ganeti and try things like master-failover and live migration.

So far only the KVM hypervisor has been tested. If you use LXC or Xen with Ganeti, please reach out to guix-devel@gnu.org and share your experience.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Marius Bakke at July 17, 2020 03:00 PM

July 14, 2020

Programming Praxis

Binary Concatenation

We have an interview question today:

The concatenation of the first four integers, written in binary, is 11011100; that is, 1 followed by 10 followed by 11 followed by 100. That concatenated number resolves to 220. A similar process can convert the concatenation of the first n binary numbers to a normal decimal number.

Your task is to compute the nth binary concatenation in the manner described above; report the result modulo 109+7, because the result grows so quickly. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 14, 2020 09:00 AM

July 07, 2020

Programming Praxis

Trailing Zero-Bits

Today’s exercise indulges in some bit-hackery:

Given a positive integer, count the number of trailing zero-bits in its binary representation. For instance, 1810 = 100102, so it has 1 trailing zero-bit, and 4810 = 1100002, so it has 4 trailing zero-bits.

Your task is to write a program that counts the number of trailing zero-bits in the binary representation of a positive integer. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 07, 2020 09:00 AM

July 03, 2020

Programming Praxis

Spelling Numbers

Your task is to write a program that lists all of the numbers from zero to one hundred, inclusive, in alphabetical order. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 03, 2020 09:00 AM

July 01, 2020

GNU Guix

Securing updates

Software deployment tools like Guix are in a key position when it comes to securing the “software supply chain”—taking source code fresh from repositories and providing users with ready-to-use binaries. We have been paying attention to several aspects of this problem in Guix: authentication of pre-built binaries, reproducible builds, bootstrapping, and security updates.

A couple of weeks ago, we addressed the elephant in the room: authentication of Guix code itself by guix pull, the tool that updates Guix and its package collection. This article looks at what we set out to address, how we achieved it, and how it compares to existing work in this area.

What updates should be protected against

The problem of securing distro updates is often viewed through the lens of binary distributions such as Debian, where the main asset to be protected are binaries themselves. The functional deployment model that Guix and Nix implement is very different: conceptually, Guix is a source distribution, like Gentoo if you will.

Pre-built binaries are of course available and very useful, but they’re optional; we call them substitutes because they’re just that: substitutes for local builds. When you do choose to accept substitutes, they must be signed by one of the keys you authorized (this has been the case since version 0.6 in 2014).

Guix consists of source code for the tools as well as all the package definitions—the distro. When users run guix pull, what happens behind the scene is equivalent to git clone or git pull. There are many ways this can go wrong. An attacker can trick the user into pulling code from an alternate repository that contains malicious code or definitions for backdoored packages. This is made more difficult by the fact that code is fetched over HTTPS from Savannah by default. If Savannah is compromised (as happened in 2010), an attacker can push code to the Guix repository, which everyone would pull. The change might even go unnoticed and remain in the repository forever. An attacker with access to Savannah can also reset the main branch to an earlier revision, leading users to install outdated software with known vulnerabilities—a downgrade attack. These are the kind of attacks we want to protect against.

Authenticating Git checkouts

If we take a step back, the problem we’re trying to solve is not specific to Guix and to software deployment tools: it’s about authenticating Git checkouts. By that, we mean that when guix pull obtains code from Git, it should be able to tell that all the commits it fetched were pushed by authorized developers of the project. We’re really looking at individual commits, not tags, because users can choose to pull arbitrary points in the commit history of Guix and third-party channels.

Checkout authentication requires cryptographically signed commits. By signing a commit, a Guix developer asserts that they are the one who made the commit; they may be its author, or they may be the person who applied somebody else’s changes after review. It also requires a notion of authorization: we don’t simply want commits to have a valid signature, we want them to be signed by an authorized key. The set of authorized keys changes over time as people join and leave the project.

To implement that, we came up with the following mechanism and rule:

  1. The repository contains a .guix-authorizations file that lists the OpenPGP key fingerprints of authorized committers.
  2. A commit is considered authentic if and only if it is signed by one of the keys listed in the .guix-authorizations file of each of its parents. This is the authorization invariant.

(Remember that Git commits form a directed acyclic graph (DAG) where each commit can have zero or more parents; merge commits have two parent commits, for instance. Do not miss Git for Computer Scientists for a pedagogical overview!)

Let’s take an example to illustrate. In the figure below, each box is a commit, and each arrow is a parent relationship:

Example commit graph.

This figure shows two lines of development: the orange line may be the main development branch, while the purple line may correspond to a feature branch that was eventually merged in commit F. F is a merge commit, so it has two parents: D and E.

Labels next to boxes show who’s in .guix-authorizations: for commit A, only Alice is an authorized committer, and for all the other commits, both Bob and Alice are authorized committers. For each commit, we see that the authorization invariant holds; for example:

  • commit B was made by Alice, who was the only authorized committer in its parent, commit A;
  • commit C was made by Bob, who was among the authorized committers as of commit B;
  • commit F was made by Alice, who was among the authorized committers of both parents, commits D and E.

The authorization invariant has the nice property that it’s simple to state, and it’s simple to check and enforce. This is what guix pull implements. If your current Guix, as returned by guix describe, is at commit A and you want to pull to commit F, guix pull traverses all these commits and checks the authorization invariant.

Once a commit has been authenticated, all the commits in its transitive closure are known to be already authenticated. guix pull keeps a local cache of the commits it has previously authenticated, which allows it to traverse only new commits. For instance, if you’re at commit F and later update to a descendant of F, authentication starts at F.

Since .guix-authorizations is a regular file under version control, granting or revoking commit authorization does not require special support. In the example above, commit B is an authorized commit by Alice that adds Bob’s key to .guix-authorizations. Revocation is similar: any authorized committer can remove entries from .guix-authorizations. Key rotation can be handled similarly: a committer can remove their former key and add their new key in a single commit, signed by the former key.

The authorization invariant satisfies our needs for Guix. It has one downside: it prevents pull-request-style workflows. Indeed, merging the branch of a contributor not listed in .guix-authorizations would break the authorization invariant. It’s a good tradeoff for Guix because our workflow relies on patches carved into stone tablets (patch tracker), but it’s not suitable for every project out there.

Bootstrapping

The attentive reader may have noticed that something’s missing from the explanation above: what do we do about commit A in the example above? In other words, which commit do we pick as the first one where we can start verifying the authorization invariant?

We solve this bootstrapping issue by defining channel introductions. Previously, one would identify a channel simply by its URL. Now, when introducing a channel to users, one needs to provide an additional piece of information: the first commit where the authorization invariant holds, and the fingerprint of the OpenPGP key used to sign that commit (it’s not strictly necessary but provides an additional check). Consider this commit graph:

Example commit graph with introduction.

On this figure, B is the introduction commit. Its ancestors, such as A are considered authentic. To authenticate, C, D, E, and F, we check the authorization invariant.

As always when it comes to establishing trust, distributing channel introductions is very sensitive. The introduction of the official guix channel is built into Guix. Users obtain it when they install Guix the first time; hopefully they verify the signature on the Guix tarball or ISO image, as noted in the installation instructions, which reduces chances of getting the “wrong” Guix, but it is still very much trust-on-first-use (TOFU).

For signed third-party channels, users have to provide the channel’s introduction in their channels.scm file, like so:

(channel
  (name 'my-channel)
  (url "https://example.org/my-channel.git")
  (introduction
   (make-channel-introduction
    "6f0d8cc0d88abb59c324b2990bfee2876016bb86"
    (openpgp-fingerprint
     "CABB A931 C0FF EEC6 900D  0CFB 090B 1199 3D9A EBB5"))))

The guix describe command now prints the introduction if there’s one. That way, one can share their channel configuration, including introductions, without having to be an expert.

Channel introductions also solve another problem: forks. Respecting the authorization invariant “forever” would effectively prevent “unauthorized” forks—forks made by someone who’s not in .guix-authorizations. Someone publishing a fork simply needs to emit a new introduction for their fork, pointing to a different starting commit.

Last, channel introductions give a point of reference: if an attacker manipulates branch heads on Savannah to have them point to unrelated commits (such as commits on an orphan branch that do not share any history with the “official” branches), authentication will necessarily fail as it stumbles upon the first unauthorized commit made by the attacker. In the figure above, the red branch with commits G and H cannot be authenticated because it starts from A, which lacks .guix-authorizations and thus fails the authorization invariant.

That’s all for authentication! I’m glad you read this far. At this point you can take a break or continue with the next section on how guix pull prevents downgrade attacks.

Downgrade attacks

An important threat for software deployment tools is downgrade or roll-back attacks. The attack consists in tricking users into installing older, known-vulnerable software packages, which in turn may offer new ways to break into their system. This is not strictly related to the authentication issue we’ve been discussing, except that it’s another important issue in this area that we took the opportunity to address.

Guix saves provenance info for itself: guix describe prints that information, essentially the Git commits of the channels used during git pull:

$ guix describe
Generation 149  Jun 17 2020 20:00:14    (current)
  guix 8b1f7c0
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: 8b1f7c03d239ca703b56f2a6e5f228c79bc1857e

Thus, guix pull, once it has retrieved the latest commit of the selected branch, can verify that it is doing a fast-forward update in Git parlance—just like git pull does, but compared to the previously-deployed Guix. A fast-forward update is when the new commit is a descendant of the current commit. Going back to the figure above, going from commit A to commit F is a fast-forward update, but going from F to A or from D to E is not.

Not doing a fast-forward update would mean that the user is deploying an older version of the Guix currently used, or deploying an unrelated version from another branch. In both cases, the user is at risk of ending up installing older, vulnerable software.

By default guix pull now errors out on non-fast-forward updates, thereby protecting from roll-backs. Users who understand the risks can override that by passing --allow-downgrades.

Authentication and roll-back prevention allow users to safely refer to mirrors of the Git repository. If git.savannah.gnu.org is down, one can still update by fetching from a mirror, for instance with:

guix pull --url=https://github.com/guix-mirror/guix

If the repository at this URL is behind what the user already deployed, or if it’s not a genuine mirror, guix pull will abort. In other cases, it will proceed.

Unfortunately, there is no way to answer the general question “is X the latest commit of branch B ?”. Rollback detection prevents just that, rollbacks, but there’s no mechanism in place to tell whether a given mirror is stale. To mitigate that, channel authors can specify, in the repository, the channel’s primary URL. This piece of information lives in the .guix-channel file, in the repository, so it’s authenticated. guix pull uses it to print a warning when the user pulls from a mirror:

$ guix pull --url=https://github.com/guix-mirror/guix
Updating channel 'guix' from Git repository at 'https://github.com/guix-mirror/guix'...
Authenticating channel 'guix', commits 9edb3f6 to 3e51f9e (44 new commits)...
guix pull: warning: pulled channel 'guix' from a mirror of https://git.savannah.gnu.org/git/guix.git, which might be stale
Building from this channel:
  guix      https://github.com/guix-mirror/guix 3e51f9e
…

So far we talked about mechanics in a rather abstract way. That might satisfy the graph theorist or the Git geek in you, but if you’re up for a quick tour of the implementation, the next section is for you!

A long process

We’re kinda celebrating these days, but the initial bug report was opened… in 2016. One of the reasons was that we were hoping the general problem was solved already and we’d “just” have to adapt what others had done. As for the actual design: you would think it can be implemented in ten lines of shell script invoking gpgv and git. Perhaps that’s a possibility, but the resulting performance would be problematic—keep in mind that users may routinely have to authenticate hundreds of commits. So we took a long road, but the end result is worth it. Let’s recap.

Back in April 2016, committers started signing commits, with a server-side hook prohibiting unsigned commits. In July 2016, we had proof-of-concept libgit2 bindings with the primitives needed to verify signatures on commits, passing them to gpgv; later Guile-Git was born, providing good coverage of the libgit2 interface. Then there was a two-year hiatus during which no code was produced in that area.

Everything went faster starting from December 2019. Progress was incremental and may have been hard to follow, even for die-hard Guix hackers, so here are the major milestones:

Whether you’re a channel author or a user, the feature is now fully documented in the manual, and we’d love to get your feedback!

SHA-1

We can’t really discuss Git commit signing without mentioning SHA-1. The venerable crytographic hash function is approaching end of life, as evidenced by recent breakthroughs. Signing a Git commit boils down to signing a SHA-1 hash, because all objects in the Git store are identified by their SHA-1 hash.

Git now relies on a collision attack detection library to mitigate practical attacks. Furthermore, the Git project is planning a hash function transition to address the problem.

Some projects such as Bitcoin Core choose to not rely on SHA-1 at all. Instead, for the commits they sign, they include in the commit log the SHA512 hash of the tree, which the verification scripts check.

Computing a tree hash for each commit in Guix would probably be prohibitively costly. For now, for lack of a better solution, we rely on Git’s collision attack detection and look forward to a hash function transition.

As for SHA-1 in an OpenPGP context: our authentication code rejects SHA-1 OpenPGP signatures, as recommended.

Related work

A lot of work has gone into securing the software supply chain, often in the context of binary distros, sometimes in a more general context; more recent work also looks into Git authentication and related issues. This section attempts to summarize how Guix relates to similar work that we’re aware of in these two areas. More detailed discussions can be found in the issue tracker.

The Update Framework (TUF) is a reference for secure update systems, with a well-structured spec and a number of implementations. TUF is a great source of inspiration to think about this problem space. Many of its goals are shared by Guix. Not all the attacks it aims to protect against (Section 1.5.2 of the spec) are addressed by what’s presented in this post: indefinite freeze attacks, where updates never become available, are not addressed per se (though easily observable), and slow retrieval attacks aren’t addressed either. The notion of role is also something currently missing from the Guix authentication model, where any authorized committer can touch any files, though the model and .guix-authorizations format leave room for such an extension.

However, both in its goals and system descriptions, TUF is biased towards systems that distribute binaries as plain files with associated meta-data. That creates a fundamental impedance mismatch. As an example, attacks such as fast-forward attacks or mix-and-match attacks don’t apply in the context of Guix; likewise, the repository depicted in Section 3 of the spec has little in common with a Git repository.

Developers of OPAM, the OCaml package manager, adapted TUF for use with their Git-based package repository, later updated to write Conex, a separate tool to authenticate OPAM repositories. OPAM is interesting because like Guix it’s a source distro and its package repository is a Git repository containing “build recipe”. To date, it appears that opam update itself does not authenticate repositories though; it’s up to users or developer to run Conex.

Another very insightful piece of work is the 2016 paper On omitting commits and committing omissions. The paper focuses on the impact of malicious modifications to Git repository meta-data. An attacker with access to the repository can modify, for instance, branch references, to cause a rollback attack or a “teleport” attack, causing users to pull an older commit or an unrelated commit. As written above, guix pull would detect such attacks. However, guix pull would fail to detect cases where metadata modification does not yield a rollback or teleport, yet gives users a different view than the intended one—for instance, a user is directed to an authentic but different branch rather than the intended one. The “secure push” operation and the associated reference state log (RSL) the authors propose would be an improvement.

Wrap-up and outlook

Guix now has a mechanism that allows it to authenticate updates. If you’ve run guix pull recently, perhaps you’ve noticed additional output and a progress bar as new commits are being authenticated. Apart from that, the switch has been completely transparent. The authentication mechanism is built around the commit graph of Git; in fact, it’s a mechanism to authenticate Git checkouts and in that sense it is not tied to Guix and its application domain. It is available not only for the main guix channel, but also for third-party channels.

To bootstrap trust, we added the notion of channel introductions. These are now visible in the user interface, in particular in the output of guix describe and in the configuration file of guix pull and guix time-machine. While channel configuration remains a few lines of code that users typically paste, this extra bit of configuration might be intimidating. It certainly gives an incentive to provide a command-line interface to manage the user’s list of channels: guix channel add, etc.

The solution here is built around the assumption that Guix is fundamentally a source-based distribution, and is thus completely orthogonal to the public key infrastructure (PKI) Guix uses for the signature of substitutes. Yet, the substitute PKI could probably benefit from the fact that we now have a secure update mechanism for the Guix source code: since guix pull can securely retrieve a new substitute signing key, perhaps it could somehow handle substitute signing key revocation and delegation automatically? Related to that, channels could perhaps advertise a substitute URL and its signing key, possibly allowing users to register those when they first pull from the channel. All this requires more thought, but it looks like there are new opportunities here.

Until then, if you’re a user or a channel author, we’d love to hear from you! We’ve already gotten feedback that these new mechanisms broke someone’s workflow; hopefully it didn’t break yours, but either way your input is important in improving the system. If you’re into security and think this design is terrible or awesome, please do provide feedback.

It’s a long and article describing a long ride on a path we discovered as we went, and it felt like an important milestone to share!

Acknowledgments

Thanks to everyone who provided feedback, ideas, or carried out code review during this long process, notably (in no particular order): Christopher Lemmer Webber, Leo Famulari, David Thompson, Mike Gerwitz, Ricardo Wurmus, Werner Koch, Justus Winter, Vagrant Cascadian, Maxim Cournoyer, Simon Tournier, John Soo, and Jakub Kądziołka. Thanks also to janneke, Ricardo, Marius, and Simon for reviewing an earlier draft of this post.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Ludovic Courtès at July 01, 2020 05:40 PM

June 30, 2020

Programming Praxis

Shuffle An Array

Today’s exercise comes to us from Leetcode via Reddit:

Given an array consisting of 2n elements in the form
[x1,x2,…,xn,y1,y2,…,yn], return the array in the form [x1,y1,x2,y2,…,xn,yn].

The Reddit poster claims to be new to Scheme and functional programming, and was thinking of a solution using length and list-ref, but couldn’t solve the problem.

Your task is to show the student how to solve the problem. When you are
finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 30, 2020 09:00 AM

June 20, 2020

Göran Weinholt

Akku Archive Improvements

Akku.scm is a language package manager for R6RS and R7RS Scheme. The software that powers the package index has been growing beyond the simple one-liner it was in the beginning and today I’ve finally pushed it to a public repository. I’ve also made preparations for hosting packages as tarballs directly in the archive.

Tarballs

The Akku archive has never hosted packages directly. The index points at git repositories and commit revisions. These are added to each project’s Akku.lock file and are used when akku install clones the repository.

This has two major drawbacks. Cloning a git repository can be really slow. The repositories are also hosted on sites like GitHub where users sometimes decide to force-push or remove the repositories completely. I feel this is likely to happen more often the more politics and business influences GitHub in the future.

I’ve prepared the Akku archive to host tarballs directly. These are made with git archive from the submitted git repository. Downloading these is much faster than cloning a repository, they are not at risk of being removed at a whim, and they are cached in a local shared cache. Other package managers as a rule host their own archives as well, so this is nothing unusual.

Provenance

It’s important to me that users of Akku can trust that they get original software that has not been tampered with. I review all code that goes into the archive to protect Akku against use in supply chain attacks.

Building tarballs changes the equation a little bit since you now need to trust that the tarballs have not been tampered with. Tarballs are verified when they are downloaded, but how do you know that they match the original software?

This can be seen as an issue of provenance, or providing proof of the history of a piece of software. Here is the chain for the new tarballs:

  • Akku packages are submitted through akku publish by a developer (or by the Snow mirror software) as a .akku file with a detached GPG signature. This signature can be independently verified by fetching the key from the keyservers.

    The signed .akku file contains a git commit id. Because it is signed by the person who submitted the package, we can use the signature to verify that it was not tampered with after it went into the archive.

    Copies of these files are hosted under /archive/packages.

  • The archive software creates a tarball from the original repository using the git archive command. It also creates a new .akku file which contains information about the original repository and commit id as a comment. The non-comment part of the file contains the URL and hash of the new tarball. Like other .akku files, it is signed. This provides a signature linking the original git commit to the new tarball’s hash.

    These files are available under /archive/pkg. The signature is made with the current Akku archive key, which is in turn signed by my own key (which is in the Debian keyring).

  • The .akku files for Snow packages and the new tarballs are combined using akku archive-scan and written to Akku-index.scm, which is then XZ-compressed and signed with the archive key. The akku update command verifies the signature when it downloads this file. When Akku creates an Akku.lock file it incorporates the hash from the index, which is verified when akku install runs.

The above should make it possible for any interested party to check the integrity of the archive. It also protects against attackers uploading funky tarballs that don’t match the git repository.

All git repositories and Snow packages are mirrored in the archive under /archive/mirror. This mirror is not used in the index and is mostly provided for backup purposes.

Beta testers

The new index with tarballs is not live yet, it needs some testing.

Anyone who wants to do so can try it and report successes or failures in the comments section below or in GitLab issues. Here is how to update to the new archive manually:

curl https://archive.akkuscm.org/beta/Akku-index.scm \
  > ~/.local/share/akku/index.db

There is a GPG signature (.sig) in the same directory in case you want to verify that it was not tampered with.

Run akku lock in your existing project to get a lockfile that uses the new index. Then run akku install to download your packages as usual.

If all goes well then some time soon the switch to the new index will happen and akku update will use the new style index. You will still be able to revert to the old index by downloading Akku-origin.scm manually from the archive site and then use that as your index. This file will keep being maintained because that is where the Akku website generator finds pointers to upstream Git repositories.

Further reading

More about Akku:

by weinholt at June 20, 2020 12:00 AM

June 19, 2020

Programming Praxis

Summing A String

In a string consisting of digits and other non-digit characters, the digits form an embedded series of positive integers. For instance, the string “123abc45def” contains the embedded integers 123 and 45, which sum to 168.

Your task is to write a program that takes a string and writes the sum of the integers embedded in the string. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 19, 2020 09:00 AM

June 16, 2020

Programming Praxis

Counting Fingers

A little girl counts on her fingers in a curious way. She counts 1 on her thumb, 2 on her index finger, 3 on her middle finger, 4 on her ring finger, and 5 on her pinkie finger, then works back, counting 6 on her ring finger, 7 on her middle finger, 8 on her index finger, and 9 on her thumb, when she again turns around and counts 10 on her index finger, 11 on her middle finger, and so on.

Your task is to write a program that determines which finger the little girl will be on when she reaches a thousand. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 16, 2020 09:00 AM

June 15, 2020

GNU Guix

Guix Further Reduces Bootstrap Seed to 25%

We are delighted to announce that the second reduction by 50% of the Guix bootstrap binaries has now been officially released!

The initial set of binaries from which packages are built now weighs in at approximately 60~MiB, a quarter of what it used to be.

In a previous blog post we elaborate on why this reduction and bootstrappability in general is so important. One reason is to eliminate---or greatly reduce the attack surface of---a “trusting trust” attack. Last summer at the Breaking Bitcoin conference, Carl Dong gave a fun and remarkably gentle introduction and at FOSDEM2020 I also gave a short talk about this. If you choose to believe that building from source is the proper way to do computing, then it follows that the “trusting trust” attack is only a symptom of an incomplete or missing bootstrap story.

Further Reduced Binary Seed bootstrap

Last year, the first reduction removed the GCC, glibc and Binutils binary seeds. The new Further Reduced Binary Seed bootstrap, merged in Guix master last month, removes the “static-binaries tarball” containing GNU Awk, Bash, Bzip2, the GNU Core Utilities, Grep, Gzip, GNU Make, Patch, sed, Tar, and Xz. It replaces them by Gash and Gash Core Utils. Gash is a minimalist POSIX shell written in Guile Scheme, while Gash Core Utils is a Scheme implementation for most of the tools found in GNU Coreutils, as well as the most essential bits of Awk, grep and sed.

After three new GNU Mes releases with numerous Mes C Library updates and fixes, a major update of Gash and the first official Gash Utils release, and the delicate balancing of 17 new bootstrap source packages and versions, the bottom of the package graph now looks like this (woohoo!):

                              gcc-mesboot (4.9.4)
                                      ^
                                      |
                                    (...)
                                      ^
                                      |
               binutils-mesboot (2.14), glibc-mesboot (2.2.5),
                          gcc-core-mesboot (2.95.3)
                                      ^
                                      |
            bash-mesboot (2.05), bzip2-mesboot, gawk-mesboot (3.0.0)
       diffutils-mesboot (2.7), patch-mesboot (2.5.9), sed-mesboot (1.18)
                                      ^
                                      |
                             gnu-make-mesboot (3.82)
                                      ^
                                      |
                                gzip-mesboot (1.2.4)
                                      ^
                                      |
                                  tcc-boot
                                      ^
                                      |
                                  mes-boot
                                      ^
                                      |
                          gash-boot, gash-utils-boot
                                      ^
                                      |
                                      *
                 bootstrap-mescc-tools, bootstrap-mes (~12 MiB)
                            bootstrap-guile (~48 MiB)

full graph

We are excited that the Nlnet Foundation has sponsored this work!

However, we aren't done yet; far from it.

Lost Paths

The idea of reproducible builds and bootstrappable software is not very new. Much of that was implemented for the GNU tools in the early 1990s. Working to recreate it in present time shows us much of that practice was forgotten.

Readers who are familiar with the GNU toolchain may have noticed the version numbers of the *-mesboot source packages in this great new bootstrap: They are ancient! That's a problem.

Typically, newer versions of the tool chain fix all kinds of bugs, make the software easier to build and add support for new CPU architectures, which is great. However---more often than not--- simultaneously new features are introduced or dependencies are added that are not necessary for bootstrapping and may increase the bootstrap hurdle. Sometimes, newer tools are more strict or old configure scripts do not recognise newer tool versions.

A trivial example is GNU sed. In the current bootstrap we are using version 1.18, which was released in 1993. Until recently the latest version of sed we could hope to bootstrap was sed-4.2.2 (2012). Newer releases ship as xz-compressed tarballs only, and xz is notoriously difficult to bootstrap (it needs a fairly recent GCC and try building that without sed).

Luckily, the sed maintainers (Jim Meyering) were happy to correct this mistake and starting from release sed-4.8 (2020) also gzip-compressed tarballs will be shipped. Similar for the GNU Core Utils: Releases made between 2011 and 2019 will probably be useless for bootstrapping. Confronted with this information, also the coreutils maintainers (Pádraig Brady) were happy to release coreutils-8.32 also in gzip compression from now on.

Even these simple cases show that solving bootstrap problems can only be done together: For GNU it really is a project-wide responsibility that needs to be addressed.

Most bootstrap problems or loops are not so easy to solve and sometimes there are no obvious answers, for example:

and while these examples make for a delightful puzzle from a bootstrappability perspective, we would love to see the maintainers of GNU softwares to consider bootstrappability and start taking more responsibility for the bootstrap story of their packages.

Towards a Universal, Full Source Bootstrap

Our next target will be a third reduction by ~50%; the Full-Source bootstrap will replace the MesCC-Tools and GNU Mes binaries by Stage0 and M2-Planet.

The Stage0 project by Jeremiah Orians starts everything from ~512 bytes; virtually nothing. Have a look at this incredible project if you haven’t already done so.

We are most grateful and excited that the Nlnet Foundation has again decided to sponsor this work!

While the reduced bootstrap currently only applies to the i686-linux and x86_64-linux architectures, we are thrilled that ARM will be joining soon. The Trusted ARM bootstrapping work is progressing nicely, and GNU Mes is now passing its entire mescc test suite on native ARMv7, and passing nigh its entire gcc test suite on native ARMv7. Work is underway to compile tcc using that GNU Mes. Adding this second architecture is a very important one towards the creation of a universal bootstrap!

Upcoming releases of Gash and Gash-Utils will allow us to clean up the bottom of the package graph and remove many of the “vintage” packages. In particular, the next version of Gash-Utils will be sophisticated enough to build everything up to gcc-mesboot using only old versions of GNU Make and Gzip. This is largely thanks to improvements to the implementation of Awk, which now includes nearly all of the standard features.

Looking even further into the future, we will likely have to remove the “vintage” GCC-2.95.3 that was such a helpful stepping stone and reach straight for GCC-4.6.4. Interesting times ahead!

About Bootstrappable Builds and GNU Mes

Software is bootstrappable when it does not depend on a binary seed that cannot be built from source. Software that is not bootstrappable---even if it is free software---is a serious security risk for a variety of reasons. The Bootstrappable Builds project aims to reduce the number and size of binary seeds to a bare minimum.

GNU Mes is closely related to the Bootstrappable Builds project. Mes aims to create an entirely source-based bootstrapping path for the Guix System and other interested GNU/Linux distributions. The goal is to start from a minimal, easily inspectable binary (which should be readable as source) and bootstrap into something close to R6RS Scheme.

Currently, Mes consists of a mutual self-hosting scheme interpreter and C compiler. It also implements a C library. Mes, the scheme interpreter, is written in about 5,000 lines of code of simple C. MesCC, the C compiler, is written in scheme. Together, Mes and MesCC can compile a lightly patched TinyCC that is self-hosting. Using this TinyCC and the Mes C library, it is possible to bootstrap the entire Guix System for i686-linux and x86_64-linux.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Jan (janneke) Nieuwenhuizen at June 15, 2020 12:00 PM

June 05, 2020

Programming Praxis

2Max

Today’s exercise comes from Stack Overflow:

Given an array A consisting of N integers, return the maximum sum of two numbers whose digits add up to an equal sum. If there are not two numbers whose digits have an equal sum, the function should return -1. For example, A = [51, 71, 17, 42] would output 93 because there are two sets of numbers with the same digit-sum, (51, 42) with a digit-sum of 6 and (17, 71) with a digit-sum of 8, and the first pair has the maximum sum of two numbers of 93.

Your task is to write a program to calculated the requested maximum sum. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 05, 2020 09:00 AM

June 03, 2020

Andy Wingo

a baseline compiler for guile

Greets, my peeps! Today's article is on a new compiler for Guile. I made things better by making things worse!

The new compiler is a "baseline compiler", in the spirit of what modern web browsers use to get things running quickly. It is a very simple compiler whose goal is speed of compilation, not speed of generated code.

Honestly I didn't think Guile needed such a thing. Guile's distribution model isn't like the web, where every page you visit requires the browser to compile fresh hot mess; in Guile I thought it would be reasonable for someone to compile once and run many times. I was never happy with compile latency but I thought it was inevitable and anyway amortized over time. Turns out I was wrong on both points!

The straw that broke the camel's back was Guix, which defines the graph of all installable packages in an operating system using Scheme code. Lately it has been apparent that when you update the set of available packages via a "guix pull", Guix would spend too much time compiling the Scheme modules that contain the package graph.

The funny thing is that it's not important that the package definitions be optimized; they just need to be compiled in a basic way so that they are quick to load. This is the essential use-case for a baseline compiler: instead of trying to make an optimizing compiler go fast by turning off all the optimizations, just write a different compiler that goes from a high-level intermediate representation straight to code.

So that's what I did!

it don't do much

The baseline compiler skips any kind of flow analysis: there's no closure optimization, no contification, no unboxing of tagged numbers, no type inference, no control-flow optimizations, and so on. The only whole-program analysis that is done is a basic free-variables analysis so that closures can capture variables, as well as assignment conversion. Otherwise the baseline compiler just does a traversal over programs as terms of a simple tree intermediate language, emitting bytecode as it goes.

Interestingly the quality of the code produced at optimization level -O0 is pretty much the same.

This graph shows generated code performance of the CPS compiler relative to new baseline compiler, at optimization level 0. Bars below the line mean the CPS compiler produces slower code. Bars above mean CPS makes faster code. You can click and zoom in for details. Note that the Y axis is logarithmic.

The tests in which -O0 CPS wins are mostly because the CPS-based compiler does a robust closure optimization pass that reduces allocation rate.

At optimization level -O1, which adds partial evaluation over the high-level tree intermediate language and support for inlining "primitive calls" like + and so on, I am not sure why CPS peels out in the lead. No additional important optimizations are enabled in CPS at that level. That's probably something to look into.

Note that the baseline of this graph is optimization level -O1, with the new baseline compiler.

But as I mentioned, I didn't write the baseline compiler to produce fast code; I wrote it to produce code fast. So does it actually go fast?

Well against the -O0 and -O1 configurations of the CPS compiler, it does excellently:

Here you can see comparisons between what will be Guile 3.0.3's -O0 and -O1, compared against their equivalents in 3.0.2. (In 3.0.2 the -O1 equivalent is actually -O1 -Oresolve-primitives, if you are following along at home.) What you can see is that at these optimization levels, for these 8 files, the baseline compiler is around 4 times as fast.

If we compare to Guile 3.0.3's default -O2 optimization level, or -O3, we see bigger disparities:

Which is to say that Guile's baseline compiler runs at about 10x the speed of its optimizing compiler, which incidentally is similar to what I found for WebAssembly compilers a while back.

Also of note is that -O0 and -O1 take essentially the same time, with -O1 often taking less time than -O0. This is because partial evaluation can make the program smaller, at a cost of being less straightforward to debug.

Similarly, -O3 usually takes less time than -O2. This is because -O3 is allowed to assume top-level bindings that aren't exported from a module can be transformed to lexical bindings, which are more available for contification and inlining, which usually leads to smaller programs; it is a similar debugging/performance tradeoff to the -O0/-O1 case.

But what does one gain when choosing to spend 10 times more on compilation? Here I have a gnarly graph that plots performance on some microbenchmarks for all the different optimization levels.

Like I said, it's gnarly, but the summary is that -O1 typically gets you a factor of 2 or 4 over -O0, and -O2 often gets you another factor of 2 above that. -O3 is mostly the same as -O2 except in magical circumstances like the mbrot case, where it adds an extra 16x or so over -O2.

worse is better

I haven't seen the numbers yet of this new compiler in Guix, but I hope it can have a good impact. Already in Guile itself though I've seen a couple interesting advantages.

One is that because it produces code faster, Guile's boostrap from source can take less time. There is also a felicitous feedback effect in that because the baseline compiler is much smaller than the CPS compiler, it takes less time to macro-expand, which reduces bootstrap time (as bootstrap has to pay the cost of expanding the compiler, until the compiler is compiled).

The second fortunate result is that now I can use the baseline compiler as an oracle for the CPS compiler, when I'm working on new optimizations. There's nothing worse than suspecting that your compiler miscompiled itself, after all, and having a second compiler helps keep me sane.

stay safe, friends

The code, you ask? Voici.

Although this work has been ongoing throughout the past month, I need to add some words on the now before leaving you: there is a kind of cognitive dissonance between nerding out on compilers in the comfort of my home, rain pounding on the patio, and at the same time the world on righteous fire. I hope it is clear to everyone by now that the US police are an essentially racist institution: they harass, maim, and murder Black people at much higher rates than whites. My heart is with the protestors. Godspeed to you all, from afar. At the same time, all my non-Black readers should reflect on the ways they participate in systems that support white supremacy, and on strategies to tear them down. I know I will be. Stay safe, wear eye protection, and until next time: peace.

by Andy Wingo at June 03, 2020 08:39 PM