Planet Scheme

March 24, 2017

Programming Praxis

Symbolic Differentiation

One of the original motivations for the Lisp language was to write a program that performed symbolic differentiation. In this exercise we look at the symbolic differentiator in Section 2.3.2 of SICP, which handles expressions containing only addition and multiplication according to the following rules:

\frac{dc}{dx} = 0, with c \not= x,

\frac{dx}{dx} = 1,

\frac{d(u+v)}{dx} = \frac{du}{dx} + \frac{dv}{dx}, and

\frac{d(uv)}{dx} = u\frac{dv}{dx} + v\frac{du}{dx}.

Your task is to write a program that performs symbolic differentiation according to the rules given above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 24, 2017 09:00 AM

March 21, 2017

Programming Praxis

Word Sets, Solved

I completed the solution to the previous exercise, which you can look at if you wish. We’ll have a new exercise on Friday.


by programmingpraxis at March 21, 2017 09:00 AM

March 20, 2017

GNU Guix

Creating bundles with guix pack

Guix just got a new command, dubbed guix pack, which we think many developers will find useful.

Last week we were celebrating the release of GNU Guile 2.2.0, the Scheme implementation that powers Guix. This is a major milestone and Guile developers naturally wanted to make it easy for users to discover all the goodies of 2.2.0 as soon as possible. One of the major roadblocks to that, as for any non-trivial piece of software, is deployment: because your distro is unlikely to have Guile 2.2.0 packaged on Day 1, you have to build it by yourself, which means getting the right dependencies installed and then building Guile itself. That’s not difficult for a developer, but it’s certainly cumbersome.

Andy Wingo, the driving force behind Guile, thought that it would be nice to propose a binary tarball of Guile 2.2.0 on the day of its release. Guix had already been providing binary tarballs for a couple of years, so why not do the same for Guile? Essentially, the new guix pack command is a generalization of what Guix was already using.

Making packs

So how does it work? The basic idea is simple: you type

guix pack guile

and the command returns in /gnu/store a good old tarball that contains binaries for Guile and all its dependencies. If you run, say,

guix pack guile emacs geiser

then you get a complete “Guile SDK” containing Guile, Emacs, Geiser, and all their dependencies.

When you extract the tarball, you get a /gnu/store directory with a bunch of sub-directories with these long hashes, one of which is the “profile” containing Guile, Emacs, and Geiser.

You wouldn’t want to ask users to type /gnu/store/war325pv1iixj13k6y8yplzagpknfn0c-profile/bin/guile to launch Guile, though. So guix pack has a command-line option to create symlinks in the image.

guix pack -S /opt/gnu/bin=bin guile emacs geiser

The command above creates a /opt/gnu/bin symlink to the bin directory of the profile in the tarball, such that users can simply type /opt/gnu/bin/guile to run Guile.

Recipients of a binary tarball are expected to either extract it in their root file system (yes!) where it will create /gnu and /opt/gnu in this case:

# cd /
# tar xf /path/to/pack.tar.gz
# /opt/gnu/bin/guile --version
guile (GNU Guile) 2.2.0

… or they can chroot into it, possibly relying on user namespaces and thereby avoiding root privileges:

$ mkdir /tmp/pack
$ cd /tmp/pack
$ tar xf /path/to/pack.tar.gz
$ unshare -mrf chroot . /opt/gnu/bin/guile --version
guile (GNU Guile) 2.2.0

The good thing with this is that, because Guix captures the complete dependency graph of packages, the tarball contains everything that’s needed to run Guile and is going to work in exactly the same way on any system that runs the kernel Linux!

Bells and whistles

Of course a popular approach to run such “application bundles” is Docker. Since the image format for Docker is documented and fairly easy to produce, we added an option to produce images in this format (Ricardo Wurmus initially contributed Docker support for the low-level guix archive tool but we found that it made more sense to have it in guix pack):

guix pack -f docker -S /opt/gnu=/ guile emacs geiser

The resulting tarball can be passed to docker load, and people can then use docker run to actually run the application.

One of the goodies that comes for free is cross-compilation: Guix supports cross-compilation, so you can create a pack consisting of software cross-compiled for a given platform, specified by the usual GNU triplet. For example, the following command creates a pack with binaries for GNU/Linux on ARMv7:

guix pack --target=arm-linux-gnueabihf guile

… while the command below creates a pack with Windows binaries using the MinGW cross-compiler:

guix pack --target=i686-w64-mingw32 guile

All the package transformation options that Guix supports are available to guix pack. Let’s say you’re a developer of a large piece of software such as a web browser like IceCat and you’d like your users to test whether the current master branch actually fixes the bug you attempted to fix. In this case, you can build a pack of IceCat, but replace the source that’s specified in the distribution with the snapshot of master you’re interested in:

guix pack icecat --with-source=./icecat-48.8.0.master.tar.gz

Of course the resulting pack is going to be pretty big in this case, but I’m sure the general pattern can be useful.

Wait, didn’t you say that “app bundles get it wrong”?

It turns out that we Guix developers have been saying that binary “application bundles” à la Docker are problematic for a number of reasons:

  1. Composability: each bundle comes with a complete operating system, minus the kernel, and there is little or no sharing happening among bundles, notably in terms of disk space and memory usage.
  2. Security updates: since an “app bundle” is essentially a complete operating system, one has to be careful and apply security updates to all the software in each bundle. Unfortunately, that doesn’t always happen as has been famously reported on several occasions.
  3. Reproducibility: Docker images, for instance, are often hardly “reproducible” in the sense of a reproducible build process. First, Dockerfiles start out with a “base layer” that is typically a huge binary blob of some major distro. On top of that, they run a number of commands such as apt-get install whose result likely depends on the time at which they are run. Docker’s best practices document suggests ways to mitigate the problem, such as “version pinning”, but the whole approach remains rather brittle.
  4. Experimentation: Once you have this big binary blob, sure you can run the application you wanted, but you can do little more than that—you may or may not be able to find the corresponding source code, and you’d have a hard time fiddling with one of the components of the software stack.

We pride ourselves with having a tool set that caters to some of the use cases that “app bundles” and “containerization” try to address while having none of these drawbacks. So how do Guix packs fit into that picture?

First of all, the intended use case is different: we view guix pack as a tool that makes it easy to try out a piece of software on a non-Guix machine. But it is clear that for production, our recommendation is to use Guix directly, to get security updates and generally address all the above issues. :-)

That said, let’s see how these issues affect Guix packs. First, composability of Guix packs turns out to be pretty good. If you receive two different Guix packs for different pieces of software, you can unpack both in your root directory (or union-mount them in the same place): packages that differ have a different /gnu/store file name with a different hash, so they won’t collide; packages that are identical (say the C library or GTK+) will have the same /gnu/store file name so they’ll actually be shared.

That means that for security updates, you could always fetch a new pack of your application with the security updates and extract it in place. However, that requires you as a user to manually pay attention to vulnerabilities in all the software that comes with the pack, so clearly you’re better off using Guix instead and regularly upgrading. No wonders.

Packs themselves are reproducible bit-by-bit. If you know the Guix commit that was used to build a given pack, you can thus run the same guix pack command on another machine and verify that you get the exact same tarball. Currently not 100% of the packages Guix provides are reproducible bit-by-bit; we’re getting closer to that goal though, in part due to the fact that Guix builds are isolated by default, and also thanks to the efforts of everyone in the Reproducible Builds project to address sources of non-determinism in free software.

Because Guix packs are reproducible, you can not only reproduce the exact same pack but also create packs with variants of the software—for instance, changing the version of one of the packages in the stack. Of course this part requires you to have Guix installed somewhere, but at least you can easily fiddle with the software stack and “compile” your own variant of the software stack down to a new pack.

We hope you’ll enjoy packs and Guix, and would welcome your feedback on the guix-devel mailing list and on #guix on Freenode!

About GNU Guix

GNU Guix is a transactional package manager for the GNU system. The Guix System Distribution or GuixSD is an advanced distribution of the GNU system that relies on GNU Guix and respects the user&aposs freedom.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. Guix uses low-level mechanisms from the Nix package manager, except that packages are defined as native Guile modules, using extensions to the Scheme language. GuixSD offers a declarative approach to operating system configuration management, and is highly customizable and hackable.

GuixSD can be used on an i686 or x86_64 machine. It is also possible to use Guix on top of an already installed GNU/Linux system, including on mips64el and armv7.

by Ludovic Courtès (guix-devel@gnu.org) at March 20, 2017 01:45 PM

March 17, 2017

Programming Praxis

Word Sets

Today’s exercise is an interview question;

Given a list of words and a set of characters, determine which words in the list can be formed from the given characters. For instance, given the characters a, c and t, the words act and cat can be formed, but not the word stop.

Your task is to write the word classified described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 17, 2017 09:00 AM

March 15, 2017

Andy Wingo

guile 2.2 omg!!!

Oh, good evening my hackfriends! I am just chuffed to share a thing with yall: tomorrow we release Guile 2.2.0. Yaaaay!

I know in these days of version number inflation that this seems like a very incremental, point-release kind of a thing, but it's a big deal to me. This is a project I have been working on since soon after the release of Guile 2.0 some 6 years ago. It wasn't always clear that this project would work, but now it's here, going into production.

In that time I have worked on JavaScriptCore and V8 and SpiderMonkey and so I got a feel for what a state-of-the-art programming language implementation looks like. Also in that time I ate and breathed optimizing compilers, and really hit the wall until finally paging in what Fluet and Weeks were saying so many years ago about continuation-passing style and scope, and eventually came through with a solution that was still CPS: CPS soup. At this point Guile's "middle-end" is, I think, totally respectable. The backend targets a quite good virtual machine.

The virtual machine is still a bytecode interpreter for now; native code is a next step. Oddly my journey here has been precisely opposite, in a way, to An incremental approach to compiler construction; incremental, yes, but starting from the other end. But I am very happy with where things are. Guile remains very portable, bootstrappable from C, and the compiler is in a good shape to take us the rest of the way to register allocation and native code generation, and performance is pretty ok, even better than some natively-compiled Schemes.

For a "scripting" language (what does that mean?), I also think that Guile is breaking nice ground by using ELF as its object file format. Very cute. As this seems to be a "Andy mentions things he's proud of" segment, I was also pleased with how we were able to completely remove the stack size restriction.

high fives all around

As is often the case with these things, I got the idea for removing the stack limit after talking with Sam Tobin-Hochstadt from Racket and the PLT group. I admire Racket and its makers very much and look forward to stealing fromworking with them in the future.

Of course the ideas for the contification and closure optimization passes are in debt to Matthew Fluet and Stephen Weeks for the former, and Andy Keep and Kent Dybvig for the the latter. The intmap/intset representation of CPS soup itself is highly endebted to the late Phil Bagwell, to Rich Hickey, and to Clojure folk; persistent data structures were an amazing revelation to me.

Guile's virtual machine itself was initially heavily inspired by JavaScriptCore's VM. Thanks to WebKit folks for writing so much about the early days of Squirrelfish! As far as the actual optimizations in the compiler itself, I was inspired a lot by V8's Crankshaft in a weird way -- it was my first touch with fixed-point flow analysis. As most of yall know, I didn't study CS, for better and for worse; for worse, because I didn't know a lot of this stuff, and for better, as I had the joy of learning it as I needed it. Since starting with flow analysis, Carl Offner's Notes on graph algorithms used in optimizing compilers was invaluable. I still open it up from time to time.

While I'm high-fiving, large ups to two amazing support teams: firstly to my colleagues at Igalia for supporting me on this. Almost the whole time I've been at Igalia, I've been working on this, for about a day or two a week. Sometimes at work we get to take advantage of a Guile thing, but Igalia's Guile investment mainly pays out in the sense of keeping me happy, keeping me up to date with language implementation techniques, and attracting talent. At work we have a lot of language implementation people, in JS engines obviously but also in other niches like the networking group, and it helps to be able to transfer hackers from Scheme to these domains.

I put in my own time too, of course; but my time isn't really my own either. My wife Kate has been really supportive and understanding of my not-infrequent impulses to just nerd out and hack a thing. She probably won't read this (though maybe?), but it's important to acknowledge that many of us hackers are only able to do our work because of the support that we get from our families.

a digression on the nature of seeking and knowledge

I am jealous of my colleagues in academia sometimes; of course it must be this way, that we are jealous of each other. Greener grass and all that. But when you go through a doctoral program, you know that you push the boundaries of human knowledge. You know because you are acutely aware of the state of recorded knowledge in your field, and you know that your work expands that record. If you stay in academia, you use your honed skills to continue chipping away at the unknown. The papers that this process reifies have a huge impact on the flow of knowledge in the world. As just one example, I've read all of Dybvig's papers, with delight and pleasure and avarice and jealousy, and learned loads from them. (Incidentally, I am given to understand that all of these are proper academic reactions :)

But in my work on Guile I don't actually know that I've expanded knowledge in any way. I don't actually know that anything I did is new and suspect that nothing is. Maybe CPS soup? There have been some similar publications in the last couple years but you never know. Maybe some of the multicore Concurrent ML stuff I haven't written about yet. Really not sure. I am starting to see papers these days that are similar to what I do and I have the feeling that they have a bit more impact than my work because of their medium, and I wonder if I could be putting my work in a more useful form, or orienting it in a more newness-oriented way.

I also don't know how important new knowledge is. Simply being able to practice language implementation at a state-of-the-art level is a valuable skill in itself, and releasing a quality, stable free-software language implementation is valuable to the world. So it's not like I'm negative on where I'm at, but I do feel wonderful talking with folks at academic conferences and wonder how to pull some more of that into my life.

In the meantime, I feel like (my part of) Guile 2.2 is my master work in a way -- a savepoint in my hack career. It's fine work; see A Virtual Machine for Guile and Continuation-Passing Style for some high level documentation, or many of these bloggies for the nitties and the gritties. OKitties!

getting the goods

It's been a joy over the last two or three years to see the growth of Guix, a packaging system written in Guile and inspired by GNU stow and Nix. The laptop I'm writing this on runs GuixSD, and Guix is up to some 5000 packages at this point.

I've always wondered what the right solution for packaging Guile and Guile modules was. At one point I thought that we would have a Guile-specific packaging system, but one with stow-like characteristics. We had problems with C extensions though: how do you build one? Where do you get the compilers? Where do you get the libraries?

Guix solves this in a comprehensive way. From the four or five bootstrap binaries, Guix can download and build the world from source, for any of its supported architectures. The result is a farm of weirdly-named files in /gnu/store, but the transitive closure of a store item works on any distribution of that architecture.

This state of affairs was clear from the Guix binary installation instructions that just have you extract a tarball over your current distro, regardless of what's there. The process of building this weird tarball was always a bit ad-hoc though, geared to Guix's installation needs.

It turns out that we can use the same strategy to distribute reproducible binaries for any package that Guix includes. So if you download this tarball, and extract it as root in /, then it will extract some paths in /gnu/store and also add a /opt/guile-2.2.0. Run Guile as /opt/guile-2.2.0/bin/guile and you have Guile 2.2, before any of your friends! That pack was made using guix pack -C lzip -S /opt/guile-2.2.0=/ guile-next glibc-utf8-locales, at Guix git revision 80a725726d3b3a62c69c9f80d35a898dcea8ad90.

(If you run that Guile, it will complain about not being able to install the locale. Guix, like Scheme, is generally a statically scoped system; but locales are dynamically scoped. That is to say, you have to set GUIX_LOCPATH=/opt/guile-2.2.0/lib/locale in the environment, for locales to work. See the GUIX_LOCPATH docs for the gnarlies.)

Alternately of course you can install Guix and just guix package -i guile-next. Guix itself will migrate to 2.2 over the next week or so.

Welp, that's all for this evening. I'll be relieved to push the release tag and announcements tomorrow. In the meantime, happy hacking, and yes: this blog is served by Guile 2.2! :)

by Andy Wingo at March 15, 2017 10:56 PM

March 14, 2017

Programming Praxis

Square Pyramidal Numbers

Cannonballs are traditionally stacked in a pyramid with a square base. A stack with a square base of 15 cannonballs has 15 × 15 = 225 on the bottom level, 14 × 14 = 196 on the level above that, and so on, a total of 1240 cannonballs.

Your task is to write a program to compute the number of cannonballs in a stack; use it to compute the number of cannonballs in a stack with a base of 1,000,000 cannonballs on a side. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 14, 2017 09:00 AM

March 10, 2017

Programming Praxis

Disordered Binary Search Trees

In a recent exercise we wrote a program to determine if a binary tree was balanced. Today’s exercise is similar:

Write a program to determine if a binary tree satisfies the binary search tree property that all left children are less than their parent and all right children are greater than their parent.

Your task is to write a program to determine if a binary tree is a binary search tree. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 10, 2017 09:00 AM

March 09, 2017

GNU Guix

Join GNU Guix for GSoC

As in previous years, Guix participates in the Google Summer of Code (GSoC), under the aegis of the GNU Project.

We have collected project ideas for Guix, GuixSD, as well as the GNU Shepherd, covering a range of topics. If you are passionate about computing freedom, Scheme, functional programming, or operating system development, check out the proposed projects. The list is far from exhaustive, so feel free to bring your own!

You can get in touch with us on the mailing lists and on the #guix channel on the Freenode IRC network.

If you are an eligible student, make sure to apply by April 3rd.

About GNU Guix

GNU Guix is a transactional package manager for the GNU system. The Guix System Distribution or GuixSD is an advanced distribution of the GNU system that relies on GNU Guix and respects the user&aposs freedom.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. Guix uses low-level mechanisms from the Nix package manager, except that packages are defined as native Guile modules, using extensions to the Scheme language. GuixSD offers a declarative approach to operating system configuration management, and is highly customizable and hackable.

GuixSD can be used on an i686 or x86_64 machine. It is also possible to use Guix on top of an already installed GNU/Linux system, including on mips64el and armv7.

by Ludovic Courtès (guix-devel@gnu.org) at March 09, 2017 01:00 PM

March 08, 2017

Greg Hendershott

Please scroll

Recently I got more time to catch up on racket-mode. I improved two things that happen to fit one theme — an extraordinarily advanced UX concept I call, “scrolling down to the point of interest.”

Visit definition

In racket-mode you can M-. (press Alt and period) to visit the definition of a Racket identifier.

There is no Racket library function that supports this, exactly. The identifier-binding function gives you a filename, but not a position within. And actually, it gives you two filenames (and identifier symbols), because the location and symbol of the definition might differ from the location and symbol under which it is provided.

I’ve also found it can be tricky when something is renamed more than once — for example both renamed and wrapped in a contract. Of the three (or more) names involved, identifier-binding will return only two. For example in (provide (contract-out [rename orig new contract])) it reports (1) the contract wrapper’s generated identifier and (2) new — but not (3) orig. Unfortunately the definition of orig is our desired destination.

So, I need to treat identifier-binding as a valuable head start — but maybe not the real answer.

I need to check each file and try to find the identifier within. This isn’t a job for regular expression; the name might not appear textually at the definition site (think of define-er macros). Instead I read the file as syntax and walk it. Sometimes it makes sense to search the syntax after it has been fully-expanded. Sometimes it helps to walk it unexpanded, looking for some special forms, for example the “rename” variant of contract-out.

If after all that, we can’t find the position, racket-mode plops you at the start of the file.

The change I made was to reduce the chance of that happening. Details in the commit message and diff.

View documentation

In racket-mode you can C-c C-d to view Racket’s HTML documentation in your default web browser. It should (a) open the correct page and (b) scroll to the item on that page. Unfortunately (b) didn’t always happen on macOS. Under certain conditions, macOS is reluctant to open file: URLs and scroll down to the anchor/fragment (the bit after the # in the URL).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Will open the default browser to the top of the define.html page
# but not scroll down to the define-values item:
$ open 'file:///Applications/Racket_v6.7/doc/reference/define.html#%28form._%28%28quote._~23~25kernel%29._define-values%29%29'

# Ditto
$ osascript -e 'open location "file:///Applications/Racket_v6.7/doc/reference/define.html#%28form._%28%28quote._~23~25kernel%29._define-values%29%29"'

# But this works!
$ osascript -e 'tell application "chrome" to open location "file:///Applications/Racket_v6.7/doc/reference/define.html#%28form._%28%28quote._~23~25kernel%29._define-values%29%29"'

# Well, you also want the "activate" at the end:
$ osascript -e 'tell application "chrome" to open location "file:///Applications/Racket_v6.7/doc/reference/define.html#%28form._%28%28quote._~23~25kernel%29._define-values%29%29" activate'

Interestingly the generic open seems to work fine for http:. Also fine if a file: location is under your home directory instead of /Applications. Because security?1 Anyway, this is probably why Racket developers and power users haven’t noticed, if they’re building Racket (and its docs) from HEAD.

So I can use osascript if I know the default browser. How do I know that? Ugh. OK. This information seems to reside in Library/Preferences/com.apple.LaunchServices/com.apple.launchservices.secure.plist. Read the JSON, find the correct entry, grab the browser part of com.company.browser, and hopefully we’re good.

Pontification

Computer science mostly isn’t science.

Software engineering mostly isn’t engineering.

Sometimes success is simply scrolling.

  1. Security theater? I don’t see why it’s safe to load a page, but risky to scroll to an anchor in it? I also don’t see why /Applications is more risky than your home dir — much less some rando http: location? 

by Greg Hendershott at March 08, 2017 10:00 PM

March 07, 2017

Programming Praxis

RC40

We examined the Solitaire Cipher of Bruce Schneier in a previous exercise. In an article on his blog, Ted Unangst complains that the cipher is too complicated for manual use, and proposes instead RC40, a variant of RC4 that uses only 40 characters instead of 256. Unangst’s alphabet is the 26 lower-case letters, the ten digits 0 through 9, the characters period, question mark, and space, and a “shift” character that gives an upper-case version of those 39 characters; two consecutive shift characters enter or leave caps-lock mode. Otherwise, RC40 is exactly the same as RC4, except that all references to 256 are changed to 40.

Your task is to write a program that enciphers and deciphers strings based on the RC40 cipher; use that program to decipher the string “5cxaxlfrfhy6kh38fbplm0mDko58xs.l9Hkz8” with key “tedunangst”. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 07, 2017 09:00 AM

March 06, 2017

Andy Wingo

it's probably spam

Greetings, peoples. As you probably know, these words are served to you by Tekuti, a blog engine written in Scheme that uses Git as its database.

Part of the reason I wrote this blog software was that from the time when I was using Wordpress, I actually appreciated the comments that I would get. Sometimes nice folks visit this blog and comment with information that I find really interesting, and I thought it would be a shame if I had to disable those entirely.

But allowing users to add things to your site is tricky. There are all kinds of potential security vulnerabilities. I thought about the ones that were important to me, back in 2008 when I wrote Tekuti, and I thought I did a pretty OK job on preventing XSS and designing-out code execution possibilities. When it came to bogus comments though, things worked well enough for the time. Tekuti uses Git as a log-structured database, and so to delete a comment, you just revert the change that added the comment. I added a little security question ("what's your favorite number?"; any number worked) to prevent wordpress spammers from hitting me, and I was good to go.

Sadly, what was good enough in 2008 isn't good enough in 2017. In 2017 alone, some 2000 bogus comments made it through. So I took comments offline and painstakingly went through and separated the wheat from the chaff while pondering what to do next.

an aside

I really wondered why spammers bothered though. I mean, I added the rel="external nofollow" attribute on links, which should prevent search engines from granting relevancy to the spammer's links, so what gives? Could be that all the advice from the mid-2000s regarding nofollow is bogus. But it was definitely the case that while I was adding the attribute to commenter's home page links, I wasn't adding it to links in the comment. Doh! With this fixed, perhaps I will just have to deal with the spammers I have and not even more spammers in the future.

i digress

I started by simply changing my security question to require a number in a certain range. No dice; bogus comments still got through. I changed the range; could it be the numbers they were using were already in range? Again the bogosity continued undaunted.

So I decided to break down and write a bogus comment filter. Luckily, Git gives me a handy corpus of legit and bogus comments: all the comments that remain live are legit, and all that were ever added but are no longer live are bogus. I wrote a simple tokenizer across the comments, extracted feature counts, and fed that into a naive Bayesian classifier. I finally turned it on this morning; fingers crossed!

My trials at home show that if you train the classifier on half the data set (around 5300 bogus comments and 1900 legit comments) and then run it against the other half, I get about 6% false negatives and 1% false positives. The feature extractor interns sequences of 1, 2, and 3 tokens, and doesn't have a lower limit for number of features extracted -- a feature seen only once in bogus comments and never in legit comments is a fairly strong bogosity signal; as you have to make up the denominator in that case, I set it to indicate that such a feature is 99.9% bogus. A corresponding single feature in the legit set without appearance in the bogus set is 99% legit.

Of course with this strong of a bias towards precise features of the training set, if you run the classifier against its own training set, it produces no false positives and only 0.3% false negatives, some of which were simply reverted duplicate comments.

It wasn't straightforward to get these results out of a Bayesian classifier. The "smoothing" factor that you add to both numerator and denominator was tricky, as I mentioned above. Getting a useful tokenization was tricky. And the final trick was even trickier: limiting the significant-feature count when determining bogosity. I hate to cite Paul Graham but I have to do so here -- choosing the N most significant features in the document made the classification much less sensitive to the varying lengths of legit and bogus comments, and less sensitive to inclusions of verbatim texts from other comments.

We'll see I guess. If your comment gets caught by my filters, let me know -- over email or Twitter I guess, since you might not be able to comment! I hope to be able to keep comments open; I've learned a lot from yall over the years.

by Andy Wingo at March 06, 2017 02:16 PM

March 03, 2017

Programming Praxis

Balanced Binary Search Trees

The story exploded across the intertubes a few days ago: A software engineer trying to enter the United States on a work visa was asked to prove his occupation by writing a program for the Customs and Border Protection agent:

Write a function to check if a Binary Search Tree is balanced.

Let’s help him.

Your task is to write a function to check if a binary search tree is balanced. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at March 03, 2017 09:00 AM

February 28, 2017

Programming Praxis

A Fun Little Task

Today’s exercise is a fun little task:

Given two strings a and b, with a no longer than b, determine if there is any permutation of a that appears in b. For instance, a permutation of string xyz appears starting at the fourth character (counting from zero) of string afdgzyxksldfm.

Your task is to write a program to determine is any permutation of a string is a substring of another string. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at February 28, 2017 09:00 AM

February 24, 2017

Programming Praxis

Nth Smallest Item In Binary Search Tree

A binary search tree is a binary tree in which all nodes in the left subtree are less than the current node and all nodes in the right subtree are greater than the current node. Items arrive in random order, are inserted into the binary search tree, and an in-order traversal produces the items in ascending order.

Your task is to write a program that finds the nth smallest item in a binary search tree, without enumerating all of the items in the tree. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


by programmingpraxis at February 24, 2017 09:00 AM

February 21, 2017

Programming Praxis

It Was A Beautiful Weekend

It was a beautiful three-day weekend here in St Louis, with temperatures in the 70s all three days. It’s supposed to be winter here, with high temperatures around 50 and lows in the 30s, or colder, but the weatherman goofed and gave us beautiful weather.

I took the opportunity to do things other than write an exercise. I apologize. I’ll be back with another exercise on Friday. Probably. The beautiful weather is supposed to last another few days. . . .

In the meantime, pick an exercise you have done yet, and solve it. Or, enjoy the weather where you are.


by programmingpraxis at February 21, 2017 09:00 AM