Please log in to watch this conference skillscast.
In this talk Ed will give live coding introduction to normalization by evaluation. He will then show how Graal and Truffle, on the JVM, can be (ab)used to JIT functional languages. He discussesd why this seems like a promising direction for evaluating dependently typed languages in particular.
Q&A
Question: How usable is this for writing a real program today?
Answer: It is certainly more usable than the naive evaluation strategy.
Question: What does this mean for the overall Coda project going forward? Can we get any sort of status report or hints where it might go next?
Answer: A lot of my work lately has been on probabilistic programming, and using it to make a nicer equality saturation engine
Question: What extra tricks could a JIT built for a functional language expose?
Answer: Well, the bits I didn't get to really share are the examples of how it has to trampoline.
What I wind up doing is each call in tail position starts a "potential" trampoline, which is to say that the evaluator puts a little exception handler in place that handles tail call exceptions that make it through, but then what i do is i start a bloom filter of what closure bodies i've seen so far on the call chain locally, turning the java hash function into a little Int64 sized bloom filter with ~5 hashes in it.
The idea is that if I go to enter a closure I throw if the bits for the current closure are set in the filter and it handles it by turning it into a loop if it actually is in the local stack, and hitting the fall back trampoline in the cases where i got a false positive.
So that is trampolining. For polymorphic inline caching, each 'app' node becomes a little polymorphic inline cache of what closure body is 'expected' for the function. if it hits you get to pass the arguments directly in registers, etc. by the time it gets through the assembler, and it builds PAPs like GHC when there aren't enough arguments.The PIC is a bit strange because its indexed by arity of the functions expected, and turns into expected chains of result closure bodies in the presence of overapplication. e.g. if you call foo x y z -- and foo always evaluates to a 2-ary function with a particular closure body, it'll basically inline the whole thing right there, and then figure out what it expects the 1-ary result to be and do the same there using a nested PIC. The main headache is that the trampoline story wound up being way messier than i'd hoped. I'd originally wanted to have it gradually install better trampolines. e.g. check for self-tail-calls, check for self-tail-calls with a different environment, check for the need for a full trampoline, and degrade gracefully. But that didn't work well with the "partial escape analysis" pass that is used by truffle/graal to avoid actually passing arguments so i'd always wind up building a big object[] array with all the arguments, putting everything into boxes and my memory usage pattern was awful and it was doing all sorts of extra work the current bloom filter craziness was a compromise in theory a better functional jit would be able to avoid that tax.
Question: Are you using Haskell for that or python/Julia etc
Answer: I've been bouncing back and forth between haskell and rust for a lot of this. In particular looking at rust for representing the database that is my crazy syntax tree for equality saturation and the rule engine, but all my probabilistic programming work has been in haskell so far.
Question: You mentioned not quite getting the performance you were hoping for, but can you say something about the performance you did see?
Answer: Well, It was a few x faster than a fairly naive evaluator, but also a few x slower than the direct 'lets just write the example in java' test. I was hoping to much more cleanly lean towards one side or the other. i was like if i get 10x its a win, if i get 2x it’s a loss, getting 5x it's a meh =P - what i would like to see is a similar framework for something more haskelly, but i'm not sure how it'd work.
Question: What parts of probabilistic programming are you looking to use/improve? I vaguely remember in a guanxi talk you briefly mentioned something about pumping random bits into an arithmetic decoder to do sampling without replacement in succinct space(?), but you didn't explain.
Answer: That thing didn't work out so well in the end, it really got in the way of variable reordering. Unpacking the probabilistic programming thing a bit: step 1 is using control/antithetic variates for HMC, which Piponi wrote a nice paper. This has been really useful. But in general, there is a pretty straightforward way to make a probabilistic programming monad that uses 'scores' and replace it by traced MCMC with one that doesn't. If you make those scores use AD you can use HMC on that monad rather than naive MCMC. But this can be done in multiple layers to enable you to compute the expectation for the control variates.
To explain what I mean: if f is a function of a random variable, and g is another function of that same random variable we can look at something like
f(X) + b (g(X) - E[g(X)])
for some constant b
E[g(X) - E[g(X)]] = 0,
so this has the same expected value as f(X), but if f and g are correlated this can have much lower variance. But I can replace E[g(X)] with an approximation of it produced by another layer of MCMC/HMC and you just compute the b that minimizes variance, which is a little exercise in algebra. Now this works for multiple g's in parallel or nested. Basically as a way of taking crappy models and using them to improve the variance in better models. But the trick is that bayesians in general ignore computation time. So you spend time proportional to the relative computational cost working on improving each estimated E[g(X)] vs. computing the raw f(X) function. When mixed with HMC, this is a pretty good 'universal evaluator'. Next step is how to apply that to equality saturation. But that centers on rethinking equality saturation as a series of mutations, MCMC like moves, that turn one syntax tree into another. You could think of each rewrite rule as a bidirectional mutation that you might apply.
Now if you do this naively, it's strictly worse than equality saturation. You basically are doing random walks through program space and then judging based on whether it improved some theoretical or actual measurement of performance taking all that make it faster, and enough that make it slower to maintain detailed balance. But what I want is something more like equality saturation than this sort of STOKE-like mutation procedure. With equality saturation you just keep adding new facts to the database, and proofs that things are equal, and hope the unifier puts an amount of pressure that balances the work of the rewrite rules expanding the terms. Then pick the 'best' program using a crappy metric.
Here I want something more like an ensemble, where i have a batch of equivalent programs, but NOT all of them, and then transform the rewrite rules into things that either add or REMOVE (by running them backwards) programs from this set. Then when i pick one out of the 'swarm' (using probabilistic programming, again!), it basically lets me weight the rules based on how much they help in practice. The nice thing is the longer you run, the faster the code gets and the more I learn about the weights to apply. At least in theory. The headache is making the right kind of database for storing this, computing detailed balance calculations because every edge has to have an inverse lest I lack proper support, etc.
That is the gist of it anyways not sure how coherent that was.
Question: Lots of food for thought at least. Sounds kind of like primary sample space Metropolis light transport.
Answer: MLT is a huge influence on me. In particular all the recent Bitterli work where he managed to show how to use reservoir sampling and the like in an unbiased way that I just never would have thought could be made unbiased. The other bits are to use something called product sampling and variations on multiple importance sampling as yet more ways to take crappy approximations of a space and get something better. Each of these, control/antithetic variates, product sampling, MIS, all sort of hit the same space. In particular for a cheap source of crappy models, there is a nice paper I keep playing with: http://ddg.math.uni-goettingen.de/pub/GeodesicsInHeat.pdf
Discretize the space, use that to reconstruct geodesics and use that to move around, rather than the discrete approximation, super cheap to evaluate, but not as bad as the naive discrete version.
I do love that the probability theoretic equality saturation engine doesn't require me to figure out how to make my rules terminate. I just run, and get better and better programs over time, the idea being that the expectation of any individual program in the 'swarm' being in the swarm and then being selected should match the detailed balance criterion, it's a weird form of 'slightly dependent' replica exchange. Making the math work there has been where all my headaches have kicked in… what I'd kind of like is a compiler that is well suited to doing little bits of AD on parts of it, setting up these little residuated probabilistic programming, and also handle things like the heavily data-layout dependent stuff like how to store the database and implement the not-quite Rete network of rules. Then to implement all of this in that… but "there is a hole in that bucket dear liza dear liza" as the song goes.
One thing i completely abjectly forgot to mention is that the approach to NBE there is almost entirely taken from David Thrane Christiansen, he has a paper that covers pretty much the same code I livecoded.
I'm usually over on irc.freenode.net in ##coda and #haskell so assuming i didn't manage to answer all questions for all time, feel free to bug me there.
YOU MAY ALSO LIKE:
Cadenza: Building Fast Functional Languages Fast
Edward Kmett
Fellow and VP of Software EngineeringGroq Inc.