Piccolo - A Stackless Lua Interpreter

2024-05-01


piccolo is an interpreter for the Lua language written in pure, mostly safe Rust with an eye towards safe sandboxing and resiliency. It uses gc-arena for its garbage collection system, and in fact gc-arena was originally created as part of piccolo.

You can try it out below in an the interactive REPL. I'm not much of a web developer, and this is a little homegrown web terminal thingy, so hopefully this works for you. I'm going to be using REPLs like this to demonstrate a lot of what makes piccolo unique, so if it doesn't work or you or you don't have Javascript turned on, then this might be a pretty boring post and I'm sorry!

↓ Type Some Lua Here ↓[1]

I find REPLs to be really magical and inviting,[2] and I end up eventually wanting to attach them to everything I ever make.[3] Not just a REPL but the idea that your program is a sort of Living Thing that understands a Language, and if the normal UI isn't fit for purpose and you're enough of a Witch, you can always just Speak to the program in the way it naturally understands... cast a Spell to achieve your Goals, or maybe just have a Conversation. I think it helps close the gap between the author of a program and the user. I'm not better than the user, who am I to tell them what they can and can't say to the program?

I hope you feel the same way about REPLs as I do because there are a lot of them on this page, and I'm going to ask you to type things into them. If you're not into it, well... I'll try and always give you working code that you can copy and paste, but where's the fun in that?


I said in my last post in this series that my goal wasn't to try to sell anyone on gc-arena or piccolo[4] and I think that's still true here. piccolo is pretty rough around the edges[5] right now, but it's more complete than you might think[6] and it has some interesting and unique features. Still, I'm not telling you to go out and replace LuaJIT or Luau or The Ur Lua with piccolo just yet.

In this post, I just want to talk about what makes piccolo special, and hopefully you'll find it interesting. In a future post, I'm going to tie all of this together, the way gc-arena and piccolo are designed, how they work now, and how I wish they could work in the future, but this post is going to focus just on piccolo as it works right now.

History of piccolo

When I was first writing piccolo (in 2019), I had noticed that nobody had quite figured out how to make VMs for certain kinds of languages that could be competitive with C while being implemented primarily in safe Rust. Mostly I'm referring to problems surrounding garbage collection, rather than something like designing fast interpreter loops (which is something I'm not very good at yet!).

Languages that have ownership semantics similar to Rust are of course much easier to write VMs for in Rust, because the implemented language can snarf[7] much of the functionality from the host language. It's pretty easy to express the exact semantics of Rc by just... using Rc itself to implement the language. There are several such scripting languages that try to have matching ownership and mutability semantics with Rust and I think that's honestly a great idea because sharing these core semantics with the host language just removes a huge amount of cognitive burden when crossing the language boundary, and you can make an embedded language this way that feels like it fits in perfectly with the host.

However, I also think it's a bit of a disappointment if only Rust-like languages can be easily made using Rust. Certainly this is not actually true, and there are plenty of other Rust runtimes for languages with "unrestricted ownership" (proper garbage collection, unrestricted references... the terms for this are a bit all over the place, but what I mean is languages like Lua). At the time at least, when I surveyed language runtimes written in Rust they broadly fell into one of several categories, none of which was what I wanted for piccolo...

  1. Languages with ownership semantics similar to Rust, so no "true garbage collection" / "unrestricted ownership" or whatever you want to call it (dyon, rune, etc...)
  2. Language runtimes with true garbage collection (tracing or generational collector) but whose implementations were wildly, infectiously unsafe as they would be in C, due to the nature of garbage collected pointers and their interactions with the Rust stack.
  3. Language runtimes for languages that are meant to have proper garbage collection but the implementer used Rc or similar and left the problem of what to do about reference cycles for later (RustPython).

I wanted to have a language for Rust that felt as natural as Lua does for C, and one that had true garbage collection[8] ... and frankly I really like just plain vanilla Lua. I think it matches perfectly with Rust because they're so different, I think having an Rust embedded language that frees you from even having to think about ownership is very powerful because it can be used for things where having to think about ownership can be more trouble than its worth. Let each language play to their strengths, and Rust and Lua in a lot of ways have complementary strengths.

Since I just looooove Lua so much and I had so much experience with vanilla PUC-Rio Lua (aka The Ur Lua), I decided to try and write an interpreter designed similarly to[9] PUC-Rio's Lua, with a similar sort of garbage collector, but because I was primarily interested in sandboxing untrusted scripts, somehow made of mostly safe Rust.[10] gc-arena was born out of my efforts to solve this problem.

But I really don't intend to throw any shade at any of the projects I listed above or any other Rust-implemented language runtime written in a different style. These are hard problems and gc-arena is not a perfect solution. In fact, in the early days of gc-arena and piccolo, I ran into so many seemingly unsolvable problems that I became hopelessly frustrated and gave up on piccolo entirely for about four years.

It was only through Ruffle's use of gc-arena and Ruffle contributors helping to get through the roadblocks we encountered that I was eventually able to pick piccolo back up. Today, there are not nearly so many unsolved problems in trying to use gc-arena to implement a language like Lua or ActionScript, but it really came down to Ruffle contributors helping to solve each issue one by one over the intervening years.

BUT, even with all of the major roadblocks overcome (pointers to Unsize types, GC finalization, non-'static Any, a bunch more I've forgotten) influence from the biggest limitation of gc-arena stubbornly remained: the "Mutation XOR Collection" design of gc-arena that was talked about in the last post. gc-arena's design requires that code that wants to access garbage collected data do so through special mutation methods, and that collection must ONLY happen when no mutation methods are executing.

This "Mutation XOR Collection" design means that calls to Arena::mutate must return before garbage collection can safely take place, to prove that no garbage collected pointers exist anywhere on the Rust stack. Unless I were willing to just give up on ever hoping to match Lua runtimes written in C, or were willing to limit the places piccolo could be used,[11] I had to figure out a way to make the entire execution context of piccolo suspendable, just to be able to leave calls to Arena::mutate, so that garbage collection could take place at arbitrary points.

At the beginning, this limitation infuriated me, and I spent ages trying anything I could to find an acceptable way around it. It still certainly is limiting, but now that piccolo has gotten further along I think what I've ended up with is actually very cool, and what started out as purely a painful compromise might actually end up being piccolo's "killer feature"...

A "Stackless" Interpreter Design

Some of the biggest, most interesting features of piccolo come from its "stackless" design, originally born only from necessity due to the limitations of gc-arena. This design is similar to other "stackless" interpreters, and the one most people have heard of is Stackless Python, so if you're familiar with it, most of what you know will be applicable to piccolo as well.

"Stackless" here is jargon that's used in a couple of places, not just in interpreter design. You may have heard that Rust has "stackless" coroutines, and the word "stackless" as I'm using it here means the exact same thing. It means that piccolo's Lua runtime is not "stackful", it does not rely on the Rust function call stack to maintain its execution state, and execution can be suspended at any time. This applies not just for plain interpreted Lua bytecode but also for Rust code executed from Lua (callbacks) in any form, for any depth of Lua -> Rust -> Lua calls. The overriding design decision made early in piccolo's life was that execution of Lua (and ALL callback code called from Lua) must always be able to be suspended, and that execution would be driven from the outside by polling:

// This is pseudocode to demonstrate the "stackless" or "trampoline" VM style,
// how this really works inside piccolo is slightly more complex.

// Set up the execution of some Lua code and produce an object representing the
// execution state. The internal Lua / Callback call stack is reified into this
// object, and does not rely on the normal Rust function call stack.
let execution_state = ...;

// We need to call a method to make progress, and call that method in a loop
// until the running Lua code is complete.
loop {
    match execution_state.poll() {
        None => {
            // We are not yet done and must continue to call
            // `execution_state.poll()`. We could suspend the computation at any
            // time however by just exiting the loop.
        }
        Some(result) => {
            // Our execution is done and we receive the result, and should not
            // call `execution_state.poll()` anymore.
            break;
        }
    }
}

This should be extremely familiar to anyone who has ever touched Rust futures, and the similarity is no accident. The core of the design of piccolo is virtually the same as Async Rust: that all long running operations are reified into objects that must be polled to completion.

The obvious benefit for piccolo is that it becomes trivial now to exit a call to gc_arena::Arena::mutate and allow for collection, since we can now do so in-between calls to execution_state.poll().[12] What I didn't fully appreciate when I began writing piccolo is that this style of writing an interpreter, though at times more difficult, comes with many other benefits that make it (in my opinion) a worthwhile goal, or at least a very interesting place in the design space and I hope a unique niche that piccolo can fill.

Benefits of Stackless

Cancellation

An obvious side benefit of polling execution in the style of Future is that, just like a Future, execution can be canceled at any time by just... not continuing to call poll.

Let's go back to the REPL from above, but this time, let's see what happens when we run some Lua code that never returns.

If you don't know much Lua, try typing something like:

while true do end

or maybe

repeat
    print("Look Around You")
until false
↓ Just Endlessly Do It ∞ ↓

You should see a big interrupt button appear, and when you press it, the command should stop. How this works under the hood in this demo is that inside this webpage there is some javascript that looks something like this:

this.interval = setInterval(() => {
    if (this.executor.step(8192)) {
        this.finish();
    }
}, 0);

This is the "poll loop" that we talked about above that polls running Lua code to completion. This is still not exactly how it would look when using piccolo directly but it's a little closer... The executor there is a piccolo::Executor object,[13] and Executor::step is called in a loop until the code has completed. Here, Lua execution actually hooks into the normal Javascript event loop, every time the closure is run, the piccolo::Executor is "stepped" for 8192 "steps". The "steps" value here is referred to inside piccolo as "fuel" and (more or less) corresponds to a number of Lua VM instructions to run before returning. Since the timeout given to setInterval is 0, we run this function regularly and rapidly but without blocking the main Javascript event loop. When the interrupt button is pressed, the interval is canceled and the executor is dropped, interrupting execution. In fact, every REPL on the page works in the same way and shares the main Javascript event loop, so all of them can execute Lua code concurrently.

Interruptable Lua code is not something new to piccolo, PUC-Rio Lua (and most of its forks) have something like this in the form of lua_sethook. This function allows you to, among a few other things, set "hook" function that runs every count VM instructions, and one of the things this function can do when run is interrupt running code by calling e.g. lua_error.[14] So we can imagine a situation in which we can set up something similar to what piccolo is doing here, either by running Lua code in a different thread and waiting for an interrupt flag to be set in the hook function, or by pumping an event loop from within the hook function or something similar.[15]

However, I would argue that the way piccolo is structured makes this effortless and natural due to its stackless design. Since PUC-Rio Lua is written in normal, stackful style, the best thing it can offer is a hook function that will be periodically called by the VM loop, whereas with piccolo the user never loses control to the VM in the first place. piccolo is designed such that a call to Executor::step should always return in a reasonable, bounded amount of time proportional to the "fuel" it is given,[16] so it is not necessary to provide an equivalent to lua_hook at all.

Pre-emptive Concurrency

One of Lua's best and most defining features is its support for coroutines. Coroutines in Lua can be used to provide seamless cooperative multitasking, and are especially powerful for things like game development where some kind of script must execute concurrently with the running simulation of a video game.

However, Lua coroutines only provide cooperative multitasking, the script must decide where and when to yield control to the caller, and a buggy (or malicious) script that does not yield control may need to be interrupted and canceled (via lua_sethook) or might make the game hang.

Rust (at least, unstable Rust) also has coroutines, and they are used behind the scenes to implement async. In Rust, like in Lua, these coroutines provide cooperative multitasking, Rust code must decide when to call await, or an implementation of Future::poll must decide when to return. A buggy implementation of Future will hang an async executor thread just like a buggy Lua coroutine might hang a game loop.

In piccolo, running Lua code acts very similarly to a Rust "task" (a term for something that implements Future and is run on an async "executor"), and like Rust tasks, they can easily be run concurrently. However, piccolo works very hard to guarantee that piccolo::Executor::step returns in a bounded amount of time, even without the cooperation of the running Lua code. So, by using several independent piccolo::Executor "tasklets" and multiplexing calls to each piccolo::Executor::step, we can give Lua pre-emptive multitasking.

It's easier to understand with a demonstration. The two REPLs below are connected to one Lua instance. Instead of a single print function, they have the functions print_left and print_right to print in the left or right REPL console. They share global variables, so we can use this to demonstrate that the two interfaces are running Lua code on the same VM.

In the left REPL, type something like this

i = 0
while true do
    i = i + 1
end

While that is running in the left REPL, in the right REPL type this:

while true do
    print_right(i)
end
↓ These two REPLs are connected to the same Lua instance! ↓

You should notice that it appears that two separate Lua REPLs access the same state, seemingly in parallel! In fact, if you copied the exact code above, the right REPL probably prints values of i seemingly at random, every thousand and a half or so increments.

In this demo, this behavior comes from the way that running Lua code is run inside setInterval callbacks... The REPLs here work exactly the same as any of the REPLs above except that they both share a Lua instance, and this really is the only difference. There are two setInterval callbacks calling Executor::step being run by the browser at the same time and each callback is run in a round-robin fashion.[17] In a plain Rust environment you could get the same behavior by looping and calling Executor::step for one executor then another in turn, in a simple round-robin scheduler. This is very similar in a way to OS threads which also are pre-emptively scheduled, but instead of using an OS scheduler, we write our own scheduler and execute some running Lua for a time slice via calls to Executor::step.[18] In fact, though you can't observe actual data races here or even an analogue of it within Lua, you can observe a mirror of other race condition problems that OS concurrency primitives are meant to solve, and a custom scheduler for these Lua "tasklets" might even want to provide a version of common OS primitives to prevent them and aid scheduling.[19]

This is also again, VERY similar to how rust Futures (well, tasks) work when running on an async executor. Async tasks can be round robin scheduled or in some more advanced way, but each task has a "slice" of execution given to it by calling its Future::poll method. The difference in piccolo is that Lua scripts are not actually aware that they are being scheduled this way, and from the perspective of Lua, this scheduling happens pre-emptively. Rust callbacks in piccolo are more privileged than this and actually receive a piccolo::Fuel object that they can use to consume fuel proportional to their work, and must be trusted to cooperatively schedule themselves (you can always incorrectly write loop {} in a callback, after all), but Lua code cannot break out, at least not on its own.

Yet another way to look at this is that piccolo executes Lua code sort of as if there are something like invisible coroutine.yield statements inserted everywhere, but ones that operate on a different level of abstraction from the real coroutine.yield, ones which regular Lua code cannot interact with.

Let's imagine we transformed the above code that I asked you to type in the paired REPLs into something like this in plain Lua:

-- This example is in a big function loaded in the REPL below so you can easily
-- call it, and I'll do the same thing with later examples. I'm not gonna ask
-- you to type *that* much.
function coroutine_example()
    local i = 0

    local co1 = coroutine.create(function()
        while true do
            i = i + 1
            coroutine.yield()
        end
    end)

    local co2 = coroutine.create(function()
        while true do
            print(i)
            coroutine.yield()
        end
    end)

    while true do
        coroutine.resume(co1)
        coroutine.resume(co2)
    end
end
↓ The above code is loaded here you want to run it ↓

This behaves sort of like the code above but in a much more predictable way. If you know Lua and are comfortable with coroutines, you can probably tell that the above code is pretty much just a convoluted version of a simple for loop, but it's enough to demonstrate the idea. We have two coroutines that execute their own loops independent of each other and we schedule between them, but this time we require that the body of the coroutines decide where to yield to the scheduler to allow the other task to run. Stackless execution in piccolo is almost the same as if we could painlessly automatically insert these calls to coroutine.yield everywhere in the body of our running Lua tasks and use this to pre-emptively rather than cooperatively schedule them.

Fuel, Pacing, and Custom Scheduling

In the last section where I transformed the code executed in the concurrent REPLs into basic Lua coroutines, you may have noticed a big difference between the two. In the first example, the scheduling between the two Lua REPLs was somewhat random and hard to discern, the left REPL would increment the i value more than a thousand times for every time the right REPL printed the value of i, but in the second example the first task would increment the i value once for every time the second task printed i. The reason for this has to do with how the javascript for this page is actually written, but it's a perfect, simple example of something using piccolo enables: custom tasklet scheduling.

I've mentioned "fuel" before in the context of piccolo::Executor::step. Here is the real signature of Executor::step inside piccolo:

impl Executor {
    /// Runs the VM for a period of time controlled by the `fuel` parameter.
    ///
    /// The VM and callbacks will consume fuel as they run, and `Executor::step`
    /// will return as soon as `Fuel::can_continue()` returns false *and some
    /// minimal positive progress has been made*.
    ///
    /// Returns `false` if the method has exhausted its fuel, but there is
    /// more work to do, and returns `true` if no more progress can be made.
    pub fn step(self, ctx: Context<'gc>, fuel: &mut Fuel) -> bool {
        ...
    }
}

The method requires a fuel: &mut Fuel parameter to, well, "fuel" the VM, and the running VM consumes this fuel as it runs. Fuel is a very simple wrapper around an i32 value (you can see the current implementation here), that is decremented by Executor as it runs Lua code, and also optionally by any Rust callback that it calls. piccolo's ultimate goal is to enable treating all loaded Lua code as potentially malicious, but Rust callbacks are never on the other side of this security boundary. Callbacks are meant to cooperate with the execution system of piccolo and act as part of the security boundary to potentially malicious Lua, and as such, they can consume Fuel or even "interrupt" the "flow" of fuel to the Executor that calls them.

This system makes a lot of sense to provide, and not only strictly for security. piccolo's goal is to enable Lua tasks to run concurrently not only with Rust but with each other, and as such there are many ways we we might want to give certain tasks more or less time to run. We could imagine a game engine where we want to provide a sandbox for running Lua code such that no matter what, if the script is badly written or buggy, that the game simulation can continue without being compromised. Tasks could be scheduled such that they are assigned a certain amount of fuel "per second" up to a predefined "tank limit", giving them a kind of "burst" fuel. In this way, a task that periodically needs a lot of computational power can get it, but a broken task that has infinite looped will always use a much smaller amount of sustainable fuel per frame.[20]

Besides just "consuming" fuel, another thing a Rust callback can do is interrupt fuel. This is quite similar in behavior to just consuming all of the remaining fuel so the difference isn't that important, but it exists to mesh well with the use case described before, where we want to give tasks a sizeable "reserve tank" of fuel. "Interrupting" fuel flow makes the outer Executor::step immediately return to the Rust code calling it, no matter the amount of fuel currently consumed. This is mostly useful for technical purposes, for example if one Lua task is waiting on some event and cannot possibly currently make any progress, or if some callback must return to the outer Rust caller immediately to take effect.

This is what is happening on REPLs on this page when print is called! I noticed when testing the REPL I was writing that calling print in a hot loop slowed down the whole page quite a lot, and mostly just to create and destroy a bunch of output divs faster than they could be read as the output went far past the console scrollback limit. So, to fix this, I made print callbacks in this page always call Fuel::interrupt to make the page more responsive during hot loops. When you call print in a REPL on this page, it immediately yields control to the browser task queue! This is the sort of thing that having deep control over VM scheduling allows you to do: customize it to make it work in many different situations.[21]

"Symmetric" Coroutines and coroutine.yieldto

It's tough to talk about coroutines because there tend to not be universally agreed upon definitions, but I'm going to try. You might want to have the wikipedia article on coroutines and possibly this paper open if you want the full extra credit for this section.

Lua has what is usually referred to as "asymmetric coroutines", which are (as far as I can tell) the most commonly seen type of coroutine. This is also the same sort of coroutine that Rust supports with the (unstable) std::ops::Coroutine trait. As such, this can feel like a fancy term for a simple idea, but it refers to the limitation that coroutines yield only to their caller. It is also possible to instead support fully "symmetric" coroutines that can yield to any other coroutine, not just the calling one!

This is probably easier to understand expressed as a Lua function that almost provides what we want. Symmetric coroutines work as if we had the following function in Lua:

-- We want to yeild control to another coroutine directly, so we need to yield
-- control *and* resume another coroutine somehow at the same time.
function yieldto(co)
    -- Unfortunately this doesn't quite work as is, because there is no such
    -- thing as a "tail yield" in standard Lua. This will resume the given
    -- coroutine, but it will keep around a stack frame for *this* coroutine
    -- until the entire sequence of coroutines eventually returns, which may
    -- be *never*. As it is, this "works" but it is a stack leak.
    coroutine.yield(coroutine.resume(co))
end

We want a function that can yield to another coroutine by resuming that coroutine and yielding to the caller... whatever that other coroutine would have yielded. The problem is that this function as written is a stack leak: there is no way for normal Lua to "tail yield" like it can "tail return", the yieldto function as written will consume stack space for the current call to coroutine.resume, only giving the stack space up when the stack of coroutines eventually finishes. True symmetric coroutines do not have this limitation, and can mutually yield to each other without bound.

Because of the way piccolo's Executor works, Lua control flow that might normally be expressed as a Rust function call (such as a callback resuming another coroutine) is reified into the structure of the Executor, and the actual Rust control flow always "jumps back up" to Executor::step. This is actually the origin of the term "trampoline style" when referring to stackless interpreters, that control flow always "jumps back" to the same place. In PUC-Rio Lua, coroutine.resume is a normal C function call, so it is impossible to directly support this "tail yield" operation and avoid the stack leak, but piccolo's design just so happens to allow easily providing this as a builtin function: coroutine.yieldto, enabling full symmetric coroutines!

Let's see how it works...

-- I'm deeply sorry for this example...
function browse_internet()
    local be_bored
    local be_optimistic
    local be_dooming

    be_bored = coroutine.create(function()
        while true do
            print("I'm bored, I think I'll mindlessly browse The Internet")
            coroutine.yieldto(be_optimistic)
        end
    end)

    be_optimistic = coroutine.create(function()
        while true do
            print("Maybe The Internet won't be so bad this time")
            coroutine.yieldto(be_dooming)
        end
    end)

    be_dooming = coroutine.create(function()
        while true do
            print("I think I need a break from The Internet")
            coroutine.yieldto(be_bored)
        end
    end)

    coroutine.resume(be_bored)
end
↓ You can run the 100% accurate Internet Simulator below ↓

Now, unless you're already pretty familiar with coroutines (or for some unearthly reason you decided to stop reading this and instead go carefully read the paper I linked earlier), you might not know that "symmetric" and "asymmetric" coroutines are actually of equivalent expressive power. Let's pretend that we don't have coroutine.yieldto and transform the previous example a bit to make up for it.

-- I'm still sorry...
function browse_internet()
    -- Because we don't have proper `coroutine.yieldto`, we need some way of
    -- returning to the outer level and resuming the next coroutine. We can't
    -- provide this as a function because there's no way around the stack
    -- leak, but we can provide it as an outer "runner".
    function run_coroutines(co, ...)
        -- Because we can't provide a normal function, we instead require that
        -- every coroutine always yield the coroutine that should be run *next*,
        -- and in this way we avoid the stack leak. I've kept this purposefully
        -- simple and am not handling resume arguments or errors.
        local _, next = coroutine.resume(co, ...)
        if not next then
            return
        end
        return run_coroutines(next)
    end

    -- Afterwards, we change every call to `coroutine.yieldto` to
    -- `coroutine.yield`, and wrap the coroutines in our "runner".

    local be_bored
    local be_optimistic
    local be_dooming

    be_bored = coroutine.create(function()
        while true do
            print("I'm bored, I think I'll mindlessly browse The Internet")
            coroutine.yield(be_optimistic)
        end
    end)

    be_optimistic = coroutine.create(function()
        while true do
            print("Maybe The Internet won't be so bad this time")
            coroutine.yield(be_dooming)
        end
    end)

    be_dooming = coroutine.create(function()
        while true do
            print("I think I need a break from The Internet")
            coroutine.yield(be_bored)
        end
    end)

    run_coroutines(be_bored)
end
↓ After the above transform, our simulator is still 100% Accurate ↓

So, the coroutine.yieldto function that piccolo provides doesn't actually make Lua fundamentally any more powerful, instead it is more of a convenience. So why bring this up? Well, besides it being a very neat function to be able to provide, and piccolo being able to provide it in a way that doesn't require any outside "runner", I wanted to bring attention to the idea of transforming code like this.

It's no coincidence that piccolo has an easy time providing coroutine.yieldto. The above transform takes normal control flow and turns it into control flow via return values. This is very nearly the exact same transform that has already been done by piccolo's stackless design that I've been talking about this whole time.

In fact, let's look at the actual implementation of coroutine.yieldto in the code for the coroutine lib inside piccolo:

coroutine.set(ctx, "yieldto", Callback::from_fn(&ctx, |ctx, _, mut stack| {
    let thread: Thread = stack.from_front(ctx)?;
    Ok(CallbackReturn::Yield {
        to_thread: Some(thread),
        then: None,
    })
})).unwrap();

Ignoring some of the unimportant details, we see that the 'yieldto' field is set to a callback function, and that callback function takes a single argument of a Thread. Then, it returns an enum value CallbackReturn::Yield and states which thread to yield to (the normal coroutine.yield function simply sets to_thread to None instead). This is exactly the same as the transform that we've already done above, which shows why this is so simple for piccolo to provide: piccolo::Executor already works like this.

The "Big Lie"

So far I have talked a lot about piccolo's unique design, and how it allows piccolo to have powers that other Lua interpreters can't have. I have been lying to you! The actual truth is rather complicated, and you need the context of everything I've said so far to fully understand it.

The real truth is... PUC-Rio Lua can already sort of do about 70% of the same things piccolo can do. In fact, piccolo is not uniquely designed at all, it is the natural conclusion to the way PUC-Rio Lua already works.

Let's start by doing something that I think almost nobody who uses PUC-Rio Lua or Luau or LuaJIT knows that they can do.[22] We're going to implement tasklets using the plain Lua C API!

I don't have the energy to get normal PUC-Rio Lua 5.4 working in a browser with Emscripten, so you won't be able to run these examples interactively, you'll just have to trust me (or set up a C build environment with Lua 5.4 and try them yourself). You'll also have to understand C and the PUC-Rio Lua C API to fully understand these examples, but hopefully I can comment them enough to show what's going on even if you don't.

#include <assert.h>
#include <stdbool.h>

#include "lauxlib.h"
#include "lua.h"
#include "lualib.h"

// We will set up a "hook" function for the Lua VM to periodically call.
//
// In this case, the "hook" function always *yields*, which will only work if
// the calling Lua thread is itself a coroutine (and not the main thread).
//
// We are sort of using the "hook" function to *externally insert* calls to
// `coroutine.yield` periodically in otherwise unmodified Lua.
void yield_hook(lua_State* L, lua_Debug* _ar) {
    lua_yield(L, 0);
}

int main(int _argc, char** _argv) {
    // Open the main Lua state and all of the Lua stdlib.
    lua_State* L = luaL_newstate();
    luaL_openlibs(L);

    // Create a thread separate from the main one to use as our coroutine
    // thread.
    lua_State* co = lua_newthread(L);

    // Load *unmodified* Lua code, no manual calls to `coroutine.yield` are
    // necessary here.
    assert(
        luaL_loadstring(
            co,
            "while true do\n"
            "    print('hello')\n"
            "end\n"
        )
        == LUA_OK
    );
    // Set our hook function on the coroutine thread, *forcing* the coroutine to
    // yield whenever the hook is called.
    //
    // In this case, the hook will be called every 256 VM instructions.
    lua_sethook(co, yield_hook, LUA_MASKCOUNT, 256);

    // Our main loop.
    //
    // Every time through the loop, we resume our coroutine. The hook function
    // *externally* causes the called Lua code to periodically yield without
    // having to modify our Lua source code to manually add `coroutine.yield`
    // statements.
    //
    // When running this C code with my current version of Lua 5.4, I see 64
    // "hello" lines for every 1 "there" line, showing that execution correctly
    // periodically returns to C.
    while (true) {
        int nresults;
        assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);
        lua_pop(co, nresults);
        printf("there\n");
    }
}

The example above shows a fully working, if simplistic, Lua tasklet system. In the same way that piccolo's Executor::step function works "as though" there are invisible periodic calls to coroutine.yield everywhere, calling lua_yield from a lua_Hook function also (and much more literally) inserts invisible periodic calls to coroutine.yield. This is more or less everything required for a tasklet!

PUC-Rio Lua can do about 70% of what piccolo can do, right out of the box! The problem is the last 30%. Let's modify the example above very slightly...

#include <assert.h>
#include <stdbool.h>

#include "lauxlib.h"
#include "lua.h"
#include "lualib.h"

// Same as last time, we effectively insert invisible periodic calls to
// `coroutine.yield`.
void yield_hook(lua_State* L, lua_Debug* _ar) {
    lua_yield(L, 0);
}

int main(int _argc, char** _argv) {
    // Open the main Lua state and all of the Lua stdlib.
    lua_State* L = luaL_newstate();
    luaL_openlibs(L);

    // Create a thread separate from the main one to use as our coroutine
    // thread.
    lua_State* co = lua_newthread(L);

    // Still *unmodified* Lua code with no manual calls to `coroutine.yield`.
    //
    // We make one small change though, before calling `print('hello')`, we call
    // `table.sort` to sort a Lua table. The callback isn't important here, but
    // what's important is that `table.sort` is a C function which calls a Lua
    // function (the comparator).
    //
    // We put a big for loop in the comparator function just to make sure the VM
    // spends some actual time here, but no matter what, the same behavior will
    // eventually occur if you use Lua -> C -> Lua callbacks at all.
    assert(
        luaL_loadstring(
            co,
            "while true do\n"
            "    table.sort({3, 2, 1}, function(a, b)\n"
            "        for _ = 1,1000000 do end\n"
            "        return a < b\n"
            "    end)\n"
            "    print('hello')\n"
            "end\n"
        )
        == LUA_OK
    );
    // Set our hook function just like last time to execute every 256 VM
    // instructions.
    lua_sethook(co, yield_hook, LUA_MASKCOUNT, 256);

    // Main loop, unmodified from the previous example.
    while (true) {
        int nresults;
        assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);
        lua_pop(co, nresults);
        printf("there\n");
    }
}

If you try and run this C code, it will immediately error on this assert in the main loop:

        assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);

The reason for this is that lua_resume is erroring and returns LUA_ERRRUN instead of LUA_YIELD. This is happening because lua_yield, which is being called from our "hook" function, cannot be called from within most C callbacks. What is the C callback? It's our call to the stdlib function table.sort within the body of the tasklet loop. At the time that the call to lua_yield is called incorrectly, the C call stack looks something like this (simplified):

main -> lua_resume -> luaV_execute (the main VM loop) -> sort (in ltablib.c) -> lua_call -> luaV_execute (the main VM loop again, for the comparator) -> yield_hook -> lua_yield

PUC-Rio Lua uses the normal C stack for much of its internal state, and calls to lua_yield are expressed as a C longjmp, jumping back up to an upper C frame and popping any call frames in-between from the call stack. So, certain operations are simply disallowed when the inner call to longjmp would destroy essential information about the runtime state.

There IS a way around this problem, however. Ultimately, the problem is that the call to table.sort, a C function, in turn calls a Lua function with the C API function lua_call. Any Lua function called this way is disallowed from calling coroutine.yield (or its C equivalent lua_yield). PUC-Rio's C API provides a special version of lua_call to get around this: lua_callk. You can read in more detail about the entire situation in the section of the PUC-Rio Lua 5.4 manual called Handling Yields in C.

This does work, and in this way, PUC-Rio Lua provides the ability to yield from situations like this Lua -> C -> Lua sandwich. However, table.sort is not written this way, and in fact almost none of the stdlib is written this way at all![23] The reason for this is, frankly, that transforming C code to work this way is enormously difficult. The C code in question must be able to handle a longjmp, when the inner Lua code triggering a longjmp will destroy (not even unwind!) the current C stack up to where it was called, and the only way for the C code to resume is through the lua_KFunction and lua_KContext passed to lua_callk. There are no Drop impls to rely on, no automatic memory management, no coroutines, the C code must be transformed so that it relies entirely on a type pointed to by a lua_KContext for its state, so that it can be suspended at any time.[24]

This is not the only problem, either. By repurposing normal Lua coroutine yields like this to yield back to C, you take away Lua coroutines from the usable part of the Lua language. If we were to try to use normal coroutines in our tasklet system, the inner lua_yield from the hook function would just yield to the nearest thing that has called lua_resume, which in this case would be the Lua thread which called coroutine.resume and not the top-level C code. I love coroutines,[25] and Lua without coroutines is frankly no longer really Lua, but with enough effort, I think you could get around this problem too! Remember the transform we did before, where we made symmetric coroutines out of asymmetric ones? You can do something similar with the normal Lua C API but wrapping coroutine.yield in a special version that instead returned whether or not it is a real yield or a synthetic one from the lua_Hook function. You would have to go further than this to make it work, restructuring all of the other coroutine methods so that which thread was waiting on the results of which other thread was kept in an array rather than the C stack, so that the coroutine "tasklet" system continues to work while providing a similar, inner system for "normal" Lua coroutines.

You could also do the work of re-implementing every single function in the stdlib that calls back into Lua in such a way that it used lua_callk with a continuation function instead of lua_call, too, so that every C function in the stdlib became suspendable. For good measure, you could also periodically yield long running callbacks even if they didn't call back into Lua, just to make sure that execution always jumped back out to the outermost C code in a bounded amount of time.

So lets summarize this list of theoretical changes we can make to PUC-Rio Lua to make a complete tasklet system.[26]

  1. Use lua_Hook functions to insert synthetic calls to lua_yield within all Lua code.
  2. Make all of the stdlib that calls Lua functions (including those that implicitly call metamethods) suspendable by using lua_callk and continuation functions.
  3. Reimplement the Lua coroutine library making it one level of abstraction up from normal calls to lua_yield, so that normal Lua coroutines can still work. We would need to implement coroutine.resume in a different way that does not use the C stack. We can do a transform similar to implementing "symmetric" coroutines over "asymmetric" ones here, where we implement "Lua" coroutines over our lower level "synthetic" yielding. Lua calls to coroutine.yield and coroutine.resume would now both be a yield to the calling C code, and the yielded values would tell the outer C code what to do next (whether to resume another coroutine or yield to whatever the "upper" coroutine was). As a side effect, coroutine.yieldto becomes easy to implement.
  4. For good measure, keep track of some unit of time cost in all callbacks, and insert calls to lua_yieldk in all long running callbacks so we know that control will always return to the outer calling C code in a reasonable amount of time.

We have now reached, very roughly, the current design of piccolo.

Rust Coroutines, Lua Coroutines, and Snarfing

In the previous section I laid out a rough explanation of how to transform PUC-Rio Lua as it exists today and build a system similar to what piccolo forces by design. However, I am not aware of anyone ever doing anything like this on a grand scale.[27] The reason for this, I think, is simple, and that is that it is just monumentally hard to write C callbacks this way!

The same problem exists in piccolo though, which I alluded to near the beginning of this post. In piccolo, long running callbacks are represented by a trait called Sequence which allows them to be suspended. More precisely, it is not so much that they are suspended as it is that their API must mirror the outer Executor API in piccolo: they must be polled to completion. Now, the situation is not nearly as bad here as trying to use lua_callk / lua_pcallk / lua_yieldk in plain C, but fundamentally it can still be more than a little painful.

The Sequence trait shares a lot in common with the Future trait, in that both represent an operation that must be polled to completion. Like I said before when I was introducing the "stackless" design, this similarity is no accident.

I used the slang word "snarf" casually near the beginning of this post without really explaining it. As I understand it, snarfing is something from PLT jargon where if you implement an inner programming language B in an outer programming language A, features from language A can be very easily and automatically incorporated into language B. The most common example I see here is actually garbage collection, if you implement a runtime for a garbage collected language within another garbage collected language, and you're okay with the GC semantics from the outer language being reflected in the inner language, then you can snarf garbage collection from the outer language. Think of implementing Lua in something like Go, even though the specifics of the GC semantics in Lua may not be expressible in Go,[28] it would probably be worth it to just snarf garbage collection from Go and use plain Go pointers as Lua references.

Snarfing can also be simpler things like implementing the stdlib of the inner language using the stdlib of the outer language, in PUC-Rio Lua, there is actually a good deal of functionality snarfed from C, most of it bad (like os.setlocale).

With all of this context finally out of the way, I can say what I've wanted to say almost from the beginning of this very long blog post: The original design I wanted with piccolo and gc-arena was for Lua to snarf coroutines from Rust. I'm going to talk about this in more detail in a future post because this post is so long already, but Sequence's similarity to Future is because I want to use Rust coroutines to implement piccolo.

Think about it... why is PUC-Rio Lua's C interpreter written the way it is? Why do lua_callk and lua_pcallk and lua_yieldk exist at all... they exist because C does not have coroutines. This entire post I have been dancing around this issue without addressing it because I feared it wouldn't make sense without a mountain of context, but the entire time I've been talking about "reifing state" that would "normally be held inside the call stack" into objects that can be "polled to completion"... that is the very core of what a coroutine is.

The only real downside to gc-arena and piccolo is having to do this transform manually rather than letting the Rust compiler do it. The pain of using gc-arena and piccolo is THE SAME pain that existed before Rust Async was stabilized, with Future combinator libraries having to fill the gap. In fact, an older version of gc-arena tried to provide combinators like this to try and fill the gap, but making it fit into piccolo in a generic way was just too painful and the combinator library was dropped. piccolo::Sequence actually comes from the remains of this combinator library.

And all of this exists solely because I can't figure out how to make a Rust coroutine implement gc_arena::Collect.[29] If I could figure this out, all of the problems with gc-arena and piccolo could melt away, and the Rust compiler could do the painful transformation into "stackless" interpreter design largely for us. Even the term "stackless" is shared with Rust coroutines.

Zooming Out

I'm gonna spend a bit of time here zooming out some more. Hopefully I won't zoom out so far that I stop even being anchored to reality.

I think Rust's really cool core idea is the same that is shared by all systems programming languages: that they are meant to be the last stop in a line of other choices. I honestly don't think every single bit of software needs to be written in Rust or any systems programming language. To me, systems programming languages are languages where if you need to make system A work with system B, no matter what those systems are, you can use them. Rust and C are languages that you're supposed to use when what you're making needs to fit almost anywhere. They're supposed to be the languages with the fewest possible assumptions about how the rest of the world works, becasue they're meant to be a host or glue language to inner systems with better assumptions, which are more fit to purpose.

I know that this perspective on systems programming languages is not universal, and that the real world is actually quite resistent to putting things into neat little boxes like this, but I think this perspective is at least a useful one.

As such, I always flinch a little when I see people trying to write systems in Rust as though they're trying to figure out the one solution for something, assuming that no other solution would EVER need to exist within the same program, or even need to exist at all. I think one size fits all solutions to problems are not where Rust's strength is. Global async reactors / executors, whole-program garbage collectors,[30] heck even whole program allocators,[31] all of these things always make me some amount of uncomfortable because I just think that.. systems programming languages are meant for making BIG end products or libraries that last a long time, where more than one of these kinds of systems might need to exist at once, or you may need to take them apart and use them a la carte. It's not that I think these are wrong to use or wrong to make, I just don't prefer to use those kinds of solutions myself if I can avoid them because also honestly the user of my library knows better than me. There's a tradeoff in programming between flexibility and fitness for purpose, systems programming is the way it is because it's supposed to be ultimately flexible, it's what you use when a more friendly, more tailored tool isn't flexible enough. I don't like going against that idea when writing code that I want to last for a long time.[32]


One of my favorite parts of pretty much every version of Lua is how painless it is to have multiple copies of the interpreter. If you've ever run into large garbage collector pauses in other languages, this rarely happens in Lua not because its garbage collector is just that good, but because you aren't forced to have just ONE of them in your program, you can have as many of them as you need, each in isolation from each other! Lua is a language that actually meshes very well with my vague ideas about the core idea of systems programming, because PUC-Rio Lua was written to fit almost anywhere. It's actually amazing how neatly it fits into C, how it is fine being the bottom of the hierarchy, it's just a C library and you just call C functions. Two distant parts of a program both use Lua? Different versions of Lua with different loaded libraries? You can make it work! It doesn't read external files unless you tell it to, it doesn't do anything unless you tell it to because it makes very few assumptions. It's your tool, and it fits neatly into whatever program you already have. I think this is why it has remained so popular for so many years. [33]

I want to make a version of Lua that feels for Rust like PUC-Rio feels for C, but to go even further. I want to make a version of Lua that fits anywhere as much as I possibly can make it. Untrusted input! Cancellation! I want to make a version of Lua with the fewest possible opinions about how the rest of the program is structured. I know piccolo is a little far from that in several ways right now, but that's my ultimate goal. I think stackless interpreters actually fit this idea of being as unobtrusive as possible, of fitting anywhere better than classical language interpreters.

Garbage collection systems in general are very often at odds with this idea of fitting anywhere. There can only be one boehm gc in a C program, after all. It's interesting to me that garbage collector systems as a library for C have to be so much slower than garbage collectors written for other languages but written in C. The problem is not that C can't write fast garbage collection systems, the problem is that the C language itself has so few "global assumptions" built into it. It's much easier to write a garbage collector for a language that can annotate where GC pointers are on the language's call stack or in heap allocated objects than one like C, where it is extremely difficult to have such things.


Progress in systems programming languages seems to happen when new abstractions are invented that give new fundamental powers but do NOT introduce more required assumptions. I think coroutines are one of these, and that all systems programming languages should have a stackless coroutine system because it is the sort of thing that can fit into other systems as much as possible. I think there is also some kind of deep connection between higher level languages whose compilers / interpreters do things like annotate where garbage collected pointers are stored in the language's call stack or automatically insert garbage collector safe points, and the idea of coroutines as a general reification of the call stack itself, letting the language do this manipulation rather than a specialized compiler.

I came up with this connection way back in early 2019, but if we could make Rust coroutines implement Collect, then this makes yield / await into an abstraction of the garbage collector safe point. When a Future or Coroutine yields control to the caller, all of the (apparent) stack variables are guaranteed to be stored inside the state of the running coroutine. This would allow gc-arena to easily separate collection and mutation in normal, straight line Rust code that is simply annotated with awaits (or yields) to mark where garbage collection can safely take place in the same way that a higher level language runtime inserts safe points to mark where garbage collection can safely take place.[34]

I think Rust is so close to having some very interesting, novel powers with its coroutines by simply being able to combine existing features together. I can automatically serialize a custom struct with #[derive(Serialize)], and I can automatically transform a function body into a state machine, but what I cannot do is #[derive(Serialize)] this state machine, nor can I #[derive(Collect)] it. Why not??

This deserves its own blog post, but I felt like I couldn't rightly close out this post without at least mentioning it. In my next post I'm going to explore this idea more fully and hopefully try and actually make it "work" by pretending to be able to #[derive(Collect)] a coroutine. I think that Rust might need more than just this feature to make this system workable, but if it did work, it could represent a largely painless, general, isolated system for tracing garbage collection in Rust. A garbage collection system that can fit anywhere.

Bye!