Thoughts on JIT

A week ago I detailed Vua, a memory safe language that unprivileged Viridis programs have to use in order to exist within a single memory context.

I also mentioned I have a decent amount of preliminary work done generating a bytecode for this language, and it seems like the next logical step is to interpret that bytecode and then, likely, JIT’ing it. This is a pretty standard path, and JITs are usually good compromises between loosely typed dynamic languages and the hard earned performance of tuned assembly.

However, now that I’ve converted modified Lua syntax into modified LuaJIT bytecode to a first approximation, and I’m pondering writing a VM to execute this bytecode, I’m wondering if it’s not a better idea to just skip writing the VM and making some simple modifications to the bytecode to serve as an intermediate representation of the program rather than an (easily) executable artifact itself. This way, limitations on the bytecode could be lifted (like the set instruction size and the need to encode constants in 8 bit fields) and Vua can still have a version of LuaJIT’s super fast parser, but instead of writing a VM, I could just write an architecture specific backend to the compiler to convert this intermediate bytecode into assembly.

My reasoning here is that, the more I think about it, the less I think a VM is really buying us much. In a scripting environment, where Lua and JITs in general find a lot of traction, a lot of code is short running, likely execute once and only once, so skipping the expensive compilation stage actually improves performance, and then JITing handles the potentially longer running, more intensive parts of the code.

Well, short running doesn’t describe much of our usecase. In fact pretty much everything at the system level is going to be running the entire time the machine is up, so even “long running” sounds like an understatement compared to “always running”. In that context, it seems likely to me that the overhead of AOT (ahead of time) compiling everything is going to be recouped over the running life of the system, especially if we have a kernel controlled cache of true binaries on disk.

Another advantage of the VM is the debuggability. The VM knows a lot about the code it’s running and when something breaks it’s usually the VM that handles it. It knows the exact bytecode instruction, the exact error, the line in the source, where on a traditional binary the kernel can only be so specific (e.g. this program caused a hardware exception, or misused a kernel interface). The VM also provides a body of common code in software, which conveys certain advantages to debugging (e.g. giving the ability to stop whenever a given table is read or written without having to identify every single place in the code a read or write is done).

But Viridis is free from the problem of accuracy because of the 100% opt-in to Vua. The “virtual machine” for Vua could actually be the physical machine. Viridis can know the exact instruction, exact error because it also knows where the source is and compiled it to assembly itself. As for common code, I’m willing to give up this ease in favor of GDB style debugging, or simply debug recompilation, especially since I bet that this common code is actually bad for performance. Yes, it’s more likely to be in cache, but similar to the highly optimized instruction dispatch code in LuaJIT’s VM, we likely gain more by having far fewer and more predictable branches than we do by preventing cache misses.

I also like the idea of static compilation because it gives us a freer hand with assembly, and particularly register usage. The LuaJIT VM still obeys the platform’s C ABI because it’s designed to cooperate with arbitrary C code. We don’t have that restriction (there is no arbitrary C code, just kernel that we can warp however we want), but even without it, the most efficient use of registers in a VM is to pin certain info and locations into known registers and then use them consistently in specific instruction handlers. For example, the LuaJIT VM always has a register with the current instruction in it. Obviously that makes sense when every handler is likely to need information from the instruction to complete its work, but static compilation doesn’t need to care about that. Same with having a register that always points to the constant lookup table, etc. etc. During execution, VM “registers” are almost always just stack locations because the VM’s handlers aren’t flexible enough to accommodate using more than the handful of registers it’s defined to be consistent.

Which isn’t to say that the VM handlers are poorly written, it’s just that adding comparisons to deal with hardware registers and stack “registers” in the same code obviates any performance gain from using the hardware registers in the first place. You could specialize the bytecode, but then your bytecode isn’t portable and you double the amount of code required to deal with pretty much every instruction.

Anyway, tight register usage is fine when you’re trying to keep to yourself and co-exist with C… but with Viridis, if the VM never uses a register, say R11, it just doesn’t ever get used outside of the kernel and that’s obviously not acceptable.

Aside from the VM, the benefits of JIT, like being able to use runtime feedback optimization, aren’t off the table with static compilation. Meanwhile other interesting, if heretical, avenues of optimization open up. Like what if you could do extreme inlining across the program->library->kernel barriers? I’m interested in creating a system where there is no set ABI, except for interacting with the kernel. Of course that’s all future work.

The only real problem I see immediately is that static compilation makes a loose type system quite a pain. If a function can be called with multiple types, or getting a value from a table can yield multiple types, that’s hard to rectify in pure assembly – which is part of why successful JITs carefully select pieces of code to compile. I need to do more research into how this should be dealt with, but I’m confident that a combination of guards and repetitive compilation can be used to manage dynamic typing on the fly.

Leave a Reply

Your email address will not be published. Required fields are marked *