Using Turbolizer to inspect the V8 JIT compiler

Browser engines have sophisticated JIT compilers optimizing your JS code, but it’s usually a completely opaque process. You normally get no information about what what code is getting JIT compiled, how loops might be unrolled, memory accesses optimized, etc. V8 in particular is pretty good at hiding that, even when using the debugger or profiling.

But V8 does give you the ability to inspect the JIT process, using the --trace—turbo flag and a tool called Turbolizer. You can use the command line to generate a trace, then use the visualization tool provided in the V8 repo to view it:

node --trace-turbo <( echo "for(var i=0; i< 10000000; i++) { i*2; }")

Some notes: The --trace-turbo option refers to using the TurboFan optimizing compiler. Using a small number of iterations (like 100) won’t produce any optimized code. Upping that to 10000000 will.

The data generated consists of phases of the JIT compilation process, divided into nodes of the sea-of-nodes graph generated by the compiler:

generated data

Turbolizer visualizes this data, showing how the control flow of the generated code at various points in the compilation process. From the first generated bytecode, where you can see the original branch (i<10000000):

first step

To the last step:

last step

Notice that most of the additions here are DeoptimizeIf blocks or additional branches. Deoptimizing blocks will bail out to deoptimized code if certain checks are violated (here, overflow checks). https://users.soe.ucsc.edu/~renau/docs/iiswc16.pdf

Some other explanations: a “merge” is the opposite of a branch, a control flow mechanism that makes sure that operations aren’t reordered. A V8 developer goes into more depth on these concepts here: https://stackoverflow.com/questions/57463700/meaning-of-merge-phi-effectphi-and-dead-in-v8-terminology

The actual optimizations steps done are the following: loop peeling, early optimization, store-store elimination, control flow optimization, memory optimization, and late optimization. In this example, the only steps that actually produce changes are loop peeling and late optimization, eliminating a DeoptimizeUnless.

The DeoptimizeUnless checked that the result is not an SMI, or 32-bit (small) integer. It’s not clear what may have produced that optimization, though you can see that the largest number involved in the above code is 10,000,000, and the largest computed is 2x that (both under the 32 bit limit).

eliminated step

Also of note is the turbo-<*>.cfg file, which contains several blocks of the intermediate representation code, then the final assembly code. The blocks represent nodes in the control flow graph, a coarser representation of the program than the sea-of-nodes graph — see this helpful link again: https://darksi.de/d.sea-of-nodes/.

Here’s the full file generated from this example. It’s possible to pick out some elements from the original JS code, like the constant being defined as n75:

0 1 n75 Int32Constant[10000000]   pos:76 <|@

Then the branch after comparing the iteration counter (i < 1000000) to n75:

0 1 n30 Int32LessThan   n21 n75  type:Boolean pos:76 <|@
0 0 n31 Branch[None|NoSafetyCheck]   n30 Ctrl: n18 -> B7 B11  <|@

Some other interesting things to note: block 1 contains a safety check on the OSR (on stack replacement) values, referring to the way the compiler switches between optimized and deoptimized versions of code. It’s a check that a certain OSRValue - the eighth index of this array - is a small integer, though it’s not clear to me what this number represents or why that would not be the case.

This presentation is helpful for understanding the what the various oppcodes represent, how these change over the stages of compiler, how the sea-of-nodes graph is formed, and how the control flow graph is formed: https://docs.google.com/presentation/d/1Z9iIHojKDrXvZ27gRX51UxHD-bKf1QcPzSijntpMJBM/edit#slide=id.g19134d40cb_0_29

Conclusions

While this is a great tool for understanding how Turbo works, it’s pretty impractical for analyzing performance. The IR, hard enough to decipher for even a small example like this, is likely a lot worse for more complicated code. It’s hard to imagine a situation where reading the JITed code would help you where the Chrome profiler would not, unless you’re developing V8 itself or trying to find bugs in it.

It also doesn’t give any insight into when the optimized code is generated or how the compilation interacts with the rest of V8. While there’s a script for using the tool with perf to trace the performance of the JITed code, this option requires first building V8 from source, then, on my system, cloning the Linux kernel and building perf from source, so I didn’t try this. But for those willing to put in the work, that seems like it could yield better insights about how the compiler affects runtime performance.