Peephole optimizations: adding `opt_respond_to` to the Ruby VM, part 4

In The Ruby Syntax Holy Grail: adding opt_respond_to to the Ruby VM, part 3, I found what I referred to as the “Holy Grail” of Ruby syntax. I’m way overstating it, but it’s a readable, sequential way of viewing how a large portion of the Ruby syntax is compiled. Here’s a snippet of it as a reminder:

// prism_compile.c
static void
pm_compile_node(rb_iseq_t *iseq, const pm_node_t *node, LINK_ANCHOR *const ret, bool popped, pm_scope_node_t *scope_node)
{
    const pm_parser_t *parser = scope_node->parser;
    //...
    switch (PM_NODE_TYPE(node)) {
      //...
      case PM_ARRAY_NODE: {
        // [foo, bar, baz]
        // ^^^^^^^^^^^^^^^
        const pm_array_node_t *cast = (const pm_array_node_t *) node;
        pm_compile_array_node(iseq, (const pm_node_t *) cast, &cast->elements, &location, ret, popped, scope_node);
        return;
      }
      //...
      case PM_MODULE_NODE: {
        // module Foo; end
        //...
      }
      //...
}

The file that code lives in, prism_compile.c, is enormous. pm_compile_node itself is 1800+ lines, and the overall file is 11 thousand lines. It’s daunting to say the least, but there are some obvious directions I can ignore - i’m trying to optimize a method call to respond_to?, so I can sidestep a majority of the Ruby syntax.

Still, where I do go, specifically?

Sage wisdom

Helpfully, I got two identical sets of direction based on part 3. One from Kevin Newton, creator of Prism:

https://x.com/kddnewton/status/1872280281409105925?s=46

And one from byroot, who inspired this whole series:

https://bsky.app/profile/byroot.bsky.social/post/3le6xypzykc2x

I don’t want to jump to conclusions, but I think I need to look at the peephole optimizer 😆.

And exactly what is a “peephole optimizer”? Kevin described the process as “specialization comes after compilation”. From Wikipedia:

Peephole optimization is an optimization technique performed on a small set of compiler-generated instructions, known as a peephole or window, that involves replacing the instructions with a logically equivalent set that has better performance. https://en.wikipedia.org/wiki/Peephole_optimization

This seems to fit my goal pretty well. I want to replace the current opt_send_without_block instruction with a specialized opt_respond_to instruction, optimized for respond_to? method calls.

Finding the optimizer

So where are peephole optimizations happening in CRuby today? In Étienne’s PR, he added optimization code to a function called… iseq_peephole_optimize. A little on the nose, don’t you think? Kevin’s comment also mentioned iseq_peephole_optimize - seems like the winner.

I want to make the link between iseq_peephole_optimize and where we left off at pm_compile_node. Let’s dig into some code!

Disassembling an existing optimization

I’m going to use Étienne’s frozen array optimization to get to the optimizer and see how it relates. If you want to follow along, start with the setup instructions from part 3.

His optimization only applies to array and hash literals being frozen. So we’ll write a teensy Ruby program to demonstrate, and put it in test.rb at the root of our CRuby project:

# test.rb
pp [].freeze

The best way to run test.rb here is to use make. It will not only run the file, but also make sure things like C files get recompiled as necessary when you make changes. Let’s run our file, but dump the instructions it would generate for the Ruby VM:

RUNOPT0=--dump=insns make runruby

RUNOPT0 lets us add an option to the ruby call, so it’s effectively ruby --dump=insns test.rb. Here’s the instructions we see - we can confirm that we are getting the optimized opt_ary_freeze instruction from Étienne PR:

== disasm: #<ISeq:<main>./test.rb:3 (3,0)-(3,12)>
0000 putself                      (   3)[Li]
0001 opt_ary_freeze               [], <calldata!mid:freeze, argc:0, ARGS_SIMPLE>
0004 opt_send_without_block       <calldata!mid:pp, argc:1, FCALL|ARGS_SIMPLE>
0006 leave

You never know what code is truly doing until you run it. So far, I’ve just been reading and navigating the CRuby source. iseq_peephole_optimize lives in compile.c - let’s set a breakpoint and take a look 🕵🏼‍♂️.

Using the debugger

We can debug C code in CRuby almost as easily as we can use a debugger/binding.pry.

For MacOS, you can use lldb, and for Docker/Linux, you can use gdb. I’m going to do everything in lldb to start, but I’ll show some equivalent commands for gdb after.

Let’s start by looking at the peephole optimization code for [].freeze, inside of iseq_peephole_optimize. I’ll add comments above each line to explain what I think it’s doing:

// compile.c
static int
iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcallopt)
{
         // ...
         // if the instruction is a `newarray` of zero length
3469:    if (IS_INSN_ID(iobj, newarray) && iobj->operands[0] == INT2FIX(0)) {
             // grab the next element after the current instruction
3470:        LINK_ELEMENT *next = iobj->link.next;
             // if `next` is an instruction, and the instruction is `send`
3471:        if (IS_INSN(next) && (IS_INSN_ID(next, send))) {
3472:            const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(next, 0);
3473:            const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
3474:
                 // if the callinfo is "simple", with zero arguments,
                 // and there isn't a block provided(?), and the method id (mid) is `freeze`
                 // which is represented by `idFreeze`
3475:            if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
                     // change the instruction to `opt_ary_freeze`
3476:                iobj->insn_id = BIN(opt_ary_freeze);
                     // remove the `send` instruction, we don't need it anymore
3481:                ELEM_REMOVE(next);

Now i’ll use lldb to see where this code runs in relation to our prism compilation. In CRuby, to debug you run make lldb-ruby instead of make runruby. You’ll see some setup code run, and then you’ll be left at a prompt, prefixed by (lldb):

> make lldb-ruby
lldb  -o 'command script import -r ../misc/lldb_cruby.py' ruby --  ../test.rb
(lldb) target create "ruby"
Current executable set to '/Users/johncamara/Projects/ruby/build/ruby' (arm64).
(lldb) settings set -- target.run-args  "../test.rb"
(lldb) command script import -r ../misc/lldb_cruby.py
lldb scripts for ruby has been installed.
(lldb)

At this point, we haven’t actually run anything. We can now set our breakpoint, then run the program. I’ll add a breakpoint right after all if statements have succeeded:

(lldb) break set --file compile.c --line 3476
Breakpoint 1: where = ruby`iseq_peephole_optimize + 2276 at compile.c:3476:17

With our breakpoint set, we call run to run the program:

(lldb) run

You’ll see something like the following. It ran the program until it hit our breakpoint, right after identifying a frozen array literal:

(lldb) run
Process 50923 launched: '/ruby/build/ruby' (arm64)
Process 50923 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:17
   3473             const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
   3474
   3475             if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
-> 3476                 iobj->insn_id = BIN(opt_ary_freeze);
   3477                 iobj->operand_size = 2;
   3478                 iobj->operands = compile_data_calloc2(iseq, iobj->operand_size, sizeof(VALUE));
   3479                 iobj->operands[0] = rb_cArray_empty_frozen;

I want to see where we are in relation to all our prism compilation code. We can use bt to get the backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:29
    frame #1: ruby`iseq_optimize(...) at compile.c:4352:17
    frame #2: ruby`iseq_setup_insn(...) at compile.c:1619:5
    frame #3: ruby`pm_iseq_compile_node(...) at prism_compile.c:10139:5
    frame #4: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #5: ruby`rb_protect(...) at eval.c:1033:18
    frame #6: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #7: ruby`pm_new_child_iseq(...) at prism_compile.c:1271:27
    frame #8: ruby`pm_compile_node(...) at prism_compile.c:9458:40
    frame #9: ruby`pm_compile_node(...) at prism_compile.c:9911:17
    frame #10: ruby`pm_compile_scope_node(...) at prism_compile.c:6598:13
    frame #11: ruby`pm_compile_node(...) at prism_compile.c:9784:9
    frame #12: ruby`pm_iseq_compile_node(...) at prism_compile.c:10122:9
    frame #13: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #14: ruby`rb_protect(...) at eval.c:1033:18
    frame #15: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #16: ruby`pm_iseq_new_top(...) at iseq.c:906:12
    frame #17: ruby`load_iseq_eval(...) at load.c:756:24
    frame #18: ruby`require_internal(...) at load.c:1296:21
    frame #19: ruby`rb_require_string_internal(...) at load.c:1402:22
    frame #20: ruby`rb_require_string(...) at load.c:1388:12
    frame #21: ruby`rb_f_require(...) at load.c:1029:12
    frame #22: ruby`ractor_safe_call_cfunc_1(...) at vm_insnhelper.c:3624:12
    frame #23: ruby`vm_call_cfunc_with_frame_(...) at vm_insnhelper.c:3801:11
    frame #24: ruby`vm_call_cfunc_with_frame(...) at vm_insnhelper.c:3847:12
    frame #25: ruby`vm_call_cfunc_other(...) at vm_insnhelper.c:3873:16
    frame #26: ruby`vm_call_cfunc(...) at vm_insnhelper.c:3955:12
    frame #27: ruby`vm_call_method_each_type(...) at vm_insnhelper.c:4779:16
    frame #28: ruby`vm_call_method(...) at vm_insnhelper.c:4916:20
    frame #29: ruby`vm_call_general(...) at vm_insnhelper.c:4949:12
    frame #30: ruby`vm_sendish(...) at vm_insnhelper.c:5968:15
    frame #31: ruby`vm_exec_core(...) at insns.def:898:11
    frame #32: ruby`rb_vm_exec(...) at vm.c:2595:22
    frame #33: ruby`rb_iseq_eval(...) at vm.c:2850:11
    frame #34: ruby`rb_load_with_builtin_functions(...) at builtin.c:54:5
    frame #35: ruby`Init_builtin_features at builtin.c:74:5
    frame #36: ruby`ruby_init_prelude at ruby.c:1750:5
    frame #37: ruby`ruby_opt_init(...) at ruby.c:1811:5
    frame #38: ruby`prism_script(...) at ruby.c:2215:13
    frame #39: ruby`process_options(...) at ruby.c:2538:9
    frame #40: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #41: ruby`ruby_options(...) at eval.c:117:16
    frame #42: ruby`rb_main(...) at main.c:43:26
    frame #43: ruby`main(...) at main.c:68:12

Whoa. That thing is huge! This is not the backtrace I was expecting! Seems like I missed a codepath in my earlier explorations. I got it right, up until prism_script:

main
which calls rb_main
which calls ruby_options, then ruby_process_options, then process_options
which calls prism_script
The next instruction I expected was pm_iseq_new_main, but instead we head into ruby_opt_init
which calls Init_builtin_features

This path seems to go through some gem preloading logic, which is why we see the rb_require calls:

void
Init_builtin_features(void)
{
    rb_load_with_builtin_functions("gem_prelude", NULL);
}

By default CRuby loads gem_prelude, which lives in ruby/gem_prelude.rb. Here’s that file, shortened for brevity:

require 'rubygems'
require 'error_highlight'
require 'did_you_mean'
require 'syntax_suggest/core_ext'

Compiling on-the-fly

There’s something i’ve learned here that seems obvious in hindsight, but I hadn’t considered. Ruby will only compile what is actually loaded, and only at the point it gets loaded. If I never load a particular piece of code, it never gets compiled. Or if I defer loading it until later, it does not get compiled until later.

We can actually demonstrate this by deferring a require:

sleep 10

require "net/http"

If we run this this using make lldb-ruby, we can see the delayed compilation in action:

(lldb) break set --file ruby.c --line 2616
(lldb) run
// hits our prism compile code
(lldb) next
(lldb) break set --file compile.c --line 3476
(lldb) continue
// waits 10 seconds, then compiles the contents of "net/http"

Getting to our test.rb file

I’d rather see just my code in test.rb get compiled, so I’m going to set a breakpoint directly on pm_iseq_new_main, which for me is in ruby.c on line 2616:

(lldb) break set --file ruby.c --line 2616
(lldb) run
Process 32534 launched: '/ruby/build/ruby' (arm64)
Process 32534 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: ruby`process_options(...) at ruby.c:2616:38
   2613         if (!result.ast) {
   2614             pm_parse_result_t *pm = &result.prism;
   2615             int error_state;
-> 2616             iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);
   2617
   2618             pm_parse_result_free(pm);
   2619

Now when we run the backtrace I am seeing what I expected, because we’ve skipped the gem_prelude compilation. This is the exact flow I walked through in part 2:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: ruby`process_options(...) at ruby.c:2616:38
    frame #1: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #2: ruby`ruby_options(...) at eval.c:117:16
    frame #3: ruby`rb_main(...) at main.c:43:26
    frame #4: ruby`main(...) at main.c:68:12

From here, we can set our iseq_peephole_optimize breakpoint and see only our specific code get compiled. Since we’re already in the running program, we call continue to keep executing:

(lldb) break set --file compile.c --line 3476
Breakpoint 2: where = ruby`iseq_peephole_optimize + 2276 at compile.c:3476:17
(lldb) continue
Process 55336 resuming
Process 55336 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: ruby`iseq_peephole_optimize() at compile.c:3476:17
   3473             const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
   3474
   3475             if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
-> 3476                 iobj->insn_id = BIN(opt_ary_freeze);
   3477                 iobj->operand_size = 2;
   3478                 iobj->operands = compile_data_calloc2(iseq, iobj->operand_size, sizeof(VALUE));
   3479                 iobj->operands[0] = rb_cArray_empty_frozen;

If we call bt from here to get the backtrace, we finally see the connection between prism_compile.c and compile.c. pm_iseq_compile_node calls iseq_setup_insn, which runs the optimization logic. In the previous post, I saw iseq_setup_insn, but I didn’t know what it meant or what it did. Now we know. This is what Kevin Newton referred to earlier: specialization comes after compilation. Prism compiles the node in the standard way, then the peephole optimization layer - the specialization - is applied after:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:17
    frame #1: ruby`iseq_optimize(...) at compile.c:4352:17
    frame #2: ruby`iseq_setup_insn(...) at compile.c:1619:5
    frame #3: ruby`pm_iseq_compile_node(...) at prism_compile.c:10139:5
    frame #4: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #5: ruby`rb_protect(...) at eval.c:1033:18
    frame #6: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #7: ruby`pm_iseq_new_main(...) at iseq.c:930:12
    frame #8: ruby`process_options(...) at ruby.c:2616:20
    frame #9: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #10: ruby`ruby_options(...) at eval.c:117:16
    frame #11: ruby`rb_main(...) at main.c:43:26
    frame #12: ruby`main(...) at main.c:68:12

From here, we can inspect and see the current instruction using expr:

(lldb) expr *(iobj)
(INSN) $4 = {
  link = {
    type = ISEQ_ELEMENT_INSN
    next = 0x000000011f6568d0
    prev = 0x000000011f656850
  }
  insn_id = YARVINSN_newarray
  operand_size = 1
  sc_state = 0
  operands = 0x000000011f640118
  insn_info = (line_no = 1, node_id = 3, events = 0)
}

We see that iobj contains a link to a subsequent instruction, as well as an insn_id and some other metadata. The instruction is currently YARVINSN_newarray. If we run next, that should run iobj->insn_id = BIN(opt_ary_freeze);, and our instruction should change:

(lldb) next
(lldb) expr *(iobj)
(INSN) $5 = {
  //...
  insn_id = YARVINSN_opt_ary_freeze
  //...
}

It does! The instruction was changed from newarray to opt_ary_freeze! The optimization is at least partially complete (i’m not sure if more is involved, yet).

Making one small step towards `opt_respond_to`

This is already the longest and densest post in the series. But i’d love to make some actual progress towards a new instruction. Let’s pattern match on respond_to? in the peephole optimizer.

Here is our sample program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)

Run with RUNOPT0=--dump=insns make runruby, we get the following instructions:

== disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
0000 getglobal                              :$stdout                  (   1)[Li]
0002 putobject                              :write
0004 opt_send_without_block                 <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless                           14
0008 putself
0009 putchilledstring                       "Did you know you can write to $stdout?"
0011 opt_send_without_block                 <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave

I want to match on this line:

0004 opt_send_without_block       <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>

Here’s my attempt. I’m going to copy what the newarray freeze optimization is doing, and just try changing a few things to match my example. Right underneath the code we’ve been debugging for newarray, i’m adding this:

// If the instruction is `send_without_block`, ie `0004 opt_send_without_block`
if (IS_INSN_ID(iobj, send_without_block)) {
    // Pull the same info the `newarray` optimization does
    const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
    const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(iobj, 1);

    // <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
    // 1. We have ARGS_SIMPLE, which is probably what `vm_ci_simple(ci)` checks for
    // 2. We have argc:1, which should match `vm_ci_argc(ci) == 1`
    // 3. We send without a block, hence blockiseq == NULL
    // 4. The method id (mid) for `vm_ci_mid(ci)` matches `idRespond_to`. I searched around for names
    //    that seemed similar to idFreeze, but replacing `idFreeze` with `idRespond` and found `idRespond_to`
    if (vm_ci_simple(ci) && vm_ci_argc(ci) == 1 && blockiseq == NULL && vm_ci_mid(ci) == idRespond_to) {
        int i = 0;
    }
}

Now i’ll follow the same debugging as before, but i’ll add a breakpoint in compile.c where I added my new code. Specifically, I’m setting a breakpoint at the int i = 0; so I am inside the if statement:

(lldb) break set --file ruby.c --line 2616
Breakpoint 1: where = ruby`process_options + 4068 at ruby.c:2616:38
(lldb) run
(lldb) break set --file compile.c --line 3491
Breakpoint 2: where = ruby`iseq_peephole_optimize + 2536 at compile.c:3491:17
(lldb) continue
Process 61925 resuming
Process 61925 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3491:17
   3488         const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(iobj, 1);
   3489
   3490         if (vm_ci_simple(ci) && vm_ci_argc(ci) == 1 && blockiseq == NULL && vm_ci_mid(ci) == idRespond_to) {
-> 3491             int i = 0;
   3492         }
   3493     }
   3494

I think it worked! It pattern matched on the characteristics of the respond_to? call, and hit the breakpoint set on int i = 0;. It’s a tiny step, but it’s a first step in the direction of adding the optimization.

Using `gdb`

For anyone wanting to do the same work using gdb, it’s pretty similar. Let’s start off by creating a breakpoints.gdb file in the root of your project. This will set you up with your initial breakpoint, similar to how we ran lldb, and set the breakpoint before calling run:

break ruby.c:2616

When you run make gdb-ruby, you can use the same backtrace command, bt:

> make gdb-ruby
Thread 1 "ruby" hit Breakpoint 4, process_options (...) at ../ruby.c:2616
2616	            iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);
(gdb) bt
#0  process_options (...) at ../ruby.c:2616
#1  in ruby_process_options (...) at ../ruby.c:3169
#2  in ruby_options (...) at ../eval.c:117
#3  in rb_main (...) at ../main.c:43
#4  in main (...) at ../main.c:68
(gdb)

From here, you can set your next breakpoint so that you can see the compilation solely for the newarray instruction from our test.rb program:

(gdb) break compile.c:3476
Breakpoint 5 at 0xaaaabaa22f14: file ../compile.c, line 3476
(gdb) continue
Continuing.

Thread 1 "ruby" hit Breakpoint 5, iseq_peephole_optimize (...) at ../compile.c:3476
3476	                iobj->insn_id = BIN(opt_ary_freeze);

Similar to the lldb command expr, we can inspect the contents of locals using p or print in gdb:

(gdb) p *(iobj)
$2 = {link = {type = ISEQ_ELEMENT_INSN, next = 0xaaaace797ef0, prev = 0xaaaace797e70}, insn_id = YARVINSN_newarray,
  operand_size = 1, sc_state = 0, operands = 0xaaaace796ac8, insn_info = {line_no = 1, node_id = 3, events = 0}}

Finishing up

Ok, this went pretty long. Good on you for sticking in there with me! We’ve found the optimizer, and we’ve pattern matched our way to a respond_to? call. Next, we need to add the new instruction definition and try to actually replace the send with our new instruction. See you next time! 👋🏼