A silly optimization: adding opt_respond_to to the Ruby VM, part 6

In part 5, we finally got our new instruction defined and outputting as part of our bytecode. if you didn’t run it yourself, you just had to trust me that it really did run.

But, I just dropped most of the implementation code in without explaining it. Let’s start off by walking through the basic version, then start planning for the true optimization.

The progress so far

Here’s our sample Ruby program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)

First, we’ll disassemble the code using make run, and run it using our C changes (you can pull the work in progress here):

RUNOPT0=--dump=insns make run

This gives us a new set of instructions. Most of it is the same as Ruby master, but opt_send_without_block is changed to opt_respond_to. The calldata containing respond_to? is still there, and I think it’ll stay even once we finish the whole implementation:

# == disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
0000 getglobal                :$stdout                  (   1)[Li]
0002 putobject                :write
# our new instruction!
0004 opt_respond_to           <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless             14
0008 putself
0009 putchilledstring         "Did you know you can write to $stdout?"
0011 opt_send_without_block   <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave

Our current implementation is mostly just a pass through to the normal respond_to? method, with some debug information printed. Running it without the dump=insns option, this is the output we get:

> make run

symbol:File
Did you know you can write to $stdout?

File is the type of the receiver, $stdout, and symbol is the type of the method argument, :write.

📝 In previous posts, we used make runruby and make lldb-ruby/make gdb-ruby. Based on feedback from Ruby maintainers in the know (like byroot), it seems like make run and make lldb/make gdb are the better options in 99% of cases. These commands use “miniruby”, which is all the Ruby syntax without loading stdlib and gems, so it should run faster. If you do need the stdlib and standard gems, you’ll want to continue using make runruby and friends

Breaking down the changes

The last post was running pretty long, so I dumped all the code at the end without explanation. Let’s break each section down, starting with our insns.def change to the virtual machine DSL:

//insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
    val = vm_opt_respond_to(recv, mid);
    CALL_SIMPLE_METHOD();
}

We have some context for how a virtual machine instruction is defined from the previous post, so let’s break this down:

opt_respond_to is the name of the instruction
(CALL_DATA cd) is the one “operand”, the call data of the method. I don’t think we’ll need this for our optimized version, but I think if we use a fallback it would still be required
(VALUE recv, VALUE mid) are the values this instruction is expecting to be popped off the stack so they can be used in the call. In our sample program instructions this should correspond to getglobal :$stdout and putobject :write. $stdout is recv, or the “receiver”. :write is mid, or the “method id”
(VALUE val) is the return value. Whatever gets set to val gets pushed onto the stack at the end of the instruction. The next instruction in our example is branchunless, which pops our val off the stack and tests it
Next is the body of the instruction:
- val = vm_opt_respond_to(recv, mid); here I followed the convention of other instructions which need some custom logic - they put their code inside of a vm_ prefixed function named after their instruction, and define it in vm_insnhelper.c. My function takes the receiver and the method id, and we’ll dive into that in a bit
- I think CALL_SIMPLE_METHOD(); will use the calldata to call the original method. Normally you would check the return value of the vm_ function to determine whether you want to pass through to the original implementation. In my case, my function is just printing some debug information so I let it always call the original

We’ve dug into most of the pattern matching logic in compile.c in previous posts, so I’ll skip that part and focus on the instruction override:

// compile.c
const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
//...
iobj->insn_id = BIN(opt_respond_to);
iobj->operand_size = 1;
iobj->operands = compile_data_calloc2(
  iseq, 
  iobj->operand_size, 
  sizeof(VALUE)
);
iobj->operands[0] = (VALUE)ci;

Once it’s found an instruction that matches a send to respond_to?, we override the current information. First we set insn_id to BIN(opt_respond_to), which we know expands to the enum value YARVINSN_opt_respond_to.

The rest seems… redundant? It already had ci at the first operand position, it was already an operand_size of 1. It’s possible I don’t need to recompile this, but I’ll need some guidance around that. It’s probably not harmful, but possibly unnecessary.

Last we’ve got our vm_opt_respond_to function:

// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (SYMBOL_P(mid)) {
    printf("symbol:");
  } else if (STRING_P(mid)) {
    printf("string:");
  }
  printf("%s\n", rb_builtin_type_name(TYPE(recv)));
  return Qundef;
}

It’s purely a debug function right now. It prints “symbol:” if mid is a symbol (SYMBOL_P and STRING_P are each “predicate” functions, hence the _P), “string:” if we have a string. Then it prints the type of the receiver and a new line. This is how we end up with symbol:File when we run our program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)
# symbol:File
# Did you know you can write to $stdout?

What’s next?

I’m missing some things at the moment:

Tests
Logic for handling the private/protected param
Actual optimization code 😅

1. Tests

There should already be tests for respond_to?, so I’ll start running those and rely on them for the moment.

As might be expected for an entire language, there are tons of tests. There is also RubySpec, which is the standard spec suite for every Ruby language implementation. It’s automatically included in the repository as well.

I’ll rely on those specs for now:

> make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"

ruby 3.5.0dev (2025-01-04T14:32:13Z opt-respond-to 5688434f63) +PRISM [arm64-darwin24]
[\ | ==================100%================== | 00:00:00]      0F      0E 

Finished in 0.007758 seconds

1 file, 13 examples, 24 expectations, 0 failures, 0 errors, 0 tagged

As expected, it still works so far since my version is basically a pass-through. We’ll see if we need more specs later on or if the base set is enough.

2. Logic for handling the private/protected param

respond_to? takes a second parameter - include_all - which determines whether to include private and protected methods.

I’ve never seen someone use this second parameter, but I’m sure it’s out there somewhere 🤷‍♂️. Piotr Szotkowski recently told me he’s a fan of the flip-flop operator - so the world is full of surprises 😉! Part of me wants to ignore it for optimizing and just pass through in that case, but that’s a total cop out.

I think there is some VM magic I need to utilize to handle an optional argument, applying special attributes for dynamic stack pointer adjustment. For instance, opt_send_without_block is defined like this:

DEFINE_INSN
opt_send_without_block
(CALL_DATA cd)
(...)
(VALUE val)
// attr bool handles_sp = true;
// attr rb_snum_t sp_inc = sp_inc_of_sendish(cd->ci);
// attr rb_snum_t comptime_sp_inc = sp_inc_of_sendish(ci);
{
  //...
}

It doesn’t specify the pop values, but instead uses the syntax (...) similar to argument forwarding in Ruby. It then specifies some stack pointer (“sp”) counts (those comments are actual code!), which I think allows it to handle a dynamic number of values to pop off the stack.

This seems complex for my case, where I have one required and one optional argument. I’ll defer this one for the moment.

3. Actual optimization code

I actually don’t know if this is optimizable in a meaningful way. I’d be lying if I said I didn’t care if there’s an optimization win here - that’s the most satisfying/impactful outcome.

This entire series is inspired by Optimizing Ruby’s JSON, Part 2, and one of the goals of that work was to reduce setup costs. Here’s some of the JSON.dump method in its original form:

def dump(obj, anIO = nil, limit = nil, kwargs = nil)
  #...
  if anIO.respond_to?(:to_io)
    anIO = anIO.to_io
  elsif limit.nil? && !anIO.respond_to?(:write)
    anIO, limit = nil, anIO
  end
  #...
end

The majority of the time, anIO is nil, so it won’t have a to_io or write method. That means in a micro-benchmark running millions of times the call to respond_to? is pure overhead. The solution in the post was to avoid the call when nil, but how fast can we make it if we did a silly, nil-specific optimization?

Setting up a performance baseline

Let’s setup a benchmark to see what our current performance is, as a baseline. In CRuby there are built-in benchmarking scripts we can use. We’ll define a new benchmark for respond_to?:

# benchmark/object_respond_to.yml
prelude: |
  class Base; def foo; end end
  class OneTwentyEight < Base
    128.times { include(Module.new) }
  end
  obj = OneTwentyEight.new  
benchmark:
  respond_to_false: obj.respond_to?(:bar)
  respond_to_true: obj.respond_to?(:foo)
  respond_to_nil_false: nil.respond_to?(:bar)
loop_count: 1_000_000

This YAML first sets up a prelude, which is Ruby code to setup our benchmark:

It defines a Base class with a foo method
Creates a child class called OneTwentyEight, which extends the Base class
Includes Module.new 128 times, to create alot of ancestors to search for methods
Instantiates OneTwentyEight to call from the benchmark

The benchmark keys specify what operations to run. respond_to_false checks respond_to? for a method that doesn’t exist, and respond_to_true checks for a method that does exist. respond_to_nil_false is unrelated to the prelude, but let’s me test how fast looking for a method on nil is.

The loop_count is how many iterations the code will run. I believe it runs several times, and then calculates how many times per second it should be able to run. Aaron Patterson created this benchmark in a PR that never merged, so thanks to him for that!

We can run the benchmark using make benchmark ITEM='respond_to'. I get the following output on a clean master branch:

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_nil_false  |     29.029M|   28.259M|
|                      |       1.03x|         -|
|respond_to_false      |     29.177M|   29.121M|
|                      |       1.00x|         -|
|respond_to_true       |     33.503M|   32.481M|
|                      |       1.03x|         -|

compare-ruby is the version of Ruby the project was built with (yes, building Ruby requires Ruby 🫨). For me, that’s Ruby 3.4. built-ruby is my local, built version. The differences in performance are pretty negligable - probably differences in compile flags used to build Rubies. The performance of each stays pretty close, and can flip-flip a bit between iterations.

You can run alot of respond_to?s in a second! The found method cases are the fastest, and the miss cases are consistently slower.

A first silly optimization

Now that we have a baseline, let’s try two optimizations to see what our upper-limit might be:

A nil specific check that always returns false
A nil specific check that has a hard-coded set of possible methods

First, we’ll change opt_respond_to into a common pattern. Many instructions will call a method, and if the method returns Qundef, they’ll revert to a base-case path. In our case right now, that’s CALL_SIMPLE_METHOD(). I assume Qundef exists to specify “undefined” behavior, to differentiate from Qnil which could be a valid return value:

// insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
  val = vm_opt_respond_to(recv, mid);
  if (UNDEF_P(val)) {
    CALL_SIMPLE_METHOD();
  }
}

And here is our silliest optimization. If recv is nil, always return false. Otherwise, return Qundef:

// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (NIL_P(recv)) {
    return Qfalse;
  }

  return Qundef;
}

Let’s rerun our benchmark, and see what we get:

> make benchmark ITEM='respond_to'

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false      |     29.121M|   27.795M|
|                      |       1.05x|         -|
|respond_to_true       |     32.241M|   31.544M|
|                      |       1.02x|         -|
|respond_to_nil_false  |     26.872M|   57.894M|
|                      |           -|     2.15x|

Oh, not bad! Around a 2x improvement. But ya know, it’s totally incorrect. We can add a spec to the respond_to_spec to check. It fails, as expected:

it "returns true for checking for `==` on nil" do
  nil.respond_to?(:==).should == true
end

# make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"
# 1)
# Kernel#respond_to? returns true for checking for `==` on nil FAILED
# Expected false == true
# to be truthy but was false
# [/ | ==================100%================== | 00:00:00]      1F      0E 
# Finished in 0.017146 seconds
# 1 file, 14 examples, 25 expectations, 1 failure, 0 errors, 0 tagged

A second, slightly less silly optimization

What if I added some overhead, but not a ton of overhead. First, I got every method available to me from an irb session:

nil.methods
# "rationalize", "&", "===", "inspect", "=~", "to_a",...

Then I took that and put it into an array of chars in C. The first time we call our vm_opt_respond_to function, it populates a rb_id_table with each of the available method names using rb_id_table_insert. rb_id_table is an internal CRuby hashtable structure which revolves around IDs, which I believe typically correspond to method names.

If the recv is nil, we use method_id_table to check if one of our hard-coded method names is being checked by respond_to?, using rb_id_table_lookup. If it returns true, we return Qtrue, otherwise Qfalse.

static struct rb_id_table *method_id_table = NULL;

static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (method_id_table == NULL) {
    const char *method_names[] = {
      "rationalize", "&", "===", "inspect", "=~", "to_a", "to_s", "to_i", "to_f", "to_r",
      "to_c", "nil?", "pretty_print_cycle", "|", "to_h", "^", "to_json", "to_yaml",
      "pretty_print", "pretty_print_instance_variables", "pretty_print_inspect", "singleton_class",
      "dup", "itself", "methods", "singleton_methods", "protected_methods", "private_methods",
      "public_methods", "instance_variables", "instance_variable_get", "instance_variable_set",
      "instance_variable_defined?", "remove_instance_variable", "instance_of?", "kind_of?",
      "is_a?", "display", "frozen?", "class", "then", "yield_self", "tap", "TypeName",
      "public_send", "extend", "clone", "<=>", "pretty_inspect", "!~", "method", "eql?",
      "respond_to?", "public_method", "singleton_method", "define_singleton_method", "hash",
      "freeze", "object_id", "Namespace", "send", "to_enum", "enum_for", "equal?", "!",
      "__send__", "==", "!=", "__id__", "instance_eval", "instance_exec"
    };

    size_t method_names_size = sizeof(method_names) / sizeof(method_names[0]);
    method_id_table = rb_id_table_create(method_names_size);

    for (size_t i = 0; i < method_names_size; i++) {
      ID id = rb_intern(method_names[i]);
      rb_id_table_insert(method_id_table, id, Qtrue);
    }
  }
  if (NIL_P(recv)) {
    ID id = rb_check_id(&mid);
    if (!id) return Qfalse;

    VALUE val;
    if (rb_id_table_lookup(method_id_table, id, &val)) {
      return Qtrue;
    } else {
      return Qfalse;
    }
  }

  return Qundef;
}

How fast is this version, now that we’re doing some actual work?

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false      |     29.668M|   28.738M|
|                      |       1.03x|         -|
|respond_to_true       |     33.320M|   29.829M|
|                      |       1.12x|         -|
|respond_to_nil_false  |     28.610M|   53.084M|
|                      |           -|     1.86x|

Still pretty fast! It even passes our spec now:

[| | ==================100%================== | 00:00:00]      0F      0E 
Finished in 0.008847 seconds
1 file, 14 examples, 25 expectations, 0 failures, 0 errors, 0 tagged

Back to reality

Ok - we described our base code. We walked through next steps. We ran some specs and got a feel for some benchmarks. It seems like our upper limit on performance may be about 2x how fast it currently runs - and it’s probably unattainable. But it’s nice to know the potential ceiling on performance from where things currently are.

Next time we’ll dig into some previous optimization improvements to respond_to? in older PRs, how respond_to? works currently, and hopefully make our first real optimization improvement. See you next time!

PS - You can find the code changes made in the branch here.