In part 5, we finally got our new instruction defined and outputting as part of our bytecode. if you didn’t run it yourself, you just had to trust me that it really did run.
But, I just dropped most of the implementation code in without explaining it. Let’s start off by walking through the basic version, then start planning for the true optimization.
The progress so far
Here’s our sample Ruby program:
puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)
First, we’ll disassemble the code using make run
, and run it using our C changes (you can pull the work in progress here):
RUNOPT0=--dump=insns make run
This gives us a new set of instructions. Most of it is the same as Ruby master, but opt_send_without_block
is changed to opt_respond_to
. The calldata
containing respond_to?
is still there, and I think it’ll stay even once we finish the whole implementation:
# == disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
0000 getglobal :$stdout ( 1)[Li]
0002 putobject :write
# our new instruction!
0004 opt_respond_to <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless 14
0008 putself
0009 putchilledstring "Did you know you can write to $stdout?"
0011 opt_send_without_block <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave
Our current implementation is mostly just a pass through to the normal respond_to?
method, with some debug information printed. Running it without the dump=insns
option, this is the output we get:
> make run
symbol:File
Did you know you can write to $stdout?
File
is the type of the receiver, $stdout
, and symbol
is the type of the method argument, :write
.
📝 In previous posts, we used
make runruby
andmake lldb-ruby
/make gdb-ruby
. Based on feedback from Ruby maintainers in the know (like byroot), it seems likemake run
andmake lldb
/make gdb
are the better options in 99% of cases. These commands use “miniruby”, which is all the Ruby syntax without loading stdlib and gems, so it should run faster. If you do need the stdlib and standard gems, you’ll want to continue usingmake runruby
and friends
Breaking down the changes
The last post was running pretty long, so I dumped all the code at the end without explanation. Let’s break each section down, starting with our insns.def
change to the virtual machine DSL:
//insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
val = vm_opt_respond_to(recv, mid);
CALL_SIMPLE_METHOD();
}
We have some context for how a virtual machine instruction is defined from the previous post, so let’s break this down:
opt_respond_to
is the name of the instruction(CALL_DATA cd)
is the one “operand”, the call data of the method. I don’t think we’ll need this for our optimized version, but I think if we use a fallback it would still be required(VALUE recv, VALUE mid)
are the values this instruction is expecting to be popped off the stack so they can be used in the call. In our sample program instructions this should correspond togetglobal :$stdout
andputobject :write
.$stdout
isrecv
, or the “receiver”.:write
ismid
, or the “method id”(VALUE val)
is the return value. Whatever gets set toval
gets pushed onto the stack at the end of the instruction. The next instruction in our example isbranchunless
, which pops ourval
off the stack and tests it- Next is the body of the instruction:
val = vm_opt_respond_to(recv, mid);
here I followed the convention of other instructions which need some custom logic - they put their code inside of avm_
prefixed function named after their instruction, and define it invm_insnhelper.c
. My function takes the receiver and the method id, and we’ll dive into that in a bit- I think
CALL_SIMPLE_METHOD();
will use thecalldata
to call the original method. Normally you would check the return value of thevm_
function to determine whether you want to pass through to the original implementation. In my case, my function is just printing some debug information so I let it always call the original
We’ve dug into most of the pattern matching logic in compile.c
in previous posts, so I’ll skip that part and focus on the instruction override:
// compile.c
const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
//...
iobj->insn_id = BIN(opt_respond_to);
iobj->operand_size = 1;
iobj->operands = compile_data_calloc2(
iseq,
iobj->operand_size,
sizeof(VALUE)
);
iobj->operands[0] = (VALUE)ci;
Once it’s found an instruction that matches a send
to respond_to?
, we override the current information. First we set insn_id
to BIN(opt_respond_to)
, which we know expands to the enum value YARVINSN_opt_respond_to
.
The rest seems… redundant? It already had ci
at the first operand position, it was already an operand_size
of 1. It’s possible I don’t need to recompile this, but I’ll need some guidance around that. It’s probably not harmful, but possibly unnecessary.
Last we’ve got our vm_opt_respond_to
function:
// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
if (SYMBOL_P(mid)) {
printf("symbol:");
} else if (STRING_P(mid)) {
printf("string:");
}
printf("%s\n", rb_builtin_type_name(TYPE(recv)));
return Qundef;
}
It’s purely a debug function right now. It prints “symbol:” if mid
is a symbol (SYMBOL_P
and STRING_P
are each “predicate” functions, hence the _P
), “string:” if we have a string. Then it prints the type of the receiver and a new line. This is how we end up with symbol:File
when we run our program:
puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)
# symbol:File
# Did you know you can write to $stdout?
What’s next?
I’m missing some things at the moment:
- Tests
- Logic for handling the private/protected param
- Actual optimization code 😅
1. Tests
There should already be tests for respond_to?
, so I’ll start running those and rely on them for the moment.
As might be expected for an entire language, there are tons of tests. There is also RubySpec, which is the standard spec suite for every Ruby language implementation. It’s automatically included in the repository as well.
I’ll rely on those specs for now:
> make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"
ruby 3.5.0dev (2025-01-04T14:32:13Z opt-respond-to 5688434f63) +PRISM [arm64-darwin24]
[\ | ==================100%================== | 00:00:00] 0F 0E
Finished in 0.007758 seconds
1 file, 13 examples, 24 expectations, 0 failures, 0 errors, 0 tagged
As expected, it still works so far since my version is basically a pass-through. We’ll see if we need more specs later on or if the base set is enough.
2. Logic for handling the private/protected param
respond_to?
takes a second parameter - include_all
- which determines whether to include private
and protected
methods.
I’ve never seen someone use this second parameter, but I’m sure it’s out there somewhere 🤷♂️. Piotr Szotkowski recently told me he’s a fan of the flip-flop operator - so the world is full of surprises 😉! Part of me wants to ignore it for optimizing and just pass through in that case, but that’s a total cop out.
I think there is some VM magic I need to utilize to handle an optional argument, applying special attributes for dynamic stack pointer adjustment. For instance, opt_send_without_block
is defined like this:
DEFINE_INSN
opt_send_without_block
(CALL_DATA cd)
(...)
(VALUE val)
// attr bool handles_sp = true;
// attr rb_snum_t sp_inc = sp_inc_of_sendish(cd->ci);
// attr rb_snum_t comptime_sp_inc = sp_inc_of_sendish(ci);
{
//...
}
It doesn’t specify the pop values, but instead uses the syntax (...)
similar to argument forwarding in Ruby. It then specifies some stack pointer (“sp”) counts (those comments are actual code!), which I think allows it to handle a dynamic number of values to pop off the stack.
This seems complex for my case, where I have one required and one optional argument. I’ll defer this one for the moment.
3. Actual optimization code
I actually don’t know if this is optimizable in a meaningful way. I’d be lying if I said I didn’t care if there’s an optimization win here - that’s the most satisfying/impactful outcome.
This entire series is inspired by Optimizing Ruby’s JSON, Part 2, and one of the goals of that work was to reduce setup costs. Here’s some of the JSON.dump
method in its original form:
def dump(obj, anIO = nil, limit = nil, kwargs = nil)
#...
if anIO.respond_to?(:to_io)
anIO = anIO.to_io
elsif limit.nil? && !anIO.respond_to?(:write)
anIO, limit = nil, anIO
end
#...
end
The majority of the time, anIO
is nil
, so it won’t have a to_io
or write
method. That means in a micro-benchmark running millions of times the call to respond_to?
is pure overhead. The solution in the post was to avoid the call when nil
, but how fast can we make it if we did a silly, nil
-specific optimization?
Setting up a performance baseline
Let’s setup a benchmark to see what our current performance is, as a baseline. In CRuby there are built-in benchmarking scripts we can use. We’ll define a new benchmark for respond_to?
:
# benchmark/object_respond_to.yml
prelude: |
class Base; def foo; end end
class OneTwentyEight < Base
128.times { include(Module.new) }
end
obj = OneTwentyEight.new
benchmark:
respond_to_false: obj.respond_to?(:bar)
respond_to_true: obj.respond_to?(:foo)
respond_to_nil_false: nil.respond_to?(:bar)
loop_count: 1_000_000
This YAML first sets up a prelude
, which is Ruby code to setup our benchmark:
- It defines a
Base
class with afoo
method - Creates a child class called
OneTwentyEight
, which extends theBase
class - Includes
Module.new
128 times, to create alot of ancestors to search for methods - Instantiates
OneTwentyEight
to call from the benchmark
The benchmark keys specify what operations to run. respond_to_false
checks respond_to?
for a method that doesn’t exist, and respond_to_true
checks for a method that does exist. respond_to_nil_false
is unrelated to the prelude, but let’s me test how fast looking for a method on nil
is.
The loop_count
is how many iterations the code will run. I believe it runs several times, and then calculates how many times per second it should be able to run. Aaron Patterson created this benchmark in a PR that never merged, so thanks to him for that!
We can run the benchmark using make benchmark ITEM='respond_to'
. I get the following output on a clean master
branch:
# Iteration per second (i/s)
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_nil_false | 29.029M| 28.259M|
| | 1.03x| -|
|respond_to_false | 29.177M| 29.121M|
| | 1.00x| -|
|respond_to_true | 33.503M| 32.481M|
| | 1.03x| -|
compare-ruby
is the version of Ruby the project was built with (yes, building Ruby requires Ruby 🫨). For me, that’s Ruby 3.4. built-ruby
is my local, built version. The differences in performance are pretty negligable - probably differences in compile flags used to build Rubies. The performance of each stays pretty close, and can flip-flip a bit between iterations.
You can run alot of respond_to?
s in a second! The found method cases are the fastest, and the miss cases are consistently slower.
A first silly optimization
Now that we have a baseline, let’s try two optimizations to see what our upper-limit might be:
- A
nil
specific check that always returns false - A
nil
specific check that has a hard-coded set of possible methods
First, we’ll change opt_respond_to
into a common pattern. Many instructions will call a method, and if the method returns Qundef
, they’ll revert to a base-case path. In our case right now, that’s CALL_SIMPLE_METHOD()
. I assume Qundef
exists to specify “undefined” behavior, to differentiate from Qnil
which could be a valid return value:
// insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
val = vm_opt_respond_to(recv, mid);
if (UNDEF_P(val)) {
CALL_SIMPLE_METHOD();
}
}
And here is our silliest optimization. If recv
is nil
, always return false. Otherwise, return Qundef
:
// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
if (NIL_P(recv)) {
return Qfalse;
}
return Qundef;
}
Let’s rerun our benchmark, and see what we get:
> make benchmark ITEM='respond_to'
# Iteration per second (i/s)
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false | 29.121M| 27.795M|
| | 1.05x| -|
|respond_to_true | 32.241M| 31.544M|
| | 1.02x| -|
|respond_to_nil_false | 26.872M| 57.894M|
| | -| 2.15x|
Oh, not bad! Around a 2x improvement. But ya know, it’s totally incorrect. We can add a spec to the respond_to_spec
to check. It fails, as expected:
it "returns true for checking for `==` on nil" do
nil.respond_to?(:==).should == true
end
# make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"
# 1)
# Kernel#respond_to? returns true for checking for `==` on nil FAILED
# Expected false == true
# to be truthy but was false
# [/ | ==================100%================== | 00:00:00] 1F 0E
# Finished in 0.017146 seconds
# 1 file, 14 examples, 25 expectations, 1 failure, 0 errors, 0 tagged
A second, slightly less silly optimization
What if I added some overhead, but not a ton of overhead. First, I got every method available to me from an irb
session:
nil.methods
# "rationalize", "&", "===", "inspect", "=~", "to_a",...
Then I took that and put it into an array of char
s in C. The first time we call our vm_opt_respond_to
function, it populates a rb_id_table
with each of the available method names using rb_id_table_insert
. rb_id_table
is an internal CRuby hashtable structure which revolves around ID
s, which I believe typically correspond to method names.
If the recv
is nil
, we use method_id_table
to check if one of our hard-coded method names is being checked by respond_to?
, using rb_id_table_lookup
. If it returns true, we return Qtrue
, otherwise Qfalse
.
static struct rb_id_table *method_id_table = NULL;
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
if (method_id_table == NULL) {
const char *method_names[] = {
"rationalize", "&", "===", "inspect", "=~", "to_a", "to_s", "to_i", "to_f", "to_r",
"to_c", "nil?", "pretty_print_cycle", "|", "to_h", "^", "to_json", "to_yaml",
"pretty_print", "pretty_print_instance_variables", "pretty_print_inspect", "singleton_class",
"dup", "itself", "methods", "singleton_methods", "protected_methods", "private_methods",
"public_methods", "instance_variables", "instance_variable_get", "instance_variable_set",
"instance_variable_defined?", "remove_instance_variable", "instance_of?", "kind_of?",
"is_a?", "display", "frozen?", "class", "then", "yield_self", "tap", "TypeName",
"public_send", "extend", "clone", "<=>", "pretty_inspect", "!~", "method", "eql?",
"respond_to?", "public_method", "singleton_method", "define_singleton_method", "hash",
"freeze", "object_id", "Namespace", "send", "to_enum", "enum_for", "equal?", "!",
"__send__", "==", "!=", "__id__", "instance_eval", "instance_exec"
};
size_t method_names_size = sizeof(method_names) / sizeof(method_names[0]);
method_id_table = rb_id_table_create(method_names_size);
for (size_t i = 0; i < method_names_size; i++) {
ID id = rb_intern(method_names[i]);
rb_id_table_insert(method_id_table, id, Qtrue);
}
}
if (NIL_P(recv)) {
ID id = rb_check_id(&mid);
if (!id) return Qfalse;
VALUE val;
if (rb_id_table_lookup(method_id_table, id, &val)) {
return Qtrue;
} else {
return Qfalse;
}
}
return Qundef;
}
How fast is this version, now that we’re doing some actual work?
# Iteration per second (i/s)
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false | 29.668M| 28.738M|
| | 1.03x| -|
|respond_to_true | 33.320M| 29.829M|
| | 1.12x| -|
|respond_to_nil_false | 28.610M| 53.084M|
| | -| 1.86x|
Still pretty fast! It even passes our spec now:
[| | ==================100%================== | 00:00:00] 0F 0E
Finished in 0.008847 seconds
1 file, 14 examples, 25 expectations, 0 failures, 0 errors, 0 tagged
Back to reality
Ok - we described our base code. We walked through next steps. We ran some specs and got a feel for some benchmarks. It seems like our upper limit on performance may be about 2x how fast it currently runs - and it’s probably unattainable. But it’s nice to know the potential ceiling on performance from where things currently are.
Next time we’ll dig into some previous optimization improvements to respond_to?
in older PRs, how respond_to?
works currently, and hopefully make our first real optimization improvement. See you next time!
PS - You can find the code changes made in the branch here.