snippet

All Snippets

2025-01-06: A silly optimization: adding opt_respond_to to the Ruby VM, part 6

2024-12-31: Defining an instruction: adding opt_respond_to to the Ruby VM, part 5

2024-12-27: Peephole optimizations: adding `opt_respond_to` to the Ruby VM, part 4

2024-12-25: The Ruby Syntax Holy Grail: adding `opt_respond_to` to the Ruby VM, part 3

2024-12-23: Finding the compiler: adding `opt_respond_to` to the Ruby VM, part 2

2024-12-22: Adding `opt_respond_to` to the Ruby VM: part 1

2024-12-16: 20 days of ruby gems: part 1

2024-12-02: My MacOS setup for hacking on CRuby

2024-11-28: Counting C method calls in CRuby

2024-11-27: My docker setup for hacking on CRuby

2024-11-26: Calculating the largest known prime in Ruby

A silly optimization: adding opt_respond_to to the Ruby VM, part 6

Jan 6, 2025 JP Camara

In part 5, we finally got our new instruction defined and outputting as part of our bytecode. if you didn’t run it yourself, you just had to trust me that it really did run.

But, I just dropped most of the implementation code in without explaining it. Let’s start off by walking through the basic version, then start planning for the true optimization.

The progress so far

Here’s our sample Ruby program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)

First, we’ll disassemble the code using make run, and run it using our C changes (you can pull the work in progress here):

RUNOPT0=--dump=insns make run

This gives us a new set of instructions. Most of it is the same as Ruby master, but opt_send_without_block is changed to opt_respond_to. The calldata containing respond_to? is still there, and I think it’ll stay even once we finish the whole implementation:

# == disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
0000 getglobal                :$stdout                  (   1)[Li]
0002 putobject                :write
# our new instruction!
0004 opt_respond_to           <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless             14
0008 putself
0009 putchilledstring         "Did you know you can write to $stdout?"
0011 opt_send_without_block   <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave

Our current implementation is mostly just a pass through to the normal respond_to? method, with some debug information printed. Running it without the dump=insns option, this is the output we get:

> make run

symbol:File
Did you know you can write to $stdout?

File is the type of the receiver, $stdout, and symbol is the type of the method argument, :write.

📝 In previous posts, we used make runruby and make lldb-ruby/make gdb-ruby. Based on feedback from Ruby maintainers in the know (like byroot), it seems like make run and make lldb/make gdb are the better options in 99% of cases. These commands use “miniruby”, which is all the Ruby syntax without loading stdlib and gems, so it should run faster. If you do need the stdlib and standard gems, you’ll want to continue using make runruby and friends

Breaking down the changes

The last post was running pretty long, so I dumped all the code at the end without explanation. Let’s break each section down, starting with our insns.def change to the virtual machine DSL:

//insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
    val = vm_opt_respond_to(recv, mid);
    CALL_SIMPLE_METHOD();
}

We have some context for how a virtual machine instruction is defined from the previous post, so let’s break this down:

opt_respond_to is the name of the instruction
(CALL_DATA cd) is the one “operand”, the call data of the method. I don’t think we’ll need this for our optimized version, but I think if we use a fallback it would still be required
(VALUE recv, VALUE mid) are the values this instruction is expecting to be popped off the stack so they can be used in the call. In our sample program instructions this should correspond to getglobal :$stdout and putobject :write. $stdout is recv, or the “receiver”. :write is mid, or the “method id”
(VALUE val) is the return value. Whatever gets set to val gets pushed onto the stack at the end of the instruction. The next instruction in our example is branchunless, which pops our val off the stack and tests it
Next is the body of the instruction:
- val = vm_opt_respond_to(recv, mid); here I followed the convention of other instructions which need some custom logic - they put their code inside of a vm_ prefixed function named after their instruction, and define it in vm_insnhelper.c. My function takes the receiver and the method id, and we’ll dive into that in a bit
- I think CALL_SIMPLE_METHOD(); will use the calldata to call the original method. Normally you would check the return value of the vm_ function to determine whether you want to pass through to the original implementation. In my case, my function is just printing some debug information so I let it always call the original

We’ve dug into most of the pattern matching logic in compile.c in previous posts, so I’ll skip that part and focus on the instruction override:

// compile.c
const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
//...
iobj->insn_id = BIN(opt_respond_to);
iobj->operand_size = 1;
iobj->operands = compile_data_calloc2(
  iseq, 
  iobj->operand_size, 
  sizeof(VALUE)
);
iobj->operands[0] = (VALUE)ci;

Once it’s found an instruction that matches a send to respond_to?, we override the current information. First we set insn_id to BIN(opt_respond_to), which we know expands to the enum value YARVINSN_opt_respond_to.

The rest seems… redundant? It already had ci at the first operand position, it was already an operand_size of 1. It’s possible I don’t need to recompile this, but I’ll need some guidance around that. It’s probably not harmful, but possibly unnecessary.

Last we’ve got our vm_opt_respond_to function:

// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (SYMBOL_P(mid)) {
    printf("symbol:");
  } else if (STRING_P(mid)) {
    printf("string:");
  }
  printf("%s\n", rb_builtin_type_name(TYPE(recv)));
  return Qundef;
}

It’s purely a debug function right now. It prints “symbol:” if mid is a symbol (SYMBOL_P and STRING_P are each “predicate” functions, hence the _P), “string:” if we have a string. Then it prints the type of the receiver and a new line. This is how we end up with symbol:File when we run our program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)
# symbol:File
# Did you know you can write to $stdout?

What’s next?

I’m missing some things at the moment:

Tests
Logic for handling the private/protected param
Actual optimization code 😅

1. Tests

There should already be tests for respond_to?, so I’ll start running those and rely on them for the moment.

As might be expected for an entire language, there are tons of tests. There is also RubySpec, which is the standard spec suite for every Ruby language implementation. It’s automatically included in the repository as well.

I’ll rely on those specs for now:

> make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"

ruby 3.5.0dev (2025-01-04T14:32:13Z opt-respond-to 5688434f63) +PRISM [arm64-darwin24]
[\ | ==================100%================== | 00:00:00]      0F      0E 

Finished in 0.007758 seconds

1 file, 13 examples, 24 expectations, 0 failures, 0 errors, 0 tagged

As expected, it still works so far since my version is basically a pass-through. We’ll see if we need more specs later on or if the base set is enough.

2. Logic for handling the private/protected param

respond_to? takes a second parameter - include_all - which determines whether to include private and protected methods.

I’ve never seen someone use this second parameter, but I’m sure it’s out there somewhere 🤷‍♂️. Piotr Szotkowski recently told me he’s a fan of the flip-flop operator - so the world is full of surprises 😉! Part of me wants to ignore it for optimizing and just pass through in that case, but that’s a total cop out.

I think there is some VM magic I need to utilize to handle an optional argument, applying special attributes for dynamic stack pointer adjustment. For instance, opt_send_without_block is defined like this:

DEFINE_INSN
opt_send_without_block
(CALL_DATA cd)
(...)
(VALUE val)
// attr bool handles_sp = true;
// attr rb_snum_t sp_inc = sp_inc_of_sendish(cd->ci);
// attr rb_snum_t comptime_sp_inc = sp_inc_of_sendish(ci);
{
  //...
}

It doesn’t specify the pop values, but instead uses the syntax (...) similar to argument forwarding in Ruby. It then specifies some stack pointer (“sp”) counts (those comments are actual code!), which I think allows it to handle a dynamic number of values to pop off the stack.

This seems complex for my case, where I have one required and one optional argument. I’ll defer this one for the moment.

3. Actual optimization code

I actually don’t know if this is optimizable in a meaningful way. I’d be lying if I said I didn’t care if there’s an optimization win here - that’s the most satisfying/impactful outcome.

This entire series is inspired by Optimizing Ruby’s JSON, Part 2, and one of the goals of that work was to reduce setup costs. Here’s some of the JSON.dump method in its original form:

def dump(obj, anIO = nil, limit = nil, kwargs = nil)
  #...
  if anIO.respond_to?(:to_io)
    anIO = anIO.to_io
  elsif limit.nil? && !anIO.respond_to?(:write)
    anIO, limit = nil, anIO
  end
  #...
end

The majority of the time, anIO is nil, so it won’t have a to_io or write method. That means in a micro-benchmark running millions of times the call to respond_to? is pure overhead. The solution in the post was to avoid the call when nil, but how fast can we make it if we did a silly, nil-specific optimization?

Setting up a performance baseline

Let’s setup a benchmark to see what our current performance is, as a baseline. In CRuby there are built-in benchmarking scripts we can use. We’ll define a new benchmark for respond_to?:

# benchmark/object_respond_to.yml
prelude: |
  class Base; def foo; end end
  class OneTwentyEight < Base
    128.times { include(Module.new) }
  end
  obj = OneTwentyEight.new  
benchmark:
  respond_to_false: obj.respond_to?(:bar)
  respond_to_true: obj.respond_to?(:foo)
  respond_to_nil_false: nil.respond_to?(:bar)
loop_count: 1_000_000

This YAML first sets up a prelude, which is Ruby code to setup our benchmark:

It defines a Base class with a foo method
Creates a child class called OneTwentyEight, which extends the Base class
Includes Module.new 128 times, to create alot of ancestors to search for methods
Instantiates OneTwentyEight to call from the benchmark

The benchmark keys specify what operations to run. respond_to_false checks respond_to? for a method that doesn’t exist, and respond_to_true checks for a method that does exist. respond_to_nil_false is unrelated to the prelude, but let’s me test how fast looking for a method on nil is.

The loop_count is how many iterations the code will run. I believe it runs several times, and then calculates how many times per second it should be able to run. Aaron Patterson created this benchmark in a PR that never merged, so thanks to him for that!

We can run the benchmark using make benchmark ITEM='respond_to'. I get the following output on a clean master branch:

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_nil_false  |     29.029M|   28.259M|
|                      |       1.03x|         -|
|respond_to_false      |     29.177M|   29.121M|
|                      |       1.00x|         -|
|respond_to_true       |     33.503M|   32.481M|
|                      |       1.03x|         -|

compare-ruby is the version of Ruby the project was built with (yes, building Ruby requires Ruby 🫨). For me, that’s Ruby 3.4. built-ruby is my local, built version. The differences in performance are pretty negligable - probably differences in compile flags used to build Rubies. The performance of each stays pretty close, and can flip-flip a bit between iterations.

You can run alot of respond_to?s in a second! The found method cases are the fastest, and the miss cases are consistently slower.

A first silly optimization

Now that we have a baseline, let’s try two optimizations to see what our upper-limit might be:

A nil specific check that always returns false
A nil specific check that has a hard-coded set of possible methods

First, we’ll change opt_respond_to into a common pattern. Many instructions will call a method, and if the method returns Qundef, they’ll revert to a base-case path. In our case right now, that’s CALL_SIMPLE_METHOD(). I assume Qundef exists to specify “undefined” behavior, to differentiate from Qnil which could be a valid return value:

// insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
  val = vm_opt_respond_to(recv, mid);
  if (UNDEF_P(val)) {
    CALL_SIMPLE_METHOD();
  }
}

And here is our silliest optimization. If recv is nil, always return false. Otherwise, return Qundef:

// vm_insnhelper.c
static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (NIL_P(recv)) {
    return Qfalse;
  }

  return Qundef;
}

Let’s rerun our benchmark, and see what we get:

> make benchmark ITEM='respond_to'

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false      |     29.121M|   27.795M|
|                      |       1.05x|         -|
|respond_to_true       |     32.241M|   31.544M|
|                      |       1.02x|         -|
|respond_to_nil_false  |     26.872M|   57.894M|
|                      |           -|     2.15x|

Oh, not bad! Around a 2x improvement. But ya know, it’s totally incorrect. We can add a spec to the respond_to_spec to check. It fails, as expected:

it "returns true for checking for `==` on nil" do
  nil.respond_to?(:==).should == true
end

# make test-spec SPECOPTS="../spec/ruby/core/kernel/respond_to_spec.rb"
# 1)
# Kernel#respond_to? returns true for checking for `==` on nil FAILED
# Expected false == true
# to be truthy but was false
# [/ | ==================100%================== | 00:00:00]      1F      0E 
# Finished in 0.017146 seconds
# 1 file, 14 examples, 25 expectations, 1 failure, 0 errors, 0 tagged

A second, slightly less silly optimization

What if I added some overhead, but not a ton of overhead. First, I got every method available to me from an irb session:

nil.methods
# "rationalize", "&", "===", "inspect", "=~", "to_a",...

Then I took that and put it into an array of chars in C. The first time we call our vm_opt_respond_to function, it populates a rb_id_table with each of the available method names using rb_id_table_insert. rb_id_table is an internal CRuby hashtable structure which revolves around IDs, which I believe typically correspond to method names.

If the recv is nil, we use method_id_table to check if one of our hard-coded method names is being checked by respond_to?, using rb_id_table_lookup. If it returns true, we return Qtrue, otherwise Qfalse.

static struct rb_id_table *method_id_table = NULL;

static VALUE
vm_opt_respond_to(VALUE recv, VALUE mid)
{
  if (method_id_table == NULL) {
    const char *method_names[] = {
      "rationalize", "&", "===", "inspect", "=~", "to_a", "to_s", "to_i", "to_f", "to_r",
      "to_c", "nil?", "pretty_print_cycle", "|", "to_h", "^", "to_json", "to_yaml",
      "pretty_print", "pretty_print_instance_variables", "pretty_print_inspect", "singleton_class",
      "dup", "itself", "methods", "singleton_methods", "protected_methods", "private_methods",
      "public_methods", "instance_variables", "instance_variable_get", "instance_variable_set",
      "instance_variable_defined?", "remove_instance_variable", "instance_of?", "kind_of?",
      "is_a?", "display", "frozen?", "class", "then", "yield_self", "tap", "TypeName",
      "public_send", "extend", "clone", "<=>", "pretty_inspect", "!~", "method", "eql?",
      "respond_to?", "public_method", "singleton_method", "define_singleton_method", "hash",
      "freeze", "object_id", "Namespace", "send", "to_enum", "enum_for", "equal?", "!",
      "__send__", "==", "!=", "__id__", "instance_eval", "instance_exec"
    };

    size_t method_names_size = sizeof(method_names) / sizeof(method_names[0]);
    method_id_table = rb_id_table_create(method_names_size);

    for (size_t i = 0; i < method_names_size; i++) {
      ID id = rb_intern(method_names[i]);
      rb_id_table_insert(method_id_table, id, Qtrue);
    }
  }
  if (NIL_P(recv)) {
    ID id = rb_check_id(&mid);
    if (!id) return Qfalse;

    VALUE val;
    if (rb_id_table_lookup(method_id_table, id, &val)) {
      return Qtrue;
    } else {
      return Qfalse;
    }
  }

  return Qundef;
}

How fast is this version, now that we’re doing some actual work?

# Iteration per second (i/s)
|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|respond_to_false      |     29.668M|   28.738M|
|                      |       1.03x|         -|
|respond_to_true       |     33.320M|   29.829M|
|                      |       1.12x|         -|
|respond_to_nil_false  |     28.610M|   53.084M|
|                      |           -|     1.86x|

Still pretty fast! It even passes our spec now:

[| | ==================100%================== | 00:00:00]      0F      0E 
Finished in 0.008847 seconds
1 file, 14 examples, 25 expectations, 0 failures, 0 errors, 0 tagged

Back to reality

Ok - we described our base code. We walked through next steps. We ran some specs and got a feel for some benchmarks. It seems like our upper limit on performance may be about 2x how fast it currently runs - and it’s probably unattainable. But it’s nice to know the potential ceiling on performance from where things currently are.

Next time we’ll dig into some previous optimization improvements to respond_to? in older PRs, how respond_to? works currently, and hopefully make our first real optimization improvement. See you next time!

PS - You can find the code changes made in the branch here.

Add a comment

Defining an instruction: adding opt_respond_to to the Ruby VM, part 5

Dec 31, 2024 JP Camara

In Peephole optimizations: adding opt_respond_to to the Ruby VM, part 4, we dug deep. We found the connection between prism compilation and the specialization we need for our new bytecode, called “peephole optimization”. We learned how to debug and step through C code in the Ruby runtime, and we added some logic for matching the “pattern” of the instruction we want to change.

Now that we know where the specialization needs to go and how to match what needs to be specialized - what do we actually replace it with? How do we get the virtual machine to recognize opt_respond_to?

Pattern matching bytecode instructions

opt_ary_freeze has been a great learning tool - let’s see what it teaches us about adding a new instruction name.

Here’s a refresher on how iseq_peephole_optimize matches on newarray, and then replaces it with opt_ary_freeze:

if (
  IS_INSN_ID(iobj, newarray)
) {
  LINK_ELEMENT *next = iobj->link.next;
  if (
    IS_INSN(next) &&
    IS_INSN_ID(next, send)
  ) {
    //... more if statements
    iobj->insn_id = BIN(opt_ary_freeze);

it first checks if the instruction id of iobj is newarray
it grabs the next element and checks if its an instruction, then checks if the instruction is send (it also checks if the method id is “freeze”, not shown above)
if those checks match, it replaces the instruction id with opt_ary_freeze

That’s pretty reasonable to follow. But how do IS_INSN and IS_INSN_ID work? What is BIN? What type is opt_ary_freeze - where is it defined? How do we add new instructions ourselves?

Macros and enums

BIN, IS_INSN and IS_INSN_ID are all C macros that revolve around interacting with virtual machine instructions.

Macros in C get embedded directly into your code in a preprocessing step before being compiled, so you can write things that look pretty odd compared to a typical C-like syntax. Here’s the definition for BIN:

#define BIN(n) YARVINSN_##n

📝 BIN probably stands for “Binary INstruction”

That ## is kind of like string interpolation, but the result is a static part of your actual code. This means that anywhere BIN is called, it’s kind of like saying YARVINSN_#{n} in Ruby. So this code:

iobj->insn_id = BIN(opt_ary_freeze);

Gets transformed into this, right before the program is compiled:

iobj->insn_id = YARVINSN_opt_ary_freeze;

Here’s the definition for IS_INSN_ID:

#define IS_INSN_ID(iobj, insn) (INSN_OF(iobj) == BIN(insn))

Based on our understanding of macros and BIN, so far, it gets transformed into:

#define IS_INSN_ID(iobj, insn) (INSN_OF(iobj) == YARVINSN_##insn)

Here’s the definition for INSN_OF, it just casts insn to an INSN type, and accesses its instruction id:

#define INSN_OF(insn) \
  (((INSN*)(insn))->insn_id)

That means the expanded version of IS_INSN_ID is:

#define IS_INSN_ID(iobj, insn) \
  (((INSN*)(insn))->insn_id == YARVINSN_##insn)

Here’s the definition for IS_INSN:

#define IS_INSN(link) ((link)->type == ISEQ_ELEMENT_INSN)

If we combined all of it together, and manually inline it, here’s what our original pattern matching code looks like:

if (
  (((INSN*)(iobj))->insn_id == YARVINSN_newarray)
) {
  LINK_ELEMENT *next = iobj->link.next;
  if (
    (next)->type == ISEQ_ELEMENT_INSN &&
    (((INSN*)(next))->insn_id == YARVINSN_send)
  ) {
    //... more if statements
    iobj->insn_id = YARVINSN_opt_ary_freeze;

I’m glad CRuby adds those macros… this expanded code is a lot less readable.

Why did I expand it and make that original code less clear? I wanted to think through what the instructions really look like at runtime, and why. The reason I can infer, is that all of these macros let you focus on a syntax that looks just like our VM instructions, while making sure there are no name collisions behind the scenes.

Ok, I still haven’t shown where the instruction comes from. Here’s the file you can actually find an enum containing the entire list of vm instructions:

// insns.inc
enum ruby_vminsn_type {
  BIN(nop),
  BIN(getlocal),
  //...
  BIN(opt_ary_freeze),
  //...
}

// or, expanded by the preprocessor!
enum ruby_vminsn_type {
  YARVINSN_nop,
  YARVINSN_getlocal,
  //...
  YARVINSN_opt_ary_freeze,
  //...
}

insns.inc gets included anywhere we need instruction checks, like in compile.c. These enum values are globally available anywhere this file is included. Thanks to BIN prepending all of their names with YARVINSN_, we can use them in a convenient syntax without having any collisions.

So if I search the CRuby repo for insns.inc, where can I find it? Hmmm, I can’t 🤔. insns.inc is a generated file! I can only see it locally, after compiling the entire project. Where does that file get generated from?

A virtual machine DSL

While insns.inc tells us the name of each instruction available, the file it is generated from defines every instruction available in the Ruby virtual machine, and how it should respond to that instruction. It’s called insns.def. The file looks a lot like C, but it’s actually a kind of DSL.

It lets you define a simplified set of information for the instruction. That simplified format is then compiled into a more comprehensive, C compatible version.

It’s compilers all the way down… 😵‍💫

The top of the file defines the format. I don’t fully understand it, but let’s walk through it:

/*
DEFINE_INSN
instruction_name
(type operand, type operand, ..)
(pop_values, ..)
(return values ..)
// attr type name contents..
{
  .. // insn body
}
*/

An instructions consists of:

A name ✅
Operands - like our “call data”. I’m not sure what other types of operands are typically used for, but I know opt_ary_freezs puts the frozen array here as well
Values to pop off the virtual machine stack so we can operate on them. This should be values that we’ve seen pushed onto the stack in previous instructions
A return value
(I don’t fully understand the value of attr type but it seems to influence what code gets generated by the instruction definition)
A C-compatible body

That’s a lot, baked into a small interface. Let’s look at a very simple example. Here’s one of the simplest instructions available, putnil:

DEFINE_INSN
putnil
()
()
(VALUE val)
{
    val = Qnil;
}

Looks… pointless? Theputnil instruction takes no arguments, and has a return value of val. The only thing the code block does is set val equal to Qnil, which is a special value in CRuby representing Ruby’s nil. What does that accomplish?

📝 VALUE is a special type in CRuby that points at a Ruby object, usually located on the heap. When you see VALUE, this often means we’re looking at a value you’d use in a Ruby program.

This file is compiled into regular C code, and the context of this simple instruction becomes clearer:

// vm.inc
/* insn putnil()()(val) */
INSN_ENTRY(putnil)
{
  //...
  VALUE val;
  //...
#   define NAME_OF_CURRENT_INSN putnil
#   line 331 "../insns.def"
{
  val = Qnil;
}
  //...
  INC_SP(INSN_ATTR(sp_inc));
  TOPN(0) = val;
  //...
}

The return value VALUE val is declared
It’s set to val = Qnil, the instruction we saw in insns.def
INC_SP is called, which I believe “increments” the “stack pointer”, giving us extra space on the stack to push onto?
TOPN(0) = val sets val to the top of the stack

I think I’ll dig more into that next time. But let’s get back to the task at hand - it’s time to try and get our respond_to? call replaced with opt_respond_to!

Adding to the DSL

It took me a bit of banging my head against a wall, but here is the working instruction and specialization, in a basic form:

// insns.def
DEFINE_INSN
opt_respond_to
(CALL_DATA cd)
(VALUE recv, VALUE mid)
(VALUE val)
{
    val = vm_opt_respond_to(recv, mid);
    CALL_SIMPLE_METHOD();
}

// compile.c
if (IS_INSN_ID(iobj, send)) {
  const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
  const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(iobj, 1);

  if (vm_ci_simple(ci) && vm_ci_argc(ci) == 1 && blockiseq == NULL && vm_ci_mid(ci) == idRespond_to) {
      iobj->insn_id = BIN(opt_respond_to);
      iobj->operand_size = 1;
      iobj->operands = compile_data_calloc2(iseq, iobj->operand_size, sizeof(VALUE));
      iobj->operands[0] = (VALUE)ci;
  }
}

If we dump the instructions, we finally see our new instruction opt_respond_to. It’s not really doing anything yet, but it’s there!

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)

# > RUNOPT0=--dump=insns make run
# == disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
# 0000 getglobal                :$stdout                  (   1)[Li]
# 0002 putobject                :write
# 0004 opt_respond_to           <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
# 0006 branchunless             14
# 0008 putself
# 0009 putchilledstring         "Did you know you can write to $stdout?"
# 0011 opt_send_without_block   <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
# 0013 leave
# 0014 putnil
# 0015 leave

Sorry to just dump the code here, and split. We’ll dig into it more, and expand on it next time! There is lots more to do, and to explain, but i’m excited about this milestone! See you next time! 👋🏼

PS - since something is working now, i’ve pushed up my basic code so far, here.

Add a comment

Peephole optimizations: adding `opt_respond_to` to the Ruby VM, part 4

Dec 27, 2024 JP Camara

In The Ruby Syntax Holy Grail: adding opt_respond_to to the Ruby VM, part 3, I found what I referred to as the “Holy Grail” of Ruby syntax. I’m way overstating it, but it’s a readable, sequential way of viewing how a large portion of the Ruby syntax is compiled. Here’s a snippet of it as a reminder:

// prism_compile.c
static void
pm_compile_node(rb_iseq_t *iseq, const pm_node_t *node, LINK_ANCHOR *const ret, bool popped, pm_scope_node_t *scope_node)
{
    const pm_parser_t *parser = scope_node->parser;
    //...
    switch (PM_NODE_TYPE(node)) {
      //...
      case PM_ARRAY_NODE: {
        // [foo, bar, baz]
        // ^^^^^^^^^^^^^^^
        const pm_array_node_t *cast = (const pm_array_node_t *) node;
        pm_compile_array_node(iseq, (const pm_node_t *) cast, &cast->elements, &location, ret, popped, scope_node);
        return;
      }
      //...
      case PM_MODULE_NODE: {
        // module Foo; end
        //...
      }
      //...
}

The file that code lives in, prism_compile.c, is enormous. pm_compile_node itself is 1800+ lines, and the overall file is 11 thousand lines. It’s daunting to say the least, but there are some obvious directions I can ignore - i’m trying to optimize a method call to respond_to?, so I can sidestep a majority of the Ruby syntax.

Still, where I do go, specifically?

Sage wisdom

Helpfully, I got two identical sets of direction based on part 3. One from Kevin Newton, creator of Prism:

https://x.com/kddnewton/status/1872280281409105925?s=46

And one from byroot, who inspired this whole series:

https://bsky.app/profile/byroot.bsky.social/post/3le6xypzykc2x

I don’t want to jump to conclusions, but I think I need to look at the peephole optimizer 😆.

And exactly what is a “peephole optimizer”? Kevin described the process as “specialization comes after compilation”. From Wikipedia:

Peephole optimization is an optimization technique performed on a small set of compiler-generated instructions, known as a peephole or window, that involves replacing the instructions with a logically equivalent set that has better performance. https://en.wikipedia.org/wiki/Peephole_optimization

This seems to fit my goal pretty well. I want to replace the current opt_send_without_block instruction with a specialized opt_respond_to instruction, optimized for respond_to? method calls.

Finding the optimizer

So where are peephole optimizations happening in CRuby today? In Étienne’s PR, he added optimization code to a function called… iseq_peephole_optimize. A little on the nose, don’t you think? Kevin’s comment also mentioned iseq_peephole_optimize - seems like the winner.

I want to make the link between iseq_peephole_optimize and where we left off at pm_compile_node. Let’s dig into some code!

Disassembling an existing optimization

I’m going to use Étienne’s frozen array optimization to get to the optimizer and see how it relates. If you want to follow along, start with the setup instructions from part 3.

His optimization only applies to array and hash literals being frozen. So we’ll write a teensy Ruby program to demonstrate, and put it in test.rb at the root of our CRuby project:

# test.rb
pp [].freeze

The best way to run test.rb here is to use make. It will not only run the file, but also make sure things like C files get recompiled as necessary when you make changes. Let’s run our file, but dump the instructions it would generate for the Ruby VM:

RUNOPT0=--dump=insns make runruby

RUNOPT0 lets us add an option to the ruby call, so it’s effectively ruby --dump=insns test.rb. Here’s the instructions we see - we can confirm that we are getting the optimized opt_ary_freeze instruction from Étienne PR:

== disasm: #<ISeq:<main>./test.rb:3 (3,0)-(3,12)>
0000 putself                      (   3)[Li]
0001 opt_ary_freeze               [], <calldata!mid:freeze, argc:0, ARGS_SIMPLE>
0004 opt_send_without_block       <calldata!mid:pp, argc:1, FCALL|ARGS_SIMPLE>
0006 leave

You never know what code is truly doing until you run it. So far, I’ve just been reading and navigating the CRuby source. iseq_peephole_optimize lives in compile.c - let’s set a breakpoint and take a look 🕵🏼‍♂️.

Using the debugger

We can debug C code in CRuby almost as easily as we can use a debugger/binding.pry.

For MacOS, you can use lldb, and for Docker/Linux, you can use gdb. I’m going to do everything in lldb to start, but I’ll show some equivalent commands for gdb after.

Let’s start by looking at the peephole optimization code for [].freeze, inside of iseq_peephole_optimize. I’ll add comments above each line to explain what I think it’s doing:

// compile.c
static int
iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcallopt)
{
         // ...
         // if the instruction is a `newarray` of zero length
3469:    if (IS_INSN_ID(iobj, newarray) && iobj->operands[0] == INT2FIX(0)) {
             // grab the next element after the current instruction
3470:        LINK_ELEMENT *next = iobj->link.next;
             // if `next` is an instruction, and the instruction is `send`
3471:        if (IS_INSN(next) && (IS_INSN_ID(next, send))) {
3472:            const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(next, 0);
3473:            const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
3474:
                 // if the callinfo is "simple", with zero arguments,
                 // and there isn't a block provided(?), and the method id (mid) is `freeze`
                 // which is represented by `idFreeze`
3475:            if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
                     // change the instruction to `opt_ary_freeze`
3476:                iobj->insn_id = BIN(opt_ary_freeze);
                     // remove the `send` instruction, we don't need it anymore
3481:                ELEM_REMOVE(next);

Now i’ll use lldb to see where this code runs in relation to our prism compilation. In CRuby, to debug you run make lldb-ruby instead of make runruby. You’ll see some setup code run, and then you’ll be left at a prompt, prefixed by (lldb):

> make lldb-ruby
lldb  -o 'command script import -r ../misc/lldb_cruby.py' ruby --  ../test.rb
(lldb) target create "ruby"
Current executable set to '/Users/johncamara/Projects/ruby/build/ruby' (arm64).
(lldb) settings set -- target.run-args  "../test.rb"
(lldb) command script import -r ../misc/lldb_cruby.py
lldb scripts for ruby has been installed.
(lldb)

At this point, we haven’t actually run anything. We can now set our breakpoint, then run the program. I’ll add a breakpoint right after all if statements have succeeded:

(lldb) break set --file compile.c --line 3476
Breakpoint 1: where = ruby`iseq_peephole_optimize + 2276 at compile.c:3476:17

With our breakpoint set, we call run to run the program:

(lldb) run

You’ll see something like the following. It ran the program until it hit our breakpoint, right after identifying a frozen array literal:

(lldb) run
Process 50923 launched: '/ruby/build/ruby' (arm64)
Process 50923 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:17
   3473             const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
   3474
   3475             if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
-> 3476                 iobj->insn_id = BIN(opt_ary_freeze);
   3477                 iobj->operand_size = 2;
   3478                 iobj->operands = compile_data_calloc2(iseq, iobj->operand_size, sizeof(VALUE));
   3479                 iobj->operands[0] = rb_cArray_empty_frozen;

I want to see where we are in relation to all our prism compilation code. We can use bt to get the backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:29
    frame #1: ruby`iseq_optimize(...) at compile.c:4352:17
    frame #2: ruby`iseq_setup_insn(...) at compile.c:1619:5
    frame #3: ruby`pm_iseq_compile_node(...) at prism_compile.c:10139:5
    frame #4: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #5: ruby`rb_protect(...) at eval.c:1033:18
    frame #6: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #7: ruby`pm_new_child_iseq(...) at prism_compile.c:1271:27
    frame #8: ruby`pm_compile_node(...) at prism_compile.c:9458:40
    frame #9: ruby`pm_compile_node(...) at prism_compile.c:9911:17
    frame #10: ruby`pm_compile_scope_node(...) at prism_compile.c:6598:13
    frame #11: ruby`pm_compile_node(...) at prism_compile.c:9784:9
    frame #12: ruby`pm_iseq_compile_node(...) at prism_compile.c:10122:9
    frame #13: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #14: ruby`rb_protect(...) at eval.c:1033:18
    frame #15: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #16: ruby`pm_iseq_new_top(...) at iseq.c:906:12
    frame #17: ruby`load_iseq_eval(...) at load.c:756:24
    frame #18: ruby`require_internal(...) at load.c:1296:21
    frame #19: ruby`rb_require_string_internal(...) at load.c:1402:22
    frame #20: ruby`rb_require_string(...) at load.c:1388:12
    frame #21: ruby`rb_f_require(...) at load.c:1029:12
    frame #22: ruby`ractor_safe_call_cfunc_1(...) at vm_insnhelper.c:3624:12
    frame #23: ruby`vm_call_cfunc_with_frame_(...) at vm_insnhelper.c:3801:11
    frame #24: ruby`vm_call_cfunc_with_frame(...) at vm_insnhelper.c:3847:12
    frame #25: ruby`vm_call_cfunc_other(...) at vm_insnhelper.c:3873:16
    frame #26: ruby`vm_call_cfunc(...) at vm_insnhelper.c:3955:12
    frame #27: ruby`vm_call_method_each_type(...) at vm_insnhelper.c:4779:16
    frame #28: ruby`vm_call_method(...) at vm_insnhelper.c:4916:20
    frame #29: ruby`vm_call_general(...) at vm_insnhelper.c:4949:12
    frame #30: ruby`vm_sendish(...) at vm_insnhelper.c:5968:15
    frame #31: ruby`vm_exec_core(...) at insns.def:898:11
    frame #32: ruby`rb_vm_exec(...) at vm.c:2595:22
    frame #33: ruby`rb_iseq_eval(...) at vm.c:2850:11
    frame #34: ruby`rb_load_with_builtin_functions(...) at builtin.c:54:5
    frame #35: ruby`Init_builtin_features at builtin.c:74:5
    frame #36: ruby`ruby_init_prelude at ruby.c:1750:5
    frame #37: ruby`ruby_opt_init(...) at ruby.c:1811:5
    frame #38: ruby`prism_script(...) at ruby.c:2215:13
    frame #39: ruby`process_options(...) at ruby.c:2538:9
    frame #40: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #41: ruby`ruby_options(...) at eval.c:117:16
    frame #42: ruby`rb_main(...) at main.c:43:26
    frame #43: ruby`main(...) at main.c:68:12

Whoa. That thing is huge! This is not the backtrace I was expecting! Seems like I missed a codepath in my earlier explorations. I got it right, up until prism_script:

main
which calls rb_main
which calls ruby_options, then ruby_process_options, then process_options
which calls prism_script
The next instruction I expected was pm_iseq_new_main, but instead we head into ruby_opt_init
which calls Init_builtin_features

This path seems to go through some gem preloading logic, which is why we see the rb_require calls:

void
Init_builtin_features(void)
{
    rb_load_with_builtin_functions("gem_prelude", NULL);
}

By default CRuby loads gem_prelude, which lives in ruby/gem_prelude.rb. Here’s that file, shortened for brevity:

require 'rubygems'
require 'error_highlight'
require 'did_you_mean'
require 'syntax_suggest/core_ext'

Compiling on-the-fly

There’s something i’ve learned here that seems obvious in hindsight, but I hadn’t considered. Ruby will only compile what is actually loaded, and only at the point it gets loaded. If I never load a particular piece of code, it never gets compiled. Or if I defer loading it until later, it does not get compiled until later.

We can actually demonstrate this by deferring a require:

sleep 10

require "net/http"

If we run this this using make lldb-ruby, we can see the delayed compilation in action:

(lldb) break set --file ruby.c --line 2616
(lldb) run
// hits our prism compile code
(lldb) next
(lldb) break set --file compile.c --line 3476
(lldb) continue
// waits 10 seconds, then compiles the contents of "net/http"

Getting to our test.rb file

I’d rather see just my code in test.rb get compiled, so I’m going to set a breakpoint directly on pm_iseq_new_main, which for me is in ruby.c on line 2616:

(lldb) break set --file ruby.c --line 2616
(lldb) run
Process 32534 launched: '/ruby/build/ruby' (arm64)
Process 32534 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: ruby`process_options(...) at ruby.c:2616:38
   2613         if (!result.ast) {
   2614             pm_parse_result_t *pm = &result.prism;
   2615             int error_state;
-> 2616             iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);
   2617
   2618             pm_parse_result_free(pm);
   2619

Now when we run the backtrace I am seeing what I expected, because we’ve skipped the gem_prelude compilation. This is the exact flow I walked through in part 2:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: ruby`process_options(...) at ruby.c:2616:38
    frame #1: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #2: ruby`ruby_options(...) at eval.c:117:16
    frame #3: ruby`rb_main(...) at main.c:43:26
    frame #4: ruby`main(...) at main.c:68:12

From here, we can set our iseq_peephole_optimize breakpoint and see only our specific code get compiled. Since we’re already in the running program, we call continue to keep executing:

(lldb) break set --file compile.c --line 3476
Breakpoint 2: where = ruby`iseq_peephole_optimize + 2276 at compile.c:3476:17
(lldb) continue
Process 55336 resuming
Process 55336 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: ruby`iseq_peephole_optimize() at compile.c:3476:17
   3473             const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(next, 1);
   3474
   3475             if (vm_ci_simple(ci) && vm_ci_argc(ci) == 0 && blockiseq == NULL && vm_ci_mid(ci) == idFreeze) {
-> 3476                 iobj->insn_id = BIN(opt_ary_freeze);
   3477                 iobj->operand_size = 2;
   3478                 iobj->operands = compile_data_calloc2(iseq, iobj->operand_size, sizeof(VALUE));
   3479                 iobj->operands[0] = rb_cArray_empty_frozen;

If we call bt from here to get the backtrace, we finally see the connection between prism_compile.c and compile.c. pm_iseq_compile_node calls iseq_setup_insn, which runs the optimization logic. In the previous post, I saw iseq_setup_insn, but I didn’t know what it meant or what it did. Now we know. This is what Kevin Newton referred to earlier: specialization comes after compilation. Prism compiles the node in the standard way, then the peephole optimization layer - the specialization - is applied after:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3476:17
    frame #1: ruby`iseq_optimize(...) at compile.c:4352:17
    frame #2: ruby`iseq_setup_insn(...) at compile.c:1619:5
    frame #3: ruby`pm_iseq_compile_node(...) at prism_compile.c:10139:5
    frame #4: ruby`pm_iseq_new_with_opt_try(...) at iseq.c:1029:5
    frame #5: ruby`rb_protect(...) at eval.c:1033:18
    frame #6: ruby`pm_iseq_new_with_opt(...) at iseq.c:1082:5
    frame #7: ruby`pm_iseq_new_main(...) at iseq.c:930:12
    frame #8: ruby`process_options(...) at ruby.c:2616:20
    frame #9: ruby`ruby_process_options(...) at ruby.c:3169:12
    frame #10: ruby`ruby_options(...) at eval.c:117:16
    frame #11: ruby`rb_main(...) at main.c:43:26
    frame #12: ruby`main(...) at main.c:68:12

From here, we can inspect and see the current instruction using expr:

(lldb) expr *(iobj)
(INSN) $4 = {
  link = {
    type = ISEQ_ELEMENT_INSN
    next = 0x000000011f6568d0
    prev = 0x000000011f656850
  }
  insn_id = YARVINSN_newarray
  operand_size = 1
  sc_state = 0
  operands = 0x000000011f640118
  insn_info = (line_no = 1, node_id = 3, events = 0)
}

We see that iobj contains a link to a subsequent instruction, as well as an insn_id and some other metadata. The instruction is currently YARVINSN_newarray. If we run next, that should run iobj->insn_id = BIN(opt_ary_freeze);, and our instruction should change:

(lldb) next
(lldb) expr *(iobj)
(INSN) $5 = {
  //...
  insn_id = YARVINSN_opt_ary_freeze
  //...
}

It does! The instruction was changed from newarray to opt_ary_freeze! The optimization is at least partially complete (i’m not sure if more is involved, yet).

Making one small step towards `opt_respond_to`

This is already the longest and densest post in the series. But i’d love to make some actual progress towards a new instruction. Let’s pattern match on respond_to? in the peephole optimizer.

Here is our sample program:

puts "Did you know you can write to $stdout?" if $stdout.respond_to?(:write)

Run with RUNOPT0=--dump=insns make runruby, we get the following instructions:

== disasm: #<ISeq:<main>./test.rb:1 (1,0)-(1,76)>
0000 getglobal                              :$stdout                  (   1)[Li]
0002 putobject                              :write
0004 opt_send_without_block                 <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless                           14
0008 putself
0009 putchilledstring                       "Did you know you can write to $stdout?"
0011 opt_send_without_block                 <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave

I want to match on this line:

0004 opt_send_without_block       <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>

Here’s my attempt. I’m going to copy what the newarray freeze optimization is doing, and just try changing a few things to match my example. Right underneath the code we’ve been debugging for newarray, i’m adding this:

// If the instruction is `send_without_block`, ie `0004 opt_send_without_block`
if (IS_INSN_ID(iobj, send_without_block)) {
    // Pull the same info the `newarray` optimization does
    const struct rb_callinfo *ci = (struct rb_callinfo *)OPERAND_AT(iobj, 0);
    const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(iobj, 1);

    // <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
    // 1. We have ARGS_SIMPLE, which is probably what `vm_ci_simple(ci)` checks for
    // 2. We have argc:1, which should match `vm_ci_argc(ci) == 1`
    // 3. We send without a block, hence blockiseq == NULL
    // 4. The method id (mid) for `vm_ci_mid(ci)` matches `idRespond_to`. I searched around for names
    //    that seemed similar to idFreeze, but replacing `idFreeze` with `idRespond` and found `idRespond_to`
    if (vm_ci_simple(ci) && vm_ci_argc(ci) == 1 && blockiseq == NULL && vm_ci_mid(ci) == idRespond_to) {
        int i = 0;
    }
}

Now i’ll follow the same debugging as before, but i’ll add a breakpoint in compile.c where I added my new code. Specifically, I’m setting a breakpoint at the int i = 0; so I am inside the if statement:

(lldb) break set --file ruby.c --line 2616
Breakpoint 1: where = ruby`process_options + 4068 at ruby.c:2616:38
(lldb) run
(lldb) break set --file compile.c --line 3491
Breakpoint 2: where = ruby`iseq_peephole_optimize + 2536 at compile.c:3491:17
(lldb) continue
Process 61925 resuming
Process 61925 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: ruby`iseq_peephole_optimize(...) at compile.c:3491:17
   3488         const rb_iseq_t *blockiseq = (rb_iseq_t *)OPERAND_AT(iobj, 1);
   3489
   3490         if (vm_ci_simple(ci) && vm_ci_argc(ci) == 1 && blockiseq == NULL && vm_ci_mid(ci) == idRespond_to) {
-> 3491             int i = 0;
   3492         }
   3493     }
   3494

I think it worked! It pattern matched on the characteristics of the respond_to? call, and hit the breakpoint set on int i = 0;. It’s a tiny step, but it’s a first step in the direction of adding the optimization.

Using `gdb`

For anyone wanting to do the same work using gdb, it’s pretty similar. Let’s start off by creating a breakpoints.gdb file in the root of your project. This will set you up with your initial breakpoint, similar to how we ran lldb, and set the breakpoint before calling run:

break ruby.c:2616

When you run make gdb-ruby, you can use the same backtrace command, bt:

> make gdb-ruby
Thread 1 "ruby" hit Breakpoint 4, process_options (...) at ../ruby.c:2616
2616	            iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);
(gdb) bt
#0  process_options (...) at ../ruby.c:2616
#1  in ruby_process_options (...) at ../ruby.c:3169
#2  in ruby_options (...) at ../eval.c:117
#3  in rb_main (...) at ../main.c:43
#4  in main (...) at ../main.c:68
(gdb)

From here, you can set your next breakpoint so that you can see the compilation solely for the newarray instruction from our test.rb program:

(gdb) break compile.c:3476
Breakpoint 5 at 0xaaaabaa22f14: file ../compile.c, line 3476
(gdb) continue
Continuing.

Thread 1 "ruby" hit Breakpoint 5, iseq_peephole_optimize (...) at ../compile.c:3476
3476	                iobj->insn_id = BIN(opt_ary_freeze);

Similar to the lldb command expr, we can inspect the contents of locals using p or print in gdb:

(gdb) p *(iobj)
$2 = {link = {type = ISEQ_ELEMENT_INSN, next = 0xaaaace797ef0, prev = 0xaaaace797e70}, insn_id = YARVINSN_newarray,
  operand_size = 1, sc_state = 0, operands = 0xaaaace796ac8, insn_info = {line_no = 1, node_id = 3, events = 0}}

Finishing up

Ok, this went pretty long. Good on you for sticking in there with me! We’ve found the optimizer, and we’ve pattern matched our way to a respond_to? call. Next, we need to add the new instruction definition and try to actually replace the send with our new instruction. See you next time! 👋🏼

Add a comment

The Ruby Syntax Holy Grail: adding `opt_respond_to` to the Ruby VM, part 3

Dec 25, 2024 JP Camara

In Finding the compiler: adding opt_respond_to to the Ruby VM, part 2, I found the entrypoint into the compiler! It takes the root of our abstract syntax tree - pm->node - and produces a rb_iseq_t. rb_iseq_t is an “InstructionSequence”, which represents our virtual machine bytecode. Here’s the code where we left off:

// ruby.c
static VALUE
process_options(int argc, char **argv, ruby_cmdline_options_t *opt)
{
    //...
    pm_parse_result_t *pm = &result.prism;
    int error_state;
    iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);

Nothing about this code screams “I’m the compiler!”. But I am taking an educated guess, since:

The compiler should produce instruction sequences, or iseqs
We know this is the function that returns our program’s iseq when main.c is called
This is the only line that produces an iseq in this function

With all that lining up, I’m confident this is the place we need to investigate further. Now let’s find what needs to change to add our new bytecode. Stepping into pm_iseq_new_main, there are a few layers I need to wade through to get to something that seems promising.

Getting your own environment setup

Before I dig in further, let’s take a quick step back. In case you want to join in at home, here are some simple(ish) steps for doing that.

Check out my guides on how to setup your development environment to hack on CRuby. I have a docker guide and a MacOS guide. The one thing I didn’t add to them was cloning the repo. So you’ll need to git clone the Ruby repository.
Building CRuby can take a few minutes the first time you run it. After you’ve built everything, you can easily test your local setup using a file named test.rb. Create a test.rb file in the root of your CRuby folder.
You can run make runruby, and it will run whatever is inside your test.rb file. You can even use debug tools to debug the C code you’re running and inspecting - we’ll talk more about those later.

Back to the investigation

First we’ve got the function pm_iseq_new_main, which seems to set us up as the <main> rb_iseq_t.

// iseq.c
rb_iseq_t *
pm_iseq_new_main(pm_scope_node_t *node, VALUE path, VALUE realpath, const rb_iseq_t *parent, int opt, int *error_state)
{
    iseq_new_setup_coverage(path, (int) (node->parser->newline_list.size - 1));

    return pm_iseq_new_with_opt(node, rb_fstring_lit("<main>"),
                                path, realpath, 0,
                                parent, 0, ISEQ_TYPE_MAIN, opt ? &COMPILE_OPTION_DEFAULT : &COMPILE_OPTION_FALSE, error_state);
}

This looked immediately familiar to me. It sticks out because i’ve seen that <main> before. Let’s run a simple Ruby program:

begin
  raise
rescue => e
  puts e.backtrace
end

All our program does it raise an error, rescue the error, then puts the backtrace. What does that backtrace look like?

../test.rb:2:in '<main>'

Oh yea! We are executing our code at the top-level of the program. And that top level is referred to as <main>. I think that’s being named by our pm_iseq_new_with_opt(node, rb_fstring_lit("<main>")... call - neat!

iseq_new_setup_coverage just sets up some optional coverage information, so let’s move to pm_iseq_new_with_opt:

rb_iseq_t *
pm_iseq_new_with_opt(pm_scope_node_t *node, VALUE name, VALUE path, VALUE realpath,
                     int first_lineno, const rb_iseq_t *parent, int isolated_depth,
                     enum rb_iseq_type type, const rb_compile_option_t *option, int *error_state)
{
    rb_iseq_t *iseq = iseq_alloc();
    ISEQ_BODY(iseq)->prism = true;
    //...
    struct pm_iseq_new_with_opt_data data = {
        .iseq = iseq,
        .node = node
    };
    rb_protect(pm_iseq_new_with_opt_try, (VALUE)&data, error_state);

    if (*error_state) return NULL;

    return iseq_translate(iseq);
}

This code allocates (iseq_alloc) an rb_iseq_t struct and sets it as being part of prism. I believe rb_protect is to allow handling of errors that might be raised while running a particular function? Looking at the git blame I see Peter Zhu added it to catch errors, so confirmed ✅. Not alot is happening here otherwise, so let’s jump into pm_iseq_new_with_opt_try:

VALUE
pm_iseq_new_with_opt_try(VALUE d)
{
    struct pm_iseq_new_with_opt_data *data = (struct pm_iseq_new_with_opt_data *)d;

    // This can compile child iseqs, which can raise syntax errors
    pm_iseq_compile_node(data->iseq, data->node);

    // This raises an exception if there is a syntax error
    finish_iseq_build(data->iseq);

    return Qundef;
}

This is the most promising piece of code so far. It’s the first thing that kindly tells me in explicit terms: “I am going to compile something”. Presumably pm_iseq_compile_node compiles data->node into the data->iseq. It’s in a new file called prism_compile.c. Let’s check it out!

// prism_compile.c
VALUE
pm_iseq_compile_node(rb_iseq_t *iseq, pm_scope_node_t *node)
{
    //...
    if (pm_iseq_pre_execution_p(iseq)) {
        //...
        pm_compile_node(iseq, (const pm_node_t *) node, body, false, node);
        //...
    }
    else {
        //...
        pm_compile_node(iseq, (const pm_node_t *) node, ret, false, node);
    }

    CHECK(iseq_setup_insn(iseq, ret));
    return iseq_setup(iseq, ret);
}

😮‍💨. There are many layers to this compilation. Primarily, this function seems to do two things: “compile” the node, then “setup” the iseq. I don’t know why the iseq “setup” is required yet. Let’s start with pm_compile_node and I’ll come back to the rest:

static void
pm_compile_node(rb_iseq_t *iseq, const pm_node_t *node, LINK_ANCHOR *const ret, bool popped, pm_scope_node_t *scope_node)
{
    const pm_parser_t *parser = scope_node->parser;
    //...
    switch (PM_NODE_TYPE(node)) {
      case PM_ALIAS_GLOBAL_VARIABLE_NODE:
        // alias $foo $bar
        // ^^^^^^^^^^^^^^^
        pm_compile_alias_global_variable_node(iseq, (const pm_alias_global_variable_node_t *) node, &location, ret, popped, scope_node);
        return;
      //...
      case PM_ARRAY_NODE: {
        // [foo, bar, baz]
        // ^^^^^^^^^^^^^^^
        const pm_array_node_t *cast = (const pm_array_node_t *) node;
        pm_compile_array_node(iseq, (const pm_node_t *) cast, &cast->elements, &location, ret, popped, scope_node);
        return;
      }
      //...
      case PM_FLIP_FLOP_NODE: {
        // if foo .. bar; end
        //    ^^^^^^^^^^
        const pm_flip_flop_node_t *cast = (const pm_flip_flop_node_t *) node;
        //...
        pm_compile_flip_flop(cast, else_label, then_label, iseq, location.line, ret, popped, scope_node);
        //...
      }
      //...
      case PM_IT_LOCAL_VARIABLE_READ_NODE: {
        // -> { it }
        //      ^^
        if (!popped) {
            PUSH_GETLOCAL(ret, location, scope_node->local_table_for_iseq_size, 0);
        }

        return;
      }
      //...
      case PM_MODULE_NODE: {
        // module Foo; end
        //...
      }
      //...
}

pm_compile_node puts the “fun” in “function”. It’s really cool! This 1800+ line monster seems to cover a huge swath of Ruby syntax. Maybe all of it? The prism_compile.c file is 11 thousand lines long, as each case of this switch statement branches off into more granular node compilations, like pm_compile_array_node and pm_compile_flip_flop.

With that in mind, it is also an utterly daunting file to consider for the opt_respond_to instruction. Do I edit this file? Where would I even start? I need to swap out a method call to respond_to? - there is code that seems to handle method calls:

case PM_CALL_NODE:
    // foo
    // ^^^
    //
    // foo.bar
    // ^^^^^^^
    //
    // foo.bar() {}
    // ^^^^^^^^^^^^
    pm_compile_call_node(iseq, (const pm_call_node_t *) node, ret, popped, scope_node);
    return;

Maybe that’s it?

I think I need to use a cheat code here to give me some direction. In the previous post, I mentioned Étienne Barrié’s PR to add optimized instructions for frozen literal Hash and Array. I’ve been mostly ignoring it so far, but I think it’s time I use that for a bit of direction on where to go from here.

I think we’re close! So far, I’ve navigated the code manually. In the next post, we’re going to actually run and debug some code, and dig a bit into Étienne’s work. See you then! 👋🏼

Add a comment

Finding the compiler: adding `opt_respond_to` to the Ruby VM, part 2

Dec 23, 2024 JP Camara

In Adding opt_respond_to to the Ruby VM: part 1, inspired by recent JSON gem optimizations, I setup my goal: I want to add a new bytecode instruction to the Ruby VM which optimizes respond_to? calls. I took this Ruby code:

if $stdout.respond_to?(:write)
  puts "Did you know you can write to $stdout?"
end

And identified what bytecode instructions matter most (I think):

== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(3,3)>
# ...
0002 putobject                 :write
0004 opt_send_without_block    <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>

This seems pretty low-level! But it’s still very high-level in terms of what I need to actually do. I know what instructions matter, but how can I change them?

A little help from Git

Thankfully, @byroot gave me some helpful direction in the form of a recent PR. Étienne Barrié recently merged a PR to add optimized instructions for frozen literal Hash and Array. It adds special handling for frozen, empty arrays and hashes so that when you use them in your code, calls like [].freeze will not result in any additional object allocations. Cool! A neat enhancement, and kind of perfect for me to analyze.

I think the use-case is simpler than what i’d need for opt_respond_to, but looking at that PR I can see an important part of adding a new bytecode instruction is in compile.c. I’ll eventually need to add some new logic there, but what steps does CRuby take to get to compile.c?

Starting from main

Knowing a bit about the CRuby source, and how C programs start, I know there is a main function that kicks everything off. In CRuby, it’s helpfully in main.c, at the root of the project:

int
main(int argc, char **argv)
{
    //...
    return rb_main(argc, argv);
}

static int
rb_main(int argc, char **argv)
{
    RUBY_INIT_STACK;
    ruby_init();
    return ruby_run_node(ruby_options(argc, argv));
}

I’m going to guess at not needing RUBY_INIT_STACK or ruby_init for adding a new instruction. I took a peek at it and it all seems related to setting up the runtime, and creating data structures needed for the Ruby virtual machine. Past that, there are only two function calls: ruby_options and ruby_run_node. ruby_options sounds like it would just get the options needed for the program. Maybe we need to go into ruby_run_node?

int
ruby_run_node(void *n)
{
    rb_execution_context_t *ec = GET_EC();
    int status;
    if (!ruby_executable_node(n, &status)) {
        rb_ec_cleanup(ec, (NIL_P(ec->errinfo) ? TAG_NONE : TAG_RAISE));
        return status;
    }
    return rb_ec_cleanup(ec, rb_ec_exec_node(ec, n));
}

Maybe? It looks like if the “node” n isn’t executable, it fails. I’ll concentrate instead on the success path on the last line. rb_ec_exec_node will run first, followed by rb_ec_cleanup.

📝 All this ec_* stuff seems to stand for execution_context, which presumably is the state of the runtime at any given point?

📝 I’m not Mr. C programmer - so I didn’t know what void *n meant. Looking it up, this seems to be a way of specifying a generic pointer type that can point to any data type

Let’s start by checking rb_ec_exec_node:

static int
rb_ec_exec_node(rb_execution_context_t *ec, void *n)
{
    volatile int state;
    rb_iseq_t *iseq = (rb_iseq_t *)n;
    if (!n) return 0;

    EC_PUSH_TAG(ec);
    if ((state = EC_EXEC_TAG()) == TAG_NONE) {
        rb_iseq_eval_main(iseq);
    }
    EC_POP_TAG();
    return state;
}

Hmmmm. This is the first place where I don’t like what i’m seeing. My primary concern is that void *n is getting cast to rb_iseq_t. The class we used to compile our Ruby sample in part 1 - RubyVM::InstructionSequence - is defined in a C file called iseq.c. So in CRuby, iseq stands for “InstructionSequence”. If we already have an iseq, I think it means our code has already been compiled and we’ve gone too far.

Stepping back to `ruby_options`

ruby_run_node doesn’t do much aside from calling rb_ec_exec_node. So if ruby_run_node and rb_ec_exec_node are not the right functions… that only leaves ruby_options. Not what I would expect, but let’s check:

void *
ruby_options(int argc, char **argv)
{
    rb_execution_context_t *ec = GET_EC();
    enum ruby_tag_type state;
    void *volatile iseq = 0;

    EC_PUSH_TAG(ec);
    if ((state = EC_EXEC_TAG()) == TAG_NONE) {
        iseq = ruby_process_options(argc, argv);
    }
    else {
        rb_ec_clear_current_thread_trace_func(ec);
        int exitcode = error_handle(ec, ec->errinfo, state);
        ec->errinfo = Qnil; /* just been handled */
        iseq = (void *)INT2FIX(exitcode);
    }
    EC_POP_TAG();
    return iseq;
}

There’s a lot going on in here, but I’m drawn to the iseq = ruby_process_options(argc, argv) line. Let’s dig into ruby_process_options. This is a big one:

void *
ruby_process_options(int argc, char **argv)
{
    ruby_cmdline_options_t opt;
    VALUE iseq;
    const char *script_name = (argc > 0 && argv[0]) ? argv[0] : ruby_engine;

    if (!origarg.argv || origarg.argc <= 0) {
        origarg.argc = argc;
        origarg.argv = argv;
    }
    set_progname(external_str_new_cstr(script_name));  /* for the time being */
    rb_argv0 = rb_str_new4(rb_progname);
    rb_vm_register_global_object(rb_argv0);

#ifndef HAVE_SETPROCTITLE
    ruby_init_setproctitle(argc, argv);
#endif

    iseq = process_options(argc, argv, cmdline_options_init(&opt));

    //...

    return (void*)(struct RData*)iseq;
}

Most of this function seems to be VM setup. But I think we’re getting closer with iseq = process_options(...).

Checking process_options… whoa, this is a ~350 line function! It’s a bit much to all paste in here, but scanning the code, I think we’re on the right track. There are all sorts of option initializations here:

static VALUE
process_options(int argc, char **argv, ruby_cmdline_options_t *opt)
{
    //...
    if (FEATURE_SET_P(opt->features, yjit)) {
        bool rb_yjit_option_disable(void);
        opt->yjit = !rb_yjit_option_disable(); // set opt->yjit for Init_ruby_description() and calling rb_yjit_init()
    }
    //...
    ruby_mn_threads_params();
    Init_ruby_description(opt);
    //...
    ruby_gc_set_params();
    ruby_init_loadpath();
    //...
}

Among many other things, it sets up options for yjit, mn threads, the program description, garbage collection params, and the loadpath. That’s just scratching the surface of this function. Then around 240 lines into the function, I see a very promising if statement:

static VALUE
process_options(int argc, char **argv, ruby_cmdline_options_t *opt)
{
    //...
    struct {
        rb_ast_t *ast;
        pm_parse_result_t prism;
    } result = {0};
    // ... ~240 lines of option handling
    if (!rb_ruby_prism_p()) {
        ast_value = process_script(opt);
        if (!(result.ast = rb_ruby_ast_data_get(ast_value))) return Qfalse;
    }
    else {
        prism_script(opt, &result.prism);
    }

The beginning of the function sets up a struct that contains either a rb_ast_t, or a pm_parse_result_t. Prism is the new default Ruby parser as of Ruby 3.4, so we’re getting close. rb_ast_t must be the format for the prior CRuby parser.

From a naming perspective, I would never have guessed that ruby_options is the place that parses our Ruby code. In principle I guess this is all preamble to actually running the program, so it kind of relates.

I won’t dig into prism_script, since it would create our Abstract Syntax Tree (AST), which I expect later will be used by the compiler:

typedef struct {
    //...
    /** The resulting scope node that will hold the generated AST. */
    pm_scope_node_t node;
    //...
} pm_parse_result_t;

Ok, here we go! I think we’ve got it with this next section! The pm_scope_node_t node (which should be set by prism_script) is used to create our rb_iseq_t *iseq inside of pm_iseq_new_main!

// ~320 lines into the function
pm_parse_result_t *pm = &result.prism;
int error_state;
iseq = pm_iseq_new_main(&pm->node, opt->script_name, path, parent, optimize, &error_state);

We now have the entrypoint into creating our InstructionSequence (our iseq, or rb_iseq_t). I wanted to start digging into the actual compiler, but I think I’ll stop here for today.

Now that we know the entrypoint into the compiler, we can start figuring out what code might need to change to add a new bytecode instruction. Next up, i’m hoping we can find the appropriate area that needs that change. See you then! 👋🏼

Add a comment

Adding `opt_respond_to` to the Ruby VM: part 1

Dec 22, 2024 JP Camara

@byroot has been posting a series on optimizations he and others have made to the json gem, and it’s been 🔥🔥🔥. Enjoyable and informative, I highly recommend reading what he’s posted so far.

In his second post, he mentions the possibility of improving performance by adding an additional method cache. This would involve compiling respond_to? calls in a special way:

It actually wouldn’t be too hard to add such a cache, we’d need to modify the Ruby compiler to compile respond_to? calls into a specialized opt_respond_to instruction that does have two caches instead of one. The first cache would be used to look up respond_to? on the object to make sure it wasn’t redefined, and the second one to look up the method we’re interested in. Or perhaps even 3 caches, as you also need to check if the object has a respond_to_missing? method defined in some cases.

That’s an idea I remember discussing in the past with some fellow committers, but I can’t quite remember if there was a reason we didn’t do it yet.

Inspired by his comment, I’m going to add a new bytecode instruction - opt_respond_to - to the Ruby VM, for fun. I don’t know how to add a new bytecode instruction (yet). I don’t know if one would get accepted by the Ruby team. I don’t know if adding it will actually provide a meaningful enhancement to performance. But let’s give it a try, shall we?

Understanding the requirements

I know I want to add a new Ruby Virtual Machine bytecode called opt_respond_to. What does code using respond_to? look like today, after being compiled? Here’s some code to evaluate:

if $stdout.respond_to?(:write)
  puts "Did you know you can write to $stdout?"
end

We can compile it using RubyVM::InstructionSequence. We compile the code, then we disassemble it to see the actual Ruby bytecode:

puts RubyVM::InstructionSequence.compile(DATA.read).disassemble

__END__
if $stdout.respond_to?(:write)
  puts "Did you know you can write to $stdout?"
end

📝 The __END__ format is just a convenient way of supplying some text to your program. Here we put all our Ruby code we want to compile after __END__ and it will be available to our program as an IO object called DATA. Thanks to Drew Bragg for the tip!

Compiling using InstructionSequence gives us:

== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(3,3)>
0000 getglobal                    :$stdout                  (   1)[Li]
0002 putobject                    :write
0004 opt_send_without_block       <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 branchunless                 14
0008 putself                                                (   2)[Li]
0009 putstring                    "Did you know you can write to $stdout?"
0011 opt_send_without_block       <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0013 leave
0014 putnil
0015 leave

I’m going to guess at the parts I think matter most - putobject, opt_send_without_block and branchunless.

putobject pushes the symbol :write onto the vm stack
opt_send_without_block is given <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>. I’m guessing calldata is a format for specifying metadata about what is being called. We’ve got the method name, respond_to?, how many args are being used, argc:1, and that the arguments are “simple”, ARGS_SIMPLE. mid stands for… “method id”?
branchunless wouldn’t be specifically related to creating a new instruction, but I just think it’s informative for how the respond_to? result is used. I believe it means “if the last result is false, jump to instruction 14”. 14 in this case is the putnil near the bottom of the bytecode. Each instruction seems to be prefixed with a hex value that identifies the location of the instruction. putobject is located at 0002, opt_send_without_block is located at 0004, branchunless is located at 0006 and putnil is located at 0014

I think the only thing I will be changing is taking calls to respond_to?, and making that a opt_respond_to bytecode instead of opt_send_without_block. We’ll see!

There are actually a few variations for respond_to? that I wasn’t aware of. The interface takes a symbol or a string as the first parameter, and then a boolean for whether to include private and protected methods:

$stdout.respond_to?("write")
$stdout.respond_to?("write", true)
$stdout.respond_to?(:write, true)

Let’s see the bytecode for these variations:

== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(3,33)>
0000 getglobal                    :$stdout                  (   1)[Li]
0002 putstring                    "write"
0004 opt_send_without_block       <calldata!mid:respond_to?, argc:1, ARGS_SIMPLE>
0006 pop
0007 getglobal                    :$stdout                  (   2)[Li]
0009 putstring                    "write"
0011 putobject                    true
0013 opt_send_without_block       <calldata!mid:respond_to?, argc:2, ARGS_SIMPLE>
0015 pop
0016 getglobal                    :$stdout                  (   3)[Li]
0018 putobject                    :write
0020 putobject                    true
0022 opt_send_without_block       <calldata!mid:respond_to?, argc:2, ARGS_SIMPLE>
0024 leave

It doesn’t make a huge difference. The string version is identical, except we see "write" instead of :write in the putobject call. The boolean versions add an additional putobject which pushes the boolean onto the stack. Then the opt_send_without_block call has argc:2 instead of argc:1. Good to understand, but functionally the same.

I’m going to keep these explorations shorter, and break them into parts, so i’m going to stop here. So far we’ve:

Identified that I want to create an opt_respond_to instruction for the Ruby VM
Compiled a simple respond_to? example, and examined what the current bytecode looks like
Identified what I think are the relevant instructions that need to be converted to opt_respond_to

Next up, i’m going to walk through the path CRuby takes from starting the program, to compiling our code, starting with main.c. Here we go!

Add a comment

20 days of ruby gems: part 1

Dec 16, 2024 JP Camara

Over on BlueSky, Gregory Brown suggested a #20daygemchallenge. Post gems you’ve either used time and time again, or have inspired you in some way, in no particular order. Mohit Sindhwani suggested writing about them at the end, which sounded like a great idea!

I’m breaking it into two parts. Here’s my breakdown of my first 10 gems posted:

First is HTTParty: https://github.com/jnunemaker/httparty

HTTPart is the OG http gem. It’s widely used and dead simple. There are a million http options out there, but HTTParty still remains a simple, common option. It sits on top of Net::HTTP, so it has the same timeout concern as any Net::HTTP usage. But many Ruby http gems use Net::HTTP, so that’s not a particular knock against it!

Second is Async: https://github.com/socketry/async

Async is the de facto FiberScheduler for Ruby. That’s what allows Fibers to parallelize blocking operations in Ruby 3+. It’s a great gem and a great ecosystem of tools as well, particularly revolving around all things IO and web protocols.

I talk about it in more detail in my In-Depth Ruby Concurrency talk from RubyConf and I’ll have future articles about it as well.

Third is Pitchfork: https://github.com/Shopify/pitchfork

Pitchfork is an evolution of the Unicorn web server. Its innovation is a forking technique called “Reforking”, where processes are forked multiple times from existing “warm” processes, getting to a point of maximally optimized Copy-on-Write performance.

Like async, I talk about it in more detail in my In-Depth Ruby Concurrency talk from RubyConf and I’ll have future articles about it as well.

Fourth is Ractor TVar: https://github.com/ko1/ractor-tvar

Ractor TVar is an implementation of software transactional memory in Ruby. I learned about it from the Mooro gem, which is a later gem pick. It’s a fascinating and largely unknown library that koichi seemed to have released alongside Ractors and not updated since. I’m very curious to read the source more to better understand how it works and maybe even give it a try in some real code. It only documents examples supporting Ractors but claims to also support threads.

Fifth is Strong Migrations: https://github.com/ankane/strong_migrations

I prefer my database migrations to be stress-free and zero-downtime. The strong migrations gem helps me sleep at night. It detects unsafe operations and blocks them from being run, offering safe alternatives.

There are a few other gems that help keep migrations safe, but I prefer the explicit style of strong migrations. Most other options are a bit “magical”, and will try to rewrite things for you.

Sixth is ZipKit: https://github.com/julik/zip_kit

I’ve mostly used ZipTricks in the past, and ZipKit is the successor to that gem. Being able to stream writes to a zip file is amazing for scaling and ZipKit makes it dead simple.

Using it you can, for instance, stream a file to S3, zipping it on the fly as it’s being uploaded! Streaming is the only way you can reasonably manage operations on very large files, so having this option is critical.

Seventh is Falcon: https://github.com/socketry/falcon

Falcon is a web server based on the FiberScheduler (provided by the async gem). It’s a very scalable server, particularly for IO bound operations. There’s a great talk focusing on it from RailsConf 2023 called Look ma, no jobs.

I also did some benchmarking of its web socket performance compared to a node.js implementation and it is very close in performance!

Here’s the code for it:

https://gist.github.com/jpcamara/8a1a09c9c67347c4e32384b9ce806b70

Like async and pitchfork, I talk about it in more detail in my In-Depth Ruby Concurrency talk from RubyConf and I’ll have future articles about it as well.

Eighth is OJ: https://github.com/ohler55/oj

OJ has historically been the fastest JSON parser in the Ruby world. Usually you can just drop it into a project as a JSON replacement and see things immediately speed up, especially if you are doing any heavy JSON processing.

The JSON gem was recently taken over by the Ruby GitHub organization and byroot has been making some big performance improvements to it - maybe we’ll see parity at some point but OJ is still a great choice.

Ninth is io-event: https://github.com/socketry/io-event

The async gem is the public interface, but io-event is what powers the scheduling at the OS level. It provides all of the integrations with each operating systems kernel event queue: io_uring and epoll for Linux, and kqueue for MacOS. IOCP support for windows is still in progress, so it falls back to a basic Ruby select there. If you don’t know why any of that is useful, it’s because it’s an important part of keeping the “Reactor” pattern of asynchronous IO efficient.

Like async, pitchfork and falcon (😅), I talk about the reactor pattern in more detail in my In-Depth Ruby Concurrency talk from RubyConf and I’ll have future articles about it as well. I obviously like concurrency 🙂.

Tenth is Glimmer: https://github.com/AndyObtiva/glimmer

I wasn’t familiar with Glimmer but I learned about it at RubyConf. It’s a DSL for building UIs with pure Ruby and has bindings for desktop app ui layers as well as the web. It’s a really cool concept and I look forward to learning more about it by watching How to build basic desktop applications in Ruby. I’ve been working on a cross platform app using React Native and Tauri - maybe I’ll port some of it to Glimmer as an experiment.

After I finish 11 through 20, I’ll post about them as well. Give these gems a try! 👋

Add a comment

My MacOS setup for hacking on CRuby

Dec 2, 2024 JP Camara

I recently posted my docker setup for hacking on CRuby, which showed how I test Linux features when working on CRuby. But most of the time, I just build CRuby directly on MacOS.

The Building Ruby guide from ruby-lang.org is the most up-to-date guide on doing this, but I like to spell it out exactly in order of how I do it day-to-day. So this is for me more than anything, but you may find it helpful!

# /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
xcode-select --install
brew update
brew install openssl@3
brew install autoconf
brew install gperf
brew install libffi
brew install libyaml
brew install zlib
brew install gmp
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

export CONFIGURE_ARGS=""
for ext in openssl readline libyaml zlib; do
  CONFIGURE_ARGS="${CONFIGURE_ARGS} --with-$ext-dir=$(brew --prefix $ext)"
done

./autogen.sh
mkdir build && cd build
mkdir ./.rubies

../configure --prefix="/path/to/ruby/build/.rubies/ruby-master" --disable-install-doc --config-cache --enable-debug-env optflags="-O0 -fno-omit-frame-pointer" CFLAGS="-DRUBY_DEBUG -O0" --with-opt-dir=$(brew --prefix gmp):$(brew --prefix jemalloc)

make install

That’s pretty much everything I do when setting things up!

If you want to run YJIT in “dev” mode, you add --enable-yjit=dev to the configure call:

../configure --prefix="/path/to/ruby/build/.rubies/ruby-master" --disable-install-doc --config-cache --enable-debug-env optflags="-O0 -fno-omit-frame-pointer" CFLAGS="-DRUBY_DEBUG -O0" --with-opt-dir=$(brew --prefix gmp):$(brew --prefix jemalloc) --enable-yjit=dev

From here, the simplest way to run some code is to place a test.rb file in the root of the project and run it using make runruby. To run it in a debug mode, you can run make lldb-ruby.

Add a comment

Counting C method calls in CRuby

Nov 28, 2024 JP Camara

There is a central macro in CRuby, RUBY_VM_CHECK_INTS, which is a very hot path for the Ruby runtime. It’s an important part of how threads are managed, and it’s called constantly. I was curious just how often it was called, and it turned out CRuby comes with some handy debugging functionality for just this scenario.

Inside of debug_counter.h, I changed #define USE_DEBUG_COUNTER 0 to #define USE_DEBUG_COUNTER 1 and added this line later in that file:

RB_DEBUG_COUNTER(rb_vm_check_ints)

Then inside vm_core.h I updated RUBY_VM_CHECK_INTS to add a debug increment:

#define RUBY_VM_CHECK_INTS(ec) rb_vm_check_ints(ec)
static inline void
rb_vm_check_ints(rb_execution_context_t *ec)
{
    RB_DEBUG_COUNTER_INC(rb_vm_check_ints); // increment!

After that I ran the following simple Ruby program:

10_000.times {}

And this was printed after it ran:

[RUBY_DEBUG_COUNTER]    rb_vm_check_ints    21,055

Iterating a loop ten thousand times results in twenty thousand calls to RUBY_VM_CHECK_INTS, exactly what I was looking to measure!

I’d like to know the proper configuration to compile without having to manually modify USE_DEBUG_COUNTER in the header file. Maybe someone can comment and let me know how? It has something to do with CFLAGS, I think.

Update 12/5/24 Thanks to Mohit Sindhwani for some advice on how to add the CFLAGS!

Add a comment

My docker setup for hacking on CRuby

Nov 27, 2024 JP Camara

I run on MacOS, but I often want to test Linux behaviors when working on the CRuby implementation.

Here’s the Dockerfile I use:

FROM ubuntu:24.04

# Preventing dialog prompts when installing packages
ENV DEBIAN_FRONTEND=noninteractive

# Update and install basic build dependencies and Rust
RUN apt-get update && apt-get install -y \
    git \
    curl \
    build-essential \
    autoconf \
    libreadline-dev \
    libssl-dev \
    libyaml-dev \
    libncurses5-dev \
    zlib1g-dev \
    libffi-dev \
    bison \
    libgdbm-dev \
    libgdbm-compat-dev \
    libreadline6-dev \
    libssl-dev \
    libgmp-dev \
    liburing-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Rust via rustup
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

RUN apt-get update && apt-get install -y ruby
RUN apt-get update && apt-get install -y gdb

# Add Rust to the PATH so the cargo and rustc commands are available
ENV PATH="/root/.cargo/bin:${PATH}"

# Create a directory for the Ruby source code
WORKDIR /usr/src/ruby

# Copy Ruby source code from your local directory
COPY . .

# This will be the default command when you run the container.
CMD [ "/bin/bash" ]

To run the Dockerfile, you can use the following two commands:

docker build -t ruby-source-build-env .
docker run -it --mount type=bind,src=.,target=/usr/src/ruby ruby-source-build-env

Based on our Dockerfile, docker run will open up a bash shell for you. From there, I run the following commands to build CRuby:

./autogen.sh
mkdir build && cd build
mkdir ./.rubies
../configure --prefix="/usr/src/ruby/build/.rubies/ruby-master" --disable-install-doc --config-cache --enable-debug-env optflags="-O0 -fno-omit-frame-pointer"
make install

We now have CRuby operating under Ubuntu Linux! From here, the simplest way to run some code is to place a test.rb file in the root of the project and run it using make runruby.

Add a comment

Calculating the largest known prime in Ruby

Nov 26, 2024 JP Camara

Looking to impress your Ruby friends by calculating the largest known prime, 2 ** 136_279_841-1?

On Ruby 3.4.0-preview2 and earlier, 2 ** 136_279_841-1 logs a warning and returns Infinity 😔:

2 ** 136_279_841-1
# warning: in a**b, b may be too big
# => Infinity

Thanks to @mametter, Ruby 3.4 will handle this calculation just fine! See Do not round a**b to infinity.

Knowing this, you excitedly use your ruby manager of choice to pull down ruby master:

rvm install ruby-head

You run ruby -e "puts 2 ** 136_279_841-1", and your excitement is slowly eroded. An hour into calculating, you terminate the command in frustration 😫.

Is @mametter a liar?!

As it turns out, there is critically important library you need for accelerating “Bignum” calculations: GMP, the GNU Multiple Precision Arithmetic Library. It’s even specifically mentioned in the CRuby guide to building ruby.

Without it, you can kiss your largest prime calculating dreams goodbye 👋.

You reinstall ruby head, making sure gmp is available

brew install gmp
rvm reinstall ruby-head --with-gmp-dir=$(brew --prefix gmp)

With a bit of hope in your heart, you try again:

ruby -e "puts 2 ** 136_279_841-1"

Success! @mametter was telling the truth!

Within around 5 seconds, your terminal is filled with a beautiful output of 41,024,320 digits. Your Ruby friends cheer and carry you off on their shoulders.

This was all inspired by Matz’s keynote at RubyConf 2024 - where he mentioned that Ruby 3.4 can now calculate the largest known prime. For fun, I tried it on my mac and just let it keep running - 2 hours later, it was still running! I’d never heard of GMP, but now I know!

Add a comment

All Snippets

The progress so far

Breaking down the changes

What’s next?

1. Tests

2. Logic for handling the private/protected param

3. Actual optimization code

Setting up a performance baseline

A first silly optimization

A second, slightly less silly optimization

Back to reality

Pattern matching bytecode instructions

Macros and enums

A virtual machine DSL

Adding to the DSL

Sage wisdom

Finding the optimizer

Disassembling an existing optimization

Using the debugger

Compiling on-the-fly

Getting to our test.rb file

Making one small step towards opt_respond_to

Using gdb

Finishing up

Getting your own environment setup

Back to the investigation

A little help from Git

Starting from main

Stepping back to ruby_options

Understanding the requirements

Making one small step towards `opt_respond_to`

Using `gdb`

Stepping back to `ruby_options`