The Curious Case of the Bad Clone, by Ufuk Kayserilioglu

Abstract

On Sept 4th 2020, I got pinged on a revert PR to fix a 150% slowdown on the Shopify monolith. It was a two-line change reverting the addition of a Sorbet signature on a Shop method, implicating Sorbet as the suspect.

That was the start of a journey that took me through a deeper understanding of the Sorbet, Rails and Ruby codebases. The fix is in Ruby 3.0 and I can sleep better now.

This talk will take you on that journey with me. Along the way, you will find tools and tricks that you can use for debugging similar problems. We will also delve into some nuances of Ruby, which is always fun.

Details

Intended Audience:

This talk is intended for all audiences that have a basic familiarity working with Ruby. No prior knowledge of Sorbet or Rails is required.

Outcomes:

This talk aims to provide an arsenal of debugging tools for the audience while taking them through a fun bug chasing journey.

The surface reason of the problem was the method_added hook that was being added to the Shop model to make Sorbet sig blocks work. This is not a widely known technique but has specific uses, and the audience should be more familiar with them after the talk.

The root cause of the bug turns out to be related to the difference in behaviour between .clone and .dup and with how .clone does the wrong thing with cloning singleton classes. Thus the talk will give a deeper understanding of these nuances to the audience as well.

Outline:

  • The Bug

    • Sorbet signatures being reverted
    • 150% slowdown after adding signatures
    • Despite our team's best efforts to optimize Sorbet runtime
    • Flamegraphs show a lot of time being spent in T::Private::Methods::MethodHooks#method_added
    • Revert brings us back to normalcy :thinking_face:
  • Sorbet

    • Start using puts debugging in Sorbet runtime
    • bundle open sorbet-runtime and instrument (Key Takeaway #1)
    • Run bin/rails console with DISABLE_SORBET=1 and EAGER_LOAD=1 to simulate prod behaviour.
    • Notice that normal flow through the method doesn't trigger the problem but shop.clone does.
    • Observe non-sensical output from method_added
    • Add a breakpoint in method_added using binding.pry if $break (Key Takeway #2)
    • Load app, set $break = true and call shop.clone
    • Observe backtrace. Method is being called directly from Kernel.clone; doesn't make sense.
    • The same thing does not happen on a PORO, so what is it about an AR model that makes the difference?
    • Try to find initialize_clone or initialize_dup definitions in Rails
    • Try to understand Rails callback mechanism.
  • Rails

    • Pair with a Rails expert
    • Noticing that it is all class methods triggering method_added
    • Try to replicate the problem on a custom model. Notice that some method_added hooks are being called but only for mattr_ defined methods.
    • Noticing that Rails mattr setters redefine the method on every set. (Key Takeaway #3)
    • Colleague noticing that Rails doesn't even need to set the query (?) methods. Leading to this commit on Rails
    • Slowly realizing that this is probably not a Rails problem, since the call stack has nothing related to Rails.
    • Take a look at Kernel.clone and Kernel.dup code.
    • Notice that Kernel.clone does a whole lot more copying than dup. Especially that it is copying the singleton class, etc. (Key Takeaway #4)
    • Try to replicate the setup in Shop in other AR models. Add a custom model and notice that we can trigger the same case by including the Shopify custom podding module.
  • Ruby

    • At this point, realize that the problem is something that is triggered by Rails but not in Rails
    • Try to isolate the custom podding code that seems to be changing behaviour for models that include it
    • Isolate it to the podding code registering a before_initialize callback on the model.
    • Try to find a small example. This is one of the most important steps you can take. (Key Takeaway #5)
    • Based on the code for Kernel.clone realize that someone should be setting shop.singleton_class, but who?
    • Colleague suggests taking a look at ObjectSpace.trace_object_allocations and associated ObjectSpace.allocation_sourcefile/ObjectSpace.allocation_sourceline.
    • Add that to small repro script to figure out the source. Realize that it is instance_exec being called by before_initialize which ends up creating a singleton class for the model object.
    • Realize that it should be the Kernel.clone method doing something wrong.
    • Dig in to what that code is doing.
    • Explain how singleton classes work, include diagrams, etc (Key Takeaway #6)
    • Explain what the clone code is trying to do.
    • Explain what it ends up doing instead.
    • Explain why the class was not named (since it is an in-progress clone)
    • Explain why method_added was being called but not singleton_method_added.
    • Explain how the fix PR resolves the problem.

Pitch

This bug hunt ended up being the most interesting one I've had in my 20+ year career and it thought me a lot about some of the areas of Sorbet, Rails and Ruby. I've had the help of experts which are Rails Core and Ruby Core members which have also been amazing.

I feel like there is a lot information captured in this bug hunt that would be useful for others in their similar adventures. Sharing the debugging methods and the deeper reasons for what exactly was breaking, and how I was able to hone in on it, should be educational for a general audience of any level of expertise.

This plays really well with the proposed topic of "Let Me Tell You About This Bug"

Edit proposal

Submissions

RailsConf 2021 - Accepted [Edit]

Add submission