The Curious Case of the Bad Clone, by Ufuk Kayserilioglu
Abstract
On Sept 4th 2020, I got pinged on a revert PR to fix a 150% slowdown on the Shopify monolith. It was a two-line change reverting the addition of a Sorbet signature on a Shop method, implicating Sorbet as the suspect.
That was the start of a journey that took me through a deeper understanding of the Sorbet, Rails and Ruby codebases. The fix is in Ruby 3.0 and I can sleep better now.
This talk will take you on that journey with me. Along the way, you will find tools and tricks that you can use for debugging similar problems. We will also delve into some nuances of Ruby, which is always fun.
Details
Intended Audience:
This talk is intended for all audiences that have a basic familiarity working with Ruby. No prior knowledge of Sorbet or Rails is required.
Outcomes:
This talk aims to provide an arsenal of debugging tools for the audience while taking them through a fun bug chasing journey.
The surface reason of the problem was the method_added
hook that was being added to the Shop
model to make Sorbet sig
blocks work. This is not a widely known technique but has specific uses, and the audience should be more familiar with them after the talk.
The root cause of the bug turns out to be related to the difference in behaviour between .clone
and .dup
and with how .clone
does the wrong thing with cloning singleton classes. Thus the talk will give a deeper understanding of these nuances to the audience as well.
Outline:
The Bug
- Sorbet signatures being reverted
- 150% slowdown after adding signatures
- Despite our team's best efforts to optimize Sorbet runtime
- Flamegraphs show a lot of time being spent in
T::Private::Methods::MethodHooks#method_added
- Revert brings us back to normalcy :thinking_face:
Sorbet
- Start using puts debugging in Sorbet runtime
bundle open sorbet-runtime
and instrument (Key Takeaway #1)- Run
bin/rails console
withDISABLE_SORBET=1
andEAGER_LOAD=1
to simulate prod behaviour. - Notice that normal flow through the method doesn't trigger the problem but
shop.clone
does. - Observe non-sensical output from
method_added
- Add a breakpoint in
method_added
usingbinding.pry if $break
(Key Takeway #2) - Load app, set
$break = true
and callshop.clone
- Observe backtrace. Method is being called directly from
Kernel.clone
; doesn't make sense. - The same thing does not happen on a PORO, so what is it about an AR model that makes the difference?
- Try to find
initialize_clone
orinitialize_dup
definitions in Rails - Try to understand Rails callback mechanism.
Rails
- Pair with a Rails expert
- Noticing that it is all class methods triggering
method_added
- Try to replicate the problem on a custom model. Notice that some
method_added
hooks are being called but only formattr_
defined methods. - Noticing that Rails mattr setters redefine the method on every set. (Key Takeaway #3)
- Colleague noticing that Rails doesn't even need to set the query (
?
) methods. Leading to this commit on Rails - Slowly realizing that this is probably not a Rails problem, since the call stack has nothing related to Rails.
- Take a look at
Kernel.clone
andKernel.dup
code. - Notice that
Kernel.clone
does a whole lot more copying thandup
. Especially that it is copying the singleton class, etc. (Key Takeaway #4) - Try to replicate the setup in
Shop
in other AR models. Add a custom model and notice that we can trigger the same case by including the Shopify custom podding module.
Ruby
- At this point, realize that the problem is something that is triggered by Rails but not in Rails
- Try to isolate the custom podding code that seems to be changing behaviour for models that include it
- Isolate it to the podding code registering a
before_initialize
callback on the model. - Try to find a small example. This is one of the most important steps you can take. (Key Takeaway #5)
- Based on the code for
Kernel.clone
realize that someone should be settingshop.singleton_class
, but who? - Colleague suggests taking a look at
ObjectSpace.trace_object_allocations
and associatedObjectSpace.allocation_sourcefile
/ObjectSpace.allocation_sourceline
. - Add that to small repro script to figure out the source. Realize that it is
instance_exec
being called bybefore_initialize
which ends up creating a singleton class for the model object. - Realize that it should be the
Kernel.clone
method doing something wrong. - Dig in to what that code is doing.
- Explain how singleton classes work, include diagrams, etc (Key Takeaway #6)
- Explain what the clone code is trying to do.
- Explain what it ends up doing instead.
- Explain why the class was not named (since it is an in-progress clone)
- Explain why
method_added
was being called but notsingleton_method_added
. - Explain how the fix PR resolves the problem.
Pitch
This bug hunt ended up being the most interesting one I've had in my 20+ year career and it thought me a lot about some of the areas of Sorbet, Rails and Ruby. I've had the help of experts which are Rails Core and Ruby Core members which have also been amazing.
I feel like there is a lot information captured in this bug hunt that would be useful for others in their similar adventures. Sharing the debugging methods and the deeper reasons for what exactly was breaking, and how I was able to hone in on it, should be educational for a general audience of any level of expertise.
This plays really well with the proposed topic of "Let Me Tell You About This Bug"