Sometimes your fastest queries can cause the most problems. I will take you beyond the slow query optimization and instead zero in on the performance impacts surrounding the quantity of your datastore hits.. Using real world examples dealing with datastores such as Elasticsearch, MySQL, and Redis, I will demonstrate how many fast queries can wreak just as much havoc as a few big slow ones. With each example I will make use of the simple tools available in Ruby to decrease and eliminate the need for these fast and seemingly innocuous datastore hits.
During this talk I will work through some of the bottlenecks our team found in our code that were causing our datastores to significantly underperform. As we grew we started to scale our infrastructure horizontally by processing data in parallel using libraries such as Resque and Sidekiq. With every additional worker we were increasing the hits made to our datastores and eventually we were making so many that even simple requests became painfully slow.
1) Bulk ActiveRecord serialization:
We use Elasticsearch to allow clients to search their data, but our primary source of truth is MySQL. In order to get records from MySQL into Elasticsearch we have to serialize them. For the longest time we were using ActiveModel Serializers to painlessly serialize millions of ActiveRecord objects a day. This worked great until we found ourselves serializing over 200 million records a day. At that point the speed at which we were serializing them dropped dramatically. The solution to this problem came in the form of bulk serialization. By processing the records to be serialized as a group, instead of individually, we were able to use a local cache to retrieve all the data needed to serialize each record and eliminate 90% of our database calls.
2) Don’t make requests you don’t need:
As mentioned above, we also use Elasticsearch. One of the things we use Elasticsearch for is reporting. Every night we query over 3 billion documents in Elasticsearch and create detailed reports from them. We did a lot of work to make sure our slow Elasticsearch queries ran fast, but the time it took us to build all the reports was still painfully long. When we looked closer at our reports we realized a good number of them contained no data. Even though the queries for the reports with no data executed quickly they were still unnecessary requests which means we could save time and resources by skipping them all together.
We were able to apply this same concept to many of our MySQL queries. It was as simple as checking a hash or an array to see if it was empty. If it was, then we knew the MySQL query would return nothing and we would avoid making it.
3) Replacing Redis requests with a local cache:
When we serialize ActiveRecord objects for Elasticsearch we not only make requests to MySQL, we also fetch data from Redis. The Redis requests we were making were simple get requests that were blindly fast. However, because we were making so many they were responsible for 65% of the runtime on some of our jobs. By making a single request to Redis and memoizing it in a local hash we are able to increase the speed of our code and decrease the load on Redis.
We were able to apply this same concept by caching our sharded database setup using ActiveRecord’s connection object. Prior to using ActiveRecord’s connection object we were constantly fetching the data from Redis. This was fine to start, but as our configuration grew we were eventually reading 7.8MB/second from Redis. Using ActiveRecord’s connection object as a cache completely eliminated those requests.
I joined Kenna Security in 2015 and was the 7th engineer hired. At that time we were handling 50 million records in our database and processing a couple million a day. Now our engineering team has over 35 people, we handle over 3 billion records in our database, and we process over 200 million of those records a day. Getting to that scale was not easy, but we learned a lot along the way. Scale is always one of the biggest challenges for a growing company and the ability to process data in parallel can speed things up tremendously. But it is easy to scale horizontally past the abilities of your external resources. For growing companies, our speech focuses on the simple improvements that can help them control their external resource use and spending while scaling with minimal effort. For companies that are stable in size or can’t increase their resource spend, this talk will give them the tools they need to scale with the infrastructure they have. Having experienced a lot of these issues first hand I know what it takes to scale and I also know how to get the biggest bang for your buck.
I am a MIT grad with an Aerospace Engineering degree which is most likely why I gravitate towards optimizing performance issues as a Software Engineer. I joined Kenna in 2015 where I have had the opportunities to work on some of the most challenging aspects of our code base. This includes scaling Elasticsearch to 3 billion docs, sharding MySQL databases, and taming Resque's usage of Redis. All of these had huge performance gains, but in the end, I found equally significant gains by using Ruby.
RubyConf 2018 - Accepted [Edit]Add submission