Improving Query Performance by 10000x | by Abdul Rafehi | Apr, 2022

Photograph by Hello I’m Nik on Unsplash

Discover your software slows to a crawl as you get extra knowledge in your system? You’re not alone. Whereas we’ve all heard the warnings about untimely optimization, sooner or later, you’re going to want to spend a while figuring out how you can enhance the efficiency of your system.

We lately bumped into such a scenario at Sky Ledge. A easy question to retrieve time sequence knowledge took 30 seconds to run in our staging surroundings. These had been easy queries with indexes on the columns of curiosity. We’d anticipate the requests to finish in <1 second, not 30 seconds.

We pleasure ourselves on a very good person expertise. Ready 30 seconds for a graph to render makes for very poor UX, so it was time for sleuthing and database optimization.

Context

Sky Ledge is a platform that helps customers shortly create management room experiences to achieve insights into their operations.

One of many foundational elements of Sky Ledge is our Asset API, which makes it very straightforward to trace and visualize knowledge from objects in the actual world. Belongings are entities inside Sky Ledge that characterize a real-world object (e.g. a automobile). Metrics are knowledge streams related to an asset. Examples of metrics are velocity, elevation, temperature and so forth. Every metric tracks a single worth in opposition to the asset with an related timestamp.

Sky Ledge in motion

Our metrics database seems to be as follows (we use AWS RDS, operating Postgres):

create desk asset_metric
(
asset_id. uuid not null,
metric_name textual content,
timestamp timestamptz,
worth double precision
);

We’ve additionally acquired an index on the asset_metric desk:

create index on asset_metric (timestamp, asset_id, title);

The Drawback

The issue occurred after I was making a demo for a shopper in our staging surroundings. As a part of the demo, I displayed temperature knowledge from an asset inside the Sky Ledge platform.

To retrieve the temperature knowledge, we make the next question to the database:

choose * from asset_metric
the place
asset_id = 'abc123' and metric_name = 'velocity'
order by timestamp desc
restrict 1000;

This question was persistently taking 30+ seconds to finish. It’s a easy question, and I knew we had an index on this desk. The trigger wasn’t instantly obvious.

Connecting to the Database

Step one was to hook up with the database. We take safety very significantly, so each our staging and manufacturing databases are locked down inside a non-public VPC on AWS, with no entry to the general public web.

I arrange an SSH bounce field to hook up with the staging database. See here for more details on how to take action (I wouldn’t suggest this strategy for a manufacturing DB).

What’s our knowledge appear to be?

Subsequent, I wished to discover whether or not the info distribution is likely to be contributing to the issue. We had different demos on staging that had been performing with out points, so the slowdown was restricted to particular property.

There are 7,000,000 rows within the asset_metrics database. Whereas not tiny, I’d anticipate higher efficiency than what we’re seeing. The 7 million rows are unfold between 106 totally different property. There are additionally 710 asset / metric pairs (velocity, temperature, elevation and so forth.).

Rows:                     7,050,951  
Distinctive property: 106
Asset / metric pairs: 710

I subsequent ran a few queries to know the info distribution. 84% of the info is related to simply ten property in our staging surroundings. The dataset could be very extremely skewed.

Distribution of database rows by asset
Distribution of database rows by asset / metric pair

Testing Completely different Asset / Metric Pairs

As an alternative of testing the question on a single asset / metric pair, I compiled a listing of 9 asset / metric pairs. This may give a very good indication of efficiency throughout the board. These are:

+-------+-------------+--------+
| Asset | Metric | Rows |
+-------+-------------+--------+
| 1 | velocity | 169000 |
| 2 | velocity | 100000 |
| 3 | velocity | 75000 |
| 4 | velocity | 20000 |
| 5 | velocity | 10000 |
| 6 | velocity | 5000 |
| 7 | velocity | 2000 |
| 8 | temperature | 700 |
| 9 | velocity | 400 |
+-------+-------------+--------+

We have now a very good unfold from metrics that replace very continuously to people who replace very sometimes. This may assist inform if the difficulty occurs throughout the board or solely in sure circumstances.

How I measured question efficiency

As I’m connecting to the database through the bounce field, community latencies would have an effect on spherical journey question occasions. As an alternative, I used clarify analyse to measure the time it took for the DB to execute the question. This allowed me to measure efficiency instantly within the database with out worrying concerning the community’s variability.

The Present Question

I ran the queries on every of the 9 asset / metric pairs to get a baseline.

The outcomes stunned me. In 8 of the 9 instances, the question planner used a backwards index scan. I had assumed it will be a sequential scan as a result of time taken. Instinct is nice, but it surely pays to substantiate your instinct. As an alternative of losing my time happening a rabbit gap of why the index wasn’t used, I now knew that the difficulty was the index’s efficiency.

The one time it used a sequential scan was for the temperature metric. There are orders of magnitude extra velocity than temperature metrics, in order that’s seemingly the rationale.

It’s already utilizing an index scan, and Index Scans are Quick, so why the gradual outcomes for all however essentially the most frequent metrics?

Un-reversing the Index

The primary commentary is that the database was performing a backwards index scan. Backward scans could be slower than a regular index scan for multicolumn indexes (which is what we have now).

Provided that we virtually all the time question the newest knowledge (with order by timestamp desc ), it appears logical to order our index with timestamp descending. So, I created a brand new index:

create index on asset_metric(timestamp desc, asset_id, metric_name)

and reran the queries.

No change for the temperature metric (nonetheless a sequential scan) however progress all over the place else! We’ve considerably improved the question efficiency for the opposite gradual asset / metric pairs by altering to a ahead index scan…

…but it surely’s nonetheless horrible. 4 seconds is best than 25 seconds however nonetheless means too gradual for our wants. We’re all about constructing nice experiences for our prospects, and having to attend seconds for a easy graph to load will not be that.

Simplifying the Index

Possibly the index is simply too difficult and inflicting points for the question planner? I examined that assumption by creating a brand new index, utilizing solely the metric title and asset id:

create index on asset_metric (metric_name, asset_id)

I figured this is able to make it straightforward to search out the related knowledge for every asset / metric pair.

And positive sufficient, the outcomes seemed nice! We’d gone down from 4 seconds to lower than ten milliseconds, a 400x enchancment!

Apart from the one asset / metric pair that also took 11 seconds to return the question. I ran the question a number of occasions to verify it wasn’t a bizarre quirk, but it surely persistently took 10+ seconds.

The question planner was nonetheless utilizing the unique index for the extra frequent asset / metric pairs(I had eliminated the un-reversed index by this level in order that I used to be solely testing one variable at a time).

This wasn’t the case for the much less frequent pairs. Right here, it used a Bitmap Index Scan and mixed each the unique index and the brand new index.

Is there a option to eradicate that annoying edge case and get glorious efficiency throughout the board?

Improved Index — Altering Column Orders

The earlier index gave a significant clue — it was helpful to go looking by the asset id and metric title for much less frequent pairs.

Does column ordering matter with multicolumn indexes? It does.

The canonical instance is that of a phonebook. Very like a phonebook is sorted first by the final title, then by the primary title. Multicolumn indexes are sorted by the primary column, second column, third column, and so forth.

Like you may’t search for any individual by their first title alone in a phonebook, neither can you employ a multicolumn index to search for a row by its second or third columns.

We’re all the time searching for metric knowledge by contemplating the asset id and metric title; it appears logical that these ought to be the primary columns within the index. I created the next index:

create index on asset_metric (metric_name, asset_id, timestamp desc)

Right here, we each change the order of the columns to make sure the index is sorted by metric_nameand asset_id. We additionally change timestamp to be descending, as per our outcomes beforehand. Let’s see how this performs:

Good! Glorious efficiency throughout the board. All queries ran persistently in lower than two milliseconds.

We managed to enhance our queries by 15,000x within the worst case! There’s nothing revolutionary concerning the steps we took above. It’s a typical instance of the kind of DB optimizations you need to make in the actual world. Trendy databases can deal with unimaginable quantities of knowledge.

A primary understanding of how databases work could make a world of distinction. As an alternative of choking at 7 million rows, our desk ought to now be capable of scale to 100s of thousands and thousands of rows for this use case. It took me not more than 2 hours to search out the bottleneck and repair it. All it took was a easy tweak to our index.

More Posts