❶ Author of How to Open Source (.dev). A book to take you from coder to contributor.

❷ Creator of CodeTriage, a free service helping developers contribute to open source.

❸ Core committer to ruby/ruby.

❹ Rust Cloud Native Buildpack author .

❺ Married to Ruby, literally.

Say No to Randos (in Your Database)

29 Nov 2017

When I used my first ORM, I wondered “why didn’t they include a random() method?” It seemed like such an easy thing to add. While there are many reasons you may want to pull a record out of your database at random, you shouldn’t be using SQL’s ORDER BY RANDOM() unless you’ll only be randomizing a limited number of records. In this post, we’ll examine how such a simple looking SQL operator can cause a lot of performance pain, and a few different techniques we can use to fix it.

As you might know, I run CodeTriage, the best way to get started helping open source, and I’ve written about improving the database performance on that site:

Recently, I was running the heroku pg:outliers command to see what it looked like after some of those optimizations, and I was surprised to find I was spending 32% of database time in two queries with a RANDOM() in them.

$ heroku pg:outliers
14:52:35.890252 | 19.9%          | 186,846    | 02:38:39.448613 | SELECT  "repos".* FROM "repos" WHERE (repos.id not in (?,?)) ORDER BY random() LIMIT $1
08:59:35.017667 | 12.1%          | 2,532,339  | 00:01:13.506894 | SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT $1

Let’s take a look at the first query to understand why it’s slow.

SELECT
  "repos".*
FROM "repos"
WHERE
  (repos.id not in (?,?))
ORDER BY
  random()
LIMIT $1

This query is used once a week to encourage users to sign up to “triage” issues on an open source repo if they have an account, but aren’t subscribed to help out. We send an email with 3 repo suggestions including a random repo. That seems like a good use of RANDOM() after all, we literally want a random result. Why is this bad?

While we’re telling Postgres to only give us 1 record, the ORDER BY random() LIMIT 1 doesn’t only do that. It orders ALL the records before returning one.

While you might think it’s doing something like Array#sample it’s really doing Array#shuffle.first. When I wrote this code, it was pretty dang fast because I only had a few repos in the database. But now there are 2,761 repos and growing. And EVERY time this query executes, the database must load rows for each of those repos and spend CPU power to shuffle them.

You can see another query that was doing the same thing with the user table:

=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT 1;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1471.00..1471.01 rows=1 width=2098) (actual time=12.747..12.748 rows=1 loops=1)
   ->  Sort  (cost=1471.00..1475.24 rows=8464 width=2098) (actual time=12.745..12.745 rows=1 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on users  (cost=0.00..1462.54 rows=8464 width=2098) (actual time=0.013..7.327 rows=8726 loops=1)
               Filter: (github_access_token IS NOT NULL)
               Rows Removed by Filter: 13510
 Total runtime: 12.811 ms
(8 rows)

It takes almost 13ms for each execution of this relatively small query.

So if RANDOM() is bad, what do we fix it with? This is a surprisingly difficult question. It largely depends on your application and how you’re accessing the data.

In my case, I fixed the issue by generating a random ID and then pulling that record. In this instance, I know that the IDs are relatively contiguous, so I pull the highest ID, pick a random number between 1 and @@max_id, then perform a query where I’m grabbing a record >= that id.

Is it faster? Oh yeah. Here’s the same query as before with the RANDOM() replaced:

=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) AND id >= 55 LIMIT 1;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.17 rows=1 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..1469.36 rows=8459 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
         Filter: ((github_access_token IS NOT NULL) AND (id >= 55))
 Total runtime: 0.039 ms

We went from ~13ms to sub 1ms query execution time.

Note: If you’re curious about the “contiguous” requirement read these comments on reddit.

There are are some pretty severe caveats here to watch out for. My implementation caches the max id, which is fine for my use cases, but it might not be for yours. It’s possible to do this entirely in SQL using something like:

WHERE
  /* ... */
  AND id IN (
    SELECT
    FLOOR(
      RANDOM() * (SELECT MAX(id) FROM issues)
    ) + 1
  )

As always, benchmark your SQL queries before and after an optimization. This implementation doesn’t handle sparsely populated ID values very well, and doesn’t account for randomly selecting a max id that is greater than one available based on WHERE conditions. Essentially, if you were to do it “right”, you would need to apply the same WHERE conditions to the subquery for MAX(id) as to your main query.

For my cases, it’s fine if I get some failures, and I know that I’m only applying the most basic of WHERE conditions. Your needs might not be so flexible.

If you’re thinking “is there no built-in way to do this?”, it turns out there is TABLESAMPLE, which was introduced in Postgres 9.5. Thanks to @HotFusionMan for introducing it to me.

Here’s the best blog I’ve found on using TABLESAMPLE. The downside is that it’s not “truly random” (if that matters to your application), and you cannot use it to only retrieve 1 result. I was able to hack it by doing a query that the only table sampled 1%. Then I used that 1% to get ids and then limited to the first record. Something like:

SELECT
  *
FROM repos
WHERE
  id IN (
    SELECT
      id
    FROM repos
    TABLESAMPLE SYSTEM(1) /* 1 percent */
  )
LIMIT 1

I was also told in a Reddit thread on /r/postgres that there’s another way to do this.

While this works and is much faster than ORDER BY RANDOM() for queries returning LOTS of data (thousands or tens of thousands of rows), it’s very slow for queries that have very little data.

When I was optimizing https://www.codetriage.com I found another query that uses RANDOM(). It is being used to find open source issues for a specific repo. Due to the way issues are stored, the IDs are not very contiguous so my previous (sampling >= to a random id) trick wouldn’t work as well. I need a more robust way to randomize the data, and I thought perhaps TABALESAMPLE might perform better.

While some repos have thousands of issues, 50% have 27 or fewer issues. When I used the TABLESAMPLE technique for this query, it made my small queries really slow, and my previously slow queries fast. Since my numbers skew towards the small side for that query, it wasn’t a net gain, so I stuck to the original RANDOM() method.

Update: To claify I know that random() is not slow, and the article does not say that it is. It says ORDER BY random() is slow. It’s the ordering of a large number of articles that cause the pain.

Update 2: No, using a random offset is not faster than the >= method with IDs. In fact using an offset for a large amount of records can also be a huge perf problem.

Have you replaced RANDOM() with another more efficient technique? Let me know about it to Twitter @schneems.

Keep Reading 🚀

Don't McBlock me

“That cannot be done.” Is rarely true, but it’s a phrase I’ve heard more and more from technical people without offering any rationale or further explanation. This tendency to use absolute language when making blocking statements reminded me of a useful “McDonald’s rule” that I was introduced to many years ago when deciding where to eat with friends. It goes something like this:

Read More
Bad Type Patterns - The Duplicate duck

Why aren’t people writing more types? Perhaps it’s because the intermediate and expert developers deleted the patterns that didn’t work and left no trace for beginners to learn from. This post details some code I recently deleted that has a pattern I call the “duplicate duck.” You can learn the process I used to develop the type, and why I deleted it. Further, I advocate for Rust developers to document and share their mistakes in the hope that we can all learn from them.

Read More
A Daft proc-macro trick: How to Emit Partial-Code + Errors

Update (2025/04/02): The change I suggested below was merged in PR #64. It’s pretty neat I went from knowing nothing about this project to contributing to it in the span of a single blog post.

Read More
Installing the sassc Ruby gem on a Mac. A debugging story

I’m not exactly sure about the timeline, but at some point, gem install sassc stopped working for me on my Mac (ARM). Initially, I thought this was because that gem was no longer maintained, and the last release was in 2020, but I was wrong. It’s 100% installable today. In this post I’ll explain some things about native extensions libraries in Ruby and in the process tell you how to fix this error below if you’re getting it on your Mac:

Read More
My Red Hot ADHD Programming 'Affliction'

Sorry, Dave, ADHD is real, and (not acknowledging it) can hurt you. Hi. I’m Richard. I’m a Ruby Core Contributor. I also code in Rust, and enjoy giving talks and writing books about How to (Contribute to) Open Source. I was diagnosed with ADHD in my late 30’s. What does it mean that I was “diagnosed” with ADHD? Am I simply a speed junkie? What even is ADHD, and why is there so much misinformation and misunderstanding about it? Keep reading to find out.

Read More

Say No to Randos (in Your Database)

Subscribe to my Newsletter 😻 🤠

Keep Reading 🚀

Don't McBlock me

Bad Type Patterns - The Duplicate duck

A Daft proc-macro trick: How to Emit Partial-Code + Errors

Installing the sassc Ruby gem on a Mac. A debugging story

My Red Hot ADHD Programming 'Affliction'