❶ Author of How to Open Source (.dev). A book to take you from coder to contributor.

❷ Creator of CodeTriage, a free service helping developers contribute to open source.

❸ Core committer to ruby/ruby.

❹ Rust Cloud Native Buildpack author .

❺ Married to Ruby, literally.

How I lost 17,000 GitHub Auth Tokens in One Night

30 Aug 2017

How on earth does someone accidentally delete 85% of their users’ GitHub tokens? I was suspicious that something might be wrong when I got an email from a service I run called CodeTriage, it’s a free web app to help find open source projects and issues to work on. While I get plenty of emails from my service, I don’t often get ones with the subject line “Code Triage auth failure”. Before we can understand what happened, let’s look into why this email even exists.

For CodeTriage to work it needs info from GitHub. Specifically, it needs to know about all the issues an open source library has open. To do that we need to make authenticated API requests. To make API requests, we need an API token. Now while an API token is good, even better is a VALID API token. Which unfortunately the system would lose from time to time.

When an invalid API token is in the system, it would cause random failures in the chain of pulling in issues, so I took steps to mitigate the issue. Once a week I have a script that cycles through all users, and use the GitHub API to check a token. If the token is bad then I remove it from the database. I then email the user to let them know they need to re-authorize.

That’s why my system deletes tokens, now let’s look into the code, see if you can spot the problem that caused me to delete 85% of my users’ tokens.

I use a home rolled GitHub API library that I wrote called git_hub_bub mostly because none of the existing libraries gave me enough control, and also I thought the name was funny. Mostly because I thought the name was funny. I implemented the logic to check a token on my User model:

def auth_is_valid?
  GitHubBub.get("https://#{ENV['GITHUB_APP_ID']}:#{ENV['GITHUB_APP_SECRET']}@api.github.com/applications/#{ENV['GITHUB_APP_ID']}/tokens/#{self.token}", {}, token: nil)
  true
rescue GitHubBub::RequestError
  false
end

Then I hacked together a quick task to cycle through each user and check their token:

task check_user_auth: :environment do
   User.find_each(conditions: "token is not null") do |user|
     if user.auth_is_valid?
       # Do nothing, auth is good
     else
       user.update_attributes(token: nil)
     end
   end
 end

When the token is not valid, I remove the token, because why would I want a bad token in my database?

Usually when someone’s token is invalid it’s because they made an update to their profile. Not that they revoked the token purposefully. So to fix the issue, we need a new token.

To get a new token, a user can log back into the system which will auto update their GitHub credentials. To let people know that their tokens are invalid and they need to be updated (along with sending them instructions), I wrote another task that sent out emails:

task warn_invalid_token: :environment do
  User.find_each(conditions: "token is null") do |user|
    next unless Date.today.thursday?
    ::UserMailer.invalid_token(user).deliver
  end
end

We don’t want to swamp you with emails, so we only send this once a week until they log back in to update their token.

You win points if you guess what day the emails go out on.

I implemented this logic back in 2014 and for nearly 3 years it ran fine. Occasionally people would get bad tokens, but then they would re-auth and be on their way.

When I got that fateful email, I knew something was wrong because I hadn’t deployed any code recently and I didn’t modify my GitHub account. So what was up?

As soon as I could, I went to the console and did a token check. Yes my token was gone, so the email was correct. I then wondered if other people’s were missing and sure enough, out of the roughly 20,000 users roughly 17,000 of them were missing tokens. My jaw just about dropped on the floor.

Turns out that there was a bug in my logic but not necessarily my code. After all, it did run flawlessly for a few years. So if my code was fine, where was the bug?

Looking at the update time of some of the records, I was able to place them roughly around the time of another event: A GitHub outage.

So while my code was correctly looping through and checking all the tokens, it was also dutifully deleting them when they came back as “bad” tokens. The thing was ALL requests were coming back without a success status code. Most all of the tokens on CodeTriage were deleted before the GitHub servers came back up.

For me it’s not actually the end of the world. I don’t need EVERY user to have a token, just that enough do. Once I realized how the failure happened I put some guards in place:

1) Make an API call to check that the API is up

response          = Excon.get("https://status.github.com/api/status.json").body
github_api_status = JSON.parse(response)["status"]
next unless github_api_status == "good"

So now if the API is down, this should fail before we do any of our status checks. The API could always fail mid-way through the list but this would at least prevent running the checks in the middle of a known GitHub status downtime.

2) Duplicate token checks

Most of the tokens are valid, so when one comes back as “invalid” we can spend more time verifying that it’s not a mistake. To do this we check the same token 3 times before deleting it.

3) Don’t delete the tokens

This was a no-brainer. Instead of deleting the tokens I’m now moving them to a new field old_token. So if such a mass token event happens again, then I could recover much easier. I contacted GitHub support after my mass token deletion. The first thing they asked me for was an example token, which I didn’t have because I deleted all of them. So keeping a log of your really important values can be a good idea.

On this theme you might be tempted to use a gem like acts_as_paranoid. I would say, don’t. I’ve heard a lot of things about this gem, mostly around sheer amount of data bloat.

What if this was mission critical data?

If these tokens really were irreplaceable, how would I have recovered? I’m running on Heroku (I work there) and my “standard” Postgres instance includes point in time rollback for up to 4 days. This means that I’m able to pick an arbitrary day and time and generate a new database with the data available at that time. You can read about how to do this on the documentation. It’s also worth mentioning that “premium” databases have a point in time rollback of 10 days.

If this really had been a “stop the world” event, I could have rolled back in time, gotten all the tokens and been up and running fairly quickly.

If you’re not running on Heroku you should set up a continuous archive such asthe WAL-E library or WAL_G.

It’s lucky for me that the impact wasn’t so severe and the service was able to run just fine without this data. On the other-hand if the impact had been more severe then I would have had no other option but to rollback and I would have all the tokens again.

It’s also worth mentioning that you can schedule periodic backups against your database using heroku pg:backups:schedule command, however this puts load on the database when you’re taking the backup. It also prevents maintenance tasks from being able to run on the database.

I checked in with our database team and the consistent story seems to be don’t use it unless you really need to. If I did have a pg:backup it would mean that there would be gap in time between when the backup was taken and the tokens were deleted (meaning that I might have only been able to recover most of the tokens instead of all of them). On the standard level I’m eligible to store 25 backups.

If your DB is under heavy load, you can also add a follower DB and take the backup from the follower DB. There’s a whole article on all the ways to backup your database if you’re interested.

As the saying goes, you own your uptime, this includes service failures and data loss. It’s important to think about how your service will be affected when sensitive data goes missing. The biggest mistake I made wasn’t the code I wrote, it was not thinking about the edge case that someone else’s API might go down.

Your service will fail. Instead of trying to prevent failure we can own recovery instead. Does everything go down or will failures be graceful? When you lose data, is it gone for good or can you recover?

I’m in the process of patching git_hub_bub to make checking for explicit error code easier so I can harden my API token checks, for example only delete an explicit 404 instead of ANY non 200 result. Also some helpful souls on lobste.rs pointed out that I could do this check when I get a 401 from GitHub instead of in a batch. Regardless of fixes to logic and handling edgecases it’s important that when my service falls down, that there’s a way for it to get back up.

Update: Made the post clear that I intend on patching the original failure mode. Added a shout out to a good idea on Lobsters.

Keep Reading 🚀

How to "Sustain Heroku"

This is a personal essay (I speak for me and my views, not for my employer) about what exactly a “Sustaining Engineering Model” is, in the context of the recent Heroku announcement, and the book “The Innovator’s Dilemma,” as seen by someone who has worked at the company for the past 14 years.

Read More
Non-Violent Comments: Calling out or Calling in?

Now that programmers are at war with the robots (Gen AI) for our jobs, we need to lean into the things that they cannot do. Today, I’m going to be talking about how to be a human and communicate with other humans in the most hostile of scenarios, “in conflict (drama).”

Read More
Disallow code usage with a custom `clippy.toml`

I recently discovered that adding a clippy.toml file to the root of a Rust project gives the ability to disallow a method or a type when running cargo clippy. This has been really useful. I want to share two quick ways that I’ve used it: Enhancing std::fs calls via fs_err and protecting CWD threadsafety in tests.

Read More
Upgrade to Puma 7 and Unlock the Power of Fair Scheduled Keep-alive

Puma 7 is here, and that means your Ruby app is now keep-alive ready. This bug, which existed in Puma for years, caused one out of every 10 requests to take 10x longer by unfairly “cutting in line.” In this post, I’ll cover how web servers work, what caused this bad behavior in Puma, and how it was fixed in Puma 7; specifically an architectural change recommended by MSP-Greg that was needed to address the issue.

Read More
Learn How to Lower Heroku Dyno Latency through Persistent Connections (Keep-alive)

Before the latest improvements to the Heroku Router, every connection between the router and your application dyno risked incurring the latency penalty of a TCP slow start. To understand why this is a performance bottleneck for modern web applications, we must look at the fundamentals of the Transmission Control Protocol (TCP) and its history with HTTP.

Read More

How I lost 17,000 GitHub Auth Tokens in One Night

What if this was mission critical data?

Subscribe to my Newsletter 😻 🤠

Keep Reading 🚀

How to "Sustain Heroku"

Non-Violent Comments: Calling out or Calling in?

Disallow code usage with a custom `clippy.toml`

Upgrade to Puma 7 and Unlock the Power of Fair Scheduled Keep-alive

Learn How to Lower Heroku Dyno Latency through Persistent Connections (Keep-alive)