Backward-compatible database migrations

Tute Costa February 5, 2020

My team has a policy of not applying destructive changes to our database. We don’t, for example, remove a column or table when deprecated, and we don’t rename either, as renaming is akin to removing a name the code uses.

Our deployment process performs database changes about a minute before the related code changes go live. During this time, Rails might run database queries using an out-of-date schema, erroring out. As the usage of our app continues to increase, even sub-second rename migrations bring down several in-flight requests.

For some time, we dealt with this by deploying after hours, with the consequence of having to work after hours. Also, this doesn’t guarantee that no users will be using the new mobile features we released in 2019.

The best way we found to apply database changes without interruptions is to do them in two separate deployments:

  1. Configure the model to ignore the deprecated field (we use Rails’ self.ignored_columns API)
  2. Run the destructive migration in the next deployment


Example

Once we removed the column offices.activation_token. First, we configured ActiveRecord to ignore it:

1
2
3
class Office < ApplicationRecord
  self.ignored_columns = %w(activation_token)
  # ...

With that change, ActiveRecord won’t use that name in any of its Office queries. If it’s still actively used in any part of the codebase, our tests will fail, telling us where to update code (assuming good test coverage, otherwise the app will fail during manual tests).

After deploying the first step to production, we safely deploy the migration:

1
2
3
4
5
class RemoveActivationTokenFromOffices < ActiveRecord::Migration[6.0]
  def change
    remove_column :offices, :activation_token, :string
  end
end

The migration deletes the column from our database, and because of step 1, we know it is unused. We may now clean up the temporary code added in step 1 and deploy safely at any time of day.

The two-stage deployment approach is safer, but is it worth it to take this precaution even when a tiny percent of our traffic is affected for a given day?

Well, yes! Let me tell you how it can go much worse.


One day…

One day, we let a rename_column migration go by in code review. The table was small enough, the developer was new to our team, his pair-buddy was on scheduled PTO, and the ones who noticed the issue didn’t GitHub-reject it in time.

After he merged the seemingly safe commit, I merged a time-consuming migration, diligently filling in our Pull Request template for changesets with DB changes, where I specified it would take a while to run and that we should publish after hours. I edited the comment to mention the person deploying the code, without knowing that edits in GitHub don’t send notifications as I expected.

So he did a routine deployment to production. During working hours. During a weekday. After the first migration committed, the running code in the servers went expectedly stale (notably, the first page in our users’ workflow; many exception notifications started coming in).

Then my UPDATE ran for 9.5 minutes, delaying in turn the updated code from running alongside its first migration, blocking users from authenticating. Hundreds of users saw our error screen that afternoon.

A minute after the SQL query finished, the code updated, and everything went back to normal (with some of us losing hair in the process).

That morning, as per usual, the deployer sent a report for the commits he would publish in the afternoon. I should have verified whether he was aware of the non-standard deployment, but our deployments have been sailing smoothly for months, and I forgot. After all, we had the checklist in GitHub already; not good enough, as I learned that day.

That afternoon he announced again he’d deploy soon. And soon after, we started encountering the errors from notification services and client support messages.

Steps we took to prevent this from happening again

The deploy report script now warns when there are migrations. I added the following snippet:

1
2
3
4
5
6
7
8
migrations=`git log $last_deploy..$release_candidate db/migrate | wc -l`;
if [ "$migrations" -gt 0 ]
then
       echo "DEPLOYMENT CONTAINS DB MIGRATIONS"
       git diff $last_deploy..$release_candidate --stat db/migrate
       echo "- Are they destructive? Consider making backward compatible"
       echo "- Do they affect large tables? Consider deploying after hours"
fi

The Data team is now the GitHub CodeOwner of db/**/*, so they get automatically notified when a changeset touches database files, and they can spot potential issues preventively.

We are discussing whether we add the strong_migrations gem to disallow such merges (PR template checklist and code reviews weren’t good enough).


We now have more automatic safety checks in place, making perfect storms like this less prone to happen in our future.