Backward-compatible database migrations
My team has a policy of not applying destructive changes to our database. We don’t, for example, remove a column or table when deprecated, and we don’t rename either, as renaming is akin to removing a name the code uses.
Our deployment process performs database changes about a minute before the related code changes go live. During this time, Rails might run database queries using an out-of-date schema, erroring out. As the usage of our app continues to increase, even sub-second rename migrations bring down several in-flight requests.
For some time, we dealt with this by deploying after hours, with the consequence of having to work after hours. Also, this doesn’t guarantee that no users will be using the new mobile features we released in 2019.
The best way we found to apply database changes without interruptions is to do them in two separate deployments:
- Configure the model to ignore the deprecated field (we use Rails’
self.ignored_columns
API) - Run the destructive migration in the next deployment
Example
Once we removed the column offices.activation_token
. First, we configured
ActiveRecord
to ignore it:
1
2
3
class Office < ApplicationRecord
self.ignored_columns = %w(activation_token)
# ...
With that change, ActiveRecord
won’t use that name in any of its Office
queries. If it’s still actively used in any part of the codebase, our tests will
fail, telling us where to update code (assuming good test coverage, otherwise
the app will fail during manual tests).
After deploying the first step to production, we safely deploy the migration:
1
2
3
4
5
class RemoveActivationTokenFromOffices < ActiveRecord::Migration[6.0]
def change
remove_column :offices, :activation_token, :string
end
end
The migration deletes the column from our database, and because of step 1, we know it is unused. We may now clean up the temporary code added in step 1 and deploy safely at any time of day.
The two-stage deployment approach is safer, but is it worth it to take this precaution even when a tiny percent of our traffic is affected for a given day?
Well, yes! Let me tell you how it can go much worse.
One day…
One day, we let a rename_column
migration go by in code review. The table
was small enough, the developer was new to our team, his pair-buddy was on
scheduled PTO, and the ones who noticed the issue didn’t GitHub-reject it in
time.
After he merged the seemingly safe commit, I merged a time-consuming migration, diligently filling in our Pull Request template for changesets with DB changes, where I specified it would take a while to run and that we should publish after hours. I edited the comment to mention the person deploying the code, without knowing that edits in GitHub don’t send notifications as I expected.
So he did a routine deployment to production. During working hours. During a weekday. After the first migration committed, the running code in the servers went expectedly stale (notably, the first page in our users’ workflow; many exception notifications started coming in).
Then my UPDATE
ran for 9.5 minutes, delaying in turn the updated code from
running alongside its first migration, blocking users from authenticating.
Hundreds of users saw our error screen that afternoon.
A minute after the SQL query finished, the code updated, and everything went back to normal (with some of us losing hair in the process).
That morning, as per usual, the deployer sent a report for the commits he would publish in the afternoon. I should have verified whether he was aware of the non-standard deployment, but our deployments have been sailing smoothly for months, and I forgot. After all, we had the checklist in GitHub already; not good enough, as I learned that day.
That afternoon he announced again he’d deploy soon. And soon after, we started encountering the errors from notification services and client support messages.
Steps we took to prevent this from happening again
The deploy report script now warns when there are migrations. I added the following snippet:
1
2
3
4
5
6
7
8
migrations=`git log $last_deploy..$release_candidate db/migrate | wc -l`;
if [ "$migrations" -gt 0 ]
then
echo "DEPLOYMENT CONTAINS DB MIGRATIONS"
git diff $last_deploy..$release_candidate --stat db/migrate
echo "- Are they destructive? Consider making backward compatible"
echo "- Do they affect large tables? Consider deploying after hours"
fi
The Data team is now the GitHub CodeOwner of db/**/*
, so they get
automatically notified when a changeset touches database files, and they can
spot potential issues preventively.
We are discussing whether we add the strong_migrations
gem to disallow
such merges (PR template checklist and code reviews weren’t good enough).
We now have more automatic safety checks in place, making perfect storms like this less prone to happen in our future.