Skip to content

stevegrossi

Migrations

Tended 2 years ago (3 times) Planted 3 years ago Mentioned 2 times

Contents

A form of technical Change management and an essential engineering skill, migrations involve replacing elements of a running system over time. On the smaller end, you might be renaming a column on a large and frequently-accessed database table over the course of a day. On the larger end, replacing all of Github’s permissions code incrementally over the course of two years or even, when necessary, gradually replacing entire legacy systems.

A General Approach and Example

Migrations come into play when you can’t easily or safely replace the old element with the new element in a single change. The general pattern I’ve seen work is to:

  1. Introduce the new element alongside the old one
  2. Gradually switch from the old element to the new one, preserving your ability to switch back if needed
  3. Once you’re entirely using the new element, remove the old one

A simple example is renaming a frequently-accessed column on a large database table. There are at least two reasons not to just use SQL’s RENAME COLUMN:

  • RENAME COLUMN takes out an ACCESS EXCLUSIVE lock on the table, preventing reads and writes for the duration of the operation. If this takes more than a few seconds (such as on a large table), queries on this table from your application will hang, potentially leading to timeouts and downtime.
  • Renaming a column breaks the API contract of your code. Especially on a high-throughput system, running code looking for the old column name will fail after it’s renamed, and new code looking for the new column name will fail until it’s renamed. Even if new code is deployed at exactly the time you run RENAME COLUMN, one version of the code will fail for the duration of that operation, resulting in errors.

To safely rename a column requires a multi-stage approach:

  1. Add a new column new_column_name to the table while keeping old_column_name in place
  2. Deploy code to write all future updates to both columns so that they always contain the same data moving forward, while continuing to read from old_column_name.
  3. Populate new_column_name with the value of old_column_name for all records created prior to step 2 (often called a “backfill”). At this point, we should have a guarantee that new_column_name and old_column_name contain the same data for all rows. A query like SELECT COUNT(*) FROM table WHERE new_column_name <> old_column_name, which should remain 0, can help confirm this.
  4. Deploy code to now read from new_column_name.
  5. Assuming your app continues to work as expected, you can now safely remove the old_column_name.

Of course, this is quite a bit more work than a single RENAME COLUMN query, so it’s worth considering the tradeoff between development effort and avoiding downtime. But as your app and user-base grow the tradeoff typically tilts in favor of avoiding downtime.

Antipatterns

The most common anti-pattern I’ve seen is migrations that are never completed. This can happen for a number of reasons, most of them sociotechnical, e.g.

  • In large apps, it can be difficult to track down all uses of the old element for replacement with the new element
  • On large teams, colleagues might be adding new uses of the old element at the same time you’re trying to replace it with the new one. This is where a ratchet approach comes in handy, halting use of the old element moving forward. Better yet, Martin Fowler also suggests wrapping the old element in an abstraction layer so that callers have a unified API to call for the duration of the migration.
  • In changing organizations (which is to say, almost all of them), priorities shift and people get moved or leave, which can leave migrations in an unfinished state. We can mitigate this risk by limiting the duration of migrations, keeping them as small as possible and staying focused on moving them forward.

Unfinished migrations impose a significant burden, so it’s best to avoid these by ensuring migrations are followed through on or rolled back. Some of those burdens are

  • Developer productivity: with in-progress migrations both the old and new elements are in place, and if the migrations is ever to be completed all new code will need to interact with both elements (potentially doubling the cost of development) until the old one can be removed.
  • Performance: there is a performance cost to updating both the old and new elements until a switchover can be made, making your app slower for users and/or more costly to run.

Resources

Mentions

  • ratchet approach

    …The ratchet approach is a valuable tool for managing technical [[Migrations]]. One example is the colorfully-named [shitlist approach](https://sirupsen…

  • sociotechnical systems

    …experience](https://martinfowler.com/articles/patterns-legacy-displacement/) with failed [[Migrations]] off of legacy systems resulting from focusing only on the…