This post is a slightly revised and extended version of the post that has originally been posted here.
These days our customers expect their applications to be up and running all the time and literally experience no down-time at all ever. At the same time we should be able to add new features to the existing application or fix existing defects. Is that even possible? Yes, it is, but it is not for free. We have to make a certain effort to achieve what is called zero-downtime deployments.
If we need to achieve zero down-time then we can not use the more classical way of deploying new versions of our application. There we used to stop the current version of the application and put up the maintenance page for all our potential users that wanted to use the application while we were deploying. We would tell them something along the line:
“Sorry, but our site is currently down for maintenance. We apologize for the inconvenience. Please come back later.”
But we cannot afford to do that anymore. Every minute of down-time signifies a lot of missed opportunities and with that potential revenue. So what we have to do is install the new version of the application while the current version is still up and running. For that we need to either have additional servers at hand to which we can install the new version or we need to find a way how we can have to versions of the same application running on the same servers. This is also called a non destructive deployment.
OK, now we have spoken about non destructive deployment and zero downtime deployment. There are various ways how this can be achieved. The most popular ones are rolling update, blue-green deployment and canary releases. Let’s start with the most popular one, a rolling update, that is e.g. used by default in Kubernetes.
A rolling update is an update of some application (or service) where we have multiple identical instances of the application running, say all having version 1.0. When we now want to rollout a new version 1.1 of the application we do this in a way that at all time there are at least some application instances up and running and thus the application is available the consumers. The process takes down one instance of the application and replaces it with a new instance with the new version 1.1. If the replacement went well, then the next application instance is updated, until all instances are now on version 1.1. Only when all instances are updated then the rollout is complete.
If during the rollout a problem arises, the process will be stopped and depending on the choice of developers or DevOps engineers, the process will be left at that point, or alternatively already upgraded application instances can be automatically rolled back to their previous version.
Now what exactly is a blue-green deployment? In this case we have a current version up and running in production. We can label this version as the blue version. Then we install a new version of our application in production. This time we label it with the color green. Once green is completely installed, smoke tested and ready to go we funnel all traffic through it. After waiting for some time until we are sure that no rollback is needed the blue version is now obsolete and can be decommissioned. The blue label is now free. When we deploy yet a newer version we will call it the blue version, and so on and so on. We are permanently switching from blue to green to blue to green.
Since we leave the current (blue) version running while we deploy the new (green) bits we usually have more time to execute the deployment. Once the new version is installed we can run some tests against it – also called smoke tests – to make sure the application is working as expected. We also use this opportunity to warm up the new application and potentially pre-fill the caches (if we use any) so that once it is hit by the public it is operating with maximal speed.
It is important to notice that during this time the new application is not visible to the public and we can only reach it internally to e.g. test it. Once we’re sure the new version is working as expected we reconfigure the routing from the current version to the new version. This reconfiguration happens (near to) instantaneous and all public traffic is now funneled through the new version. We can keep the previous version around for a while until we are sure that no unexpected fatal error was introduced with the new version of the application.
A canary release is exactly the same as a blue-green deployment with one important distinction. Instead of switching on the new “green” version in one go, traffic to the application or the service is gradually switched from blue to green. Initially we may start with only funneling say 1% of the traffic to green and leave the rest on blue. We then observe what is happening with all request that go through green. Ideally we have some key performance indicators we can measure and compare the new values with the old ones that were valid for requests that go to the blue version. If all looks OK we can funnel more and more traffic to green, until we reach 100%. If during this process of gradually and continuously funneling more and more traffic to green something goes wrong, we can flip back all traffic to blue in an instant.
The key here is, that we have some monitoring in place which gives us reliable feedback about the behavior of green compared to blue. This can be such things as the total duration of a request or the CPU or memory consumption of green compared to blue.
Bad things happen, it’s just a sad truth. Sometimes we introduce a severe bug with a new release of our software. This can happen no matter how good we test. In this situation we need a way to rollback the application to the previous know good version. This rollback needs to be fool proof and quick.
When we are using zero-downtime or non-destructive deployment we gain this possibility for free. Since the new version has been deployed without destroying or overwriting the previous version the latter is still around as long as we want after we have switched the public traffic over to the new version. We thus “only” need to redirect the traffic back to the old version if bad things happen. Again, re-routing traffic is a near instantaneous operation and can be fully automated so that a rollback is absolutely risk-free.
Compare this with a rollback in the case of a classically deployed application which used to be destructive. Once the new version was in place the old version was gone. One had to find the correct previous bits and re-install them… a nightmare! Most often the necessary steps to execute during a rollback were maybe documented but never exercised. A huge risk and nerve wrecking for everyone.
Database Schema Changes
The attentive reader may now say: “Wait a second, what about the case where a deployment encompasses breaking database schema changes?”
This is an interesting question and it needs some further discussion. In a nutshell, we need to deploy schema changes separately from code changes and ALWAYS make the schema changes such as that they are backwards compatible. I’m going to describe the how and why in much more details in the remainder of this post.
Let’s now look at the steps needed to roll out breaking database schema changes. It is a 2 step process. We have
- Roll out backwards compatible schema change
- Roll out code changes
- Roll out backwards incompatible schema changes
In step 1 we roll out a first version of the schema changes that are still backwards compatible such as that the current (old) version of the application is not affected by the changes in any way. Only once this backwards compatible change has been successfully deployed can we in step 2 roll out the new and modified code that will used the modified DB schema. This rollout can be in any form we want, for example as a rolling update. Once we have made sure that the new code is working as expected and no rollback is required we can prepare and execute step 3 in which we will do a cleanup of the database schema, e.g. by removing obsolete tables, table columns, views or index, to name just a few possibilities.
Let’s make a simple sample. Assume we have an table
Address with a column
Street on it among others. Further assume that the
Street column contains the name of the street as well as the (optional) house number. We have now a feature request that the content of the
Street column should be split into two new columns named
HouseNumber, where the latter is optional. Furthermore we want to remove the column
Street from the table.
Step 1: Migrate schema & data
In this case our first step would be to roll out a script that:
- adds the two columns
HouseNumberto the table
- splits the data of column
Streetinto the street name and the house number according to given rules (e.g. by using a Regex expression) and fills the according new columns of the table with the calculated data
In case you’re using a PostgreSQL database the migration script could look similar to the one below:
# Schema change ALTER TABLE Address ADD COLUMN StreetName VARCHAR NOT NULL, ADD COLUMN HouseNumber VARCHAR NULL; # Data migration (simplistic logic) UPDATE Address SET StreetName = SPLIT_PART(Street, ' ', 1), HouseNumber = SPLIT_PART(Street, ' ', 2); # optional: add index e.g. to StreetName CREATE INDEX StreetName_idx ON Address (StreetName);
When authoring the migration script make sure you consider how much data is in the target database as it will influence how long the migration will take and how it will affect the responsiveness of the application while the script is running. You will want to avoid that your application stalls due to the fact that the database might be overloaded.
Please note that while executing step below we may have, for a short time, a state where we have old and new versions of the application or service up and running. In this case we need to have a mechanism to deal with address records that are created by the old instances and migrate their
Street column live into the columns
HouseNumber. This can be done by using a trigger. For PostgreSQL it would look similar to this:
CREATE TRIGGER my_insert_trigger AFTER INSERT OR UPDATE OF Street ON Address FOR EACH ROW EXECUTE PROCEDURE split_street(); # provide implementation of function split_street ...
That is, when a new record is added or an existing record has its column
Street updated then the trigger executes the function
split_street which does the actual migration and fills the fields
We can then tests the result of the rollout, either manually or with some prepared automated regression tests. We can specifically make sure that we verify that known edge cases such as non existing house numbers have been handled correctly by the schema (and data) migration.
Since we have not touched the existing data, that is, the column
Street with its data is still part of the table
Address, our application is not in any way affected.
Assuming that we use Docker containers to deploy and run our artifacts the whole process can be sketched like this:
Step 2: Deploy new version of app
In this step we can now roll out our code changes, that is the new version of the application or service that uses the migrated schema and data. This roll out can happen in any of the mentioned ways above, for example as a rolling update. Since the previous step has left the database in a backwards compatible state, it is not a problem to temporarily have a state where two different versions of the application co-exist in the system. The old instances of the application will continue using the Street column of the table until they are replaced by new instances. The new application instances will immediately start using the new columns of the table.
During the short phase of the rolling upgrade we have the following situation, where e.g. instance 1 is already upgraded and thus consumes the new DB schema and migrated data whereas instance n is still on the previous version and thus is using the old schema with the non migrated data.
Step 3: Cleanup
In this step we are going to cleanup the leftover from the previous steps. After completing step 2 we do not need column Street of table Address anymore and thus should remove it. Technically we could just leave it there but it is always a good idea to clean up code and/or database schema and data. Obsolete code, tables, columns, views, etc. add to the technical depth of an application.
In this step we need to deploy a script that removes the obsolete column Street from the table. It could look similar to this, assuming we are using a PostgreSQL DB:
ALTER TABLE Address DROP COLUMN Street;
Please note that if the column is part of an index we also need to delete the index first.
Once again, assuming that we use Docker containers to ship and run our artifacts and that we deploy to Kubernetes (k8s), the third step can be illustrated as follows:
After we have run this migration script our database is clean and no technical depth is left.
In this post I have discussed in detail what zero-downtime deployments are and how this can be achieved by using the technique on non-destructive deployments. One variant of this latter technique is the so called blue-green deployment. These days this is a technique frequently used in fully automated CI/CD pipelines.
I have also shown how backwards incompatible database schema changes and data migrations can be structured to allow for zero-downtime deployments, by splitting up the process in three distinctive steps where first a backwards compatible schema and data change is deployed, then the code changes are rolled out and finally the database schema and the data are cleaned up by removing obsolete artifacts.