Zero Downtime Deployment - Transitional Db Schema

Question

Achieving Zero Downtime Deployment touched on the same issue but I need some advice on a strategy that I am considering.

Context

A web-based application with Apache/PHP for server-side processing and MySQL DB/filesystem for persistence.

We are currently building the infrastructure. All networking hardware will have redundancy and all main network cables will be used in bonded pairs for fault-tolerance. Servers are being configured as high-availability pairs for hardware fault-tolerance and will be load-balanced for both virtual-machine fault-tolerance and general performance.

It is my intent that we are able to apply updates to the application without any down-time. I have taken great pains when designing the infrastructure to ensure that I can provide 100% up-time; it would be extremely disappointing to then have 10-15 minutes downtime every time an update was applied. This is particularly significant as we intend to have a very rapid release cycle (sometimes it may reach one or more releases per day.

Network Topology

This is a summary of the network:

                      Load Balancer
             |----------------------------|
              /       /         \       \  
             /       /           \       \ 
 | Web Server |  DB Server | Web Server |  DB Server |
 |-------------------------|-------------------------|
 |   Host-1   |   Host-2   |   Host-1   |   Host-2   |
 |-------------------------|-------------------------|
            Node A        \ /        Node B
              |            /            |
              |           / \           |
   |---------------------|   |---------------------|
           Switch 1                  Switch 2
    
   And onward to VRRP enabled routers and the internet

Note: DB servers use master-master replication

Suggested Strategy

To achieve this, I am currently thinking of breaking the DB schema upgrade scripts into two parts. The upgrade would look like this:

Web-Server on node A is taken off-line; traffic continues to be processed by web-server on node B.
Transitional Schema changes are applied to DB servers
Web-Server A code-base is updated, caches are cleared, and any other upgrade actions are taken.
Web-Server A is brought online and web-server B is taken offline.
Web-server B code-base is updated, caches are cleared, and any other upgrade actions are taken.
Web-server B is brought online.
Final Schema changes are applied to DB

'Transitional Schema' would be designed to establish a cross-version compatible DB. This would mostly make use of table views that simulate the old version schema whilst the table itself would be altered to the new schema. This allows the old version to interact with the DB as normal. The table names would include schema version numbers to ensure that there won't be any confusion about which table to write to.

'Final Schema' would remove the backwards compatibility and tidy the schema.

Question

In short, will this work?

more specifically:

Will there be problems due to the potential for concurrent writes at the specific point of the transitional schema change? Is there a way to make sure that the group of queries that modify the table and create the backwards-compatible view are executed consecutively? i.e. with any other queries being held in buffer until the schema changes are completed, which will generally only be milliseconds.
Are there simpler methods that provide this degree of stability whilst also allowing updates without down-time? It is also preferred to avoid the 'evolutionary' schema strategy as I do not wish to become locked into backwards schema compatibility.

score 4 · Accepted Answer · answered Jan 27 '16 at 18:47

It sounds like what you are really looking for is not so much High Availability as you would need Continuous Availability.

Essentially your plan will work but you seem to have noticed that the major flaw in your setup is that database schema changes in a release could result in either downtime or failure of still available node to operate correctly. Continuous Availability approach solves this by essentially creating a number of Production environments.

Production One

This environment is your current live version of the software being utilized by users. It has its own web servers, application servers, and database servers and tablespace. It operates independently of any other environment. The Load Balancer which owns the domain resolution endpoint for these services is currently pointing to these web servers.

Production Two

This is basically release staging environment that is identical to Production One. You can perform your release upgrades here and do your sanity tests before your go live event. This also affords you to safely perform your database changes on this environment. The Load Balancer does not point to this environment currently.

Production DR

This is another duplicate at a separate data center that is located in a different region of the world. This allows you to fail over in the event of catastrophic event by doing a DNS switch at the Load Balancer.

Go Live

This event is essentially updating the DNS record to cycle to Production Two from Production One or vice-versa. This takes a while to propagate throughout the DNS servers of the world so you leave both environments running for a while. Some users MAY be working in existing sessions on the old version of your software. Most users will be establishing new sessions on the upgraded version of your software.

Data Migration

The only drawback here is that not all data during that window is available to all users at that time. There is clearly important user data in the previous version database that now needs to be migrated safely to the new database schema. This can be accomplished with a well tested data export and migration script or batch job or similar ETL process.

Conclusion

Once you have fully completed your release event, Production Two is now your primary and you begin working on installing the next release to Production One for the next deployment cycle.

Drawbacks

This is a complex environment setup and it requires a large amount of system resources, often times two to three times the system resources to do successfully. Operating this way can be expensive, especially if you have very large heavy use systems.

So, if I have understood correctly, you suggest that instead of a 'transitional' DB schema change that is applied while the Db is still in use, Db-A is kept online with the old schema whilst Db-B is updated to the new schema. When the update is ready for release, the web servers are changed over and the data that was written to Db A whilst the update was being prepared is migrated to Db B (presumably by getting all changes applied after a specific time-stamp). — , Jan 27 '16 at 19:03
@PeterScott You got it. Just keep in mind that you dont want to run the script until you are sure that all active sessions are over in the old system and it has been a long enough time that all DNS caches have been updated to the new CNAME or IP address. — maple_shaft, Jan 27 '16 at 19:04
I should be Ok on both those points; the sessions are being persisted in the Db rather than server storage to avoid sessions being tied to specific virtual-machines and I currently intend to try and use a non-DNS based load-balancer. I won't have data-center level redundancy, but that can wait for a year or so after application launch. — , Jan 27 '16 at 19:10

DocSalvager · Answer 2 · 2016-01-29T10:07:12.583

Your strategy is sound. I would only offer to consider expanding the "Transitional Schema" into a complete set of "transaction tables".

With transaction tables, SELECTs (queries) are performed against the normalized tables in order to assure correctness. But all database INSERTs, UPDATEs, and DELETEs are always written to the denormalized transaction tables.

Then a separate, concurrent process applies those changes (perhaps using Stored Procedures) to the normalized tables per the business rules and schema requirements established.

Most of the time, this would be virtually instantaneous. But separating the actions allows the system to accommodate excessive activity and schema update delays.

During schema changes on database (B), data updates on the active database (A) would go into its transaction tables and be immediately applied to its normalized tables.

On bringing database (B) back up, the transactions from (A) would be applied to it by writing them to (B)'s transaction tables. Once that part is done, (A) could be brought down and the schema changes applied there. (B) would finish applying the transactions from (A) while also handling its live transactions which would queue just like (A) did and the "live ones" would be applied the same way when (A) came back up.

A transaction table row might look something like...

| ROWID | TRANSNR | DB | TABLE | SQL STATEMENT
    0        0       A    Name   INSERT INTO Name ...
    1        0       A    Addr   INSERT INTO Addr ...
    2        0       A    Phone  INSERT INTO Phone ...
    3        1       A    Stats   UPDATE Stats SET NrOfUsers=...

The transaction "tables" could actually be rows in a separate NoSQL database or even sequential files, depending on performance requirements. A bonus is that the application (website in this case) coding gets a bit simpler since it writes only to the transaction tables.

The idea follows the same principles as double-entry bookkeeping, and for similar reasons.

Transaction tables are analogous to a bookkeeping "journal". The fully normalized tables are analogous to a bookkeeping "ledger" with each table being somewhat like a bookkeeping "account".

In bookkeeping, each transaction gets two entries in the journal. One for the "debited" ledger account, and the other for the "credited" account.

In an RDBMS, a "journal" (transaction table) gets an entry for each normalized table to be altered by that transaction.

The DB column in the table illustration above indicates on which database the transaction originated, thus allowing the queued rows from the other database to be filtered out and not reapplied when the second database is brought back up.

I like the comparison to book-keeping. So, if I've understood, the transaction tables allow me to place a very small delay on writing data to a particular normalised table so that I can apply all schema changes without risk of interruption mid-way through the changes? Then, with the table's schema up-to-date, I can resume the process that applies the denormalised transactions to the normalised tables (this process being capable of mapping the old schema data queries to the new schema)? — , Jan 29 '16 at 10:49
Yes. You would modify the mapping stored procedures (or whatever) to accommodate both old and new data. New NOT-NULL columns might be filled from old data with a code that means "prompt for this on user update." Columns to be split (i.e. FULLNAME into FIRST and LAST) would need some algorithm. I recommend adding 1 or more "comment-like" columns to tables for new biz requirements that come up. If you don't, I guarantee users will appropriate other columns for that purpose and fixing the data will then be almost impossible. — DocSalvager, Jan 29 '16 at 11:09
How would you prevent SELECT queries structured for the old schema being applied to the new schema? I could use create a table view and rename the schema table (with a schema version number) but this would still be problematic whilst the schema changes are being applied since they apply directly to the normalised table. — , Jan 29 '16 at 11:15
When you add a table, column, or anything else to an RDBMS, you are actually just adding rows to a set of internal tables that can be written to only by the RDBMS engine. The DBAs manage the database by querying them through VIEWs. Since Oracle, IBM, MS, etc. are the experts and say this is the best way, seems we should follow their lead. Create a set of VIEWs for each version of the application. You can model them after the (usually fairly denormalized) tables the developers want you to create so you can properly normalize to prevent corrupted data. — DocSalvager, Jan 29 '16 at 11:36
Thanks. I'll need to think about this. i am building an ORM layer in the application that completely removes all state persistence logic from the main domain; being more based in server-side programming I tend to solve problems more from that side than the DB administration side. Using my current strategy, the Db would quite flat with the ORM actively managing the raw tables. Adding table views and, possibly, a transaction log adds greater complexity to the ORM but it also affords multiple schema versions to be supported without data splintering. — , Jan 29 '16 at 14:23
VIEWs should be completely transparent. I know they are in Oracle. I think they are in SQL Server. To your ORM layer and app, they look like tables and present a more flatten schema to it. VIEWs are just a declarative way to eliminate a lot of the complexity needed to maintain data integrity when done in code. They take up no space since they contain no data. "Materialized VIEWs" on the other hand, do cache the VIEW results in a table. They're mostly used in Data Warehousing where the data only changes on a predictable schedule. People often confuse them. — DocSalvager, Jan 31 '16 at 08:59