What can you do to decrease the number of deployment bugs of a live website?

Question

I'm sure lots of you have experienced this problem. A website or web application is running and is live. You want to upload the next version, but you haven't figured out everything, like setting a value to false in configuration file, inserting another record into the database, and doing lots of minor stuff which can sometimes count to 20 or more parameters.

As soon as you upload the new version, everything breaks. Now, fixing the problem may only take up to 20 minutes, but the overall stress that you tolerate, and the financial and goodwill damage to the company are sometimes not forgettable.

What are the ways to reduce these types of bugs which arise from the initial configuration of new version deployment?

PS: Please don't mention check-lists, cause we already have them. The problem with check-lists is that, they should always get updated, but they won't.

You should not be breaking your website when you update it. If you are...then your procedure is wrong. — Ramhound, Sep 13 '11 at 11:44
"The problem with check-lists is that, they should always get updated, but they won't" In that case, you're doomed. We can tell you a good things to do, and -- just like the check-list -- it won't get done. If you can't keep checklists up to date, you should consider finding another kind of job were errors and sloppiness are tolerated better. Perhaps government service. — S.Lott, Sep 13 '11 at 13:01
If you haven't figured out everything, you should not be deploying. — HLGEM, Sep 13 '11 at 14:27

Oded · Answer 1 · 2011-09-13T13:24:56.810

28

Two things:

Staging environment, as similar to the live environment where you do test deployments.
Ample testing of this environment after deployment. Automated and non-automated.

There are other things that can be done.

I suggest reading the 5 part blog series about automated deployment by Troy Hunt. The tooling he uses is MS stack specific, but the concepts are universal.

edited Sep 13 '11 at 13:24

answered Sep 13 '11 at 11:07

Oded

53,326
19
166
181

you mean that all websites all over the world have an **staging environment**. – Saeed Neamati Sep 13 '11 at 11:09
15

Not all of them. Which is why they have such problems with deployment. Any site of significant size that I have worked with does have one. – Oded Sep 13 '11 at 11:31
@Saeed Neamati - Of course not this is the exact reason so many websites do not actually work like they should ( i.e. my credit unions external load payment website ) when it happens your customers with in the field only laugh at you. In my case I have choice but to use my credit unions website. – Ramhound Sep 13 '11 at 11:46
6

@saeed: I can't speak for the world but all mine damn sure do. – Wyatt Barnett Sep 13 '11 at 11:54
1

@saeed all the good ones do. – HLGEM Sep 13 '11 at 14:26

score 13 · Answer 2 · answered Sep 13 '11 at 12:49

I wonder, why no one mentioned Version control -- which is one of the most important ways to save you from trouble while updating/upgrading.

First, your deployment should be just a clone of the stable branch on your repository. Everything including config files, SQL files, install/update scripts MUST be version controlled.

Second, you need to have "some sort of" staging area -- it could be anything -- a local server, a temporary virtual cloud server you just spawned, a very simple virtual host setup, or, a fully-fledged custom application that you maintain along with the main app. The difference between this "staging area" and your "development area" is that the former more closely models (or simulates) your actual deployment environment. For example, you can develop on PHP 5.3.x with Apache module, but since your host is PHP 5.2.x with FCGI, you staging area must be same.

Then, you first write and test your updates on your development environment. Merge those changes to the staging area repository, and again test. At this point you can make any changes to your config to suit your deployment -- since it's version controlled, nothing is going to be lost, and you can always revert back in case of problems.

Finally merge the staging area changes on your live deployment copy.

The complexity of your staging area should reflect the complexity and scope of your app. But in any case Version control is indispensable.

Of course if you don't use Version control -- none of this applies -- but then it's as naive as writing a database in Logo.

+1 but I didn't mention it because I just assumed version control was understood... — maple_shaft, Sep 13 '11 at 13:29
Yes, amazing how many people only soucre control the code they care about and not things like configurations, SQl etc. — HLGEM, Sep 13 '11 at 14:21
@HLGEM, You are right sadly, I source control everything, I even have a subversion server running at home for NON DEVELOPMENT documents I have at home like my resume and cooking recipes. :) — maple_shaft, Sep 13 '11 at 18:01
@maple_shaft, Ohhhh, I never thought to version control my resume, what a great idea. — HLGEM, Sep 13 '11 at 18:31
Certainly a great idea -- one day you would look at the log and see what you learned and how you became more and more experienced as the time went by. And, if you commit once every month or two, your log after 25 years would be very interesting. — treecoder, Sep 13 '11 at 18:39

score 6 · Answer 3 · edited May 28 '13 at 19:44

Yes, you need a test or staging environment where you go through all of the steps, but keeping separate configuration files for separate environments is a must.

Environments
|_property_files
    |_ dev
        |_ com.bla.util
        |    |_ example.properties
        |_ com.bla.beans
        |    |_ someconfig.xml
    |_ test
        ....
    |_ production
        ....
|_database_updates
    |_ dev
        |_ insert_new_static_data.sql
        |_ ...

...

Basically in my build and deployment scripts I take an environment property that will fetch the environment specific meta data files like XML files and replace them in my build location before packaging. Further in my deployment scripts I will look for any SQL files in database updates and execute those on the configured database for that environment as well.

I could do this with a custom build task, but I actually just use some JUnit tests to do this for me. If any SQL exceptions occur then the test fails and thus the deployment fails. Generally speaking too the SQL scripts have intelligence, if the necessary data already exists in the environment then they skip the insert or update.

I also have a similar directory for batch or shell scripts that I need to run for a specific environment.

The tell all in your question is this: they should always get updated, but they won't.

These configurations drive your automated builds and deployments so if you DON'T update them then your builds fail and your manager gets emailed about it. It is just as important then for the team to maintain the build and deployment configurations for a specific release as it is for them to check in code that compiles. Either infraction breaks the build.

In short, greater adoption of continuous integration (CI) principles will help remove the pain of production releases.

score 6 · Answer 4 · answered Sep 13 '11 at 12:39

As suggested, use a staging system. This gives you the opportunity to test your changes in a live environment.

This brings up another point: have testers. Testing the stuff I wrote myself doesnt find as much bugs as when someone else tests my application.

Another thing: automate your deployment process. Do db migrations with ant migrate, deploy the newest version automatically from svn via capistrano etc. When you deploy something, you shouldnt have to do more then just a click and everything is automatic. Especially for websites which needs some configuration setup, manual steps required for deployment are a nightmare and the possibility that something goes wrong huge.

score 6 · Answer 5 · answered Sep 13 '11 at 18:26

6

For something that absolutely, positively must not break consider having an A and a B system and use a load balancer to route all requests to A while you upgrade and test B, and then route everything to B while you update A.

For bonus points, add C and ensure your systems are geographically separated so an earthquake won't take 2 of them out simultaneously.

For many applications, I admit that this is overkill.

It also complicates any transaction management you need to do, but the problems are not insurmountable.

answered Sep 13 '11 at 18:26

Bill Michell

1,980
14
15

1

This is the correct answer – Sep 13 '11 at 18:33
2

Thank you. But versioning, staging systems and one touch deploys are all essential too. – Bill Michell Sep 13 '11 at 18:50

Alex Feinman · Answer 6 · 2013-08-11T04:00:25.027

In addition to the excellent suggestions above to have a pre-production environment and use automated testing:

Reduce the complexity of the codebase. Less code, generally, means fewer bugs and an easier time finding them. This is the philosophy behind refactoring, separation of concerns, and so forth.

Segment the codebase. One common approach is to separate it into:

a few core parts that change slowly and are shared across the site
many leaf parts that may change more quickly but each only impact a smaller part of the site

This understanding of your code base allows you to focus your development and testing on the core parts, since bugs there will have the most drastic effect.

score 4 · Answer 7 · answered Sep 13 '11 at 11:11

4

1) Deploy to the test site first and test your changes

2) Have all configuration in a configuration file (web config or similar). This config should then be specific to the application and never overwritten. Any changes are then delibrate rather than forgetting to change something that should be diffrent from test.

answered Sep 13 '11 at 11:11

Tom Squires

17,695
11
67
88

And make sure to have someone code review that configuration for each different environment. – HLGEM Sep 13 '11 at 14:24

score 4 · Answer 8 · answered Sep 13 '11 at 19:00

A well executed release is all about planning and communication. So prior to conducting a release consider these questions:

How long is the release likely to take, and are there any risks in letting people continue to interface with my product while the release is underway? If there is a risk to the system, consider taking the system offline and putting up a "System in currently undergoing maintenance" message in its place.
Are there any customers you may need to notify about the release ahead of time? Do I need to tell them about a possible service interruption, or performance degradation while the release is underway? Personally I always err on the side of over-communicating and telling all customers about an upcoming release or maintenance window on a public blog or a similar venue.
What are my contingency plans should the release go awry? For example, if the release goes poorly should we roll back and restore the system to the way it was to minimize any time we are offline? And if so, are the steps for rolling back a release well documented? Or should I have the right people on call or on hand in order to assist with troubleshooting problems should they occur. Personally, I think the best way to approach the planning of any release is to assume something will go wrong with the release. That way I have forced myself to think through some of these issues ahead of time.

Next, when it comes to executing a release, one of the best ways to ensure that it will run smoothly is to practice, practice, practice, and to document everything you encounter along the way. So, well in advance of deploying new code to production, practice deploying the code to a secure, properly sandboxed staging environment first. Have the person who will be responsible for deploying to production, perform the test deployment to staging. Consider this your dress rehearsal and conduct yourself as you would if this is the real thing. Document everything you do every step of the way; document every command you execute, any SQL code you run, any files you modify and how you modified them and for each step along the way document what you expect to see if the procedure is executed properly. If and when you encounter a problem of some kind, document what you did to resolve it.

Then the practice deployment is complete, look over your notes and see if you can refine the process to eliminate errors. Then do it all over again. Keep practicing until executing a release becomes as routine as following a simple instruction sheet, like "login to this machine, execute this command; then login to the database and execute this SQL command; then..."

Listed above are the things an operations or release management team can do to help a release run smoothly. But what can the engineering do to help minimize the risks in a release?

Keep releases small. Simply put, the more complex the set of code changes contained by a release, the more risky the release will become. Do your operations team a favor by planning to have a larger number of small releases, rather than a smaller number of large releases over the same period of time.
Test, test, test. Don't just test your code in your QA environment, use the staging environment to test your software as well. Often there are bugs that have little or nothing to do with the code itself, but rather have a root cause that lies in the configuration of the environment itself (or some mix of the two). To find these issues you need to test your code in an environment that closely mirrors production, a.k.a. staging.

As a last word, sometimes what is most important is not what we do to prevent things from going wrong, but it is how we conduct ourselves when they do go wrong. Therefore, I think it is important to build a culture in your company around operational transparency. Don't try to hide issues from customers, be forthcoming. Use Twitter actively to let customers know if there are issues your ops team is currently aware of and working to resolve (Lighthouse is awesome at this!). Consider publishing a "status" page for your service that customers can reference to see if anything is wrong (TypePad offers a great example of this). Bottom line, always err on the side of over-communication. Your customers will thank you for it.

score 1 · Answer 9 · answered Sep 14 '11 at 23:38

Many answers here already tell you how to implement your specific solution to the problem, but, as far as I can tell, the real problem isn't one of properly migrating/updating a website. It may be that the design/architecture behind it is fragile.

If that's true, you'll have to adjust the architecture for your system such that it is robust enough to continue functioning properly even if configuration settings change or are not properly set, and such that it degrades gracefully if they occur. Ideally, if you've added in new functionality or changed old functionality in ways that require a new database column, your site will work even if the column is missing (maybe without the new functionality, or with a degraded form of the old functionality). Your client should not be losing money - at worst, he might not be getting new money from the improvements you've put in.

If your system is fragile enough that configuration settings can cause such serious problems, program updates are not going to be the only sources for problems - and figuring out how to do the updates safely will only increase the damage you'll experience when failure comes from a different source.

What can you do to decrease the number of deployment bugs of a live website?

9 Answers9

Linked