Monday, April 07, 2014

Why does it work in staging but not in production?

This is a question that I am sure was faced by every developer and operation engineer out there. There can be multiple answers to this question, and I'll try to offer some of the ones we arrived at, having to do mainly with our Chef workflow, but that can be applied I think to any other configuration management tool.

1) A Chef cookbook version in staging is different from the version in production

This is a common scenario, and it's supposed to work this way. You do want to test out new versions of your cookbooks in staging first, then update the version of the cookbook in production.

2) A feature flag is turned on in staging but turned off in production

We have Chef attributes defined in attributes/default.rb that serve as feature flags. If a certain attribute is true, some recipe code or template section gets included which wouldn't be included if the attribute were false. The situation can occur where a certain attribute is set to true in the staging environment but is set to false in the production environment, at which point things can get out of sync. Again, this is expected, as you do want to test new features out in staging first, but don't forget to turn them on in production at some point.

3) A block of code or template is included in staging but not in production

We had this situation very recently. Instead of using attributes as feature flags, we were directly comparing the environment against 'stg' or 'prod' inside an if block in a template, and only including that template section if the environment was 'stg'. So things were working perfectly in staging, but mysteriously the template section wasn't even there in production. An added difficulty was that the template in question was peppered with non-indented if blocks, so it took us a while to figure out what was going on.

Two lessons here:

a) Make your templates readable by indenting code blocks.

b) Use attributes as feature flags, and don't compare directly against the current environment. This way, it's easier to always look at the default attribute file and see if a given feature flag is true or false.

4) A modification is made to the cookbook version in production directly on the Chef server

I blogged about this issue in the past. Suppose you have an environments file that pins a given cookbook (let's designate it as cookbook C) to 1.0.1 in staging and to 1.0.0 in production. You want to upgrade production to 1.0.1, because it was tested in staging and it worked fine. However, instead of i) modifying the environments/prod.rb file and pinning the cookbook C to 1.0.1, ii) updating the Chef server via "knife environment from file environments/prod.rb" and iii) committing your changes in git, you modify the version of the cookbook C directly on the Chef server with "knife environment edit prod".

Then, the next time you or somebody else modifies environments/prod.rb to bump up another cookbook to the next version, the version of cookbook C in that file is still 1.0.0, so when you upload environments/prod.rb to the Chef server, it will downgrade cookbook C from 1.0.1 to 1.0.0. Chaos will ensue the next time chef-client runs on the nodes that have recipes from cookbook C. Production will be broken, while staging will still happily work.

Here are 2 other scenarios not related directly to staging vs production, but instead having the potential to break production altogether.

You forget to upload the new version of the cookbook to the Chef server

You make all of your modifications to the cookbook, you commit your code to git, but for some reason you forget to upload the cookbook to the Chef server. Particularly if you keep the same version of the cookbook that is in staging (and possibly in production), then your modifications won't take effect and you may spend some quality time pulling your hair.

You upload a cookbook to the Chef server without bumping its version

There is another, even worse, scenario though: you do upload your cookbook to the Chef server, but you realize that you didn't bump up the version number compared to what is currently pinned to production. As a consequence, all the nodes in production that have recipes from that cookbook will be updated the next time they run chef-client. That's a nasty one. It does happen. So make sure you pay attention to your cookbook versioning process and stick to it!

2 comments:

JCD said...

You seem to have forgotten (at least) one important factor: data. The customer data is different in production than the data you have in stage. Not always is the data the same, sometimes because of legal requirements (credit cards come to mind) and sometimes just because things change so fast. Perhaps you have not seen that particular scenario, but it is not uncommon in web environments.

Grig Gheorghiu said...

JCD -- thanks for the comment, very good point!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...