3

"In my company, we have this distributed system over Kubernetes. Some microservices are shared among all customers, and others are up to them to upgrade. This system has to interact with A LOT of customer's services (VPNs, private APIs, databases, log aggregators, etc.), and each customer can have wildly different environments.

We follow a very common software development lifecycle:

  • Our "engineering teams" work on features or bug fixes; once done
  • Our QA teams test the changes to ensure they work; and then
  • We have a release process where new container images are generated, and more thorough tests are done.

The reasoning behind this development -> QA -> release pipeline is sound. It allows the engineering team to focus on their tasks without being overwhelmed by an influx of customer requests. It also creates space for product evolution and ensures the quality one would expect from the thorough QA/release process.

However, when a customer requires a new feature or a bug fix, this software development lifecycle can backfire because we are not sure our changes will address all aspects of the customer's environment. For example, we may reproduce an issue, compose a fix, etc., but when deployed, the database may be using a different encoding, or the VPN server requires a specific cipher, etc.

Technically, it is not hard to test changes earlier. We can release different images, use feature flags, etc. Or customers also always have development and UAT environments.

The problem is the policy. I'd like to suggest we use a different workflow in situations where we need to test against the real environments.

So, is there some known process where a development team can test a developing change together with a customer, bypassing the default SDLC?

gnat
  • 21,442
  • 29
  • 112
  • 288
brandizzi
  • 172
  • 12
  • If there a way to do a silent release to their environment so you can do integration testing before the customer begins testing? There is no magic technique here. You've got to put it out there and try it. – Greg Burghardt Jul 30 '23 at 20:15
  • Does your customer have their own "test environment" you can deploy to first? – Greg Burghardt Jul 30 '23 at 20:16
  • 3
    What kind of software are you taking about? – Greg Burghardt Jul 30 '23 at 20:16
  • 3
    ... desktop, client/server, web, mobile, embedded software? What kind of "customer environment"? Something which involves a lot of other applications only available at the customer site? Or some environment which can be replicated in your own infrastructure? – Doc Brown Jul 30 '23 at 20:29
  • 1
    This seems to brief to be an answer, but you may want to look into "continuous deployment" and especially "testing in production". Most sources will cover applications where the same organisation does the development and owns the environment. but not all, and the same ideas are applicable if it's a customer owned environment. – bdsl Jul 30 '23 at 21:47
  • @GregBurghardt and DocBrown thank you for the comments, I edited the question to add more context. – brandizzi Jul 30 '23 at 22:31
  • @bdsl thanks! Continuous CD would surely solve the issue here, we probably should work on that direction. However, it is a vaster philosophy that will take time to implement. We can technically test the changes before the full release, and I'd like to suggest that, but as an IC I don't want to sound like I didn't do my research on ways of doing that. – brandizzi Jul 30 '23 at 22:36
  • 2
    It sounds like your issues could be solved by the QA team having a better test environment: one which reflects all of your customers production environment in a more adequate way. So whenever one of your customers finds an issue which slipped through QA, think about how you can prevent this to happen again. Such improvements may not be restricted to the test environments, one may also find improvements in the automated tests of your devs, or improvements in the self-checks or configuration of the software (like detecting the database encoding automatically etc). – Doc Brown Jul 31 '23 at 06:13
  • 3
    Let me add a literal answer to your question: adapting a default SDLC to real needs has a name, and I am sure you know it already: this is called Agile. You want interactions over processes, you want customer collaboration, and you want to change your way of workings instead of following a plan blindly - these are at least 3 of the four main points of the [Agil Manifesto](https://agilemanifesto.org/) – Doc Brown Jul 31 '23 at 06:20
  • @AgilManifesto that's true. Alas, in some places even cogitating that the process is not that agile can be seen as an offence to leadership. But I get from the comments that there is no widely known term for what I'm looking for, which is a good answer anyway. – brandizzi Jul 31 '23 at 21:32
  • @DocBrown thanks for the suggestion, we fortunately already improve the documentation and tests after these learnings. I'd just like to learn earlier. Anyway, I get this is not a common concept, so I'll try to find my words here. Thank you all! – brandizzi Jul 31 '23 at 21:33

1 Answers1

1

I deal with this exact problem constantly at my work. Here are three ways I can think of that we have dealt with it for different clients:

Blue/Green

You keep two production environments. One is open to the public, the other is for pilot testing only. When testing passes, you swap the environments, making go-live a breeze.

The main sticking point with this methodology is the database. There can be only one database (which means db changes require special handling) and it can only contain production data (which means testing on it is a risk, e.g. your tests might accidentally send emails to real customers).

While this may seem like an expensive option (it doubles the hardware cost), many customers already have a second production environment for disaster recovery-- usually just sitting there gathering dust. This is a way to put it to work.

Outage window

Testing is done in lower environments then promoted to production during an outage window. During the outage window, the firewall is manipulated so that only QA can access production.

If bugs are found, client can either fix-forward (e.g. a simple config issue) or roll back and reschedule. The latter is required for code defects.

When analyzing the defects, emphasis is on reproducing the production issue in the lower environment first. If it cannot, special attention is given to the differences between the lower environment and production-- if possible, the environments are brought into sync. If not possible, QA notes the delta on their test plans for future releases. This way you have continuous improvement.

Signoff on risk

We get the client to sign off on the risk that some additional defects may be found in production if test environments are not production-like. The signoff includes an agreement on who will pay for fixing such defects.

Once you have the signoff, production defects are not necessarily a bad thing; they are sales, in fact, with a captive customer. That being said, a good partner should try to identify the risks beforehand and mitigate them somehow, e.g. by using network test tools to prove connectivity to new network connections, or by incorporating feature flags to turn the new feature off while issues are resolved.

John Wu
  • 26,032
  • 10
  • 63
  • 84