How to test and optimize when you can't reproduce the environment?

Question

In the past, I've worked in a variety of environments. Desktop apps, games, embedded stuff, web services, command line jobs, web sites, database reporting, and so on. All of these environments shared the same trait: no matter their complexity, no matter their size, I could always have a subset or slice of the application on my machine or in a dev environment to test.

Today I do not. Today I find myself in an environment whose primary focus is on scalability. Reproducing the environment is prohibitively costly. Taking a slice of the environment, while plausible (some of the pieces would need to be simulated, or used in a single-instance mode that they're not made to do), kinda defeats the purpose since it obscures the concurrency and loading that the real system encounters. Even a small "test" system has its flaws. Things will behave differently when you've got 2 nodes and when you have 64 nodes.

My usual approach to optimization (measure, try something, verify correctness, measure differences, repeat) doesn't really work here since I can't effectively do steps 2 and 3 for the parts of the problem that matter (concurrency robustness and performance under load). This scenario doesn't seem unique though. What is the common approach to doing this sort of task in this sort of environment?

There are some related questions:

this question has to do with hardware (like spectrum analyzers) being unavailable, which can be (relatively) easily emulated.
this question is about tracking down bugs that only exist in production environments, which is helpful - but a different sort of activity.

Short answer: the answers to the second linked question apply as well. More logging will not only help debugging, it will also help to test and optimize. You might just have to log different things, especially running times and resource usage. — Doc Brown, Jul 08 '14 at 15:27
Can you time-multiplex parts of the production environment between production and testing? — Patrick, Jul 08 '14 at 15:27
@DocBrown - sure, but logging won't help me see if an alternative implementation will be correct or more performant in production until it's actually in production - which certainly seems to be too late. — Telastyn, Jul 08 '14 at 15:52
`Reproducing the environment is prohibitively costly.` - How much does a show-stopping production bug cost? What about 2 bugs? At unpredictable times (most likely when you have the majority of your users putting load on the system at the same time). Weigh that against the cost of setting up a minimal reproduction environment - you might find it's not that prohibitively expensive after all. — Jess Telford, Jul 10 '14 at 20:17
For some reason, I have a feeling that this just means the system is badly designed, organized. If the system is well organized and modular, setting up a test case or optimization scenario wouldn't be `prohibitively costly`. — InformedA, Jul 15 '14 at 04:52
@randoma - I'm sorry, I don't see how good design would mitigate the challenges of optimizing at scale. Even the most modular system will behave differently at scale than for some subset of data and hardware. Hell, the modular system is likely to behave far differently due to the increased number of interconnections. — Telastyn, Jul 15 '14 at 11:39
@Telastyn I was meant to say that with the modular system, I can setup static testing scenario in module to reproduce bug, optimization scenario easier. I agree that there might be corner cases where the cost is `prohibitively high` though. In that case, I would ask for help from senior members. Cheers. — InformedA, Jul 15 '14 at 11:46

Doc Brown · Accepted Answer · 2014-07-08T21:01:58.137

Actually its tough, but I am sure in lots of comparable situations it is primarily an organizational problem. The only viable approach is probably a mixture of combined measures, not just "one silver bullet". Some things you can try:

logging: as I wrote already in a comment, excessive time and resource logging (which is a kind of profiling) can help you to identify the real bottlenecks in production. This may not tell you if an alternative implementation will work better, but it will surely help you to avoid optimizing the completely wrong part of your application.
test what you can test beforehand - thoroughly, with a lot of upfront planning. Sure, things will behave different in production, but not all things. The correctness of a different implementation often can be checked beforehand - if an implementation scales well, is a different question. But planning can help a lot. Think hard about the problems your test environment can solve for you, and which are not. There are almost always things where you believe at a first glance "it cannot be tested beforehand", but if you think twice, there is often more possible.
Work as team. When trying a new approach or idea, discuss it with at least one other person of your team. When you implement a different algo, insist about code inspections and QA. The more bugs and problems you can avoid beforehand, the less serious problems you will have to solve in production.
Since you cannot test everything beforehand, expect problems showing up in production. Thus try to prepare a really good fall-back strategy when bringing new code into production. When your new code has the risk of beeing slower than the old solution, or if it has the risk of crashing, make sure you can change to the previous version ASAP. If it has the risk of destroying production data, make sure you have a good backup/recovery in place. And make sure you detect such failures by adding some validation mechanism into your system.
keep a project diary or a solution log - seriously. Each day you find out something new about the environment, write it down - success stories as well as failure stories. Don't make the same failure twice.

So the gist is - when you cannot go with try-and-error, your best option are conservative, classic upfront planning and QA techniques.

score 6 · Answer 2 · answered Jul 08 '14 at 15:18

If you can't reproduce the live environment, the uncomfortable reality is that whatever you do, it won't have been sufficiently tested.

So, what can you do?

Well, whatever has to scale, be it a process, server cluster or database volume should be tested with the zero, one, infinity rule in mind to tease out where the potential bottlenecks/limitations are be it IO, CPUs, CPU load, inter-process communication etc.

Once you have this, you should get a feel for what kind of testing is affected. If it is unit testing, then this traditionally sits with the developer, if it is integration/system testing then there may be touch points with other teams who may be able to assist with additional expertise or better still, tools.

Speaking of tools, it isn't really the remit of the developer to load test a system beyond what is possible in their development environment. This should be pushed onto the test department or other 3rd party.

The elephant in the room of course is that systems do not always scale in predictable ways!

In a former life, I was a DBA for bank databases with billions of rows and armed with execution plans we could generally predict how long queries would take on an idle database given the input volumes. However, once these volumes got to a certain size, the execution plan would change and performance would rapidly deteriorate unless the query/database was tuned.

score 0 · Answer 3 · answered Jul 14 '14 at 19:04

I'd suggest experiments.

Logging will find bottlenecks. You can then try an alternative implementation on some machines, or even on all machines with a certain probability, or for a limited time period. Then compare the logs again to check for improvements.

It's the same theory-experiment-measure cycle you're used to, but more expensive to set up - since hypotheses have to be run in production - and depending on your volume, receiving significant data from production may be slow as well.

How to test and optimize when you can't reproduce the environment?

3 Answers3