110

I recently interviewed at Amazon. During a coding session, the interviewer asked why I declared a variable in a method. I explained my process and he challenged me to solve the same problem with fewer variables. For example (this wasn't from the interview), I started with Method A then improved it to Method B, by removing int s. He was pleased and said this would reduce memory usage by this method.

I understand the logic behind it, but my question is:

When is it appropriate to use Method A vs. Method B, and vice versa?

You can see that Method A is going to have higher memory usage, since int s is declared, but it only has to perform one calculation, i.e. a + b. On the other hand, Method B has lower memory usage, but has to perform two calculations, i.e. a + b twice. When do I use one technique over the other? Or, is one of the techniques always preferred over the other? What are things to consider when evaluating the two methods?

Method A:

private bool IsSumInRange(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

Method B:

private bool IsSumInRange(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}
Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Corey P
  • 1,224
  • 2
  • 7
  • 14
  • 232
    I'm willing to bet that a modern compiler will generate the same assembly for both of those cases. – 17 of 26 Sep 04 '18 at 18:25
  • 13
    I rollbacked the question to the original state, since your edit invalidated my answer - please don't do that! If you ask a question how to improve your code, then don't change the question by improving the code in the shown way - this makes the answers look meaningless. – Doc Brown Sep 04 '18 at 19:40
  • 77
    Wait a second, they asked to get rid of `int s` while being totally fine with those magic numbers for upper and lower bounds? – null Sep 04 '18 at 20:52
  • Method B will perform the calculation twice when the sum is < -1000. There are times to optimize for memory (IO ops), the other 99% will be about performance. – RandomUs1r Sep 04 '18 at 21:23
  • 34
    Remember: profile before optimizing. With modern compilers, Method A and Method B may be optimized to the same code (using higher optimization levels). Also, with modern processors, they could have instructions that perform more than addition in a single operation. – Thomas Matthews Sep 04 '18 at 22:07
  • 6
    Just for any future answers, try to ignore the code 'format' and try to focus on what the overall question. I'm not asking how to clean up my code, method's A and B are just examples. – Corey P Sep 04 '18 at 23:01
  • 142
    Neither; optimize for readability. – Andy Sep 04 '18 at 23:19
  • 4
    What language is this in? The language may affect with method uses less memory – Ferrybig Sep 05 '18 at 06:55
  • @17of26: Only if you enable optimizations. Also don't just bet, look it up on godbolt.org compiler explorer – PlasmaHH Sep 05 '18 at 09:34
  • 5
    I'm wondering if maybe the examiner _wanted_ you to recognise that the variable `s` being declared in your first answer isn't going to cause greater memory usage... – Baldrickk Sep 05 '18 at 09:46
  • 2
    @T.Sar A really impressive candidate would have pointed this out at the time. Although it does sound like the interviewer genuinely wasn't testing in this manner, which is a shame – Lightness Races in Orbit Sep 05 '18 at 11:15
  • This depends a lot on what the purpose of the code is. How many people are likely to become responsible to look at and maintain the code? Is it likely we will need to expand it later? If it is your own code you can do what you want, but if you are part of some team it would be wise to think about stuff like that. – mathreadler Sep 05 '18 at 14:20
  • 4
    Not to mention here that it would have been much simpler to say `return !(condition)`. – chrylis -cautiouslyoptimistic- Sep 05 '18 at 17:21
  • 4
    Also, `int abs(int)` can be implemented branchless on typical architectures, so I'd throw `return abs(a+b)<=1000` into the ring ... – Hagen von Eitzen Sep 05 '18 at 18:47
  • 3
    @T.Sar Which is why technical coding interviews are generally nonsense. They don't even replicate the actual working environment anyways... just nerves can get to you and end up "failing" something you'd otherwise be able to do. – code_dredd Sep 05 '18 at 20:49
  • 2
    @code_dredd I agree that coding interviews are a usually nonsensical and more or less useless. Usually, a few paid test-drive weeks are far better weeding good from bad candidates than those types of questions. I fumbled quite a few interviews because I panicked myself... – T. Sar Sep 05 '18 at 23:21
  • 3
    All the sum answers currently suffer from overflow. (Not particularly relevant to your answer except get it right before you think about "optimizing".) – philipxy Sep 05 '18 at 23:45
  • 1
    The [C++ equivalent](https://godbolt.org/z/ZAMkH8) code compiled by g++ produces the exact same code. For other languages, they have JIT so I'm not sure. –  Sep 06 '18 at 09:38
  • @philipxy Sure, it may overflow, but nonetheless, it's right as is. You can't check for overflow everywhere, so in general, you simply have to ignore the problem. – maaartinus Sep 06 '18 at 10:56
  • 1
    @Andy if you are working for Amazon you need to be able to optimize for speed. Your code could be deployed on thousands of servers. If it runs slow, it could cost the company a lot of money and affect a lot of customers. You can keep it readable at the same time, but it is not a kindergarden. – rghome Sep 06 '18 at 11:29
  • 1
    @maaartinus It's not clear what you are trying to say. It is easy to write IsSumInRange so it is total, so it is not the case that "you simply have to ignore the problem". (Whether A is currently right depends on its specification. Since B is supposed to calculate what A does it's presumably ok.) – philipxy Sep 06 '18 at 11:30
  • @philipxy I wrote "in *general*, you simply have to ignore the problem". This doesn't mean, that you can't detect overflow in special cases. – maaartinus Sep 06 '18 at 15:39
  • 9
    BTW, never write `if (condition) then return false else return true;` Just write `return !condition;` – kevin cline Sep 06 '18 at 17:15
  • @rghome I'm sure they also profile their code to see if/what improvements could be have, and don't try to optimize every line. An unreadable line which contains a bug can also cost Amazon millions. – Andy Sep 06 '18 at 23:11
  • Method A has a lower memory consumption as only 2 variables have to be loaded at the same moment, the compiler can replace the value of `a` by `s` and do the same with B when it compares the now called `a` with the target value. With the other code, it needs 1 more value in scope to store the value of the result of the calculation, before it can compare it – Ferrybig Sep 10 '18 at 08:38
  • The difference between declaring and not declaring a variable should be negligible, if there even is a difference - if the question is about that specifically, fair enough, but you seem to be asking the more general question. Beyond that, it's a trade-off - you need to understand how much memory and time different approaches use, and how much of each you have available. I'm not sure there's a general answer here (apart from "don't optimize prematurely"). – Bernhard Barker Sep 10 '18 at 10:47
  • Well this part of your interview explains a lot about why Amazon's engineers can't build a decent API to save their lives. Remember that interviews are two-way streets: You are deciding if you want to work with them as much as they are deciding if they want to work with you. Premature optimization exercises like this are pointless and should be a red flag. Would love to see someone flip the interview tables one day and get the interviewer to write code instead. – CubicleSoft Sep 11 '18 at 14:08

15 Answers15

146

Instead of speculating about what may or may not happen, let's just look, shall we? I'll have to use C++ since I don't have a C# compiler handy (though see the C# example from VisualMelon), but I'm sure the same principles apply regardless.

We'll include the two alternatives you encountered in the interview. We'll also include a version that uses abs as suggested by some of the answers.

#include <cstdlib>

bool IsSumInRangeWithVar(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

bool IsSumInRangeWithoutVar(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}

bool IsSumInRangeSuperOptimized(int a, int b) {
    return (abs(a + b) < 1000);
}

Now compile it with no optimization whatsoever: g++ -c -o test.o test.cpp

Now we can see precisely what this generates: objdump -d test.o

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   55                      push   %rbp              # begin a call frame
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d ec                mov    %edi,-0x14(%rbp)  # save first argument (a) on stack
   7:   89 75 e8                mov    %esi,-0x18(%rbp)  # save b on stack
   a:   8b 55 ec                mov    -0x14(%rbp),%edx  # load a and b into edx
   d:   8b 45 e8                mov    -0x18(%rbp),%eax  # load b into eax
  10:   01 d0                   add    %edx,%eax         # add a and b
  12:   89 45 fc                mov    %eax,-0x4(%rbp)   # save result as s on stack
  15:   81 7d fc e8 03 00 00    cmpl   $0x3e8,-0x4(%rbp) # compare s to 1000
  1c:   7f 09                   jg     27                # jump to 27 if it's greater
  1e:   81 7d fc 18 fc ff ff    cmpl   $0xfffffc18,-0x4(%rbp) # compare s to -1000
  25:   7d 07                   jge    2e                # jump to 2e if it's greater or equal
  27:   b8 00 00 00 00          mov    $0x0,%eax         # put 0 (false) in eax, which will be the return value
  2c:   eb 05                   jmp    33 <_Z19IsSumInRangeWithVarii+0x33>
  2e:   b8 01 00 00 00          mov    $0x1,%eax         # put 1 (true) in eax
  33:   5d                      pop    %rbp
  34:   c3                      retq

0000000000000035 <_Z22IsSumInRangeWithoutVarii>:
  35:   55                      push   %rbp
  36:   48 89 e5                mov    %rsp,%rbp
  39:   89 7d fc                mov    %edi,-0x4(%rbp)
  3c:   89 75 f8                mov    %esi,-0x8(%rbp)
  3f:   8b 55 fc                mov    -0x4(%rbp),%edx
  42:   8b 45 f8                mov    -0x8(%rbp),%eax  # same as before
  45:   01 d0                   add    %edx,%eax
  # note: unlike other implementation, result is not saved
  47:   3d e8 03 00 00          cmp    $0x3e8,%eax      # compare to 1000
  4c:   7f 0f                   jg     5d <_Z22IsSumInRangeWithoutVarii+0x28>
  4e:   8b 55 fc                mov    -0x4(%rbp),%edx  # since s wasn't saved, load a and b from the stack again
  51:   8b 45 f8                mov    -0x8(%rbp),%eax
  54:   01 d0                   add    %edx,%eax
  56:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax # compare to -1000
  5b:   7d 07                   jge    64 <_Z22IsSumInRangeWithoutVarii+0x2f>
  5d:   b8 00 00 00 00          mov    $0x0,%eax
  62:   eb 05                   jmp    69 <_Z22IsSumInRangeWithoutVarii+0x34>
  64:   b8 01 00 00 00          mov    $0x1,%eax
  69:   5d                      pop    %rbp
  6a:   c3                      retq

000000000000006b <_Z26IsSumInRangeSuperOptimizedii>:
  6b:   55                      push   %rbp
  6c:   48 89 e5                mov    %rsp,%rbp
  6f:   89 7d fc                mov    %edi,-0x4(%rbp)
  72:   89 75 f8                mov    %esi,-0x8(%rbp)
  75:   8b 55 fc                mov    -0x4(%rbp),%edx
  78:   8b 45 f8                mov    -0x8(%rbp),%eax
  7b:   01 d0                   add    %edx,%eax
  7d:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax
  82:   7c 16                   jl     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  84:   8b 55 fc                mov    -0x4(%rbp),%edx
  87:   8b 45 f8                mov    -0x8(%rbp),%eax
  8a:   01 d0                   add    %edx,%eax
  8c:   3d e8 03 00 00          cmp    $0x3e8,%eax
  91:   7f 07                   jg     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  93:   b8 01 00 00 00          mov    $0x1,%eax
  98:   eb 05                   jmp    9f <_Z26IsSumInRangeSuperOptimizedii+0x34>
  9a:   b8 00 00 00 00          mov    $0x0,%eax
  9f:   5d                      pop    %rbp
  a0:   c3                      retq

We can see from the stack addresses (for example, the -0x4 in mov %edi,-0x4(%rbp) versus the -0x14 in mov %edi,-0x14(%rbp)) that IsSumInRangeWithVar() uses 16 extra bytes on the stack.

Because IsSumInRangeWithoutVar() allocates no space on the stack to store the intermediate value s it has to recalculate it, resulting in this implementation being 2 instructions longer.

Funny, IsSumInRangeSuperOptimized() looks a lot like IsSumInRangeWithoutVar(), except it compares to -1000 first, and 1000 second.

Now let's compile with only the most basic optimizations: g++ -O1 -c -o test.o test.cpp. The result:

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
   7:   3d d0 07 00 00          cmp    $0x7d0,%eax
   c:   0f 96 c0                setbe  %al
   f:   c3                      retq

0000000000000010 <_Z22IsSumInRangeWithoutVarii>:
  10:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  17:   3d d0 07 00 00          cmp    $0x7d0,%eax
  1c:   0f 96 c0                setbe  %al
  1f:   c3                      retq

0000000000000020 <_Z26IsSumInRangeSuperOptimizedii>:
  20:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  27:   3d d0 07 00 00          cmp    $0x7d0,%eax
  2c:   0f 96 c0                setbe  %al
  2f:   c3                      retq

Would you look at that: each variant is identical. The compiler is able to do something quite clever: abs(a + b) <= 1000 is equivalent to a + b + 1000 <= 2000 considering setbe does an unsigned comparison, so a negative number becomes a very large positive number. The lea instruction can actually perform all these additions in one instruction, and eliminate all the conditional branches.

To answer your question, almost always the thing to optimize for is not memory or speed, but readability. Reading code is a lot harder than writing it, and reading code that's been mangled to "optimize" it is a lot harder than reading code that's been written to be clear. More often than not, these "optimizations" have negligible, or as in this case exactly zero actual impact on performance.


Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result?

Let's measure! I've transcribed the examples to Python:

def IsSumInRangeWithVar(a, b):
    s = a + b
    if s > 1000 or s < -1000:
        return False
    else:
        return True

def IsSumInRangeWithoutVar(a, b):
    if a + b > 1000 or a + b < -1000:
        return False
    else:
        return True

def IsSumInRangeSuperOptimized(a, b):
    return abs(a + b) <= 1000

from dis import dis
print('IsSumInRangeWithVar')
dis(IsSumInRangeWithVar)

print('\nIsSumInRangeWithoutVar')
dis(IsSumInRangeWithoutVar)

print('\nIsSumInRangeSuperOptimized')
dis(IsSumInRangeSuperOptimized)

print('\nBenchmarking')
import timeit
print('IsSumInRangeWithVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeWithoutVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithoutVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeSuperOptimized: %fs' % (min(timeit.repeat(lambda: IsSumInRangeSuperOptimized(42, 42), repeat=50, number=100000)),))

Run with Python 3.5.2, this produces the output:

IsSumInRangeWithVar
  2           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 STORE_FAST               2 (s)

  3          10 LOAD_FAST                2 (s)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               4 (>)
             19 POP_JUMP_IF_TRUE        34
             22 LOAD_FAST                2 (s)
             25 LOAD_CONST               4 (-1000)
             28 COMPARE_OP               0 (<)
             31 POP_JUMP_IF_FALSE       38

  4     >>   34 LOAD_CONST               2 (False)
             37 RETURN_VALUE

  6     >>   38 LOAD_CONST               3 (True)
             41 RETURN_VALUE
             42 LOAD_CONST               0 (None)
             45 RETURN_VALUE

IsSumInRangeWithoutVar
  9           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 LOAD_CONST               1 (1000)
             10 COMPARE_OP               4 (>)
             13 POP_JUMP_IF_TRUE        32
             16 LOAD_FAST                0 (a)
             19 LOAD_FAST                1 (b)
             22 BINARY_ADD
             23 LOAD_CONST               4 (-1000)
             26 COMPARE_OP               0 (<)
             29 POP_JUMP_IF_FALSE       36

 10     >>   32 LOAD_CONST               2 (False)
             35 RETURN_VALUE

 12     >>   36 LOAD_CONST               3 (True)
             39 RETURN_VALUE
             40 LOAD_CONST               0 (None)
             43 RETURN_VALUE

IsSumInRangeSuperOptimized
 15           0 LOAD_GLOBAL              0 (abs)
              3 LOAD_FAST                0 (a)
              6 LOAD_FAST                1 (b)
              9 BINARY_ADD
             10 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               1 (<=)
             19 RETURN_VALUE

Benchmarking
IsSumInRangeWithVar: 0.019361s
IsSumInRangeWithoutVar: 0.020917s
IsSumInRangeSuperOptimized: 0.020171s

Disassembly in Python isn't terribly interesting, since the bytecode "compiler" doesn't do much in the way of optimization.

The performance of the three functions is nearly identical. We might be tempted to go with IsSumInRangeWithVar() due to it's marginal speed gain. Though I'll add as I was trying different parameters to timeit, sometimes IsSumInRangeSuperOptimized() came out fastest, so I suspect it may be external factors responsible for the difference, rather than any intrinsic advantage of any implementation.

If this is really performance critical code, an interpreted language is simply a very poor choice. Running the same program with pypy, I get:

IsSumInRangeWithVar: 0.000180s
IsSumInRangeWithoutVar: 0.001175s
IsSumInRangeSuperOptimized: 0.001306s

Just using pypy, which uses JIT compilation to eliminate a lot of the interpreter overhead, has yielded a performance improvement of 1 or 2 orders of magnitude. I was quite shocked to see IsSumInRangeWithVar() is an order of magnitude faster than the others. So I changed the order of the benchmarks and ran again:

IsSumInRangeSuperOptimized: 0.000191s
IsSumInRangeWithoutVar: 0.001174s
IsSumInRangeWithVar: 0.001265s

So it seems it's not actually anything about the implementation that makes it fast, but rather the order in which I do the benchmarking!

I'd love to dig in to this more deeply, because honestly I don't know why this happens. But I believe the point has been made: micro-optimizations like whether to declare an intermediate value as a variable or not are rarely relevant. With an interpreted language or highly optimized compiler, the first objective is still to write clear code.

If further optimization might be required, benchmark. Remember that the best optimizations come not from the little details but the bigger algorithmic picture: pypy is going to be an order of magnitude faster for repeated evaluation of the same function than cpython because it uses faster algorithms (JIT compiler vs interpretation) to evaluate the program. And there's the coded algorithm to consider as well: a search through a B-tree will be faster than a linked list.

After ensuring you're using the right tools and algorithms for the job, be prepared to dive deep into the details of the system. The results can be very surprising, even for experienced developers, and this is why you must have a benchmark to quantify the changes.

Phil Frost
  • 1,569
  • 2
  • 11
  • 12
  • 6
    To provide an example in C#: [SharpLab produces identical asm for both methods](https://sharplab.io/#v2:EYLgZgpghgLgrgJwgZwLQGMD2A7ZMFQCW2MyANDCFMgLYA+AAgAwAEDAjANwCwAUAwGY2AJhYBhFgG8+LWSwAOCQgDdYEFsEyYANiwCSyAMpwae7ACUo2AOYQAggApiMFlDItnGgJQy503nKBHiQsyCwAvK4sANQaPLy+QR5gLA5hAHws7Ew5LHR0oSwAPCyo2TlebADsLGBQ2sgQ8UmyEA3qDDX4cE2JLAC+fX2KKmoaWroGxqYWVrYAQk4hbsEuwD4Bfn2BhCkOUDEaLJnlrPlRscDFpaeVnbX1jc0tbY3VLN29m7KD3yx9gjYABYWABZByVfyBX79IA==) (Desktop CLR v4.7.3130.00 (clr.dll) on x86) – VisualMelon Sep 06 '18 at 12:53
  • @PhilFrost This was an incredible answer and exactly what I was looking for. Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result? – Corey P Sep 06 '18 at 13:56
  • 2
    @VisualMelon funilly enough the positive check: "return (((a + b) >= -1000) && ((a+b) <= 1000)); " gives a different result. :https://sharplab.io/#v2:EYLgZgpghgLgrgJwgZwLQGMD2A7ZMFQCW2MyANDCFMgLYA+AAgAwAEDAjANwCwAUAwGY2AJhYBhFgG8+LWSwAOCQgDdYEFsEyYANiwCSyAMpwae7ACUo2AOYQAggApiMFlDItnGgJQy503nKBHiQsyCwAvK4sANQaPLy+QR5gLA5hAHws7Ew5LHR0oSwAPCyo2TlebADsLGBQ2sgQ8UmyEA3qDDX4cE2JLAC+fX2KKmoaWroGxqYWVrYAQk4hbsEuwD4Bfn2BhCkOUDEaLJnlrPlRscDFpaeVnbX1jc0tbY3VLN29m7KD3yzDSlUMHUmh0+iMJjMlhsEDESxcK086z6/ha71SDn2h3Wx0iZRyTEqADIiRioNEcUVIrcvJw0X1fv0gA== – Pieter B Sep 06 '18 at 13:57
  • 12
    Readability can potentially make a program easier to optimize too. The compiler can easily rewrite to use *equivalent* logic like it is above, only if it can actually figure out what you're trying to do. If you use a lot of [old-school bithacks](http://graphics.stanford.edu/~seander/bithacks.html), cast back and forth between ints and pointers, reuse mutable storage etc. it may be much harder for the compiler to prove that a transformation is equivalent, and it'll just leave what you wrote, which may be suboptimal. – Alex Celeste Sep 06 '18 at 14:12
  • 1
    @Corey see edit. – Phil Frost Sep 06 '18 at 14:56
  • Those PyPy total times seem very short for a language that stops and JIT-compiles. It's also short enough that CPU frequency ramp-up from idle to max turbo might be a factor, especially if you're not using a Skylake (with hardware P-states). Benchmarking in seconds instead of core clock cycles requires controlling for CPU frequency. As a sanity check, does the time scale linearly with the repeat count? If not, you're measuring overhead. Also, won't a JIT be able to inline and hoist the calc out of a loop? Unless you use the output of one as input to the next, measuring latency not tput. – Peter Cordes Sep 06 '18 at 20:40
  • @PeterCordes Wouldn't that mean the first run is slower, not faster? Not sure how much optimization the pypy jit is capable of -- there's a module that can inspect the generated machine code but I haven't had time to play with it. – Phil Frost Sep 06 '18 at 21:49
  • Yeah, the ramp-up effect would normally make the earlier stuff slower. The times aren't so short that the ~8us of pause while the CPU switches frequency and voltage ([Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147)) shouldn't be making anything slower. I just mentioned it as another reason why your bench interval is too short. Perhaps it just interprets at first, and then stops to JIT, like the HotSpot JVM? If it decided to stop and JIT-compile right before the last iteration, that would suck. – Peter Cordes Sep 06 '18 at 22:25
  • Anyway, if you just crank up the repeat-count iterations so they take about 0.5 to 1.0 seconds, do the results stay similar? Your PyPy results are definitely surprising. (I don't normally look at Python, though, so IDK what kind of gotchas might exist for `timeit` on such a short function.) Oh, I just tried it. Those times are per-call averages, not the actual total measurement interval. So IDK. – Peter Cordes Sep 06 '18 at 22:26
  • @PeterCordes Yeah, increasing the iterations by 100x, whichever thing is tested first is still an order of magnitude faster. Oddly, if I benchmark all three implementations once, then again, then a third time, on each iteration the same implementation is slower than it was the first time it was benchmarked. I kinda wonder if `timeit` is somehow different under pypy. http://doc.pypy.org/en/latest/cpython_differences.html says "The timeit module behaves differently under PyPy: it prints the average time and the standard deviation, instead of the minimum, since the minimum is often misleading." – Phil Frost Sep 06 '18 at 22:32
  • The first one runs after `import timeit`, while the others run right after a `print` returns. Perhaps there's some warm-up effect there, or influence on what the JIT does? What if you collect all 3 results before printing anything? – Peter Cordes Sep 06 '18 at 22:33
  • Oh, `timeit` is still trying to time each call separately? This function is way too short for that, call overhead will dominate. To realistically let a JIT do anything, you need to call it in a loop over an array, or with input feeding output, or something, and time the whole loop. – Peter Cordes Sep 06 '18 at 22:34
  • No, the `number` parameter times calling the passed function a lot of times, in my most recent tests, 100000000 times. I grok JIT warmup and all that, the baffiling thing is the _first_ thing to get benchmarked is the fastest. That's not warm-up, it's cool-down. – Phil Frost Sep 06 '18 at 22:36
  • My point was that the PyPy doc says it prints average *and standard deviation*, so internally `timeit` must still record timestamps around this near-trivial thing. It doesn't explain this crazy "cool-down" effect, but it does mean there's huge overhead. (See [clflush to invalidate cache line via C function](https://stackoverflow.com/a/51830976) and my answer on [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/a/51989771) for how hard it is to time *very* short intervals, even with non-portable raw `rdtsc` – Peter Cordes Sep 07 '18 at 00:27
  • 2
    @Corey: this answer is actually telling you exactly what I wrote in my answer: there is no difference when you use a decent compiler, and instead focus on readibilty. Of course, it looks better founded - maybe you believe me now. – Doc Brown Sep 07 '18 at 05:39
  • I suspect the reason why the order of your benchmarking seems to matter might have to do with branch prediction in the CPU. https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – maple_shaft Sep 07 '18 at 11:35
  • @maple_shaft I seriously doubt it. The code and the input is the same every time, and yet if I run it a few million times it somehow becomes up to three orders of magnitude slower. – Phil Frost Sep 07 '18 at 11:40
  • 1
    @PeterCordes `timeit` times running the callable `number` times, and then it repeats that test `repeat` times, and I print the fastest of those repeats, the assumption being the fastest time is the intrinsic speed of the function without any JIT overhead, interruptions by other processes on the host, cache misses, etc. The cpython timeit documentation concurs, but for some reason pypy has decided this is "misleading" but they don't say why. Maybe it's re-compiling the function periodically even though nothing has changed. Maybe it's a bug. I don't really know. – Phil Frost Sep 07 '18 at 12:00
  • 1
    `!(a + b > 1000 or a + b < -1000)` and `abs(a + b) <= 1000` can produce different functionality when `a+b == INT_MIN` given the usual trouble with `abs(INT_MIN)`. This compiler happened to produce desired code, yet was not oblige to with `abs(a + b) <= 1000`. – chux - Reinstate Monica Sep 07 '18 at 14:39
  • @chux: gcc also has to undo the `abs` and turn it back into a range-check (like I commented [on another answer that suggested the `abs()` "optimization"](https://softwareengineering.stackexchange.com/questions/377927/when-to-optimize-for-memory-vs-performance-speed-for-a-method/378045?noredirect=1#comment831637_377940), then apply the unsigned-compare trick: `(unsigned)(a+b+999) <= 1998U`. (I can't repro the `a+b+1000 < 2000` output on Godbolt with -O1 or -O3 with a few gcc versions. https://godbolt.org/z/d0isuc). Other compilers (like clang) don't manage to undo it. – Peter Cordes Sep 07 '18 at 15:28
  • Re why order matters for PyPy: Did you burn through the JIT warmup period? – Kevin Sep 08 '18 at 01:20
  • @Kevin JIT warmup has already been discussed at length in the comments. – Phil Frost Sep 08 '18 at 03:06
66

To answer the stated question:

When to optimize for memory vs performance speed for a method?

There are two things you have to establish:

  • What is limiting your application?
  • Where can I reclaim the most of that resource?

In order to answer the first question, you have to know what the performance requirements for your application are. If there are no performance requirements then there is no reason to optimize one way or the other. The performance requirements help you to get to the place of "good enough".

The method you provided on its own wouldn't cause any performance issues on its own, but perhaps within a loop and processing a large amount of data, you have to start thinking a little differently about how you are approaching the problem.

Detecting what is limiting the application

Start looking at the behavior of your application with a performance monitor. Keep an eye on CPU, disk, network, and memory usage while it's running. One or more items will be maxed out while everything else is moderately used--unless you hit the perfect balance, but that almost never happens).

When you need to look deeper, typically you would use a profiler. There are memory profilers and process profilers, and they measure different things. The act of profiling does have a significant performance impact, but you are instrumenting your code to find out what's wrong.

Let's say you see your CPU and disk usage peaked. You would first check for "hot spots" or code that either is called more often than the rest or takes a significantly longer percentage of the processing.

If you can't find any hot spots, you would then start looking at memory. Perhaps you are creating more objects than necessary and your garbage collection is working overtime.

Reclaiming performance

Think critically. The following list of changes is in order of how much return on investment you'll get:

  • Architecture: look for communication choke points
  • Algorithm: the way you process data might need to change
  • Hot spots: minimizing how often you call the hot spot can yield a big bonus
  • Micro optimizations: it's not common, but sometimes you really do need to think of minor tweaks (like the example you provided), particularly if it is a hot spot in your code.

In situations like this, you have to apply the scientific method. Come up with a hypothesis, make the changes, and test it. If you meet your performance goals, you're done. If not, go to the next thing in the list.


Answering the question in bold:

When is it appropriate to use Method A vs. Method B, and vice versa?

Honestly, this is the last step in trying to deal with performance or memory problems. The impact of Method A vs. Method B will be really different depending on the language and platform (in some cases).

Just about any compiled language with a halfway decent optimizer will generate similar code with either of those structures. However those assumptions don't necessarily remain true in proprietary and toy languages that don't have an optimizer.

Precisely which will have a better impact depends on whether sum is a stack variable or a heap variable. This is a language implementation choice. In C, C++ and Java for example, number primitives like an int are stack variables by default. Your code has no more memory impact by assigning to a stack variable than you would have with fully inlined code.

Other optimizations that you might find in C libraries (particularly older ones) where you can have to decide between copying a 2 dimensional array down first or across first is a platform dependent optimization. It requires some knowledge of how the chipset you are targeting best optimizes memory access. There are subtle differences between architectures.

Bottom line is that optimization is a combination of art and science. It requires some critical thinking, as well as a degree of flexibility in how you approach the problem. Look for big things before you blame small things.

Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160
  • 2
    This answer focuses on my question the most and doesn't get caught up on my coding examples, i.e. Method A and Method B. – Corey P Sep 04 '18 at 23:05
  • 18
    I feel like this is the generic answer to "How do you address performance bottlenecks" but you would be hard pressed to identify relative memory usage from a particular function based on whether it had 4 or 5 variables using this method. I also question how relevant this level of optimization is when the compiler (or interpreter) may or may not optimize this away. – Eric Sep 05 '18 at 00:24
  • @Eric, as I mentioned, the last category of performance improvement would be your micro-optimizations. The only way to have a good guess if it will have any impact is by measuring performance/memory in a profiler. It is rare that those types of improvements have payoff, but in timing sensitive performance problems you have in simulators a couple well placed changes like that can be the difference between hitting your timing target and not. I think I can count on one hand the number of times that paid off in over 20 years of working on software, but it's not zero. – Berin Loritsch Sep 05 '18 at 00:41
  • @BerinLoritsch Again, in general I agree with you, but in this specific case I do not. I've provided my own answer, but I've not personally seen any tools that will flag or even give you ways to potentially identify performance issues related to stack memory size of a function. – Eric Sep 05 '18 at 01:09
  • @DocBrown, I have remedied that. Regarding the second question, I pretty much agree with you. – Berin Loritsch Sep 05 '18 at 12:50
  • *It requires some knowledge of how the chipset you are targeting best optimizes memory access.* No it doesn't, it requires knowing whether the language you're using does row-major or column-major 2D arrays (or how your data structure is organized). All hardware does *much* better with sequential contiguous access to all the bytes in a cache line than to strided accesses. – Peter Cordes Sep 06 '18 at 20:44
  • Your answer only talks about memory usage in the context of total memory-allocation of the whole program. The other performance-relevant kind of memory consumption is your *cache* footprint / working set, over an inner loop or over an outer loop. Memory is cheap, but cache is not. A 1MiB lookup table replacing a short computation might look good in a microbenchmark, and have negligible impact on your total memory footprint, but in real use (when you only index it between other memory operations) can lead to tons of cache misses, at least in L2 cache. – Peter Cordes Sep 06 '18 at 20:52
  • @Corey: so you'd profile with performance counters for cache misses to see if touching less stack space and other data could help, if you had an actual case where it was a tradeoff between computation and space (unlike the silly example in your question). e.g. `perf stat -d ./my_program` on Linux to get counts over the whole program, or record a profile of last-level cache misses. [Linux C++: how to profile time wasted due to cache misses?](https://stackoverflow.com/q/2486840). – Peter Cordes Sep 06 '18 at 20:54
45

"this would reduce memory" - em, no. Even if this would be true (which, for any decent compiler is not), the difference would most probably be negligible for any real world situation.

However, I would recommend to use method A* (method A with a slight change):

private bool IsSumInRange(int a, int b)
{
    int sum = a + b;

    if (sum > 1000 || sum < -1000) return false;
    else return true;
    // (yes, the former statement could be cleaned up to
    // return abs(sum)<=1000;
    // but let's ignore this for a moment)
}

but for two completely different reasons:

  • by giving the variable s an explaining name, the code becomes clearer

  • it avoids to have the same summation logic twice in code, so the code becomes more DRY, which means less error prone to changes.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • 36
    I would clean it up even further and go with "return sum > -1000 && sum < 1000;". – 17 of 26 Sep 04 '18 at 18:28
  • Why wouldn't it reduce memory usage? Lets assume my code example was perfectly formatted to anyone's liking, then would A or B be better? – Corey P Sep 04 '18 at 18:31
  • 36
    @Corey any decent optimizer will use a CPU register for the `sum` variable, thus leading to zero memory usage. And even if not, this is only a single word of memory in a “leaf” method. Considering how incredibly memory-wasteful Java or C# can be otherwise due to their GC and object model, a local `int` variable literally does not use any noticeable memory. This is pointless micro-optimization. – amon Sep 04 '18 at 18:42
  • @amon, thanks for the clarification. So, lets assume it's not an `int`, but perhaps something a little more complex which has a noticeable memory usage. How would I then consider my options? – Corey P Sep 04 '18 at 18:49
  • 10
    @Corey: if it is "**a little** more complex", it probably won't become "a noticeable memory usage". Maybe if you construct a really more complex example, but that makes it a different question. Note also, just because you don't create a specific variable for an expression, for complex intermediate results, the run time environment may still internally create temporary objects, so it completely depends on the details of the language, enviroment, optimization level, and whatever you call "noticeable". – Doc Brown Sep 04 '18 at 19:33
  • 1
    I'd reorder to `if (-1000 < sum || sum < 1000) return true; else return false;` because ordering the conditionals this way makes it clearer that you're doing a range check. – Dan Is Fiddling By Firelight Sep 04 '18 at 20:41
  • 8
    In addition to the points above, I'm pretty sure how C# / Java chooses to store `sum` would be an *implementation detail* and I doubt anyone could make a convincing case as to whether or not a silly trick like avoiding one local `int` would lead to this or that amount of memory usage in the long term. IMO readability is more important. Readability can be subjective, but FWIW, personally I'd rather you never do the same computation twice, not for CPU usage, but because I only have to inspect your addition once when I'm looking for a bug. – jrh Sep 04 '18 at 22:12
  • 2
    ... also note that garbage collected languages in general are an unpredictable, "churning sea of memory" that (for C# anyway) might only be cleaned up **when needed**, I remember making a program that allocated gigabytes of RAM and it only started "cleaning up" after itself when memory became scarce. If the GC doesn't need to run, it might take its sweet time and save your CPU for more pressing matters. – jrh Sep 04 '18 at 22:13
  • 1
    Be careful @Dan - that's not the same test. (Even if you change `||` to `&&`, you also got the comparisons wrong: consider `sum == 1000`, for example). – Toby Speight Sep 05 '18 at 14:52
  • @TobySpeight whooops. 30s code edit strikes again.... – Dan Is Fiddling By Firelight Sep 05 '18 at 15:13
33

You can do better than both of those with

return (abs(a + b) > 1000);

Most processors (and hence compilers) can do abs() in a single operation. You not only have fewer sums, but also fewer comparisons, which are generally more computationally expensive. It also removes the branching, which is much worse on most processors because it stops pipelining being possible.

The interviewer, as other answers have said, is plant life and has no business conducting a technical interview.

That said, his question is valid. And the answer to when you optimise and how, is when you've proved it's necessary, and you've profiled it to prove exactly which parts need it. Knuth famously said that premature optimisation is the root of all evil, because it's too easy to try to gold-plate unimportant sections, or make changes (like your interviewer's) which have no effect, whilst missing the places which really do need it. Until you've got hard proof it's really necessary, clarity of code is the more important target.

Edit FabioTurati correctly points out that this is the opposite logic sense to the original, (my mistake!), and that this illustrates a further impact from Knuth's quote where we risk breaking the code while we're trying to optimise it.

Graham
  • 1,996
  • 1
  • 12
  • 11
  • Good answer, but would be improved if 1000 was replaced by a well named constant. – user949300 Sep 04 '18 at 22:17
  • @user949300 Agreed, except I don't know what the constant represents, so I can't help there. :) Still, my point was more about optimisation than coding standards. If we're adding a named constant, I'd also like better variable names than "a" and "b", both parameters declared const, and Doxygen comments for the function declaration. – Graham Sep 04 '18 at 22:33
  • Yes, a and b might be usefully renamed. I'm actually not a fan of routinely "consting" function arguments as it add verbosity and noise for almost zero benefit. YMMV. – user949300 Sep 04 '18 at 22:48
  • The code was simply to demonstrate my question. See my note. – Corey P Sep 04 '18 at 22:58
  • 2
    @Corey, I am quite sure the Graham pins the request *"he challenged me to solve the same problem with less variables"* as expected. If I'd be the interviewer, I'd expect that answer, not moving `a+b` into `if` and doing it twice. You understand it wrong *"He was pleased and said this would reduce memory usage by this method"* - he was nice to you, hiding his disappointment with this meaningless explanation about memory. You shouldn't be taking it serious to ask question here. Did you get a job? My guess you didn't :-( – Sinatr Sep 05 '18 at 13:30
  • 1
    You are applying 2 transformations at the same time: you have turned the 2 conditions into 1, using `abs()`, and you also have a single `return`, instead of having one when the condition is true ("if branch") and another one when it's false ("else branch"). When you change code like this, be careful: there's the risk to inadvertently write a function that returns true when it should return false, and vice versa. Which is exactly what happened here. I know you were focusing on another thing, and you've done a nice job at it. Still, *this could have easily cost you the job...* – Fabio says Reinstate Monica Sep 05 '18 at 20:49
  • 2
    @FabioTurati Well spotted - thanks! I'll update the answer. And it's a good point about refactoring and optimisation, which makes Knuth's quote even more relevant. We should prove we need the optimisation before we take the risk. – Graham Sep 06 '18 at 01:24
  • 2
    *Most processors (and hence compilers) can do abs() in a single operation.* Unfortunately not the case for integers. ARM64 has a conditional negate it can use if flags are already set from an `adds`, and ARM has predicated reverse-sub (`rsblt` = reverse-sub if less-tha) but everything else requires multiple extra instructions to implement `abs(a+b)` or `abs(a)`. https://godbolt.org/z/Ok_Con shows x86, ARM, AArch64, PowerPC, MIPS, and RISC-V asm output. **It's only by transforming the comparison into a range-check `(unsigned)(a+b+999) <= 1998U` that gcc can optimize it like in Phil's answer.** – Peter Cordes Sep 06 '18 at 21:15
  • Floating-point is different: abs is simply masking off the top bit, because IEEE FP uses a sign/magnitude representation, and most ISAs have either a SIMD AND instruction or an `fabs` instruction. [Fastest way to compute absolute value using SSE](https://stackoverflow.com/q/32408665). Integer `abs()` is fast and branchless though: a couple to a few instructions, so it's still basically cheap, and gcc is smart enough to undo it and turn it back into a range check that allows the unsigned-compare trick. `abs(a+b)<1000` is easy for humans to understand, so it's not bad. But clang does worse. – Peter Cordes Sep 06 '18 at 21:22
  • 2
    The "improved" code in this answer is still wrong, since it produces a different answer for `IsSumInRange(INT_MIN, 0)`. The original code returns `false` because `INT_MIN+0 > 1000 || INT_MIN+0 < -1000`; but the "new and improved" code returns `true` because `abs(INT_MIN+0) < 1000`. (Or, in some languages, it'll throw an exception or have undefined behavior. Check your local listings.) – Quuxplusone Sep 07 '18 at 03:36
16

When is it appropriate to use Method A vs. Method B, and vice versa?

Hardware is cheap; programmers are expensive. So the cost of the time you two wasted on this question is probably far worse than either answer.

Regardless, most modern compilers would find a way to optimize the local variable into a register (instead of allocating stack space), so the methods are probably identical in terms of executable code. For this reason, most developers would pick the option that communicates the intention most clearly (see Writing really obvious code (ROC)). In my opinion, that would be Method A.

On the other hand, if this is purely an academic exercise, you can have the best of both worlds with Method C:

private bool IsSumInRange(int a, int b)
{
    a += b;
    return (a >= -1000 && a <= 1000);
}
John Wu
  • 26,032
  • 10
  • 63
  • 84
  • 17
    `a+=b` is a neat trick but I have to mention (just in case it isn't implied from the rest of the answer), from my experience methods that mess with parameters can be very hard to debug and maintain. – jrh Sep 04 '18 at 22:21
  • 1
    I agree @jrh. I am a strong advocate for ROC, and that sort of thing is anything but. – John Wu Sep 04 '18 at 22:40
  • 3
    "Hardware is cheap; programmers are expensive." In the world of consumer electronics, that statement is false. If you sell millions of units, then it is a very good investment to spend $500.000 in additional development cost to save $0,10 on the hardware costs per unit. – Bart van Ingen Schenau Sep 05 '18 at 10:34
  • @BartvanIngenSchenau I am not sure Amazon builds consumer electronics. I think they mostly build server software, and server hardware is cheap compared to the teams that maintain the software that runs on it. – John Wu Sep 05 '18 at 16:28
  • @JohnWu, my point was that it isn't a universal truth. For companies like Amazon it can certainly be true. – Bart van Ingen Schenau Sep 05 '18 at 16:47
  • 2
    @JohnWu: You simplified out the `if` check, but forgot to reverse the result of the comparison; your function is now returning `true` when `a + b` is *not* in the range. Either add a `!` to the outside of the condition (`return !(a > 1000 || a < -1000)`), or distribute the `!`, inverting tests, to get `return a <= 1000 && a >= -1000;` Or to make the range check flow nicely, `return -1000 <= a && a <= 1000;` – ShadowRanger Sep 05 '18 at 19:03
  • Thanks @ShadowRanger. Simple logic errors are another reason to write ROC. – John Wu Sep 05 '18 at 19:18
  • 1
    @JohnWu: Still slightly off at the edge cases, distributed logic requires `<=`/`>=`, not `<`/`>` (with `<`/`>`, 1000 and -1000 are treated as being out of range, original code treated them as in range). – ShadowRanger Sep 05 '18 at 19:30
  • This is absolutely the correct answer. I spent many years working with 1K of RAM or even less in the case of microcontrollers. Now with the advent of 4GL's and scalable EC2 instances.. this kind of painstaking detail is not economical. There are occasions where coding for execution speed is still necessary.. but memory usage is now pretty much a pointless concern. – Richard Sep 05 '18 at 21:02
  • @Richard Correct in spirit (according to some aspect of code style you don't elucidate, straightforwardness I guess) but it can overflow (like other answers' code). – philipxy Sep 05 '18 at 23:41
  • 1
    (BTW, I'm pretty sure that in the current version of this answer, `return (a >= -1000 || a <= 1000);` will ALWAYS evaluate to `true`.) @ShadowRanger – mathmandan Sep 06 '18 at 15:16
  • @mathmandan: Yup. Seriously, just copy and paste: `return a <= 1000 && a >= -1000;` or `return -1000 <= a && a <= 1000;`, either is correct, it's just a matter of personal style. Don't have edit privileges here, or I'd do it myself. – ShadowRanger Sep 06 '18 at 17:39
11

I would optimize for readability. Method X:

private bool IsSumInRange(int number1, int number2)
{
    return IsValueInRange(number1+number2, -1000, 1000);
}

private bool IsValueInRange(int Value, int Lowerbound, int Upperbound)
{
    return  (Value >= Lowerbound && Value <= Upperbound);
}

Small methods that do just 1 thing but are easy to reason about.

(This is personal preference, I like positive testing instead of negative, your original code is actually testing whether the value is NOT outside the range.)

Pieter B
  • 12,867
  • 1
  • 40
  • 65
  • 6
    This. (Upvoted comments above that were similar re: readability). 30 years ago, when we were working with machines that had less than 1mb of RAM, squeezing performance was necessary - just like the y2k problem, get a few hundred thousand records that each have a few bytes of memory being wasted due to unused vars and references, etc and it adds up quick when you only have 256k of RAM. Now that we are dealing with machines that have multiple gigabytes of RAM, saving even a few MB of RAM use vs readability and maintainability of code isn't a good trade. – ivanivan Sep 05 '18 at 13:19
  • @ivanivan: I don't think the "y2k problem" was really about memory. From a data-entry standpoint, entering two digits is more efficient than entering four, and keeping things as entered is easier than converting them to some other form. – supercat Sep 05 '18 at 16:13
  • 10
    Now you have to trace through 2 functions to see what's happening. You can't take it at face value, because you can't tell from the name whether these are inclusive or exclusive bounds. And if you add that information, the name of the function is longer than the code to express it. – Peter Sep 05 '18 at 16:27
  • 1
    Optimise readability and make small, easy-to-reason functions – sure, agree. But I strongly disagree that renaming `a` and `b` to `number1` and `number2` aids readability in any way. Also your naming of the functions is inconsistent: why does `IsSumInRange` hard-code the range if `IsValueInRange` accept it as arguments? – leftaroundabout Sep 05 '18 at 19:30
  • The 1st function can overflow. (Like other answers' code.) Although the complexity of the overflow-safe code is an argument for putting it into a function. – philipxy Sep 05 '18 at 23:36
  • @leftaroundabout I wanted to keep the original function the same. Personally I would use the "IsValueInRange" function over the "IsSumInRange". About renaming a and b to number1 and number2 is because "a" doesn't tell you anything about what that integer is. – Pieter B Sep 06 '18 at 07:49
  • @philipxy and that's why you should never copy code from stackexchange. It's a proof of concept. Actually I can't recall ever seeing code on stack that you can use in production. Range checking, overflow protection, proper exception handling, it's never there. And it would defeat the purpose of answering the question. – Pieter B Sep 06 '18 at 07:52
  • @ivanivan When we were working with much weaker machines, and thus had much weaker optimization, more tricks were needed to get acceptable performance. Now, many of those tricks make no difference at all, and most of the rest are rarely significant enough to bother, assuming they hinder readability at all. – Deduplicator Sep 06 '18 at 13:33
6

In short, I don't think the question has much relevance in current computing, but from a historical perspective it's an interesting thought exercise.

Your interviewer is likely a fan of the Mythical Man Month. In the book, Fred Brooks makes the case that programmers will generally need two versions of key functions in their toolbox: a memory-optimized version and a cpu-optimized version. Fred based this on his experience leading the development of the IBM System/360 operating system where machines may have as little as 8 kilobytes of RAM. In such machines, memory required for local variables in functions could potentially be important, especially if the compiler did not effectively optimize them away (or if code was written in assembly language directly).

In the current era, I think you would be hard pressed to find a system where the presence or absence of a local variable in a method would make noticeable difference. For a variable to matter, the method would need to be recursive with deep recursion expected. Even then, it's likely that the stack depth would be exceeded causing Stack Overflow exceptions before the variable itself caused an issue. The only real scenario where it may be an issue is with very large, arrays allocated on the stack in a recursive method. But that is also unlikely as I think most developers would think twice about unnecessary copies of large arrays.

Eric
  • 785
  • 4
  • 5
4

After the assignment s = a + b; the variables a and b are not used anymore. Therefore, no memory is used for s if you are not using a completely brain-damaged compiler; memory that was used anyway for a and b is re-used.

But optimising this function is utter nonsense. If you could save space, it would be maybe 8 bytes while the function is running (which is recovered when the function returns), so absolutely pointless. If you could save time, it would be single numbers of nanoseconds. Optimising this is a total waste of time.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
3

Local value type variables are allocated on the stack or (more likely for such small pieces of code) use registers in the processor and never get to see any RAM. Either way they are short lived and nothing to worry about. You start considering memory use when you need to buffer or queue data elements in collections that are both potentially large and long lived.

Then it depends what you care about most for your application. Processing speed? Response time? Memory footprint? Maintainability? Consistency in design? All up to you.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • 4
    Nitpicking: .NET at least (language of the post is unspecified) doesn't make any guarantees about local variables being allocated "on the stack". See ["the stack is an implementation detail"](https://blogs.msdn.microsoft.com/ericlippert/2009/04/27/the-stack-is-an-implementation-detail-part-one/) by Eric Lippert. – jrh Sep 04 '18 at 22:19
  • 1
    @jrh Local variables on stack or heap may be an implementation detail, but if someone really wanted a variable on the stack there's `stackalloc` and now `Span`. Possibly useful in a hot spot, after profiling. Also, some of the docs around structs imply that value types *may* be on the stack while reference types *will not* be. Anyway, at best you might avoid a bit of GC. – Bob Sep 06 '18 at 02:27
2

As other answers have said, you need to think what you're optimising for.

In this example, I suspect that any decent compiler would generate equivalent code for both methods, so the decision would have no effect on the run time or memory!

What it does affect is the readability of the code.  (Code is for humans to read, not just computers.)  There's not too much difference between the two examples; when all other things are equal, I consider brevity to be a virtue, so I'd probably pick Method B.  But all other things are rarely equal, and in a more complex real-world case, it could have a big effect.

Things to consider:

  • Does the intermediate expression have any side-effects?  If it calls any impure functions or updates any variables, then of course duplicating it would be a matter of correctness, not just style.
  • How complex is the intermediate expression?  If it does lots of calculations and/or calls functions, then the compiler may not be able to optimise it, and so this would affect performance.  (Though, as Knuth said, “We should forget about small efficiencies, say about 97% of the time”.)
  • Does the intermediate variable have any meaning?  Could it be given a name that helps to explain what's going on?  A short but informative name could explain the code better, while a meaningless one is just visual noise.
  • How long is the intermediate expression?  If long, then duplicating it could make the code longer and harder to read (especially if it forces a line break); if not, the duplication could be shorter over all.
gidds
  • 789
  • 3
  • 8
1

As many of the answers have pointed out, attempting to tune this function with modern compilers won't make any difference. An optimizer can most likely figure out the best solution (up-vote to the answer that showed the assembler code to prove it!). You stated that the code in the interview was not exactly the code you were asked to compare, so perhaps the actual example makes a bit more sense.

But let's take another look at this question: this is an interview question. So the real issue is, how should you answer it assuming that you want to try and get the job?

Let's also assume that the interviewer does know what they are talking about and they are just trying to see what you know.

I would mention that, ignoring the optimizer, the first may create a temporary variable on the stack whereas the second wouldn't, but would perform the calculation twice. Therefore, the first uses more memory but is faster.

You could mention that anyway, a calculation may require a temporary variable to store the result (so that it an be compared), so whether you name that variable or not might not make any difference.

I would then mention that in reality the code would be optimized and most likely equivalent machine code would be generated since all the variables are local. However, it does depend on what compiler you are using (it was not that long ago that I could get a useful performance improvement by declaring a local variable as "final" in Java).

You could mention that the stack in any case lives in its own memory page, so unless your extra variable caused the stack to overflow the page, it won't in reality allocate any more memory. If it does overflow it will want a whole new page though.

I would mention that a more realistic example might be the choice of whether to use a cache to hold the results of many computations or not and this would raise a question of cpu vs memory.

All this demonstrates that you know what you are talking about.

I would leave it to the end to say that it would be better to focus on readabilty instead. Although true in this case, in the interview context it may be interpretted as "I don't know about performance but my code reads like a Janet and John story".

What you should not do is trot out the usual bland statements about how code optimization is not necessary, don't optimize until you have profiled the code (this just indicates you can't see bad code for yourself), hardware costs less than programmers, and please, please, don't quote Knuth "premature blah blah ...".

Code performance is a genuine issue in a great many organisations and many organisations need programmers who understand it.

In particular, with organisations such as Amazon, some of the code has huge leverage. A code snippet may be deployed on thousand of servers or millions of devices and may be called billions of times a day every day of the year. There may be thousands of similar snippets. The difference between a bad algorithm and a good one can easily be a factor of a thousand. Do the numbers and multiple all this up: it makes a difference. The potential cost to the organisation of non-performing code can be very significant or even fatal if a system runs out of capacity.

Furthmore, many of these organisations work in a competetive environment. So you cannot just tell your customers to buy a bigger computer if your competitor's software already works ok on the hardware that they have or if the software runs on a mobile handset and it can't be upgraded. Some applications are particularly performance critical (games and mobile apps come to mind) and may live or die according to their responsiveness or speed.

I have personally over two decades worked on many projects where systems have failed or been unusable due to performance issues and I have been called in the optimize those systems and in all cases it has been due to bad code written by programmers who didn't understand the impact of what they were writing. Furthmore, it is never one piece of code, it is always everywhere. When I turn up, it is way to late to start thinking about performance: the damage has been done.

Understanding code performance is a good skill to have in the same way as understanding code correctness and code style. It comes out of practice. Performance failures can be as bad as functional failures. If the system doesn't work, it doesn't work. Doesn't matter why. Similarly, performance and features that are never used are both bad.

So, if the interviewer asks you about performance I would recommend to try and demonstrate as much knowledge as possible. If the question seems a bad one, politely point out why you think it would not be an issue in that case. Don't quote Knuth.

rghome
  • 668
  • 5
  • 12
0

You should first optimize for correctness.

Your function fails for input values that are close to Int.MaxValue:

int a = int.MaxValue - 200;
int b = int.MaxValue - 200;
bool inRange = test.IsSumInRangeA(a, b);

This returns true because the sum overflows to -400. The function also doesn't work for a = int.MinValue + 200. (incorrectly adds up to "400")

We won't know what the interviewer was looking for unless he or she chimes in, but "overflow is real".

In an interview situation, ask questions to clarify the scope of the problem: What is are the allowed maximum and minimum input values? Once you have those, you can throw an exception if the caller submits values outside of the range. Or (in C#), you can use a checked {} section, which would throw an exception on overflow. Yes, it's more work and complicated, but sometimes that's what it takes.

TomEberhard
  • 117
  • 2
  • The methods were only examples. They were not written to be correct, but to illustrate the actual question. Thanks for the input though! – Corey P Sep 06 '18 at 17:14
  • I think the interview question is directed at performance, so you need to answer the intent of the question. The interviewer is not asking about behaviour at the limits. But interesting side point anyway. – rghome Sep 07 '18 at 12:15
  • 1
    @Corey Good interviewers as questions to 1) assess the candidate ability concerning the issue, as suggested by rghome here yet also 2) as a opening into the larger issues (like the unspoken functional correctness) and depth of related knowledge - this is more so in later career interviews - good luck. – chux - Reinstate Monica Sep 07 '18 at 16:58
0

Your question should have been: "Do I need to optimize this at all?".

Version A and B differ in one important detail that makes A preferrable, but it is unrelated to optimization: You do not repeat code.

The actual "optimization" is called common subexpression elimination, which is what pretty much every compiler does. Some do this basic optimization even when optimizations are turned off. So that isn't truly an optimization (the generated code will almost certainly be exactly the same in every case).

But if it isn't an optimization, then why is it preferrable? Alright, you don't repeat code, who cares!

Well first of all, you do not have the risk of accidentially getting half of the conditional clause wrong. But more importantly, someone reading this code can grok immediately what you're trying to do, instead of a if((((wtf||is||this||longexpression)))) experience. What the reader gets to see is if(one || theother), which is a good thing. Not rarely, I happens that you are that other person reading your own code three years later and thinking "WTF does this mean?". In that case it's always helpful if your code immediately communicates what the intent was. With a common subexpression being named properly, that's the case.
Also, if at any time in the future, you decide that e.g. you need to change a+b to a-b, you have to change one location, not two. And there's no risk of (again) getting the second one wrong by accident.

About your actual question, what you should optimize for, first of all your code should be correct. This is the absolutely most important thing. Code that isn't correct is bad code, even moreso if despite being incorrect it "works fine", or at least it looks like it works fine. After that, code should be readable (readable by someone unfamiliar with it).
As for optimizing... one certainly shouldn't deliberately write anti-optimized code, and certainly I'm not saying you shouldn't spend a thought on the design before you start out (such as choosing the right algorithm for the problem, not the least efficient one).

But for most applications, most of the time, the performance that you get after running correct, readable code using a reasonable algorithm through an optimizing compiler is just fine, there's no real need to worry.

If that isn't the case, i.e. if the application's performance indeed doesn't meet the requirements, and only then, you should worry about doing such local optimizations as the one you attempted. Preferrably, though, you would reconsider the top-level algorithm. If you call a function 500 times instead of 50,000 times because of a better algorithm, this has larger impact than saving three clock cycles on a micro-optimization. If you don't stall for several hundred cycles on a random memory access all the time, this has a larger impact than doing a few cheap calculations extra, etc etc.

Optimization is a difficult matter (you can write entire books about that and get to no end), and spending time on blindly optimizting some particular spot (without even knowing whether that's the bottleneck at all!) is usually wasted time. Without profiling, optimization is very hard to get right.

But as a rule of thumb, when you're flying blind and just need/want to do something, or as a general default strategy, I would suggest to optimize for "memory".
Optimizing for "memory" (in particular spatial locality and access patterns) usually yields a benefit because unlike once upon a time when everything was "kinda the same", nowadays accessing RAM is among the most expensive things (short of reading from disk!) that you can in principle do. Whereas ALU, on the other hand, is cheap and getting faster every week. Memory bandwidth and latency doesn't improve nearly as fast. Good locality and good access patterns can easily make a 5x difference (20x in extreme, contrieved examples) in runtime compared to bad access patterns in data-heavy applications. Be nice to your caches, and you will be a happy person.

To put the previous paragraph into perspective, consider what the different things that you can do cost you. Executing something like a+b takes (if not optimized out) one or two cycles, but the CPU can usually start several instructions per cycle, and can pipeline non-dependent instructions so more realistically it only costs you something around half a cycle or less. Ideally, if the compiler is good at scheduling, and depending on the situation, it might cost zero.
Fetching data ("memory") costs you either 4-5 cycles if you're lucky and it's in L1, and around 15 cycles if you are not so lucky (L2 hit). If the data isn't in the cache at all, it takes several hundred cycles. If your haphazard access pattern exceeds the TLB's capabilities (easy to do with only ~50 entries), add another few hundred cycles. If your haphazard access pattern actually causes a page fault, it costs you a few ten thousand cycles in the best case, and several million in the worst.
Now think about it, what's the thing you want to avoid most urgently?

Damon
  • 253
  • 1
  • 3
0

When to optimize for memory vs performance speed for a method?

After getting the functionality right first. Then selectivity concern oneself with micro optimizations.


As an interview question regarding optimizations, the code does provoke the usual discussion yet misses the higher level goal of Is the code functionally correct?

Both C++ and C and others regard int overflow as a problem from the a + b. It is not well defined and C calls it undefined behavior. It is not specified to "wrap" - even though that is the common behavior.

bool IsSumInRange(int a, int b) {
    int s = a + b;  // Overflow possible
    if (s > 1000 || s < -1000) return false;
    else return true;
}

Such a function called IsSumInRange() would be expected to be well defined and perform correctly for all int values of a,b. The raw a + b is not. A C solution could use:

#define N 1000
bool IsSumInRange_FullRange(int a, int b) {
  if (a >= 0) {
    if (b > INT_MAX - a) return false;
  } else {
    if (b < INT_MIN - a) return false;
  }
  int sum = a + b;
  if (sum > N || sum < -N) return false;
  else return true;
}

The above code could be optimized by using a wider integer type than int, if available, as below or distributing the sum > N, sum < -N tests within the if (a >= 0) logic. Yet such optimizations may not truly lead to "faster" emitted code given a smart compiler nor be worth the extra maintenance of being clever.

  long long sum a;
  sum += b;

Even using abs(sum) is prone to problems when sum == INT_MIN.

0

What kind of compilers are we talking about, and what sort of "memory"? Because in your example, assuming a reasonable optimizer, the expression a+b needs to generally be stored in a register (a form of memory) prior to doing such arithmetic.

So if we're talking about a dumb compiler that encounters a+b twice, it's going to allocate more registers (memory) in your second example, because your first example might just store that expression once in a single register mapped to the local variable, but we're talking about very silly compilers at this point... unless you're working with another type of silly compiler that stack spills every single variable all over the place, in which case maybe the first one would cause it more grief to optimize than the second*.

I still want to scratch that and think the second one is likely to use more memory with a dumb compiler even if it's prone to stack spills, because it might end up allocating three registers for a+b and spill a and b more. If we're talking most primitive optimizer then capturing a+b to s will probably "help" it use less registers/stack spills.

This is all extremely speculative in rather silly ways absent measurements/disassembly and even in the worst-case scenarios, this is not a "memory vs. performance" case (because even among the worst optimizers I can think of, we're not talking about anything but temporary memory like stack/register), it's purely a "performance" case at best, and among any reasonable optimizer the two are equivalent, and if one is not using a reasonable optimizer, why obsesses about optimization so microscopic in nature and especially absent measurements? That's like instruction selection/register allocation assembly-level focus which I would never expect anyone looking to stay productive to have when using, say, an interpreter that stack spills everything.

When to optimize for memory vs performance speed for a method?

As for this question if I can tackle it more broadly, often I don't find the two diametrically opposed. Especially if your access patterns are sequential, and given the speed of the CPU cache, often a reduction in the amount of bytes processed sequentially for non-trivial inputs translates (up to a point) to plowing through that data faster. Of course there are breaking points where if the data is much, much smaller in exchange for way, way more instructions, it might be faster to process sequentially in larger form in exchange for fewer instructions.

But I've found many devs tend to underestimate how much a reduction in memory use in these types of cases can translate to proportional reductions in time spent processing. It's very humanly intuitive to translate performance costs to instructions rather than memory access to the point of reaching for big LUTs in some vain attempt to speed up some small computations, only to find performance degraded with the additional memory access.

For sequential access cases through some huge array (not talking local scalar variables like in your example), I go by the rule that less memory to sequentially plow through translates to greater performance, especially when the resulting code is simpler than otherwise, until it doesn't, until my measurements and profiler tell me otherwise, and it matters, in the same way I assume sequentially reading a smaller binary file on disk would be faster to plow through than a bigger one (even if the smaller one requires some more instructions), until that assumption is shown to no longer apply in my measurements.