65

I just used ~1 billion as the count for a z-index in CSS, and was thinking about the comparisons that must go on. Is there a difference in performance on the ALU level in comparisons between very large numbers vs very small ones?

For example, would one of these two snippets be more expensive than the other?

snippet 1

for (int i = 0; i < 10000000; i++){
    if (i < 10000000000000) {
        //do nothing
    }
}

snippet 2

for (int i = 0; i < 10000000; i++){
    if (i < 1000) {
        //do nothing
    }
}
J.Todd
  • 3,833
  • 5
  • 22
  • 27
  • 9
    are you aware about [how branch prediction works](http://stackoverflow.com/a/11227902/839601)? – gnat Feb 02 '15 at 14:55
  • 13
    OP isn't asking how much time the branching will take. Clearly, the example is intended to ensure that it takes exactly the same time in both snippets. The question is about whether the individual `CMP` machine instruction will be slower if `i` is larger. – Kilian Foth Feb 02 '15 at 15:08
  • 1
    I edited `snippet 2` to have the same number of loops as `snippet 1` since more loops would be more expensive, which wasn't the intended focus. – J.Todd Feb 02 '15 at 15:26
  • 3
    As edited, any difference in timing is almost certainly due to branch prediction, as @gnat said. However, this difference is likely to be infinitesimal, especially if measured in a high-level language running on a modern operating system. *Are you actually seeing a difference, or is this a hypothetical question?* – kdgregory Feb 02 '15 at 15:55
  • 2
    Also, any actual implementation difference is going to be extremely architecture-specific, to the level of the micro-architecture. It's entirely possible that the microcode for Intel Sandy Bridge is different than that for Ivy Bridge, never mind that any arbitrary ARM chip is almost certainly different than an arbitrary Intel chip. – kdgregory Feb 02 '15 at 15:58
  • 18
    Since this is done in CSS, converting a string to an integer will likely dominate the comparison operation itself in terms of time spent executing. –  Feb 02 '15 at 20:13
  • 59
    If you needed to use 1000000000 as a z-index in a CSS file, you have done something wrong. – Bergi Feb 02 '15 at 21:08
  • 2
    (A decent optimizing compiler will elide those loops entirely.) – David Conrad Feb 02 '15 at 21:14
  • 2
    Cache misses are much more important than compare or arithmetic – Basile Starynkevitch Feb 02 '15 at 21:45
  • 1
    In snippet 1, the controlling expression for the `if` statement is always-true, so it compiles (on any proper optimizing compiler) to no code at all. – R.. GitHub STOP HELPING ICE Feb 03 '15 at 04:52
  • It could be less expensive if it were in the critical path, but it's almost surely not. – user541686 Feb 03 '15 at 07:39
  • 6
    For CSS, the overhead of converting text into an integer will depend on the number of digits being converted (where a 6 digit number like 1000000 may be approximately 6 times as expensive as a 1 digit number like 1); and this overhead may be orders of magnitude larger than the overhead of integer comparisons. – Brendan Feb 03 '15 at 10:11
  • 1
    Am I the only person who just wants to change the loop to read for (int i = 1000; i < 10000000; i++) ? – Jon Story Feb 03 '15 at 14:57
  • @Bergi or mindlessly hit the zero key a bunch of times to ensure the item would be on top. 10 would have probably sufficed, but 1000000000 allows me to set *so* many more z-indexes below it and not worry about an overlap. – J.Todd Feb 03 '15 at 19:48
  • 1
    @MediaWebDev I hate it when people think like that. It's inevitable that I'll have to add a z-index of 1000000010 just to override 1000000000. Who has 1 billion elements on a webpage? – MiniRagnarok Feb 03 '15 at 20:20
  • @MiniRagnarok It was a humorous exaggeration.. I know this is Programmers SE, but come on, loosen up a bit ;P – J.Todd Feb 03 '15 at 20:22
  • 5
    @MediaWebDev I have legitimately had to override something similar to 99,999,999,999. It's not humorous it's horrifying. – MiniRagnarok Feb 03 '15 at 20:28
  • An integer will have the same number of bits no matter how large it is, and all those bits will need to be compared - even if they're zero. – Mark Ransom Feb 03 '15 at 20:56
  • 1
    It could depend on the language, specifically: how it handles different datatypes and whether or not all of those numbers are the same or different types. Since you are talking about CSS is this a Javascript question? – Octopus Feb 04 '15 at 01:24
  • By the way, the number in the question is 10 trillion, not 1 billion. This is actually a significant difference in that the 1 billion fits in a 32-bit integer (either signed or unsigned,) whereas 10 trillion does not. Of course, if the number were floating-point, they'd both fit in a 32-bit word, but floating-point math is more expensive than integer math on most CPUs. – reirab Feb 04 '15 at 15:38

7 Answers7

82

Every processor I've worked on does comparison by subtracting one of the operands from the other, discarding the result and leaving the processor's flags (zero, negative, etc.) alone. Because subtraction is done as a single operation, the contents of the operands don't matter.

The best way to answer the question for sure is to compile your code into assembly and consult the target processor's documentation for the instructions generated. For current Intel CPUs, that would be the Intel 64 and IA-32 Architectures Software Developer’s Manual.

The description of the CMP ("compare") instruction is in volume 2A, page 3-126, or page 618 of the PDF, and describes its operation as:

temp ← SRC1 − SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags in the same manner as the SUB instruction*)

This means the second operand is sign-extended if necessary, subtracted from the first operand and the result placed in a temporary area in the processor. Then the status flags are set the same way as they would be for the SUB ("subtract") instruction (page 1492 of the PDF).

There's no mention in the CMP or SUB documentation that the values of the operands have any bearing on latency, so any value you use is safe.

Blrfl
  • 20,235
  • 2
  • 49
  • 75
  • 5
    What if the number gets too big for 32-bit arithmetic? Would it then not be split to slower computation? – Falco Feb 02 '15 at 17:44
  • 3
    @Falco Not on a CPU with a 64-bit ALU (which is pretty much all of them except in the embedded space these days.) – reirab Feb 02 '15 at 17:52
  • So in the worst-case your website would only run slow on smartphones - and it would be a hell to debug ^^ better be safe than sorry and stay below 2^31-1 – Falco Feb 02 '15 at 17:58
  • 8
    @Falco: Yes, but since the question asks about ALU performance, the implication is that the values fit within the CPU's word size or the abilities of any SIMD instructions it might have. Operating on larger numbers than that would have to be implemented with multiple instructions outside the CPU. That was very common 30 years ago when you just had 8- or 16-bit registers to work with. – Blrfl Feb 02 '15 at 17:58
  • 6
    @Falco How would that require debugging? It's not a bug; it's just a bit slower to do 64-bit ops on a CPU that doesn't natively support 64-bit ops. Suggesting that one should never use a number above 2^31-1 seems a bit ridiculous. – reirab Feb 02 '15 at 21:19
  • @reirab please don't lash out! Noticeably degraded performance is considered a bug in every professional software I've worked on. Furthermore degraded performance of websites can have quite a bit hit on customer retention (there is a related question on ux). And actually needing Numbers above 2^31 for Z-Index seems a bit ridiculous - and if you don't need them, why risk even a 1% chance of degraded performance, if there is no downside to choosing a little lower number ? – Falco Feb 03 '15 at 08:48
  • 1
    @Falco I wasn't trying to lash out. I see what you're saying now. Yes, _in the context of a CSS z-index_, sure, using numbers that can't be represented with 32-bits is excessive. Your previous comment sounded like you were suggesting that people should never use 64-bit numbers at all, which is the part that seemed ridiculous. As far as CSS goes, though, the performance difference won't be noticeable unless you're running the comparison many, many, many times. The initial conversion from a string will likely dominate the execution time. – reirab Feb 03 '15 at 14:54
  • 2
    @Falco Having said that, do the rendering engines in browsers even use integers to represent z-indices? Most rendering engines I'm familiar with use single-precision floats for everything (until the final rasterization stage,) but I haven't really studied browser rendering engines. – reirab Feb 03 '15 at 14:57
  • @reirab Since when is there a `sub` instruction in x64 that takes a 64bit immediate? No idea how you'd do such a comparison without a performance hit - certainly not in any modern fixed size RISC ISA (ARMv8 manages some 64bit patterns for immediates but for from all) and not even in x64. There's a reason why people when optimizing code favor small immediates. – Voo Feb 03 '15 at 21:47
  • @Voo It can do either `CMP` or `SUB` on 2 64-bit registers, though... If you had to do a single immediate load first, I don't think that would qualify as "noticeably degraded performance." Also, the situation would be the same with 32-bit numbers on a 32-bit fixed-size ISA. Regardless of whether you're doing the compare once or a trillion times in a loop, you'd only need to do the immediate load once. – reirab Feb 03 '15 at 22:14
  • @MediaWebDev: If you're curious, that list is 6502, 8080, Z80, VAX, 8085, 8088 (and descendents), 6800, 68000, SPARC and a teensy bit of ARM. Probably should have said "worked with." The only processor I actually worked on was a class project that wasn't capable of much. – Blrfl Feb 04 '15 at 01:39
  • @reirab don't forget those nine bytes of machine code. If you're _really_ unlucky you'll cause a tight loop to span multiple pages of cache and increase the cache miss ratio. Also don't forget the instruction decoding cache (16 bytes on an Opteron, IIRC), which the tight loop might have fit into before but now it definitely won't, requiring not only L1 reads but also redecodes. It doesn't sound like much, but if your loop fits into 16 bytes, an extra tick per iteration might become noticeable. Only if the compiler is smart enough to move it out you do "only need to do the immediate load once". – John Dvorak Feb 04 '15 at 14:59
  • @JanDvorak If the compiler is _not_ smart enough to move it out, you should find a new compiler. To be fair, though, I have seen compilers do some pretty dumb things, including one that was reading and writing the incrementing variable in a `for` loop to memory on each iteration, even though it was in a register that wasn't being modified by the loop body. The worst part about that situation was that it was on an architecture that had dedicated loop-control hardware that was capable of running a loop for a specified number of times with _no overhead at all_, which the compiler didn't use. – reirab Feb 04 '15 at 15:07
  • @JanDvorak Of course, the programmer can always explicitly do the immediate load once, too, should they find that they have a stupid compiler (e.g. by storing it in a separate variable before the loop and, perhaps, marking said variable as `register`.) As long as you do the immediate load elsewhere, it looks like the 64-bit, 2-register `sub` and `cmp` instructions are only 3 bytes each. – reirab Feb 04 '15 at 15:18
  • @reirab The question was whether comparing a smaller number was faster than a larger and the answer to that is simply yes. The immediate will only stay in a register if the code does a negligible amount of work this has little to do with how intelligent the compiler is. – Voo Feb 05 '15 at 16:33
  • @Voo No, the question specifically asked about the **ALU** level, not loading operands (or loading instructions.) Those are both good things to be aware of, but they're not what the question asked. As such, the answer to the question asked is no, assuming both operands fit within the word size of the ALU. Also, accumulator architectures aside, if your loop needs to use all of the data registers, not leaving one for the immediate, then the immediate load's performance hit will almost certainly be negligible. – reirab Feb 05 '15 at 16:45
25

Is there a difference in performance on the ALU level in comparisons between very large numbers vs very small ones?

It's very unlikely, unless going from a small number to a large number changes your numeric type, say from an int to a long. Even then, the difference might not be significant. You're more likely to see a difference if your programming language silently switches to arbitrary precision arithmetic under the covers.

Nonetheless, your particular compiler might be performing some clever optimizations that you are not aware of. The way you find out is to measure. Run a profiler on your code; see which comparisons take the longest. Or simply start and stop a timer.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • It should be mentioned, that the proposed Numbers in the Question are of different numeric type in a typical 32-bit integer type... – Falco Feb 03 '15 at 16:43
19

Many processors have "small" instructions which can perform arithmetic operations, including comparisons, on certain immediately-specified operands. Operands other than those special values must either use a larger instruction format or, in some cases, must use a "load value from memory" instruction. In the ARM Cortex-M3 instruction set, for example, there are at least five ways a value might be compared to a constant:

    cmp r0,#1      ; One-word instruction, limited to values 0-255

    cmp r0,#1000   ; Two-word instruction, limited to values 0-255 times a power of 2

    cmn r0,#1000   ; Equivalent to comparing value with -1000
                   ; Two-word instruction, limited to values 0-255 times a power of 2

    mov r1,#30000  ; Two words; can handle any value 0-65535
    cmp r0,r1      ; Could use cmn to compare to values -1 to -65535

    ldr r1,[constant1000000] ; One or two words, based upon how nearby the constant is
    cmp r0,r1
    ...

constant1000000:
    dd  1000000

The first form is the smallest; the second and third form may or may not execute as quickly, depending upon the speed of the memory from which code is fetched. The fourth form form will almost certainly be slower than the first three, and the fifth form even slower, but the latter can be used with any 32-bit value.

On older x86 processors, short-form compare instructions would execute faster than long-form ones, but many newer processors will convert both the long and short forms to the same representation when they are first fetched, and store that uniform representation in the cache. Thus, while embedded controllers (like those found on many mobile platforms) will have a speed difference, many x86-based computers won't.

Note also that in many cases where a constant is used heavily within a loop, a compiler will only need to load the constant into a register once--before the loop starts--rendering timing distinctions moot. On the other hand, there are some situations, even in small loops, where that won't always happen; if a loop is small but heavily-executed, there occasionally may be a major performance between comparisons involving short immediate values and those involving longer ones.

supercat
  • 8,335
  • 22
  • 28
  • On MIPS you can only have 16-bit immediates, so definitely comparison with 1 will be shorter and (probably) faster than 1000000. Maybe the same to Sparc and PowerPC. And I think I have read from some sources that Intel also optimizes operations on small immediates in several cases but I'm not sure for comparison or not – phuclv Feb 03 '15 at 05:28
  • @LưuVĩnhPhúc: A register can be loaded before the loop. At that point, the actual comparison will be the same number of instructions in either case. – cHao Feb 03 '15 at 15:26
  • As the Loop was just an example by the op and the question was for example an z-index, if you have 1000 objects, each with its own z-index and you set them to 100000000...1000000999 or to 10000...10999 and you loop over them for sorting before rendering, there are many compares and many load instructions. There it could make a difference! – Falco Feb 03 '15 at 16:49
  • @Falco: In that case, immediates wouldn't even factor in; loading and comparing against a register seems pretty much inevitable. – cHao Feb 04 '15 at 09:10
  • @cHao: If one is comparing Z indices against each other, they'd be in registers. If one is handling certain ranges of indices differently that might entail immediate comparisons. Normally constants would get loaded before a loop starts, but if e.g. one had a loop that needed to read pairs of values from memory and compare the first value of each pair with with five different (non-uniformly-spaced) constants in the range 100000 to 100499, and the other value with five other such constants, it may be much faster to subtract 100250 (kept in a register) and then compare with values -250 to 250... – supercat Feb 04 '15 at 16:48
  • @supercat: CSS z-indexes only have meaning relative to each other. If you're comparing them against hard-coded values, you're doing it wrong. :P – cHao Feb 04 '15 at 16:54
  • ...than to compare against the larger values directly, since registers would be too cramped to hold all ten. If code is time critical, it may be necessary to experimentally optimize it for the system one is using, while be prepared to change it around if it needs to run on another system (which is slow enough that the code would still be time-critical). – supercat Feb 04 '15 at 16:55
  • In general, CSS indices are only meaningful when compared against each other, *unless* one is specializing one's code to work with objects that are created using some known policy. – supercat Feb 04 '15 at 16:56
5

The short answer to this question is, no, there's no time difference to compare two numbers based on the magnitude of those numbers assuming they're stored in the same data type (e.g. both 32-bit ints or both 64-bit longs.)

Furthermore, up to the word size of the ALU, it's incredibly unlikely that comparing two integers to each other will ever take more than 1 clock cycle, as this is a trivial operation equivalent to a subtraction. I think every architecture I've ever dealt with had single-cycle integer comparison.

The only cases I can think of that I've encountered where a comparison of two numbers was not a single-cycle operation are the following:

  • Instructions where there's actually a memory latency in fetching operands, but that has nothing to do with how the comparison itself works (and generally isn't possible on RISC architectures, though it is usually possible on CISC designs, like x86/x64.)
  • Floating-point comparisons may be multi-cycle, depending on architecture.
  • The numbers in question don't fit in the word size of the ALU and, thus, the comparison must be broken up into multiple instructions.
reirab
  • 204
  • 1
  • 7
4

@RobertHarvey's answer is good; consider this answer a supplement to his.


You should also consider Branch Prediction:

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Basically, in your example, if the if statement inside the loop always returns the same answer, then the system can optimize it by guessing correctly which way it will branch. In your example, because the if statement in the first case always returns the same result, it will run slightly faster than the second case.

Excellent Stack Overflow question on the subject

durron597
  • 7,590
  • 9
  • 37
  • 67
3

It depends on the implementation, but it would be very, very unlikely.

I admit that I have not read through the implementation details of the various browser engines, and CSS does not specify any particular type of storage for numbers. But I believe that it is safe to assume that all of the major browsers are using 64-bit double-precision floating-point numbers ("doubles", to borrow a term from C/C++) to handle most of their numeric needs in CSS, because this is what JavaScript uses for numbers, and so using the same type makes integration easier.

From the computer's standpoint, all doubles carry the same amount of data: 64 bits, whether the value is 1 or -3.14 or 1000000 or 1e100. The amount of time it takes to do an operation on these numbers doesn't depend on the actual value of those numbers, because it's always working on the same amount of data. There's a tradeoff in doing things this way, in that doubles can't accurately represent all numbers (or even all numbers within their range), but they can get close enough for most matters, and the kinds of things CSS does aren't numerically-demanding enough to need more precision than that. Combine this with the benefits of straight-across compatibility with JavaScript, and you've got a pretty strong case for doubles.

It's not impossible that someone might implement CSS using a variable-length encoding for numbers. If someone used a variable-length encoding, then comparing against small numbers would be less expensive than comparing against large numbers, because large numbers have more data to crunch. These kinds of encodings can be more precise than binary, but they are also much slower, and for CSS in particular, the precision gains are probably not enough to be worth the performance hit. I would be very surprised to learn that any browser did things this way.

Now, in theory, there is one possible exception to everything I've said above: comparing against zero is often faster than comparing against other numbers. This isn't because zero is short (if that were the reason, then 1 should be just as fast, but it's not). It's because zero lets you cheat. It's the only number where all the bits are off, so if you know that one of the values is zero, you don't even have to look at the other value as a number: if any of the bits on then it's not equal to zero, and then you only have to look at one bit to see if it's greater than or less than zero.

The Spooniest
  • 2,160
  • 12
  • 9
0

If this code was being interpreted each time it ran, there'd be a difference as it takes longer to tokenise and interpret 10000000000000 compared to 1000. However, this is the obvious first optimisation of interpreters in this case: tokenise once and interpret the tokens.

Mark Hurd
  • 343
  • 1
  • 3
  • 12