28

I have been taught that shifting in binary is much more efficient than multiplying by 2^k. So I wanted to experiment, and I used the following code to test this out:

#include <time.h>
#include <stdio.h>

int main() {
    clock_t launch = clock();
    int test = 0x01;
    int runs;

    //simple loop that oscillates between int 1 and int 2
    for (runs = 0; runs < 100000000; runs++) {


    // I first compiled + ran it a few times with this:
    test *= 2;

    // then I recompiled + ran it a few times with:
    test <<= 1;

    // set back to 1 each time
    test >>= 1;
    }

    clock_t done = clock();
    double diff = (done - launch);
    printf("%f\n",diff);
}

For both versions, the print out was approximately 440000, give or take 10000. There was no (visually, at least) significant difference between the two versions' outputs. So my question is, is there something wrong with my methodology? Should there even be a visual difference? Does this have something to do with the architecture of my computer, the compiler, or something else?

NicholasFolk
  • 389
  • 1
  • 3
  • 4
  • 47
    Whomever taught you that was clearly mistaken. That belief has not been true since the 1970s, for typically-used compilers on typically-used architectures. Good for you for testing out this claim. I have heard this nonsensical claim made about *JavaScript* for heaven's sake. – Eric Lippert Jul 21 '14 at 20:18
  • 21
    The best way to answer questions like these is to look at the assembly code the compiler is producing. Compilers typically have an option to produce a copy of the assembly language they are generating. For the the GNU GCC compilers this is '-S'. – Charles E. Grant Jul 21 '14 at 20:19
  • 8
    One should point out that after looking this with `gcc -S`, the code for `test *= 2` is actually compiled to `shll $1, %eax` When invoked with `gcc -O3 -S` there's not even a loop. The two clock calls are a line apart: `callq _clock` `movq %rax, %rbx` `callq _clock` –  Jul 21 '14 at 21:00
  • 6
    "I have been taught that shifting in binary is much more efficient than multiplying by 2^k"; we get taught a lot of things that turn out to be wrong (or at least out of date). A smartish compiler will use the same shift operation for both. – John Bode Jul 21 '14 at 21:06
  • If you want a real kick, measure the difference between multiplying and dividing an integer by 13. – John Jul 21 '14 at 23:51
  • 4
    @JohnBode - but that statement is not wrong. Binary shifts tend to be more efficient on a hardware level. It's just that C provides higher level abstraction, so that you cannot observe this. It's not hypothesis that's wrong, it's experiment that is conducted in a wrong way. – el.pescado - нет войне Jul 22 '14 at 13:19
  • 1
    BTW. It amazes me that people still think their compilers are that dumb. – el.pescado - нет войне Jul 22 '14 at 13:19
  • Back in the 80s, I had a cheap C compiler that pretty much didn't do any optimization. So this kind of stuff actually made a difference. It was fun toying with that. It didn't take long before I started using an optimizing compiler, however, so that fun was then relegated to handwritten assembly. – MetalMikester Jul 22 '14 at 13:36
  • If I remember correctly, this optimization was present in Turbo C back in late 80/early 90. – el.pescado - нет войне Jul 22 '14 at 13:56
  • 10
    Always, always check the generated assembly code when working on this kind of optimization, to be sure you're measuring what you think you're measuring. A huge number of "why am I seeing these times" questions on SO wind up boiling down to the compiler completely eliminating operations because the results aren't used. – Russell Borogove Jul 22 '14 at 15:53
  • @el.pescado Correct - that's the one I switched to after toying with that other very basic compile. Can't remember the name, but it generated very slow code. It was fine for learning the basics, however. – MetalMikester Jul 22 '14 at 17:27
  • Funnily enough neither `mul` nor `lshl` would be the most efficient instructions on modern x86 for a multiplication times 2 the last time I checked. Turns out `add` is the best to pick! I'd say chances are good that a mediocre compiler would go to the extra length of changing multiplications by 2, but may miss out shifts by 1 - probably not, but that's one way I can see how this "optimization" may actually backfire. – Voo Jul 22 '14 at 23:04
  • These comments are sad, and people like Eric Lippert should be extremely embarrassed. Of course shift is faster than multiply (https://gmplib.org/~tege/x86-timing.pdf), but not when multiplying by *compile-time constants that the compiler can optimize into shifts*. Sheesh. Change those shift amounts and multipliers to variables with values unknown to the compiler, and the times will be different. The benchmark doesn't test what it is intended to test--heck, the compiler is likely to optimize out the entire loop, but we have no idea what optimization level the OP used. – Jim Balter Jun 28 '20 at 01:21

11 Answers11

46

As said in the other answer, most compilers will automatically optimize multiplications to be done with bitshifts.

This is a very general rule when optimizing: Most 'optimizations' will actually misguide the compile about what you really mean, and might even lessen the performance.

Only optimize when you have noticed a performance problem and measured what the problem is. (and most code we write doesn't get executed that often, so we don't need to bother)

The big downside to optimizing is that the 'optimized' code is often much less readable. So in your case, always go for multiplication when you are looking to multiply. And go for bit shifting when you want to move bits.

Thirler
  • 748
  • 4
  • 7
  • 22
    Always use the operation that is semantically correct. If you were manipulating bit masks, or positioning small integers within larger integers, shift is the appropriate operation. – ddyer Jul 21 '14 at 19:49
  • Of course :), added that to my answer. – Thirler Jul 21 '14 at 19:54
  • 2
    Would there ever (practically speaking) be a need to optimize a multiplication to a shift operator in a high level software application? It seems, since the compiler already optimizes, that the only time it is useful to have this knowledge is when programming at a very low level (at least, below the compiler). – NicholasFolk Jul 21 '14 at 20:19
  • 12
    @NicholasFolk nope. Do what is simplest to understand. If you were writing assembly directly it can be useful... or if you were writing an optimizing compiler, again it could be useful. But outside of those two cases its a trick that obscures what you are doing and makes the next programmer (who is an [axe murder who knows where you live](http://blog.codinghorror.com/coding-for-violent-psychopaths/)) curse your name and think of taking up a hobby. –  Jul 21 '14 at 20:53
  • @NicholasFolk Depends on what you're doing. As always, you should profile before doing optimizations, but one area where these types of optimizations can have huge effects is in image processing. Doing a couple of multiplications for each pixel in a large image can take a lot of time even on fast computers. Converting the multiplications to shifts and additions is a pretty standard way of speeding up image processing. – Leo Jul 21 '14 at 22:21
  • 1
    As general rule, (1) you should be very familiar with the language's specification for shift operations before using them (e.g. their behavior on signed vs unsigned integers of various widths). (2) if you use them for performance reasons, you should know three things: (2.1) application-wide performance profiling, (2.2) micro-benchmarking (and CPU-level timing), (2.3) know how to read disassembly code, though it's not necessary to know how to write them. – rwong Jul 22 '14 at 05:46
  • @NicholasFolk: I don't know whether they've got round to sorting them but it wasn't very long ago that all the major games platform compilers failed to perform this simple optimisation. I'd be kind of surprised if the 3DS compiler does to this day. – Jack Aidley Jul 22 '14 at 10:43
  • 2
    @NicholasFolk: Optimizations at this level are almost always obscured or rendered moot by the CPU architecture anyway. Who cares if you save 50 cycles when just fetching the arguments from memory and writing them back takes over 100? Micro-optimizations like this made sense when memory ran at (or close to) the speed of the CPU, but not so much today. – TMN Jul 22 '14 at 16:10
  • 1
    Premature optimization is the root of all evil -- DonaldKnuth – tzerb Jul 22 '14 at 22:12
  • 2
    Because i'm tired of seeing that 10% of that quote, and because it hits the nail on the head here: "There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We *should* forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. ... – cHao Jul 22 '14 at 22:36
  • 2
    "Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only *after* that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail." (Knuth, "Structured Programming with `go to` Statements") Emphasis is his, IIRC. – cHao Jul 22 '14 at 22:37
26

The compiler recognizes constants and converts multiplies to shifts where appropriate.

ddyer
  • 4,060
  • 15
  • 18
  • The compiler recognizes constants that are powers of 2.... and converts to shifts. Not all constants can be changed into shifts. – quickly_now Jul 22 '14 at 09:59
  • 4
    @quickly_now: They can be converted into combinations of shifts and additions/subtractions. – user541686 Jul 22 '14 at 10:36
  • 2
    A classic compiler optimizer bug is to convert divides into right shifts, which works for positive dividends but is off by 1 for negative. – ddyer Jul 22 '14 at 16:54
  • 1
    @quickly_now I believe the term 'where appropriate' covers the idea that some constants cannot be rewritten as shifts. – Pharap Jul 23 '14 at 03:19
23

Whether shifting is faster than multiplication depends on the architecture of your CPU. Back in the days of the Pentium and earlier, shifting was often faster than multiplication, depending on the number of 1 bits in your multiplicand. For example, if your multiplicand was 320, that's 101000000, two bits.

a *= 320;               // Slower
a = (a<<7) + (a<<9);    // Faster

But if you had more than two bits ...

a *= 324;                        // About same speed
a = (a<<2) + (a<<7) + (a<<9);    // About same speed

a *= 340;                                 // Faster
a = (a<<2) + (a<<4) + (a<<7) + (a<<9);    // Slower

On a little microcontroller like a PIC18 with single cycle multiply, but no barrel shifter, multiplication is faster if you're shifting by more than 1 bit.

a  *= 2;   // Exactly the same speed
a <<= 1;   // Exactly the same speed

a  *= 4;   // Faster
a <<= 2;   // Slower

Note that that's the opposite of what was true on older Intel CPUs.

But it's still not that simple. If I remember correctly, due to its Superscalar architecture, a Pentium was able to process either one multiply instruction or two shift instructions simultaneously (as long as they weren't dependent on each other). This means that if you wanted to multiply two variables by a power of 2, then shifting might be better.

a  *= 4;   // 
b  *= 4;   // 

a <<= 2;   // Both lines execute in a single cycle
b <<= 2;   // 
Rocketmagnet
  • 331
  • 2
  • 5
  • 5
    +1 "Whether shifting is faster than multiplication depends on the architecture of your CPU." Thank you for actually going into the history a bit and showing that most computer myths do actually have some logical basis. – Pharap Jul 23 '14 at 03:23
11

You've got several problems with your test program.

First, you're not actually using the value of test. There is no way, within the C standard, that the value of test matters. The optimizer is this completely free to remove it. Once its removed it, your loop is actually empty. The only visible effect would be to set runs = 100000000, but runs also isn't used. So the optimizer can (and should!) remove the entire loop. Easy fix: also print the computed value. Note that a sufficiently determined optimizer could still optimize away the loop (it relies entirely on constants known at compile time).

Second, you do two operations that cancel each other out. The optimizer is allowed to notice this and cancel them out. Again leaving an empty loop, and removed. This one is downright hard to fix. You can switch to an unsigned int (so overflow isn't undefined behavior), but that of course just results in 0. And simple things (like, say, test += 1) are easy enough for the optimizer to figure out, and it does.

Finally, you assume that test *= 2 is actually going to be compiled to a multiply. That's a very simple optimization; if the bitshift is faster, the optimizer will use it instead. To get around this, you'd have to use something like an implementation-specific assembly inline.

Or, I suppose, just check your microprocessor data sheet to see which is faster.

When I checked the assembly output of compiling your program with gcc -S -O3 using version 4.9, the optimizer actually saw through every simple variation above, and several more. In all cases, it removed the loop (assigning a constant to test), the only thing left was the calls to clock(), the convert/subtract, and the printf.

derobert
  • 211
  • 1
  • 4
  • 1
    Note also that the optimizer can (and will) optimize out operations on constants (even in a loop) as shown in [sqrt c# vs sqrt c++](http://programmers.stackexchange.com/a/185449/40980) where the optimizer was able to replace a loop summing a value with the actual sum. To defeat that optimization you need to use something determined at runtime (such as a command line argument). –  Jul 22 '14 at 22:07
  • @MichaelT Yep. That's what I meant by "Note that a sufficiently determined optimizer could still optimize away the loop (it relies entirely on constants known at compile time)." – derobert Jul 22 '14 at 22:08
  • I get what you are saying, but I don't think the compiler is removing the entire loop. You can easily test out this theory by simply increasing the number of iterations. You will see that increasing the iterations does make the program take longer. If the loop was entirely removed this wouldn't be the case. – DollarAkshay May 02 '18 at 14:37
  • @AkshayLAradhya I can't say what *your* compiler is doing, but I confirmed again that `gcc -O3` (now with 7.3) still removes the loop entirely. (Make sure to switch to a long instead of int if required, otherwise it optimizes it into an infinite loop due to overflow). – derobert May 02 '18 at 14:48
8

I think it would be more helpful for the questioner to have a more differentiated answer, because I see several unexamined assumptions in the questions and in some of the answers or comments.

The resulting relative runtime of shifting and multiplication has nothing to do with C. When I say C, I do not mean the instance of a specific implementation, such as that or that version of GCC, but the language. I do not mean to take this ad absurdum, but to use an extreme example for illustration: you could implement a completely standards compliant C compiler and have multiplication take an hour, while shifting takes milliseconds - or the other way around. I am not aware of any such performance restrictions in C or C++.

You may not care about this technicality in argumentation. Your intention was probably to just test out the relative performance of doing shifts versus multiplications and you chose C, because it is generally perceived as a low level programming language, so one may expect its source code to translate into corresponding instructions more directly. Such questions are very common and I think a good answer should point out that even in C your source code does not translate into instructions as directly as you may think in a given instance. I have given you some possible compilation results below.

This is where comments that question the usefulness of substituting this equivalence in real-world software come in. You can see some in the comments to your question, such as the one from Eric Lippert. It is in line with the reaction you will generally get from more seasoned engineers in response to such optimizations. If you use binary shifts in production code as a blanket means of multiplying and dividing, people will most likely cringe at your code and have some degree of emotional reaction ("I have heard this nonsensical claim made about JavaScript for heaven's sake.") to it that may not make sense to novice programmers, unless they better understand the reasons for those reactions.

Those reasons are primarily a combination of the decreased readability and futility of such optimization, as you may have found out already with comparing their relative performance. However, I do not think that people would have as strong of a reaction if substitution of shift for multiplication was the only example of such optimizations. Questions like yours frequently come up in various forms and in various contexts. I think what more senior engineers actually react to so strongly, at least I have at times, is that there is potential for a much wider range of harm when people employ such micro-optimizations liberally across the code base. If you work at a company like Microsoft on a large code base, you will spend a lot of time reading other engineers' source code, or attempt to locate certain code therein. It may even be your own code that you will be trying to make sense of in a few years' time, particularly at some of the most inopportune times, such as when you have to fix a production outage following a call you received being on pager duty on a Friday night, about to head out for a night of fun with friends … If you spend that much time on reading code, you will appreciate it being as readable as possible. Imagine reading your favorite novel, but the publisher has decided to release a new edition where they use abbrv. all ovr th plc bcs thy thnk it svs spc. That is akin to the reactions other engineers may have to your code, if you sprinkle them with such optimizations. As other answers have pointed out, it is better to clearly state what you mean, which is using the operation in code that is semantically closest to what you are trying to express in idea.

Even in those environments, though, you may find yourself solving an interview question where you are expected to know this or some other equivalence. Knowing them is not bad and a good engineer would be aware of the arithmetic effect of binary shifting. Note that I did not say that this makes a good engineer, but that a good engineer would know, in my opinion. In particular, you may still find some manager, usually towards the end of your interview loop, who will grin broadly at you in anticipation of the delight to reveal this smart engineering "trick" to you in a coding question and prove that he/she, too, used to be or is one of the savvy engineers and not "just" a manager. In those situations, just try to look impressed and thank him/her for the enlightening interview.

Why did you not see a speed difference in C? The most likely answer is that it both resulted in the same assembly code:

int shift(int i) { return i << 2; }
int multiply(int i) { return i * 2; }

Can both compile into

shift(int):
    lea eax, [0+rdi*4]
    ret

On GCC without optimizations, i.e. using the flag "-O0", you may get this:

shift(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    sal eax, 2
    pop rbp
    ret
multiply(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    add eax, eax
    pop rbp
    ret

As you can see, passing "-O0" to GCC does not mean that it will not be somewhat smart about what kind of code it produces. In particular, notice that even in this case the compiler avoided the use of a multiply instruction. You can repeat the same experiment with shifts by other numbers and even multiplications by numbers that are not powers of two. Chances are that on your platform you will see a combination of shifts and additions, but no multiplications. It seems like a bit of a coincidence for the compiler to apparently avoid using multiplications in all those cases if multiplications and shifts really had the same cost, does it not? But I do not mean to supply supposition for proof, so let us move on.

You could rerun your test with the above code and see if you notice a speed difference now. Even then you are not testing shift versus multiply, as you can see by the absence of a multiplication, though, but the code that was generated with a certain set of flags by GCC for the C operations of shift and multiply in a particular instance. So, in another test you could edit the assembly code by hand and instead use a "imul" instruction in the code for the "multiply" method.

If you wanted to defeat some of those smarts of the compiler, you could define a more general shift and multiply method and will end up with something like this:

int shift(int i, int j) { return i << j; }
int multiply(int i, int j) { return i * j; }

Which may yield the following assembly code:

shift(int, int):
    mov eax, edi
    mov ecx, esi
    sal eax, cl
    ret
multiply(int, int):
    mov eax, edi
    imul    eax, esi
    ret

Here we finally have, even on the highest optimization level of GCC 4.9, the expression in assembly instructions you may have expected when you initially set out on your test. I think that in itself can be an important lesson in performance optimization. We can see the difference it made to substitute variables for concrete constants in our code, in terms of the smarts that the compiler is able to apply. Micro-optimizations like the shift-multiply substitution are some very low-level optimizations that a compiler can usually easily do by itself. Other optimizations that are much more impactful on performance require an understanding of the intention of the code that is often not accessible by the compiler or can only be guessed at by some heuristic. That is where you as a software engineer come in and it certainly typically does not involve substituting multiplications with shifts. It involves factors such as avoiding a redundant call to a service that produces I/O and can block a process. If you go to your hard disk or, god forbid, to a remote database for some extra data you could have derived from what you already have in memory, the time you spend waiting outweighs the execution of a million instructions. Now, I think we have strayed a bit far from your original question, but I think pointing this out to a questioner, especially if we suppose someone who is just starting to get a grasp on the translation and execution of code, can be extremely important to address not just the question itself but many of its (possibly) underlying assumptions.

So, which one will be faster? I think it is a good approach you chose to actually test out the performance difference. In general, it is easy to be surprised by the runtime performance of some code changes. There are many techniques modern processors employ and the interaction between software can be complex, too. Even if you should get beneficial performance results for a certain change in one situation, I think it is dangerous to conclude that this type of change will always yield performance benefits. I think it is dangerous to run such tests once, say "Okay, now I know which one's faster!" and then indiscriminately apply that same optimization to production code without repeating your measurements.

So what if the shift is faster than the multiplication? There are certainly indications why that would be true. GCC, as you can see above, appears to think (even without optimization) that avoiding direct multiplication in favor of other instructions is a good idea. The Intel 64 and IA-32 Architectures Optimization Reference Manual will give you an idea of the relative cost of CPU instructions. Another resource, more focused on instruction latency and throughput, is http://www.agner.org/optimize/instruction_tables.pdf. Note that they are not a good predicator of absolute runtime, but of performance of instructions relative to each other. In a tight loop, as your test is simulating, the metric of "throughput" should be most relevant. It is the number of cycles that an execution unit will typically be tied up for when executing a given instruction.

So what if the shift is NOT faster than the multiplication? As I said above, modern architectures can be quite complex and things such as branch prediction, caching, pipelining, and parallel execution units can make it hard to predict the relative performance of two logically equivalent pieces of code at times. I really want to emphasize this, because this is where I am not happy with most answers to questions like these and with the camp of people outright saying that it is simply not true (anymore) that shifting is faster than multiplication.

No, as far as I am aware we did not invent some secret engineering sauce in the 1970's or whenever to suddenly annul the cost difference of a multiplication unit and a bit shifter. A general multiplication, in terms of logical gates, and certainly in terms of logical operations, is still more complex than a shift with a barrel shifter in many scenarios, on many architectures. How this translates into overall runtime on a desktop computer may be a bit opaque. I do not know for sure how they are implemented in specific processors, but here is an explanation of a multiplication: Is integer multiplication really same speed as addition on modern CPU

While here is an explanation of a Barrel Shifter. The documents I have referenced in the previous paragraph give another view on the relative cost of operations, by proxy of CPU instructions. The engineers over at Intel frequently seem to get similar questions: intel developer zone forums clock cycles for integer multiplication and addition in core 2 duo processor

Yes, in most real-life scenarios, and almost certainly in JavaScript, attempting to exploit this equivalence for performance's sake is probably a futile undertaking. However, even if we forced the use of multiplication instructions and then saw no difference in run-time, that is more due to the nature of the cost metric we used, to be precise, and not because there is no cost difference. End-to-end runtime is one metric and if it is the only one we care about, all is well. But that does not mean that all cost differences between multiplication and shifting have simply disappeared. And I think it is certainly not a good idea to convey that idea to a questioner, by implication or otherwise, who is obviously just starting to get an idea of the factors involved in the run-time and cost of modern code. Engineering is always about trade-offs. Inquiry into and explanation as to what tradeoffs modern processors have made to exhibit the execution time that we as users end up seeing may yield a more differentiated answer. And I think a more differentiated answer than "this is simply not true anymore" is warranted if we want to see fewer engineers check in micro-optimized code obliterating readability, because it takes a more general understanding of the nature of such "optimizations" to spot its various, diverse incarnations than simply referring to some specific instances as out of date.

6

What you see is the effect of the optimiser.

The optimisers job is to make the resulting compiled code either smaller, or faster (but rarely both at the same time... but like many things... IT DEPENDS on what the code is).

In PRINCIPLE, any call to a multiplication library, or, frequently, even use of a hardware multiplier will be slower than just doing a bitwise shift.

So... if the naive compiler generated a call to a library for the operation *2, then of course it would run slower than a bitwise shift*.

However optimisers are there to detect patterns and figure out how to make the code smaller/faster/ whatever. And what you have seen is the compiler detecting that *2 is the same as a shift.

Just as a matter of interest I was just today looking at the generated assembler for some operations like *5... not actually looking at that but other things, and along the way I notice that the compiler had turned *5 into:

  • shift
  • shift
  • add original number

So the optimiser of my compiler was smart enough (at least for certain small constants) to generate inline shifts and adds instead of calls to a general purpose multiply library.

The art of compiler optimisers is a whole separate subject, filled with magic, and really properly understood by about 6 people on the whole planet :)

quickly_now
  • 14,822
  • 1
  • 35
  • 48
3

Try timing it with:

for (runs = 0; runs < 100000000; runs++) {
      ;
}

The compiler should be recognizing that the value of test is unchanged after each iteration of the loop, and the final value of test is unused, and eliminating the loop entirely.

2

Multiplication is a combination of shifts and additions.

In the case you've mentioned, I don't believe it matters whether the compiler optimises it or not - "multiply x by two" can be implemented as either:

  • Shift bits of x one place to the left.
  • Add x to x.

These are each basic atomic operations; one isn't faster than the other.

Change it to "multiply x by four", (or any 2^k, k>1) and it's a little different:

  • Shift bits of x two places to the left.
  • Add x to x and call it y, add y to y.

On a basic architecture, it's simple to see that the shift is more efficient - taking one vs. two operations, since we cannot add y to y until we know what y is.

Try the latter (or any 2^k, k>1), with appropriate options to prevent your from optimising them to be the same thing in implementation. You should find the shift is faster, taking O(1) compared to the repeated addition in O(k).

Obviously, where the multiplicand is not a power of two, a combination of shifts and additions (one where the number of each is non-zero) is necessary.

OJFord
  • 249
  • 1
  • 8
  • 1
    What's a "basic atomic operation"? Couldn't one argue that in a shift, the operation can be applied to every bit in parallel, while in an addition the leftmost bits depends on the other bits? – Bergi Jul 22 '14 at 10:54
  • 2
    @Bergi: I'm guessing he means that both shift and add are single machine instructions. You'd have to look at the instruction set documentation to see the cycle counts for each, but yes, an add is often a multi-cycle operation whereas a shift is usually performed in a single cycle. – TMN Jul 22 '14 at 16:15
  • Yes, that might be the case, but multiplication is a single machine instruction as well (though of course it might need more cycles) – Bergi Jul 22 '14 at 17:25
  • @Bergi, that too is arch dependent. What arch are you thinking of that shifts in fewer cycles than it does 32-bit addition (or x-bit as applicable)? – OJFord Jul 22 '14 at 17:37
  • I don't know any particular architectures, no (and my computer engineering courses have faded), probably both instructions take less than one cycle. I probably was thinking in terms of microcode or even logic gates, where a shift would likely be cheaper. – Bergi Jul 22 '14 at 18:17
  • @bergi It is true that there are bit dependencies in addition operations due to the chaining of full adders, whereas shifts merely require bits to be rerouted. However, modern adders have carry lookaheads and other fancy parts that close the speed gap somewhat. You are however right that it's incorrect to say 'one isn't faster than the other', instead it should be said that 'one isn't noticeably faster than the other' as depending on hardware implementation, one usually will be faster. – Pharap Jul 23 '14 at 03:35
1

Multiplication of signed or unsigned values by powers of two is equivalent to left-shifting, and most compilers will make the substitution. Division of unsigned values, or signed values which the compiler can prove are never negative, is equivalent to right-shifting, and most compilers will make that substitution (though some aren't sophisticated enough to prove when signed values can't be negative).

It should be noted, however, that division of potentially-negative signed values is not equivalent to right-shifting. An expression like (x+8)>>4 is not equivalent to (x+8)/16. The former, in 99% of compilers, will map values from -24 to -9 to -1, -8 to +7 to 0, and +8 to +23 to 1 [rounding numbers almost symmetrically about zero]. The latter will map -39 to -24 to -1, -23 to +7 to 0, and +8 to +23 to +1 [grossly asymmetrical, and likely not what was intended]. Note that even when values are not expected to be negative, use of >>4 will likely yield faster code than /16 unless the compiler can prove the values can't be negative.

supercat
  • 8,335
  • 22
  • 28
0

Some more info I just checked out.

On x86_64, the MUL opcode has a 10 cycle latency and 1/2 cycle throughput. MOV, ADD and SHL have a latency of 1 cycle, with 2.5, 2.5, and 1.7 cycle throughput.

A multiplication by 15 would require 3 SHL and 3 ADD ops at minimum and probably a couple of MOVs.

https://gmplib.org/~tege/x86-timing.pdf

Rich Remer
  • 149
  • 2
0

Your methodology is flawed. Your loop increment and condition checking itself is taking that much time.

  • Try running an empty loop and measure the time (call it base).
  • Now add 1 shift operation and measure the time (call it s1).
  • Next add 10 shift operations and measure the time (call it s2)

If everything is going correctly base-s2 should be 10 times more than base-s1. Otherwise something else is coming into play here.

Now I actually tried this myself and figured, If loops are causing a problem why not remove them entirely. So I went ahead and did this :

int main(){

    int test = 2;
    clock_t launch = clock();

    test << 6;
    test << 6;
    test << 6;
    test << 6;
    //.... 1 million times
    test << 6;

    clock_t done = clock();
    printf("Time taken : %d\n", done - launch);
    return 0;
}

And there you have your result

1 million shift operations in under 1 miliseconds ?.

I did the same thing for multiplication by 64 and got the same result. So probably the compiler is ignoring the operation completely as others mentioned the value of test is never changed.

Shiftwise Operator Result

DollarAkshay
  • 101
  • 3