17

[edit#2] If anyone from VMWare can hit me up with a copy of VMWare Fusion, I'd be more than happy to do the same as a VirtualBox vs VMWare comparison. Somehow I suspect the VMWare hypervisor will be better tuned for hyperthreading (see my answer too)

I'm seeing something curious. As I increase the number of cores on my Windows 7 x64 virtual machine, the overall compile time increases instead of decreasing. Compiling is usually very well suited for parallel processing as in the middle part (post dependency mapping) you can simply call a compiler instance on each of your .c/.cpp/.cs/whatever file to build partial objects for the linker to take over. So I would have imagined that compiling would actually scale very well with # of cores.

But what I'm seeing is:

  • 8 cores: 1.89 sec
  • 4 cores: 1.33 sec
  • 2 cores: 1.24 sec
  • 1 core: 1.15 sec

Is this simply a design artifact due to a particular vendor's hypervisor implementation (type2:virtualbox in my case) or something more pervasive across more VMs to make hypervisor implementations more simpler? With so many factors, I seem to be able to make arguments both for and against this behavior - so if someone knows more about this than me, I'd be curious to read your answer.

Thanks Sid

[edit:addressing comments]

@MartinBeckett: Cold compiles were discarded.

@MonsterTruck: Couldn't find an opensource project to compile directly. Would be great but can't screwup my dev env right now.

@Mr Lister, @philosodad: Have 8 hw threads, using VirtualBox, so should be 1:1 mapping without emulation

@Thorbjorn: I have 6.5GB for the VM and a smallish VS2012 project - it's quite unlikely that I'm swapping in/out trashing the page file.

@All: If someone can point to an open source VS2010/VS2012 project, that might be a better community reference than my (proprietary) VS2012 project. Orchard and DNN seem to need environment tweaking to compile in VS2012. I really would like to see if someone with VMWare Fusion also sees this (for VMWare vs VirtualBox compartmentalization)

Test details:

  • Hardware: Macbook Pro Retina
    • CPU : Core i7 @ 2.3Ghz (quad core, hyper threaded = 8 cores in windows task manager)
    • Memory : 16 GB
    • Disk : 256GB SSD
  • Host OS: Mac OS X 10.8
  • VM type: VirtualBox 4.1.18 (type 2 hypervisor)
  • Guest OS: Windows 7 x64 SP1
  • Compiler: VS2012 compiling a solution with 3 C# Azure projects
    • Compile times measure by VS2012 plugin called 'VSCommands'
    • All tests run 5 times, first 2 runs discarded, last 3 averaged
DeepSpace101
  • 1,394
  • 5
  • 14
  • 26
  • 9
    Probably the file I/O slowing it down with multiples tasks and the disc access being to the virtualised drive – Martin Beckett Aug 11 '12 at 04:32
  • 3
    I'd like to reproduce this on my own machine. Can you please upload a sample project somewhere? I suspect the virtual machine is playing tricks here. Try booting to Windows natively (Bootcamp) and see if you observe the same behaviour --I doubt you will. – Apoorv Aug 11 '12 at 05:01
  • Are you sure that the virtual machine is atually using more cores, not just _simulating_ the use of more cores? – Mr Lister Aug 11 '12 at 07:22
  • @MartinBeckett my first thought too, but it's likely that the source is cached in memory (since OP is discarding the cold run times). – Daniel B Aug 11 '12 at 08:10
  • I would suggest running some basic CPU benchmarks (the multi-threaed kind) under various #core settings on the VM. This should tell you whether the problem lies with the setup of the environment, or if it's something to do with the actual compiler. It definitely *is* possible that the dependencies in your source code make parallelisation less beneficial (even negative, as in your case). – Daniel B Aug 11 '12 at 08:15
  • I'm pretty sure VS won't allocate more than one core per project to the compilation. – TZHX Aug 11 '12 at 09:04
  • 1
    What are we compiling here? Lots of time the overhead of parallelizing a task doesn't pay off until you hit certain scale. See how compiling apache or ravendb does. – Wyatt Barnett Aug 11 '12 at 11:00
  • When you try this, look back at your mac and see how many CPUs are working and how many threads your VM is using. If this is the same for a windows VM with 4 cores as one for 8 cores while they execute the same task, then you are just picking up the overhead of emulating multiple cores without any of the benefits. – philosodad Aug 11 '12 at 14:55
  • @TZHX -it's on option in the compile settings in VS /MP – Martin Beckett Aug 11 '12 at 15:48
  • @DanielB - writes might be slow, especially if the VM waits until the virtual disk confirms a commit and the VM is itself single threaded. There is a lot of smart SW in the queue handling in the SATA bus on a bare machine – Martin Beckett Aug 11 '12 at 15:50
  • 2
    You probably run out of memory in your virtual machine so it starts swapping. –  Aug 11 '12 at 15:52
  • 1
    Same thing has happened to me before with Java using Maven 3.x to compile on an i3. Letting it default to *"4"* threads was much slower, near 50% slower, than telling it explicitly to only use 2 cores. I think it has something to do with the hyper-threading context switching and overlapping I/O. –  Aug 11 '12 at 18:49
  • Another thing to keep in mind: Cores created by hyperthreading don't have the same performance as the real cores. That probably explains the big jump when going from 4 cores to 8 (the other 4 are no doubt hyperthreaded "cores") but it doesn't explain the rest of what you're seeing. – Loren Pechtel Aug 13 '12 at 02:56
  • Thanks for this article. I have a MBP 2011 with core i7 2.2 . 16GB DDR3-1333 - SSD 512 I run win7x64 and debian6 x64 in VMWare Fusion. What I see is that when I affect cores 4,6 or 8 my MBP start warming a lot without doing anything special. What really is behind the core processor parameters ? –  Nov 23 '12 at 11:18

3 Answers3

13

Answer: It doesn't slow down, it does scale up with # of CPU cores. The project used in the original question was 'too small' (it's actually a ton of development but small/optimized for a compiler) to reap the benefits of multiple cores. Seems instead of planning how to spread the work, spawning multiple compiler processes etc, at this small scale it's best to hammer at the work serially right off the bat.

This is based off the new experiment I did based off the comments to the question (and my personal curiosity). I used a larger VS project - Umbraco CMS's source code since it's large, open sourced and one can directly load up the solution file and rebuild (hint: load up umbraco_675b272bb0a3\src\umbraco.sln in VS2010/VS2012).

NOW, what I see is what I expect, i.e. compiles scale up!! Well, to a certain point since I find:

Table of results

Takeaways:

  • A new VM core results in a new OS X Thread within the VirtualBox process
  • Compile times scale up as expected (compiles are long enough)
  • At 8 VM cores, core emulation might be kicking in within VirtualBox as the penalty is massive (50% hit)
  • The above is likely because OS X is unable to present 4 hyper-threaded cores (8 h/w thread) as 8 cores to VirtualBox

That last point caused me to monitor the CPU history across all the cores via 'Activity Monitor' (CPU history) and what I found was

OS X CPU history graph

Takeaways:

  • At one VM core, the activity seems to be hopping across the 4 HW cores. Makes sense, to distribute heat evenly at core levels.

  • Even at 4 Virtual cores (and 27 VirtualBox OS X threads or ~800 OS X thread overall), only even HW threads (0,2,4,6) are almost saturated while odd HW threads (1,3,5,7) are almost at 0%. More likely the scheduler works in terms of HW cores and NOT HW threads so I speculate perhaps the OSX 64bit kernel/scheduler isn't optimized for hyper threaded CPU? Or looking at the 8VM core setup, perhaps it starts using them at a high % CPU utilization? Something funny is going one ... well, that's a separate question for some Darwin developers ...

[edit]: I'd love to try the same in VMWare Fusion. Chances are it won't be this bad. I wonder if they showcase this as a commercial product ...

Footer:

In case the images ever disappear, the compile time table is (text, ugly!)

Cores in    Avg compile      Host/OSX    Host/OSX CPU
   VM         times (sec)   Threads      consumption
    1           11.83            24        105-115%
    2           10.04            25        140-190%
    4            9.59            27        180-270%
    8           14.18            31        240-430%
DeepSpace101
  • 1,394
  • 5
  • 14
  • 26
  • I suspect the drop between 4 and 8 is a combination of the VM not being optimised for HT, and HT not in any way being equal to twice as many cores (at *best* a 30% performance increase, usually far less). – Daniel B Aug 13 '12 at 06:25
  • @DanielB: At 4=>8 cores, the issue isn't just that it's a mere +30% boost (vs +100%) like you suggested - it's that the performance is actually -50%. If the hardware threads were totally 'dead/useless' and work was being diverted to the other cores, the performance delta would be 0. So therefor I'd be more inclined to say it's the design on the VirtualBox type 2 hypervisor. I wonder how VMWare Fusion is ... – DeepSpace101 Aug 13 '12 at 08:33
  • "At one VM core, the activity seems to be hopping across the 4 HW cores. Makes sense, to distribute heat evenly at core levels" - not necessarily, it is usually better to re-schedule on the same core (for cache etc) but the hypervisor is just picking one at randon, or the least-used core because it thinks its a general-purpose processing where other processes are using those cores. In this case, the scheduler optimisation works against you (but in a very minor way) – gbjbaanb Aug 13 '12 at 09:40
  • @Sid agreed, I'm just pointing out that with HT you're going to get (greatly) diminishing returns a lot sooner than you'd think, if you assumed it's actually anything like a 100% improvement. In this case, it could easily be contention for your HD that's causing this, hence my earlier suggestion for some artificial CPU benchmarks. – Daniel B Aug 13 '12 at 12:26
6

There is only one possible reason for this to be happening, which is that your overhead is exceeding your gains.

You may be emulating the multiple cores, rather than assigning actual cores or even processes or even threads from the host machine. That seems pretty likely to me, and obviously is going to give you negative speedup.

The other possibility is that the process itself doesn't parallelize well, and even attempting to parallelize it is costing you more in communication overhead than you're gaining.

philosodad
  • 1,775
  • 10
  • 14
  • `your overhead is exceeding your gains`: True but that pretty much covers everything without knowing what is really causing it :) ... I'm using VirtualBox and have the physical cores, so assumed the mapping should be 1:1 without emulation. I'm going to search for a LARGE open source VS2012 so others can reference it too... brb – DeepSpace101 Aug 12 '12 at 17:00
  • @Sid according to this answer http://superuser.com/a/297727 the virtualbox VM should use the host cores appropriately. But I'd still check out what is happening on the host, to make sure that the expected behavior is occurring. – philosodad Aug 13 '12 at 01:03
0

You are not alone ...

Same thing has happened to me before with Java using Maven 3.x to compile on an i3. Letting it default to "4" threads was much slower, near 50% slower, than telling it explicitly to only use 2 cores.

I think it has something to do with the hyper-threading context switching and overlapping I/O.

It makes sense when you start thinking about it. You can prove what is causing the degeneration of results with a good system wide profiling tool.