Benchmarking CPU processing power

Question

Provided that many tools for computers benchmarking are available already, I'd like to write my own, starting with processing power measurement.

I'd like to write it in C under Linux, but other language alternatives are welcome.

I thought starting from floating point operations per second, but it is just a hint.

I also thought it'd be correct to keep track of CPU number of cores, RAM amount and the like, to more consistently associate results with CPU architecture.

How would you proceed to the task of measuring CPU computing power?

And on top of that: I would worry about a proper minimum workload induced by concurrently running services; is it correct to run benchmarking as a standalone (and possibly avulsed from the OS environment) process?

score 4 · Accepted Answer · answered Apr 03 '12 at 05:09

Having competed in hardware benchmarking(you know, overclocking the hardware and competing whos machine calculates the Pi the quickest kind of stuff) and also written some benchmarking software code myself, first and foremost I want to say that benchmarking hardware accurately is very complicated subject. It is a blend between what you want to benchmark, how you want to benchmark it, where you want to benchmark it and most importantly why you want to benchmark.

Let me discuss these properties in detail.

What: There are different kind of things to benchmark in any given system. You might want to benchmark for example memory bandwidth or latency. Or hard drive write/read latency/bandwidth. Or for example "CPU speed" in a generic way. However, it is probably not as simple as that, which brings me to the next point...

How: As there are different things to benchmark, there are different ways to benchmark. For example, one could benchmark CPU/GPU GFLOPS by a simple naive matrix multiplication. This operation might or might not have any correlation with common performance charasteristics of the piece of hardware in question. A CPU scoring well in a matrix multiplication benchmark might not score as well for example in a data compression benchmark. There are multiple variables which have an impact the result and perhaps the most important one is the underlying microarchitecture and how well the code in question can take advantage of it's strengths and avoid it's weaknesses. So it all boils down to implementation. A single benchmark is a single implementation of the algorithm used!

Where: It is completely different ball to play to carry out benchmarks in a virtual environment such as inside a virtual machine, as compared to running as a process inside a multitasking operating system, or as compared to running the code as the only instance on the said machine(that is, without OS or drivers or watsoever in the way, directly booting x86-based machine from a custom MBR to real-mode and working your way from there.). The more the target machine has "other" code running in "background" the less reliable the benchmark results are going to be. So running a benchmark on a server with high CPU load isn't going to produce very reliable results. Likewise, since on most modern operating systems the operating system and it's kernel provide an layer of abstraction, the kernel settings and parameters have some impact on performance of processes. For example process/thread priorities and process/thread scheduling parameters can have very measurable impact on the benchmark results. As a general rule, the less you have going on on the machine the better, so the ideal environment would be to dedicate the said piece of hardware solely to the purpose of benchmarking.

Why: What purpose is the benchmark trying to fill? Do you want to try to give something like a theoretical indication of performance for a given piece of hardware in running that specific algorithm/set of algorithms(without forgeting that a benchmark simply benchmarks that implementation of that algorithm)? Or are you trying to mimic "real world performance" of the said piece of hardware as closely as possible? It all depends, because obviously performance in matrix multiplication has little to do with performance in serving webpages or database queries.

Of course these points are very vague, but what I'd want people to have in mind when they talk about benchmarks in terms of hardware performance is that it depends, it is merely an indicator and is always case specific. There is just so much you can do with writing truly fast code that it's not even funny to try to talk about benchmark results (say a very tailored implementation of some arbitrary task which is optimized for a certain instruction set extensions, a certain cache sizes, certain branch misprediction penalty, certain level of instruction level parallelism and a certain memory access latency) between different kinds of CPUs. It is all about trying to take advantage of hardware strenghts and avoiding the weaknesses.

tl;dr: It is hard to get it right(if that's even possible), you want to know some low-level concepts of the target hardware and be aware that everything is case specific. There can never be universal benchmark.

+1 for Why - Say I want you to measure the fastest data connection across the USA for me. Bet you didn't think to include a train full of hard disks....... — mattnz, Apr 03 '12 at 06:15
+1 for me, very good overall discussion, it pointed me to some light about the topic. But what if you would like to compare, say, Atom and Arm doing the same things (e.g. matrix calculations and so on)? I mean, if I execute some purposefully created generic code (without mistakes in it) shouldn't that give the ability to appreciate differences between the two CPU architectures? I still agree with you that efficiency depends on how one takes advantage of platform capabilitites, but maybe comparation would be possible on same, more generic tests being run between the contenders. Or not? — Federico Zancan, Apr 03 '12 at 09:24
You can compare different architectures by running equivalent code. But, you need to be sure that the code indeed is as equal as possible(in terms of functionality), for example some compilers might generate more optimized code for x86 than for ARM targets, meaning that the level of code optimization isn't quite equal, and as such the other may end up looking inferior for no "real" reason. Also there are other factors affecting the performance than the CPU - for example RAM latencies and bandwidth. And they vary alot even among PC platforms(as in RAM MHz and in single vs. dual channel RAM). — zxcdw, Apr 03 '12 at 09:49

score 1 · Answer 2 · answered Apr 03 '12 at 05:34

Useful benchmarks compare time for actual work with respect to real-world problems; e.g., transcoding video, calculating complex mathematical formulae, rending 3d models, etc.

While as a rule benchmarks measure nothing more important than how fast a system ran the benchmark, at least in these cases you can correlate the result to things you may have found yourself spending time on. It's still not perfect, but it's better than a tight loop of a single function.

Also bear in mind that a single algorithm can be tuned to a specific chip. AMD and Intel, for example, classically differ on how many logic units they dedicate to floating-point versus integer math. So writing your code to use one or the other will necessarily favor a specific chip family. There's really no getting away from this reality; you absolutely cannot write a neutral benchmark.

So instead of picking sides, several benchmarking suites simply use existing popular software (e.g. ffmpeg for video transcoding). The idea is that it's the software users will be running anyway, so the benchmark should reflect the user's expected experience with that software. That's the theory, at least. But of course depending on which software you pick, you're still going to bias the results.

Also, bear in mind that most of the zero-work benchmark ideas you may come up with (tight loops running a single operation) will often be optimized away by the compiler into something completely different than what you anticipated. Some chips will do similar optimizations on-the-fly, meaning you end up measuring something completely different than what you coded. Again, best advice is time the machine doing something useful. If the chip can optimize that code, then that's exactly what you want the benchmark to reflect.

Benchmarking CPU processing power

2 Answers2