Having competed in hardware benchmarking(you know, overclocking the hardware and competing whos machine calculates the Pi the quickest kind of stuff) and also written some benchmarking software code myself, first and foremost I want to say that benchmarking hardware accurately is very complicated subject. It is a blend between what you want to benchmark, how you want to benchmark it, where you want to benchmark it and most importantly why you want to benchmark.
Let me discuss these properties in detail.
What:
There are different kind of things to benchmark in any given system. You might want to benchmark for example memory bandwidth or latency. Or hard drive write/read latency/bandwidth. Or for example "CPU speed" in a generic way. However, it is probably not as simple as that, which brings me to the next point...
How:
As there are different things to benchmark, there are different ways to benchmark. For example, one could benchmark CPU/GPU GFLOPS by a simple naive matrix multiplication. This operation might or might not have any correlation with common performance charasteristics of the piece of hardware in question. A CPU scoring well in a matrix multiplication benchmark might not score as well for example in a data compression benchmark. There are multiple variables which have an impact the result and perhaps the most important one is the underlying microarchitecture and how well the code in question can take advantage of it's strengths and avoid it's weaknesses. So it all boils down to implementation. A single benchmark is a single implementation of the algorithm used!
Where:
It is completely different ball to play to carry out benchmarks in a virtual environment such as inside a virtual machine, as compared to running as a process inside a multitasking operating system, or as compared to running the code as the only instance on the said machine(that is, without OS or drivers or watsoever in the way, directly booting x86-based machine from a custom MBR to real-mode and working your way from there.). The more the target machine has "other" code running in "background" the less reliable the benchmark results are going to be. So running a benchmark on a server with high CPU load isn't going to produce very reliable results. Likewise, since on most modern operating systems the operating system and it's kernel provide an layer of abstraction, the kernel settings and parameters have some impact on performance of processes. For example process/thread priorities and process/thread scheduling parameters can have very measurable impact on the benchmark results. As a general rule, the less you have going on on the machine the better, so the ideal environment would be to dedicate the said piece of hardware solely to the purpose of benchmarking.
Why:
What purpose is the benchmark trying to fill? Do you want to try to give something like a theoretical indication of performance for a given piece of hardware in running that specific algorithm/set of algorithms(without forgeting that a benchmark simply benchmarks that implementation of that algorithm)? Or are you trying to mimic "real world performance" of the said piece of hardware as closely as possible? It all depends, because obviously performance in matrix multiplication has little to do with performance in serving webpages or database queries.
Of course these points are very vague, but what I'd want people to have in mind when they talk about benchmarks in terms of hardware performance is that it depends, it is merely an indicator and is always case specific. There is just so much you can do with writing truly fast code that it's not even funny to try to talk about benchmark results (say a very tailored implementation of some arbitrary task which is optimized for a certain instruction set extensions, a certain cache sizes, certain branch misprediction penalty, certain level of instruction level parallelism and a certain memory access latency) between different kinds of CPUs. It is all about trying to take advantage of hardware strenghts and avoiding the weaknesses.
tl;dr: It is hard to get it right(if that's even possible), you want to know some low-level concepts of the target hardware and be aware that everything is case specific. There can never be universal benchmark.