[...] (granted, in the microsecond environment) [...]
Micro-seconds add up if we're looping over millions to billions of things. A personal vtune/micro-optimization session from C++ (no algorithmic improvements):
T-Rex (12.3 million facets):
Initial Time: 32.2372797 seconds
Multithreading: 7.4896073 seconds
4.9201039 seconds
4.6946372 seconds
3.261677 seconds
2.6988536 seconds
SIMD: 1.7831 seconds
4-valence patch optimization: 1.25007 seconds
0.978046 seconds
0.970057 seconds
0.911041 seconds
Everything besides "multithreading", "SIMD" (handwritten to beat the compiler), and the 4-valence patch optimization were micro-level memory optimizations. Also the original code starting from the initial times of 32 seconds was already optimized quite a bit (theoretically-optimal algorithmic complexity) and this is a recent session. The original version long before this recent session took over 5 minutes to process.
Optimizing memory efficiency can help often by anywhere from several times to orders of magnitudes in a single-threaded context, and more in multithreaded contexts (the benefits of an efficient memory rep often multiply with multiple threads in the mix).
On the Importance of Micro-Optimization
I get a little agitated by this idea that micro-optimizations are a waste of time. I agree that it's good general advice, but not everyone does it incorrectly based on hunches and superstitions rather than measurements. Done correctly, it doesn't necessarily yield a micro impact. If we take Intel's own Embree (raytracing kernel) and test only the simple scalar BVH they've written (not ray packet which is exponentially harder to beat), and then try to beat the performance of that data structure, it can be a most humbling experience even for a veteran used to profiling and tuning code for decades. And it's all because of micro-optimizations applied. Their solution can process over a hundred million rays per second when I've seen industrial professionals working in raytracing who can't even get more than 3 million rays per second on the same hardware (same bounding volume hierarchy, same logarithmic search).
There's no way to take a straightforward implementation of a BVH with only an algorithmic focus and get over a hundred million primary ray intersections per second out of it against any optimizing compiler (even Intel's own ICC). A straightforward one often doesn't even get a million rays per second. It takes professional-quality solutions to often even get a few million rays per second. It takes Intel-level micro-optimization to get over a hundred million rays per second.
Algorithms
I think micro-optimization isn't important as long as performance isn't important at the level of minutes to seconds, e.g., or hours to minutes. If we take a horrific algorithm like bubble sort and use it over a mass input as an example, and then compare it to even a basic implementation of merge sort, the former might take months to process, the latter maybe 12 minutes, as a result of quadratic vs linearithmic complexity.
The difference between months and minutes is probably going to make most people, even those not working in performance-critical fields, consider the execution time to be unacceptable if it requires users waiting for months to get a result.
Meanwhile, if we compare the non-micro-optimized, straightforward merge sort to quicksort (which is not at all algorithmically superior to merge sort, and only offers micro-level improvements for locality of reference), the micro-optimized quicksort might finish in 15 seconds as opposed to 12 minutes. Making users wait 12 minutes might be perfectly acceptable (coffee break kind of time).
I think this difference is probably negligible to most people between, say, 12 minutes and 15 seconds, and that's why micro-optimization is often considered useless since it's often only like the difference between minutes and seconds, and not minutes and months. The other reason I think it's considered useless is that it's often applied to areas that don't matter: some little area which isn't even loopy and critical which yields some questionable 1% difference (which may very well just be noise). But for people who care about these types of time differences and are willing to measure and do it right, I think it's worth paying attention to at least the basic concepts of the memory hierarchy (specifically the upper levels relating to page faults and cache misses).
Java Leaves Plenty of Room for Good Micro-Optimizations
Phew, sorry -- with that sort of rant aside:
Does the "magic" of the JVM hinder the influence a programmer has over
micro-optimisations in Java?
A little bit but not as much as people might think if you do it right. For example, if you are doing image processing, in native code with handwritten SIMD, multithreading, and memory optimizations (access patterns and possibly even representation depending on image processing algorithm), it's easy to crunch hundreds of millions of pixels per second for 32-bit RGBA pixels (8-bit color channels) and sometimes even billions per second.
It's impossible to get anywhere close in Java if you say, made a Pixel
object (this alone would inflate the size of a pixel from 4 bytes to 16 on 64-bit).
But you might be able to get a whole lot closer if you avoided the Pixel
object, used an array of bytes, and modeled an Image
object. Java's still pretty competent there if you start using arrays of plain old data. I've tried these kinds of things before in Java and was quite impressed provided that you don't create a bunch of little teeny objects everywhere that are 4 times bigger than normal (ex: use int
instead of Integer
) and start modeling bulk interfaces like an Image
interface, not Pixel
interface. I'd even venture to say that Java can rival C++ performance if you are looping over plain old data and not objects (huge arrays of float
, e.g., not Float
).
Perhaps even more important than the memory sizes is that an array of int
guarantees a contiguous representation. An array of Integer
does not. Contiguity is often essential for locality of reference since it means multiple elements (ex: 16 ints
) can all fit into a single cache line and potentially be accessed together prior to eviction with efficient memory access patterns. Meanwhile a single Integer
might be stranded somewhere in memory with surrounding memory being irrelevant, only to have that region of memory loaded into a cache line only to use a single integer prior to eviction as opposed to 16 integers. Even if we got marvelously lucky and surrounding Integers
were all right next to each other in memory, we can only fit 4 into a cache line that can be accessed prior to eviction as a result of Integer
being 4 times larger, and that's in the best-case scenario.
And there's plenty of micro-optimizations to be had there since we're unified under the same memory architecture/hierarchy. Memory access patterns matter no matter what language you use, concepts like loop tiling/blocking might be generally applied far more often in C or C++, but they benefit Java just as much.
I recently read in C++ sometimes the ordering of the data members can
provide optimizations [...]
The order of data members generally don't matter in Java, but that's mostly a good thing. In C and C++, preserving the order of data members is often important for ABI reasons so compilers don't mess with that. Human developers working there have to be careful to do things like arrange their data members in descending order (largest to smallest) to avoid wasting memory on padding. With Java, apparently the JIT can reorder the members for you on the fly to ensure proper alignment while minimizing padding, so provided that's the case, it automates something that average C and C++ programmers can often do poorly and end up wasting memory that way (which isn't just wasting memory, but often wasting speed by increasing the stride between AoS structures needlessly and causing more cache misses). It's a very robotic thing to rearrange fields to minimize padding, so ideally humans don't deal with that. The only time where field arrangement might matter in a way that requires a human to know the optimal arrangement is if the object is bigger than 64 bytes and we're arranging fields based on access pattern (not optimal padding) -- in which case it might be a more human endeavor (requires understanding critical paths, some of which is information a compiler can't possibly anticipate without knowing what users will do with the software).
If not, could people give examples of what tricks you can use in Java
(besides simple compiler flags).
The biggest difference to me in terms of an optimizing mentality between Java and C++ is that C++ might allow you to use objects a little (teeny) bit more than Java in a performance-critical scenario. For example, C++ can wrap an integer to a class with no overhead whatsoever (benchmarked all over the place). Java has to have that metadata pointer-style+alignment padding overhead per object which is why Boolean
is bigger than boolean
(but in exchange providing uniform benefits of reflection and the ability to override any function not marked as final
for every single UDT).
It's a little bit easier in C++ to control the contiguity of memory layouts across non-homogeneous fields (ex: interleaving floats and ints into one array through a struct/class), since spatial locality is often lost (or at least control is lost) in Java when allocating objects through the GC.
... but often the highest-performance solutions will often split those up anyway and use an SoA access pattern over contiguous arrays of plain old data. So for the areas that need peak performance, the strategies to optimize memory layout between Java and C++ are often the same, and will often have you demolishing those teeny object-oriented interfaces in favor of collection-style interfaces that can do things like hot/cold field splitting, SoA reps, etc. Non-homogeneous AoSoA reps seem kind of impossible in Java (unless you just used a raw array of bytes or something like that), but those are for rare cases where both sequential and random access patterns need to be fast while simultaneously having a mixture of field types for hot fields. To me the bulk of the difference in optimization strategy (at the general kind of level) between these two is moot if you are reaching for peak performance.
The differences vary quite a bit more if you are simply reaching for "good" performance -- there not being able to do as much with small objects like Integer
vs. int
can be a bit more of a PITA, especially with the way it interacts with generics. It's a bit harder to just build one generic data structure as a central optimization target in Java that works for int
, float
, etc. while avoiding those bigger and expensive UDTs, but often the most performance-critical areas will require hand-rolling your own data structures tuned for a very specific purpose anyway so it's only annoying for code that is striving for good performance but not peak performance.
Object Overhead
Note that Java object overhead (metadata and loss of spatial locality and temporary loss of temporal locality after an initial GC cycle) is often big for things that are really small (like int
vs. Integer
) which are being stored by the millions in some data structure that's largely contiguous and accessed in very tight loops. There seems to be a lot of sensitivity about this subject, so I should clarify that you don't want to be worrying about object overhead for big objects like images, just really minuscule objects like a single pixel.
If anyone feels doubtful about this part, I'd suggest making a benchmark between summing up a million random ints
vs. a million random Integers
and to do this repeatedly (the Integers
will reshuffle in memory after an initial GC cycle).
Ultimate Trick: Interface Designs That Leave Room to Optimize
So the ultimate Java trick as I see it if you are dealing with a place that handles a heavy load over small objects (ex: a Pixel
, a 4-vector, a 4x4 matrix, a Particle
, possibly even an Account
if it only has a few small fields) is to avoid using objects for these teeny things and use arrays (possibly chained together) of plain old data. The objects then become collection interfaces like Image
, ParticleSystem
, Accounts
, a collection of matrices or vectors, etc. Individual ones can be accessed by index, e.g. This is also one of the ultimate design tricks in C and C++, since even without that basic object overhead and disjointed memory, modeling the interface at the level of a single particle prevents the most efficient solutions.