11

Does the "magic" of the JVM hinder the influence a programmer has over micro-optimisations in Java? I recently read in C++ sometimes the ordering of the data members can provide optimizations (granted, in the microsecond environment) and I presumed a programmer's hands are tied when it comes to squeezing performance from Java?

I appreciate a decent algorithm provides greater speed-gains, but once you have the correct algorithm is Java harder to tweak due to the JVM control?

If not, could people give examples of what tricks you can use in Java (besides simple compiler flags).

user997112
  • 1,469
  • 2
  • 19
  • 24
  • 14
    The basic principle behind all Java optimization is this: The JVM has probably already done it better than you can. Optimization mostly involves following sensible programming practices and avoiding the usual things like concatenating strings in a loop. – Robert Harvey Nov 20 '12 at 23:40
  • 3
    The principle of micro-optimization in _all_ languages is that the compiler already done it better than you can. The other principle of micro-optimization in all languages is that throwing more hardware on it is cheaper than programmer's time micro-optimizing it. Programmer has to tend to scaling problems (suboptimal algorithms), but micro-optimization is a waste of time. Sometimes micro-optimization makes sense on embedded systems where you can't throw more hardware on it, but Android using Java, and a rather poor implementation of it, shows most of them have enough hardware already. – Jan Hudec Nov 21 '12 at 09:54
  • 1
    for "Java performance tricks", worth studying are: [Effective Java](http://programmers.stackexchange.com/a/141149/31260 "as explained here"), [Angelika Langer Links - Java Performance](http://www.angelikalanger.com/Resources/Links/JavaPerformance.htm) and performance related articles by Brian Goetz in _Java theory and practice_ and _Threading Lightly_ series listed [here](http://programmers.stackexchange.com/a/141159/31260) – gnat Nov 21 '12 at 10:06
  • 2
    Be extremely careful about tips and tricks - the JVM, operating systems and hardware moves on - you're best off learning the performance tuning methodology and applying enhancements for *your* particular environment :-) – Martijn Verburg Nov 21 '12 at 11:10
  • In some cases, a VM can make optimizations at run time that are impractical to make at compile-time. Using managed memory can improve performance, though it will also often have a higher memory footprint. Unused memory is freed when convenient, rather than ASAP. – Brian Dec 06 '12 at 17:13
  • basically understanding how the JVM looks at code is a great start...several books that touch on it. How javac processes certain code and how its optimized at run time. gnat's suggestion is a great place to start. – Rig Dec 08 '12 at 18:18
  • Does vectorization count as micro-optimization? C++ compilers are pretty bad, and JITs usually don't even bother to try. – Sebastian Redl Jan 03 '16 at 10:36

8 Answers8

29

Micro-optimizations are almost never worth the time, and almost all the easy ones are done automatically by compilers and runtimes.

There is, however, one important area of optimization where C++ and Java are fundamentally different, and that is bulk memory access. C++ has manual memory management, which means you can optimize the application's data layout and access patterns to make full use of caches. This is quite hard, somewhat specific to the hardware you're running on (so performance gains may disappear on different hardware), but if done right, it can lead to absolutely breathtaking performance. Of course you pay for it with the potential for all kinds of horrible bugs.

With a garbage collected language like Java, this kind of optimizations cannot be done in the code. Some can be done by the runtime (automatically or through configuration, see below), and some are just not possible (the price you pay for being protected from memory management bugs).

If not, could people give examples of what tricks you can use in Java (besides simple compiler flags).

Compiler flags are irrelevant in Java because the Java compiler does almost no optimization; the runtime does.

And indeed Java runtimes have a multitude of parameters that can be tweaked, especially concerning the garbage collector. There's nothing "simple" about those options - the defaults are good for most applications, and getting better performance requires you to understand exactly what the options do and how your application behaves.

Michael Borgwardt
  • 51,037
  • 13
  • 124
  • 176
  • 1
    +1: basically what I was writing in my answer, maybe better formulation. – Klaim Nov 21 '12 at 10:05
  • 1
    +1: Very good points, explained in a very concise way: "This is quite hard ... but if done right, it can lead to absolutely breathtaking performance. Of course you pay for it with the potential for all kinds of horrible bugs." – Giorgio Dec 06 '12 at 17:09
  • Rico Mariani has an interesting analysis [here](http://blogs.msdn.com/b/ricom/archive/2005/05/10/416151.aspx) of Raymond Chen's series on optimizing a C++ program. In part 6 of that series, Raymond Chen hand-tunes his application's memory management, yielding massive performance gains. – Brian Dec 06 '12 at 17:10
  • "Of course you pay for it with the potential for all kinds of horrible bugs" - are you referring to the pure Java=Safe vs. C++=Unsafe issue, or do you have someting more specific in mind? – Martin Ba Dec 06 '12 at 18:24
  • 1
    @MartinBa: It's more that you pay for optimizing memory management. If you don't try to optimize memory management, C++ memory management isn't that difficult (avoid it entirely via STL or make it relatively easy using RAII). Of course, implementing RAII in C++ takes more lines of code than doing nothing in Java (i.e., because Java handles it for you). – Brian Dec 06 '12 at 18:59
  • 3
    @Martin Ba: Basically yes. Dangling pointers, buffer overflows, uninitialized pointers, errors in pointer arithmetic, all things that simply don't exist without manual memory management. And optimizing memory access pretty much requires you to do *a lot* of manual memory management. – Michael Borgwardt Dec 06 '12 at 22:40
  • I would disagree that optimizing memory access in C++ requires manual memory management in the sense of C-style manual memory management. – Martin Ba Dec 07 '12 at 09:13
  • @Martin Ba: maybe that should be a separate question. – Michael Borgwardt Dec 07 '12 at 09:19
  • 1
    There is a couple of things you can do in java. One is object pooling, which maximizes the chances memory locality of objects (unlike C++ where it can guarantee memory locality). – RokL Dec 07 '12 at 15:16
  • @UMad: Not unline C++. You can write a GC in C++ if you wanted to and allocate your objects in it. Just. Like. Java. – Thomas Eding Dec 14 '12 at 01:17
5

Sure, at the micro-optimization level the JVM will do some things that you'll have little control over compared to C and C++ especially.

On the other hand, the variety of compiler behaviors with C and C++ especially will have a far greater negative impact on your ability to do micro-optimizations in any sort of vaguely portable way (even across compiler revisions).

It depends on what sort of project you're tweaking, what environments you're targeting and so on. And in the end, it doesn't really matter since you're getting a few orders of magnitude better results from algorithmic/data structure/program design optimizations anyways.

Telastyn
  • 108,850
  • 29
  • 239
  • 365
5

[...] (granted, in the microsecond environment) [...]

Micro-seconds add up if we're looping over millions to billions of things. A personal vtune/micro-optimization session from C++ (no algorithmic improvements):

T-Rex (12.3 million facets):
Initial Time: 32.2372797 seconds
Multithreading: 7.4896073 seconds
4.9201039 seconds
4.6946372 seconds
3.261677 seconds
2.6988536 seconds
SIMD: 1.7831 seconds
4-valence patch optimization: 1.25007 seconds
0.978046 seconds
0.970057 seconds
0.911041 seconds

Everything besides "multithreading", "SIMD" (handwritten to beat the compiler), and the 4-valence patch optimization were micro-level memory optimizations. Also the original code starting from the initial times of 32 seconds was already optimized quite a bit (theoretically-optimal algorithmic complexity) and this is a recent session. The original version long before this recent session took over 5 minutes to process.

Optimizing memory efficiency can help often by anywhere from several times to orders of magnitudes in a single-threaded context, and more in multithreaded contexts (the benefits of an efficient memory rep often multiply with multiple threads in the mix).

On the Importance of Micro-Optimization

I get a little agitated by this idea that micro-optimizations are a waste of time. I agree that it's good general advice, but not everyone does it incorrectly based on hunches and superstitions rather than measurements. Done correctly, it doesn't necessarily yield a micro impact. If we take Intel's own Embree (raytracing kernel) and test only the simple scalar BVH they've written (not ray packet which is exponentially harder to beat), and then try to beat the performance of that data structure, it can be a most humbling experience even for a veteran used to profiling and tuning code for decades. And it's all because of micro-optimizations applied. Their solution can process over a hundred million rays per second when I've seen industrial professionals working in raytracing who can't even get more than 3 million rays per second on the same hardware (same bounding volume hierarchy, same logarithmic search).

There's no way to take a straightforward implementation of a BVH with only an algorithmic focus and get over a hundred million primary ray intersections per second out of it against any optimizing compiler (even Intel's own ICC). A straightforward one often doesn't even get a million rays per second. It takes professional-quality solutions to often even get a few million rays per second. It takes Intel-level micro-optimization to get over a hundred million rays per second.

Algorithms

I think micro-optimization isn't important as long as performance isn't important at the level of minutes to seconds, e.g., or hours to minutes. If we take a horrific algorithm like bubble sort and use it over a mass input as an example, and then compare it to even a basic implementation of merge sort, the former might take months to process, the latter maybe 12 minutes, as a result of quadratic vs linearithmic complexity.

The difference between months and minutes is probably going to make most people, even those not working in performance-critical fields, consider the execution time to be unacceptable if it requires users waiting for months to get a result.

Meanwhile, if we compare the non-micro-optimized, straightforward merge sort to quicksort (which is not at all algorithmically superior to merge sort, and only offers micro-level improvements for locality of reference), the micro-optimized quicksort might finish in 15 seconds as opposed to 12 minutes. Making users wait 12 minutes might be perfectly acceptable (coffee break kind of time).

I think this difference is probably negligible to most people between, say, 12 minutes and 15 seconds, and that's why micro-optimization is often considered useless since it's often only like the difference between minutes and seconds, and not minutes and months. The other reason I think it's considered useless is that it's often applied to areas that don't matter: some little area which isn't even loopy and critical which yields some questionable 1% difference (which may very well just be noise). But for people who care about these types of time differences and are willing to measure and do it right, I think it's worth paying attention to at least the basic concepts of the memory hierarchy (specifically the upper levels relating to page faults and cache misses).

Java Leaves Plenty of Room for Good Micro-Optimizations

Phew, sorry -- with that sort of rant aside:

Does the "magic" of the JVM hinder the influence a programmer has over micro-optimisations in Java?

A little bit but not as much as people might think if you do it right. For example, if you are doing image processing, in native code with handwritten SIMD, multithreading, and memory optimizations (access patterns and possibly even representation depending on image processing algorithm), it's easy to crunch hundreds of millions of pixels per second for 32-bit RGBA pixels (8-bit color channels) and sometimes even billions per second.

It's impossible to get anywhere close in Java if you say, made a Pixel object (this alone would inflate the size of a pixel from 4 bytes to 16 on 64-bit).

But you might be able to get a whole lot closer if you avoided the Pixel object, used an array of bytes, and modeled an Image object. Java's still pretty competent there if you start using arrays of plain old data. I've tried these kinds of things before in Java and was quite impressed provided that you don't create a bunch of little teeny objects everywhere that are 4 times bigger than normal (ex: use int instead of Integer) and start modeling bulk interfaces like an Image interface, not Pixel interface. I'd even venture to say that Java can rival C++ performance if you are looping over plain old data and not objects (huge arrays of float, e.g., not Float).

Perhaps even more important than the memory sizes is that an array of int guarantees a contiguous representation. An array of Integer does not. Contiguity is often essential for locality of reference since it means multiple elements (ex: 16 ints) can all fit into a single cache line and potentially be accessed together prior to eviction with efficient memory access patterns. Meanwhile a single Integer might be stranded somewhere in memory with surrounding memory being irrelevant, only to have that region of memory loaded into a cache line only to use a single integer prior to eviction as opposed to 16 integers. Even if we got marvelously lucky and surrounding Integers were all right next to each other in memory, we can only fit 4 into a cache line that can be accessed prior to eviction as a result of Integer being 4 times larger, and that's in the best-case scenario.

And there's plenty of micro-optimizations to be had there since we're unified under the same memory architecture/hierarchy. Memory access patterns matter no matter what language you use, concepts like loop tiling/blocking might be generally applied far more often in C or C++, but they benefit Java just as much.

I recently read in C++ sometimes the ordering of the data members can provide optimizations [...]

The order of data members generally don't matter in Java, but that's mostly a good thing. In C and C++, preserving the order of data members is often important for ABI reasons so compilers don't mess with that. Human developers working there have to be careful to do things like arrange their data members in descending order (largest to smallest) to avoid wasting memory on padding. With Java, apparently the JIT can reorder the members for you on the fly to ensure proper alignment while minimizing padding, so provided that's the case, it automates something that average C and C++ programmers can often do poorly and end up wasting memory that way (which isn't just wasting memory, but often wasting speed by increasing the stride between AoS structures needlessly and causing more cache misses). It's a very robotic thing to rearrange fields to minimize padding, so ideally humans don't deal with that. The only time where field arrangement might matter in a way that requires a human to know the optimal arrangement is if the object is bigger than 64 bytes and we're arranging fields based on access pattern (not optimal padding) -- in which case it might be a more human endeavor (requires understanding critical paths, some of which is information a compiler can't possibly anticipate without knowing what users will do with the software).

If not, could people give examples of what tricks you can use in Java (besides simple compiler flags).

The biggest difference to me in terms of an optimizing mentality between Java and C++ is that C++ might allow you to use objects a little (teeny) bit more than Java in a performance-critical scenario. For example, C++ can wrap an integer to a class with no overhead whatsoever (benchmarked all over the place). Java has to have that metadata pointer-style+alignment padding overhead per object which is why Boolean is bigger than boolean (but in exchange providing uniform benefits of reflection and the ability to override any function not marked as final for every single UDT).

It's a little bit easier in C++ to control the contiguity of memory layouts across non-homogeneous fields (ex: interleaving floats and ints into one array through a struct/class), since spatial locality is often lost (or at least control is lost) in Java when allocating objects through the GC.

... but often the highest-performance solutions will often split those up anyway and use an SoA access pattern over contiguous arrays of plain old data. So for the areas that need peak performance, the strategies to optimize memory layout between Java and C++ are often the same, and will often have you demolishing those teeny object-oriented interfaces in favor of collection-style interfaces that can do things like hot/cold field splitting, SoA reps, etc. Non-homogeneous AoSoA reps seem kind of impossible in Java (unless you just used a raw array of bytes or something like that), but those are for rare cases where both sequential and random access patterns need to be fast while simultaneously having a mixture of field types for hot fields. To me the bulk of the difference in optimization strategy (at the general kind of level) between these two is moot if you are reaching for peak performance.

The differences vary quite a bit more if you are simply reaching for "good" performance -- there not being able to do as much with small objects like Integer vs. int can be a bit more of a PITA, especially with the way it interacts with generics. It's a bit harder to just build one generic data structure as a central optimization target in Java that works for int, float, etc. while avoiding those bigger and expensive UDTs, but often the most performance-critical areas will require hand-rolling your own data structures tuned for a very specific purpose anyway so it's only annoying for code that is striving for good performance but not peak performance.

Object Overhead

Note that Java object overhead (metadata and loss of spatial locality and temporary loss of temporal locality after an initial GC cycle) is often big for things that are really small (like int vs. Integer) which are being stored by the millions in some data structure that's largely contiguous and accessed in very tight loops. There seems to be a lot of sensitivity about this subject, so I should clarify that you don't want to be worrying about object overhead for big objects like images, just really minuscule objects like a single pixel.

If anyone feels doubtful about this part, I'd suggest making a benchmark between summing up a million random ints vs. a million random Integers and to do this repeatedly (the Integers will reshuffle in memory after an initial GC cycle).

Ultimate Trick: Interface Designs That Leave Room to Optimize

So the ultimate Java trick as I see it if you are dealing with a place that handles a heavy load over small objects (ex: a Pixel, a 4-vector, a 4x4 matrix, a Particle, possibly even an Account if it only has a few small fields) is to avoid using objects for these teeny things and use arrays (possibly chained together) of plain old data. The objects then become collection interfaces like Image, ParticleSystem, Accounts, a collection of matrices or vectors, etc. Individual ones can be accessed by index, e.g. This is also one of the ultimate design tricks in C and C++, since even without that basic object overhead and disjointed memory, modeling the interface at the level of a single particle prevents the most efficient solutions.

ChrisF
  • 38,878
  • 11
  • 125
  • 168
  • 1
    Considering that bad performance in the bulk might actually have a decent chance of overwhelming peak performance in the critical areas, I don't think one can completely disregard the advantage of having good performance easily. And the trick of turning an array of structs into a struct of arrays somewhat breaks down when all (or nearly all) values comprising one of the original structs will be accessed at the same time. BTW: I see you are unearthing lots of oldish posts and adding your own good answer, sometimes even the good answer ;-) – Deduplicator Jan 03 '16 at 02:36
  • 1
    @Deduplicator Hope I'm not annoying people by bumping too much! This one got a little teeny bit ranty -- maybe I should improve it a bit. SoA vs. AoS is often a tough one for me (sequential vs. random access). I rarely know upfront which one I should use since there's often a mixture of sequential and random access in my case. The valuable lesson I often learned is to design interfaces that leave enough room to play with the data representation -- kinda bulkier interfaces that have big transform algorithms when possible (sometimes not possible with teeny bits randomly-accessed here and there). –  Jan 03 '16 at 02:44
  • 1
    Well, I only noticed because things are really slow. And I took my time with each one. – Deduplicator Jan 03 '16 at 02:52
  • I really wonder why `user204677` went away. Such a great answer. – oligofren Aug 27 '19 at 13:03
3

There's a middle area between micro-optimization, on the one hand, and good choice of algorithm, on the other.

It is the area of constant-factor speedups, and it can yield orders of magnitude.
The way it does so is by lopping off entire fractions of the execution time, like first 30%, then 20% of what's left, then 50% of that, and so on for several iterations, until there's hardly anything left.

You don't see this in little demo-style programs. Where you see it is in big serious programs with lots of class data structures, where the call stack is typically many layers deep. A good way to find the speedup opportunities is by examining random-time samples of the program's state.

Generally the speedups consist of things like:

  • minimizing calls to new by pooling and re-using old objects,

  • recognizing things being done that are sort of in there for generality's sake, rather than actually being necessary,

  • revising the data structure by using different collection classes that have the same big-O behavior but take advantage of the accessing patterns actually used,

  • saving data that's been acquired by function calls instead of re-calling the function, (It is a natural and amusing tendency of programmers to assume that functions having shorter names execute faster.)

  • tolerating a certain amount of inconsistency between redundant data structures, as opposed to trying to keep them entirely consistent with notification events,

  • etc. etc.

But of course none of these things should be done without first being shown to be problems by taking samples.

Mike Dunlavey
  • 12,815
  • 2
  • 35
  • 58
2

This question is very hard to answer, because it depends on language implementations.

In general there's very little room for such "micro optimizations" these days. The main reason is that compilers take advantage of such optimizations during compilation. For example there's no performance difference between pre-increment and post-increment operators in situations where their semantics are identical. Another example would be for example a loop like this for(int i=0; i<vec.size(); i++) where one could argue that instead of calling the size() member function during each iteration it would be better to get the size of the vector before the loop and then comparing against that single variable and thus avoiding function a call per iteration. However, there are cases in which a compiler will detect this silly case and cache the result. However, this is only possible when the function has no side-effects and the compiler can be sure that the vector size remains constant during the loop so it merely applies to fairly trivial cases.

zxcdw
  • 5,075
  • 2
  • 29
  • 31
  • As for the second case, I don't think compiler can optimize it in the foreseeable future. Detecting that it's safe to optimize vec.size() depends on proving that the size if the vector/lost does not change inside the loop, which I believe is undecidable due to the halting problem. – Lie Ryan Nov 21 '12 at 07:00
  • @LieRyan I've seen multple(simple) cases in which compiler has generated exactly identical binary file if the result has been manually "cached" and if size() has been called. I wrote some code and it turns out the behavior is highly dependant on the way the program operates. There are cases in which the compiler can guarantee that there's no possibility for vector size to change during the loop, and then there are cases in which it can't guarantee it, very much alike to the halting problem as you mentioned. For now I'm unable to verify my claim(C++ disassembly is a pain) so I edited the answer – zxcdw Nov 21 '12 at 09:33
  • 2
    @Lie Ryan: a lot of things that are undecidable in the general case are perfectly decidable for specific but common cases, and that's really all you need here. – Michael Borgwardt Nov 21 '12 at 10:12
  • @LieRyan If you only call `const` methods on this vector, I'm pretty sure many optimizing compilers will figure it out. – K.Steff Dec 06 '12 at 17:15
  • in C#, and I think I read in Java also, if you don't cache size the compiler knows it can remove the checks to see if you are going outside the array bounds, and if you do cache size it has to do the checks, which generally cost more than you are saving by caching. Trying to outsmart optimizers is rarely a good plan. – Kate Gregory Dec 07 '12 at 13:57
2

Java (as far as I'm aware) gives you no control over variable locations in memory so you have a harder time to avoid things like false-sharing and alignment of variables (you can pad out a class with several unused members). Another thing I don't think you can take advantage of are instructions such as mmpause, but these things are CPU specific and so if you figure you need it Java may not the be language to use.

There exists the Unsafe class that gives you flexibility of C/C++ but also with the danger of C/C++.

It might help you to look at the assembly code the JVM generates for your code

To read about a Java app that looks at this kind of detail see the Disruptor code released by LMAX

James
  • 4,328
  • 3
  • 21
  • 25
1

could people give examples of what tricks you can use in Java (besides simple compiler flags).

Other than improvements of algorithms, be sure to consider the memory hierarchy and how the processor makes use of it. There are big benefits in reducing memory access latencies, once you understand how the language in question allocates memory to its data types and objects.

Java example to access an array of 1000x1000 ints

Consider the below sample code - it accesses the same area of memory (a 1000x1000 array of ints), but in a different order. On my mac mini (Core i7, 2.7 GHz) the output is as follows, showing that traversing the array by rows more than doubles the performance (average over 100 rounds each).

Processing columns by rows*** took 4 ms (avg)
Processing rows by columns*** took 10 ms (avg) 

This is because the array is stored such that consecutive columns (i.e. int values) are placed adjacent in memory, whereas consecutive rows are not. For the processor to actually use the data, it has to be transferred to its caches. The transfer of memory is by a block of bytes, called a cache line - loading a cache line directly from memory introduces latencies and thus decrease a program's performance.

For the Core i7 (sandy bridge) a cache line holds 64 bytes, thus each memory access retrieves 64 bytes. Because the first test accesses memory in a predictable sequence, the processor will pre-fetch data before it is actually consumed by the program. Overall, this results in less latency on memory accesses and thus improves the performance.

Code of sample:

  package test;

  import java.lang.*;

  public class PerfTest {
    public static void main(String[] args) {
      int[][] numbers = new int[1000][1000];
      long startTime;
      long stopTime;
      long elapsedAvg;
      int tries;
      int maxTries = 100;

      // process columns by rows 
      System.out.print("Processing columns by rows");
      for(tries = 0, elapsedAvg = 0; tries < maxTries; tries++) {
       startTime = System.currentTimeMillis();
       for(int r = 0; r < 1000; r++) {
         for(int c = 0; c < 1000; c++) {
           int v = numbers[r][c]; 
         }
       }
       stopTime = System.currentTimeMillis();
       elapsedAvg += ((stopTime - startTime) - elapsedAvg) / (tries + 1);
      }

      System.out.format("*** took %d ms (avg)\n", elapsedAvg);     

      // process rows by columns
      System.out.print("Processing rows by columns");
      for(tries = 0, elapsedAvg = 0; tries < maxTries; tries++) {
       startTime = System.currentTimeMillis();
       for(int c = 0; c < 1000; c++) {
         for(int r = 0; r < 1000; r++) {
           int v = numbers[r][c]; 
         }
       }
       stopTime = System.currentTimeMillis();
       elapsedAvg += ((stopTime - startTime) - elapsedAvg) / (tries + 1);
      }

      System.out.format("*** took %d ms (avg)\n", elapsedAvg);     
    }
  }
miraculixx
  • 3,103
  • 13
  • 21
1

The JVM can and often does interfere, and the JIT compiler can change significantly between versions Some micro-optimisations are impossible in Java due to language limitations, such as being hyper-threading friendly or the latest Intel processors' SIMD collection.

A highly informative blog on the topic from one the Disruptor authors is recommended reading:

One always has to ask why bother using Java if you want micro-optimisations, there are many alternative methods for acceleration of a function such as using JNA or JNI to pass onto a native library.

Steve-o
  • 161
  • 6