Maintenance cost of SIMD programming code base

Question

Question:

The software industry's consensus is that clean and simple code is fundamental to the long-term viability of the code base and the organization that owns it. These properties lead to lower maintenance costs and increased likelihood of the code base being continued.

However, SIMD code is different than general application code, and I would like to know if there is a similar consensus regarding clean and simple code applying specifically to SIMD code.

Background to my question.

I write plenty of SIMD (single-instruction, multiple data) code for various image processing and analysis tasks. Recently I also had to port a small number of these functions from one architecture (SSE2) to another (ARM NEON).

The code is written for shrink-wrapped software, therefore it cannot depend on proprietary languages without unrestricted redistribution rights such as MATLAB.

An example of typical code structure:

Using OpenCV's matrix type (Mat) for all memory, buffer and lifetime management.
After checking the size (dimensions) of input arguments, the pointers to the start address of each row of pixels is taken.
The pixel count, and the start addresses of each row of pixels from each input matrix is passed into some low-level C++ functions.
These low-level C++ functions use SIMD intrinsics (for Intel Architecture, and ARM NEON), loading from and saving to raw pointer addresses.
Characteristics of these low-level C++ functions:
- Exclusively one-dimensional (consecutive in memory)
- Does not deal with memory allocations.
  (Every allocation, including temporaries, are handled by the outer code using OpenCV facilities.)
- The range of name lengths of symbols (intrinsics, variable names, etc) are roughly 10 - 20 characters, which is quite excessive.
  (Reads like techno-babble.)
- Reuse of SIMD variables is discouraged because compilers are quite buggy in correctly parsing code that is not written in the "single-assignment" coding style.
  (I've filed several compiler bug reports.)

What aspects of SIMD programming would cause the discussion to differ from the general case? Or, why is SIMD different?

In terms of initial development cost

It is well-known that the initial development cost of C++ SIMD code with good performance is about 10x - 100x (with a wide margin) compared to casually-written C++ code.
As noted in the answers to Choosing between performance vs readable/cleaner code? , most code (including casually-written code and SIMD code) is initially neither clean nor fast.
Evolutionary improvements in code performance (in both scalar and SIMD code) is discouraged (because it is seen as a kind of software rework), and the cost and benefit is not tracked.

In terms of propensity
(e.g. the Pareto principle, a.k.a. the 80-20 rule)

Even if image processing only comprises 20% of a software system (in both code size and functionality), image processing is comparatively slow (when viewed as a percentage of CPU time spent), taking more than 80% of time.
- This is due to the data size effect: A typical image size is measured in megabytes, whereas the typical size of non-image data is measured in kilobytes.
Within the image processing code, a SIMD programmer is trained to automatically recognize the 20% code comprising the hotspots by identifying the loop structure in the C++ code. Thus, from a SIMD programmer's perspective, 100% of the "code that matters" is performance bottleneck.
Often in an image processing system, multiple hotspots exist and take up comparable proportions of time. For example, there may be 5 hotspots each taking up (20%, 18%, 16%, 14%, 12%) of total time. To achieve a high performance gain, all of the hotspots need to be rewritten in SIMD.
- This is summarized as the balloon-popping rule: a balloon cannot be popped twice.
- Suppose there are some balloons, say 5 of them. The only way to decimate them is to pop them one by one.
- Once the first balloon is popped, the remaining 4 balloons now comprises a higher percentage of total execution time.
- To make further gains, one must then pop another balloon.
  (This is in defiance to the 80-20 rule of optimization: a good economical outcome can be achieved after the 20% of lowest-hanging fruits has been picked.)

In terms of readability and maintenance

SIMD code is patently hard to read.
- This is true even if one follows every software engineering best-practice e.g. naming, encapsulation, const-correctness (and making side-effects obvious), function decomposition, etc.
- This is true even for experienced SIMD programmers.
Optimal SIMD code is very contorted, (see remark) compared to its equivalent C++ prototype code.
- There are many ways to contort SIMD code, but only 1 out of 10 such attempts will attain acceptably fast results.
- (That is, in the tunes of 4x-10x performance gains in order to justify to high development cost. Even higher gains have been observed in practice.)

(Remark)
This is the main thesis of the MIT Halide project - quoting the paper's title verbatim:

"decoupling algorithms from schedules for easy optimization of image processing pipelines"

In terms of forward applicability

SIMD code is strictly tied to a single architecture. Each new architecture (or each widening of SIMD registers) requires a rewrite.
Unlike the majority of software development, each piece of SIMD code is typically written for a single purpose that never changes.
(With the exception of porting to other architectures.)
Some architectures maintain perfect backward compatibility (Intel); some fall short by a trivial amount (ARM AArch64, replacing vtbl with vtblq) but which is sufficient to cause some code to fail to compile.

In terms of skills and training

It is not clear what knowledge prerequisites are required to properly train a new programmer to write and maintain SIMD code.
College graduates who have learned SIMD programming in school seems to despise and dismiss it as an impractical career track.
Disassembly-reading and low-level performance profiling are cited as two fundamental skills for writing high-performance SIMD code. However, it is unclear how to systematically train programmers in these two skills.
Modern CPU architecture (which diverges significantly from what is taught in textbooks) makes training even more difficult.

In terms of correctness and defect-related costs

A single SIMD processing function is actually cohesive enough that one can establish correctness by:
- Applying formal methods (with pen-and-paper), and
- Verifying output integer ranges (with prototype code and performed outside run-time).
The verification process is, however, very costly (spends 100% time on code review and 100% time on prototype model checking), which triples the already-expensive development cost of SIMD code.
If a bug somehow manages to slip through this verification process, it is nearly impossible to "repair" (fix) except to replace (rewrite) the suspected defective function.
SIMD code suffers from the blunt of defects in the C++ compiler (optimizing code generator).
- SIMD code generated using C++ expression templates also suffers greatly from defects of the compiler.

In terms of disruptive innovations

Many solutions have been proposed from academia, but few are seeing widespread commercial usage.
- MIT Halide
- Stanford Darkroom
- NT2 (Numerical Template Toolbox) and the related Boost.SIMD
Libraries with widespread commercial usage do not seem to be heavily SIMD-enabled.
- Open-source libraries seem lukewarm to SIMD.
  - Recently I have this first-hand observation of this after profiling a large number of OpenCV API functions, as of version 2.4.9.
  - Many other image processing libraries I have profiled also do not make heavy use of SIMD, or they miss the true hotspots.
- Commercial libraries seem to avoid SIMD altogether.
  - In a few cases, I have even seen image processing libraries reverting SIMD-optimized code in an earlier version to non-SIMD code in a later version, resulting in severe performance regressions.
    (The vendor's response is that it was necessary to avoid compiler bugs.)

This Programmer's question: Does low latency code sometimes have to be "ugly"? is related, and I previously wrote an answer to that question to explain my view points a few years ago.

However, that answer is pretty much "appeasement" to the "premature optimization" viewpoint, i.e. to the viewpoint that:

All optimizations are premature by definition (or, short-term by nature), and
The only optimization that has long-term benefit is toward simplicity.

But such viewpoints are contested in this ACM article.

All of that leads me to ask:
SIMD code is different than general application code, and I would like to know if there is a similar industry consensus regarding the value of clean and simple code for SIMD code.

Do have performance requirements? Can you meet your performance requirements without using SIMD? If not the question is moot. — Charles E. Grant, Dec 13 '14 at 23:41
This is far too long for a question, most likely because a good chunk of it is effectively an attempt to answer the question, and long even for an answer (partly because it touches on far more aspects than most reasonable answers do). — , Dec 13 '14 at 23:59
@CharlesE.Grant There is a product owner for all of the code (not me). I demonstrate the purpose (functionality) of each feature I work on, and I show the user-perceived impact of performance of that code. The product owner makes all judgement calls as to whether it is fast enough for the purpose, and leaves the optimization work to me. — rwong, Dec 14 '14 at 00:04
I like to have both the clean/simple/slow code (for initial proof of concept and later documentation purposes) in addition to the optimised alternative/s. This makes it easy to understand (as people can just read the clean/simple/slow code) and easy to verify (by comparing the optimised version to the clean/simple/slow version manually and in unit tests) — Brendan, Dec 14 '14 at 09:04
@Brendan I've been in a similar project and have used testing approach with simple/slow code. While it's an option worth considering, it also has limitations. First, performance difference might turn out prohibitive: tests using unoptimized code can run for hours... days. Second, for image processing it may turn out that bit-by-bit comparison simply won't work, when optimized code produces slightly different results - so that one would have to use more sophisticated comparison, like ef root mean square diff — gnat, Dec 16 '14 at 12:27
@gnat: In general, the average software engineer's attitude toward SIMD unit-testing is detrimental. Firstly, a unit test programmer will think that "you only need to test the boundary conditions *ala* `INT_MAX`", and delete 90% of the test vectors from a SIMD unit test project, without understanding that SIMD boundary conditions can happen in many scattered places. Then, the test manager will complain that SIMD unit tests run too slow. Finally, the project manager will say that "Well it seems SIMD unit tests are passing with green. It seems safe to *skip the tests for now* to save time". — rwong, Dec 16 '14 at 18:01
In some sense, SIMD programming require a prerequisite of DSP and embedded numerical processing knowledge, which is uncommon among software developers. This missing knowledge leads to some dangerous assumptions about the perceived (lack of) importance of the SIMD unit tests. — rwong, Dec 16 '14 at 18:04
this attitude seems to be somewhat justified. In my experience, reliable and efficient unit tests for image processing code were difficult to develop and maintain (in particular, your point about necessity to understand DSP and num proc applies here). And if the code is intended for wide usage, it gets hard-to-impossible to design unit tests to cover all that needs to be covered. This all is so much different from "canonical" use cases where unit testing is known to work well... — gnat, Dec 16 '14 at 18:18
But what is the alternative - because unit testing is not so easy, one should omit it? I remember around 20 years ago, I had to create a graphics library with some assembler optimizations. Creating the full code in C first, then optimizing the crucial parts by some handwritten assembly worked well. And of course, I kept the original C code for documentation *and* testing purposes, which was extremely helpful. I cannot believe the current situation with SIMD is extremely different from that (at least, not from the fact its SIMD, maybe for a different use case). — Doc Brown, Dec 16 '14 at 19:59
@DocBrown what we meant is that the test code (along with *one of* the unoptimized version or the optimized version) tend to be deleted by people who under-appreciate (or misunderstand) the value or the complexity of these auxiliary code. — rwong, Dec 16 '14 at 20:05
I'm voting to close this question as off-topic because it is not a conceptual programming problem as described in the [help/on-topic]. — durron597, Jul 18 '15 at 22:43

Doc Brown · Accepted Answer · 2014-12-16T20:24:02.560

I did not write much SIMD code for myself, but a lot of assembler code some decades ago. AFAIK using SIMD intrinsics is essentially assembler programming, and your whole question could be rephrased just by replacing "SIMD" by the word "assembly". For example, the points you already mentioned, like

the code takes 10x to 100x to develop than "high level code"
it is tied to a specific architecture
the code is never "clean" nor easy to refactor
you need experts for writing and maintaining it
debugging and maintaining is hard, evolving really hard

are in no way "special" to SIMD - these points are true for any kind of assembly language, and they are all "industry consensus". And the conclusion in the software industry is also pretty much the same as for assembler:

don't write it if you don't have to - use a high level language whereever possible and let compilers do the hard work
if the compilers are not sufficient, at least encapsulate the "low level" parts in some libraries, but avoid to spread the code all over your program
since it is almost impossible to write "self-documenting" assembler or SIMD code, try to balance this by lots of documentation.

Of course, there is indeed a difference to the situation with "classic" assembly or machine code: today, modern compilers typically produce high quality machine code from a high level language, which is often better optimized than assembler code written manually. For the SIMD architectures which are popular today, the quality of the available compilers is AFAIK far below that - and maybe it will never reach that, since automatic vectorization is still a topic of scientific research. See, for example, this article which describes the differences in opimization between a compiler and a human, giving a notion that it might be very hard to create good SIMD compilers.

As you described in your question already, there exist also a quality problem with current state-of-the-art libraries. So IMHO best we can hope is that in the next years the quality of the compilers and libraries will increase, maybe the SIMD hardware will have to change to become more "compiler friendly", maybe specialized programming languages supporting easier vectorization (like Halide, which you mentioned twice) will become more popular (wasn't that already a strength of Fortran?). According to Wikipedia, SIMD became "a mass product" around 15 to 20 years ago (and Halide is less than 3 years old, when I interpret the docs correctly). Compare this to the time compilers for "classic" assembly language needed to become mature. According to this Wikipedia article it took almost 30 years (from ~1970 to the end of the 1990s) until compilers exceeded the performance of human experts (in producing non-parallel machine code). So we may just have to wait more 10 to 15 years until the same happens to SIMD-enabled compilers.

[per my reading of Wikipedia article](http://programmers.stackexchange.com/a/222200/31260 "discussed in an answer about similar issues"), there seems to be a general _industry consensus_ that code optimized at low level is "considered difficult to use, due to the numerous technical details which must be remembered" — gnat, Dec 16 '14 at 15:10
@gnat: yes, absolutely, but I think if I add this to my answer, I should a dozen other things already mentioned by the OP in other words in his too-long question. — Doc Brown, Dec 16 '14 at 16:10
agree, analysis in your answer looks good enough as is, adding that reference would carry a risk of "overloading" it — gnat, Dec 16 '14 at 16:13
Actually, at some point I took a lot of SIMD code written in assembler, turned it all into C++ code using intrinsic functions, and made it several times faster in the process. For example by doing loop unrolling (trivial in a high level language, hard work in assembler) and sometimes improving algorithms. I'd never write SIMD code in assembler. — gnasher729, Jun 28 '22 at 07:51

user1118321 · Answer 2 · 2016-07-16T01:08:55.597

6

My organization has dealt with this exact problem. Our products are in the video space, but much of the code we write is image processing that would work for still images, too.

We "solved" (or maybe "dealt with") the problem by writing our own compiler. This isn't quite as crazy as it sounds at first. It's got a restricted set of inputs. We know that all the code is working on images, mostly RGBA images. We set up some constraints, like that the input and output buffers can never overlap, so there's no pointer aliasing. Things like that.

We then write our code in the OpenGL Shading Language (glsl). It gets compiled to scalar code, SSE, SSE2, SSE3, AVX, Neon, and of course actual glsl. When we need to support a new platform, we update the compiler to output code for that platform.

We also do tiling of the images to improve cache coherency, and stuff like that. But by keeping the image processing to a small kernel, and using glsl (which doesn't even support pointers) we greatly reduce the complexity of compiling the code.

This approach isn't for everyone, and it has its own problems (you need to ensure compiler correctness, for example). But it has worked out fairly well for us.

edited Jul 16 '16 at 01:08

answered Dec 19 '14 at 17:06

user1118321

4,969
1
17
25

This sounds ! Is this product you sell or make available stand-alone? (Also, is ‘AVC’ = AVX?) – Ahmed Fasih Jul 16 '16 at 01:05
Sorry, yes, I meant AVX (I'll fix that.). We don't currently sell the compiler as a stand-alone product, though it might happen in the future. – user1118321 Jul 16 '16 at 01:08
No joke, this sounds really neat. The closest thing I’ve seen like this is how the CUDA compiler used to be able to make “serial” programs that run on CPU for debugging—we hoped that would generalize into a way to write multi-threaded & SIMD CPU code, but alas. The next closest thing I can think of is OpenCL—did you folks evaluate OpenCL and find it inferior to your GLSL-to-all compiler? – Ahmed Fasih Jul 16 '16 at 01:26
1

Well OpenCL didn't exist when we started, I don't think. (Or if it did, it was fairly new.) So it didn't really come into the equation. – user1118321 Jul 16 '16 at 01:30

score -1 · Answer 3 · edited Nov 23 '22 at 09:11

-1

It doesn't seem to add too much maintenance overhead if you consider using a higher-level language:

Vector<float> values = GetValues();
Vector<float> increment = GetIncrement();

// Perform addition as a vector operation:
List<float> result = (values + increment).ToList();

vs

List<float> values = GetValues();
List<float> increment = GetIncrement();

// Perform addition as a monadic sequence operation:
List<float> result = values.Zip(increment, (v, i) => v + i).ToList();

Of course you will have to face the limitations of the library, but you will not maintain it yourself. Could be a good balance between maintenance cost and performance win.

Link

edited Nov 23 '22 at 09:11

Glorfindel

3,137
6
25
33

answered Dec 16 '14 at 15:38

Den

4,827
2
32
48

per my reading, the option to use external libraries has been already investigated and addressed by asker: "Libraries with widespread commercial usage do not seem to be heavily SIMD-enabled..." – gnat Dec 16 '14 at 16:10
@gnat I have actually read that whole paragraph, not just the top-level bullet points, and the poster is not mentioning any general-purpose SIMD libraries, just the computer vision and image processing ones. Not to mention that analysis of higher-level languages application is completely missing, despite no C++ tag and no C++-specificity reflected in the question title. This leads me to believe that while my question will not be considered primary it is likely to add value, making people aware of other options. – Den Dec 17 '14 at 09:13
1

To my understanding, the OP is asking if there exist solutions with widespread commercial usage. Though I appreciate your hint (maybe I can use the lib for a project here), from what I see RyuJIT is far from beeing a "widely accepted industry standard". – Doc Brown Dec 17 '14 at 11:05
@DocBrown maybe, but his actual question is formulated to be more generic: "...industry consensus regarding the value of clean and simple code for SIMD code...". I doubt there is any (official) consensus at all, but I submit that higher-level languages can reduce the difference between the "usual" and SIMD code, just like C++ let's you forget about assembly, thus reducing the maintenance costs. – Den Dec 17 '14 at 13:38

score -1 · Answer 4 · answered Dec 19 '14 at 03:59

-1

I've done assembly programming in the past, not SIMD programming recently.

Have you consider using a SIMD-aware compiler like Intel's? Is A Guide to Vectorization with Intel® C++ Compilers interesting?

Several of your comments like "balloon-popping" suggest using a compiler (to get benefits throughout if you don't have a single hot-spot).

answered Dec 19 '14 at 03:59

ChrisW

3,387
2
20
27

per my reading, this approach was tried by asker, see mentions of compiler bugs / defects in the question – gnat Dec 19 '14 at 05:02
The OP didn't say whether they'd tried **[the Intel compiler](http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler)**, which is also the subject of **[this Programmers.SE topic](http://programmers.stackexchange.com/q/151375/19237)**. Most people haven't tried it. It's not for everyone; but it might suit the OP's business/question (better performance for lower coding/design/maintenance cost). – ChrisW Dec 19 '14 at 11:09
well what I read in the question suggests that asker is aware about compilers for Intel and other architectures: "Some architectures maintain perfect backward compatibility (Intel); some fall short..." – gnat Dec 19 '14 at 11:34
"Intel" in that sentence means the Intel-the-chip-designer, not Intel-the-compiler-writer. – ChrisW Dec 19 '14 at 11:46

Maintenance cost of SIMD programming code base

Question:

4 Answers4

Linked