I am wondering what is a good, performant algorithm for matrix multiplication of 4x4 matrices. I am implementing some affine transformations and I am aware that there are several algorithms for efficient matrix multiplication, like Strassen. But are there some algorithms that are especially efficient for matrices that small? Most sources I had a glance at are looking at which are asymptotically the most efficient.
-
I think there's some performance to be gained by noting that affine transformations for 3D only change a 4x3 submatrix, as the bottom row is always `0 0 0 1`. Therefore, you can avoid multiplying by this row. – Eric Dec 26 '15 at 21:07
-
You are right, this is an optimization that I already took in the naive o(n^3) implementation I use right now. – wirrbel Dec 26 '15 at 21:14
-
10Last time I played with this, the fastest answer was the most obvious. I wrote the most blindly naive code possible, and it was so effective at capturing my meaning that the compiler optimized the daylights out of it using SSE. It did brilliant things like keeping matrices in SSE registers, rather than letting them go to RAM. The more I tried to optimize it, myself, the less the compiler was effective at converting my methods to SIMD – Cort Ammon Dec 26 '15 at 21:54
5 Answers
Often, simple algorithms are the fastest for very small sets, because more complex algorithms usually use some transformation that add some overhead. I think your best bet is not on a more efficient algorithm (I think most libraries use straightforward methods), but on a more efficient implementation, for example, one using SIMD extensions (assuming x86 or amd64 code), or hand-written in assembly. Also, the memory layout should be well thought out. You should be able to find enough resources on this.

- 241
- 1
- 4
For 4x4 mat/mat multiplication, algorithmic improvements are often out. The basic cubic-time complexity algorithm tends to fare quite well, and anything fancier than that is more likely to degrade rather than improve times. Just in general, fancy algorithms are ill-suited if there isn't a scalability factor involved (ex: trying to quicksort an array that always has 6 elements as opposed to a simple insertion or bubble sort). Doing things like matrix transposition here to improve locality of reference also doesn't actually aid locality of reference when an entire matrix can fit into one or two cache lines. At this kind of miniature scale, if you are doing 4x4 mat/mat multiplication en masse, the improvements will generally come from micro-level optimizations of instructions and memory, like proper cache-line alignment.
-
1Great answer! I've never heard of the SoA acronym (at least, in Dutch it is an acronym for 'seksueel overdraagbare aandoening' which means 'sexual transmitted disease'... but that's hopefully not what you mean here). The technique seems clear, I'm even quite suprised that there is a name for it. What does SoA stand for? – Ruben Dec 28 '15 at 07:58
-
1@Ruben Structure of Arrays as opposed to Arrays of Structures. SoAs can also be PITAs -- best saved for your most critical paths. Here's a nice little link I found on the subject: http://stackoverflow.com/questions/17924705/structure-of-arrays-vs-array-of-structures-in-cuda – Dec 28 '15 at 08:06
-
1You might want to mention C++11 / C11 [`alignas`](http://en.cppreference.com/w/cpp/language/alignas). – Deduplicator Jul 30 '17 at 18:58
Wikipedia lists four algorithms for matrix multiplication of two nxn matrices.
The classic one that a programmer would write is O(n3) and is listed as the "Schoolbook matrix multiplication". Yep. O(n3) is a bit of a hit. Lets look at the next best one.
The Strassen algorithim is O(n2.807). This one would work - it has some restrictions to it (such as the size is a power of two) and it has a caveat in the description:
Compared to conventional matrix multiplication, the algorithm adds a considerable O(n2) workload in addition/subtractions; so below a certain size, it will be better to use conventional multiplication.
For those who are interested in this algorithm and its origins, looking at How did Strassen come up with his matrix multiplication method? can be a good read. It gives a hint at the complexity of that initial O(n2) workload that is added and why this would be more expensive than just doing the classic multiplication.
So it really is O(n2 + n2.807) with that bit about lower exponent n being ignored when writing out big O. It appears that if you are working on a nice 2048x2048 matrix, this could be useful. For a 4x4 matrix, its probably going to find it slower as that overhead eats up all the other time.
And then there's the Coppersmith–Winograd algorithm which is O(n2.373) with quite a few bits of improvements. It also comes with a caveat:
The Coppersmith–Winograd algorithm is frequently used as a building block in other algorithms to prove theoretical time bounds. However, unlike the Strassen algorithm, it is not used in practice because it only provides an advantage for matrices so large that they cannot be processed by modern hardware.
So, its better when you are working on super large matrixes, but again, not useful for a 4x4 matrix.
This is again reflected in wikipedia's page in Matrix multiplication: Sub-cubic algorithms which gets at the why of things running faster:
Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2 × 2-matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of O(nlog27) ≈ O(n2.807). Strassen's algorithm is more complex, and the numerical stability is reduced compared to the naïve algorithm, but it is faster in cases where n > 100 or so and appears in several libraries, such as BLAS.
And that gets to the core of why the algorithms are faster - you trade off some numerical stability and some additional setup. That additional setup for a 4x4 matrix is much more than the cost of doing the more multiplication.
And now, to answer your question:
But are there some algorithms that are especially efficient for matrices that small?
No, there are no algorithms that are optimized for 4x4 matrix multiplication because the O(n3) one works quite reasonably until you start finding that you are willing to take a big hit for overhead. For your specific situation, there may be some overhead that you could incur knowing specific things beforehand about your matrixes (like how much some data will be reused), but really the easiest thing to do is write good code for the O(n3) solution, let the compiler handle it, and profile it later on to see if you actually have the code being the slow spot in the matrix multiplication.
Related on Math.SE: Minimal number of multiplications required to invert a 4x4 matrix
If you know for sure that you will only need to multiply 4x4 matrices, then you don't need to worry about a general algorithm at all. You can just take in two pointers, and use this:
(I would strongly recommend translating this in some automated fashion).
The compiler would then be optimally positioned to optimize this code (to reuse partial sums, reorder the math etc) as it can see everything, there are no dynamic loops, and no control flow.
Hard to imagine that this can be beat without using intrinsics.
You cannot directly compare asymptotic complexity if you define n
differently. You are accustomed to comparing complexity of algorithms on flat data structures like lists, where n
is defined to be the total number of elements in the list, but the matrix algorithms define n
to be only the length of one side.
By this definition of n
, something as simple as looking at each element once in order to print it, what you would normally think of as O(n), is O(n2). If you define n
as the total number of elements in the matrix, i.e. n = 16 for a 4x4 matrix, then the naïve matrix multiplication is only O(n1.5), which is pretty good.
Your best bet is to take advantage of parallelism by using SIMD instructions or a GPU, rather than trying to improve the algorithm based on the mistaken belief that O(n3) is as bad as it would be if n
were defined comparably to a flat data structure.

- 146,727
- 38
- 279
- 479