Based on the example in here, I tried a very similar example (but instead of multiplying two matrices I just multiply all the elements in a matrix by 2.0).
However, when comparing the results of multiplying a 32x32 matrix by 2.0 in the ARM (after the optimization -O3) with the results in the Hardware (i.e. FPGA side) I noticed that the first took me 1425 clock cyles where the second took 3654 clock cycles. So basically, the FPGA is almost 3 times slower. (accelaration_factor
=0.389)
See this to check the accelaration factors that I'm talking about in the matrix mult example.
I already tried changing the port that connects the ARM and the AXI DMA block to HP instead of ACP and the results are the same.
I'm using AXI DMA also to transfer data rom and to the DDR and I measured the MM2S (Memory-Mapped to Stream) transfer to 1343 clock cycles to transfer 4096 bytes, which results in a transfer spped of 290.8 Mbytes/second. The S2MM transfer in turn has a velocity of 167.2 MBytes/s because it transfered 4096 bytes in 2336 clock cycles.
I have multiple questions in which I hope you can help:
Why is my FPGA design slower than the ARM when multiplying a matrix by 2.0 but not when multiplying two matrices??
Do these AXI DMA velocities look okay to you? By comparing them to Sadri's video it seems that I can transfer way faster. What can I do to improve these transfers speeds?
I saw somewhere that S2MM transfers are expected to be slower than MM2s transfers in the Zedboard. Can you tell me why and if this big of a difference makes sense?
I measured the time in my PC to do a 32x32 matrix multiplication by 2.0 and it's 3.84x10⁽⁻⁶⁾ seconds. Knowing that the same multiplication takes 1.42x10⁽⁻⁵⁾ and the FPGA one takes 3.85x10⁽⁻⁵⁾ one can notice that the CPU is almost 4 times faster than the ARM and almost 10 times faster than the FPGA. If my objective was to design an FPGA model that accelarates software why am I so far off when I'm following an example?
Note: My frequency is 100 MHz so each clock cycle is 10ns.