Zedboard clock cycles analysis

Question

Based on the example in here, I tried a very similar example (but instead of multiplying two matrices I just multiply all the elements in a matrix by 2.0).

However, when comparing the results of multiplying a 32x32 matrix by 2.0 in the ARM (after the optimization -O3) with the results in the Hardware (i.e. FPGA side) I noticed that the first took me 1425 clock cyles where the second took 3654 clock cycles. So basically, the FPGA is almost 3 times slower. (accelaration_factor=0.389)

See this to check the accelaration factors that I'm talking about in the matrix mult example.

I already tried changing the port that connects the ARM and the AXI DMA block to HP instead of ACP and the results are the same.

I'm using AXI DMA also to transfer data rom and to the DDR and I measured the MM2S (Memory-Mapped to Stream) transfer to 1343 clock cycles to transfer 4096 bytes, which results in a transfer spped of 290.8 Mbytes/second. The S2MM transfer in turn has a velocity of 167.2 MBytes/s because it transfered 4096 bytes in 2336 clock cycles.

I have multiple questions in which I hope you can help:

Why is my FPGA design slower than the ARM when multiplying a matrix by 2.0 but not when multiplying two matrices??
Do these AXI DMA velocities look okay to you? By comparing them to Sadri's video it seems that I can transfer way faster. What can I do to improve these transfers speeds?
I saw somewhere that S2MM transfers are expected to be slower than MM2s transfers in the Zedboard. Can you tell me why and if this big of a difference makes sense?
I measured the time in my PC to do a 32x32 matrix multiplication by 2.0 and it's 3.84x10⁽⁻⁶⁾ seconds. Knowing that the same multiplication takes 1.42x10⁽⁻⁵⁾ and the FPGA one takes 3.85x10⁽⁻⁵⁾ one can notice that the CPU is almost 4 times faster than the ARM and almost 10 times faster than the FPGA. If my objective was to design an FPGA model that accelarates software why am I so far off when I'm following an example?

Note: My frequency is 100 MHz so each clock cycle is 10ns.

score 1 · Accepted Answer · answered Jun 05 '15 at 19:10

1

Don't forget that the ARM processor runs at much faster speed than the programmable logic. It runs somewhere between 666MHz to 1GHz while your logic runs at 100MHz. 100MHz seems pretty slow, you can probably ramp it up to 150-200MHz. Multiplying 2 matrices requires more operations, more data dependency, more memory access, etc. In those case, it's easier to take advantage of the FPGA's parallelism, multiplying by a constant is simply not complex enough. That said, you should have better result.

The 1343 cycles to transfer 4096 bytes seems a little slow, but not too off if your design is under stress. You would get better rates if you used 64 bits AXI (I guessed you used 32 bits) and configure the AXI-DMA to use larger burst length.

The thing that worries me in your results is the 3654 it took you to perform the matrix-constant multiply algorithm. I would expect something closer to the 1343 cycles it took you for the DMA transfer, which would if you pipeline your operations properly. It seems you transfer data from RAM to your IP, then mutiply the matrix, then transfer from your IP to the RAM, taking around 1024 cycles for each operation.

It should all be done at the same time: transfer from ram to IP, multiply incoming data (without storing) and send them off the S2MM port. In that case, it would take 1024 cycles + latency through the cores.

answered Jun 05 '15 at 19:10

Jonathan Drolet

1,169
5
7

Ty very much for that thorough answer. Just want to make sure I clarify some things. First off my post had an error, where after the S2MM I failed to say the speed. That's corrected now so please check the new value. Also, when you talk about the 64-bit port advice, how can I change that? Are you referring to `Memory Map Data Width` or `Stream Data Width`? About the burst size,it's already 256. But I can try and change the Buffer Length. – João Pereira Jun 08 '15 at 10:33
About the pipeline part of your comment: I'm very interested in fully understanding that so please help me here. Right now I'm pipelining the operations exactly equal to the example.. Do you have any suggestion on how to fully pipleine it? And also when you say multiply incoming data (without storing), I don't think the example does this, but it was the only way I could find to put the AXI DMA to work... In the example, the matrix multiplication is done like so: `matrix_out[i][j]=2.9*matrix_in[i][j];` so I guess it does storing.. Note: I don't think my design in under stress. – João Pereira Jun 08 '15 at 10:46
Your S2MM throughput is invalid I would think. You need to measure on a much larger data transfer. When you write to S2MM, it buffers the data until it has enough for an AXI transfer, then it starts the AXI transaction. However, you can still write to the S2MM while it is transfering on the AXI. Basically, it takes 2336 cycles from start to finish, but you could start a new transfer every 2336/2 cycles. Thus your throughput should be closer 300MB/s. For the width, I am referring to `Memory Map Data Width`, there won't be much gain if the stream stays 100MHz 32bits though. – Jonathan Drolet Jun 08 '15 at 13:59
It does appear that you buffer the matrix before starting the multiplication. It makes sense for matrix-matrix multiplication, but not for constant multiplication. I don't know HLS, so I suggest you ask another question or try out how to change that. – Jonathan Drolet Jun 08 '15 at 14:02

Zedboard clock cycles analysis

1 Answers1

Linked