2

My objective is to read seven 512X512 float matrices from the SD card to the DDR memory (step accomplished already with each matrix occupying around 1Mb), then pass them from DDR to my custom IP block (I'm doing this transition with AXI DMA block), normalize them innside the custom IP block and then output them to DDR memory (also with the AXI DMA block).

Well, I'm doing my custom IP block in Vivado HLS and following the steps that I saw in this Xilinx manual (which shall be the ideal way to do this since its from Xilinx). It works for a 32x32 matrix.

But unfortunately, when increasing the matrix dimensions to 512x512, even doing only a multiplication by 2.0 of each matrix' parameter, the BRAM_18K utilization is 365%!!

What can I do do brutaly decrease the % of resources used? I'll need to do lots of operations to the matrices inside the custom IP block and if a simple multiplication by 2.0 uses 365% of BRAMs a solution that decreases the amount of this example to 80/90% is not good enough. What I'm looking for is a solution that sets the BRAM utilization to around 5% in this example.

João Pereira
  • 357
  • 5
  • 17
  • 512x512x4 = a megabyte of matrix. How many 18K block rams do you have? You'll almost certainly have to do it in sub-matrixes like BLAS does. – pjc50 Jun 03 '15 at 15:33
  • 280 18K BRAM blocks. BLAS? – João Pereira Jun 03 '15 at 17:16
  • 3
    https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms ; implementations usually subdivide matrices into tiles to fit in vector processors. – pjc50 Jun 03 '15 at 19:26
  • There's a big difference between 32x32 (2^10) and 512x512 (2^18). – mng Jun 03 '15 at 21:47
  • Well, you have to understand that the IP is using some of the BRAM for it's internal processing, so the size of the matrix going from 32x32 to 512x512 increases the need to memory by a factor 16x16 and that is why you are having this problem. Follow the advice of the people here and do it in sub-matrix instead. remember, this is not SW, it is hardware and hardware has it's limitations. – FarhadA Jun 04 '15 at 06:29
  • Ty for the answers guys. I'm going to look into BLAS, however I would like to ask your opinion on something: my objective when converting my C algorithm to run in the Zedboard is to make the overall running-time a lot smaller. Doing this (i.e. subdiving my 512x512 matrices), will I be able to still see a big decrease in the running-time from the software to the hardware? – João Pereira Jun 04 '15 at 10:07
  • Also, do you know any example/website of people dealing with this kind of big arrays in the Zedboard? I can't find any and I find that strange because I'm sure I'm not the first one to try and do this – João Pereira Jun 04 '15 at 11:09
  • 2
    I'm doing a big matrix multiplication on an FPGA. In fact, it's much bigger than yours - something like 75000x75000. It's only tractable because of several levels of algorithmic optimisation before I get anywhere near the hardware. Which is the core of the question, what properties of the matrix, if any, can you exploit? – Henry Gomersall Jun 11 '15 at 14:56
  • Ty for the answer. Well, each matrix is going to be normalized (so each value can be normalized separatedly with pipleine or something like that). Then the 7 matrices are going to be aggregated. How do you recommend I do this? I'm trying in submatrices of 32x32 but it doesn't seem like a very good otpimization... – João Pereira Jun 11 '15 at 15:07
  • Well back another layer, is your matrix symmetric perhaps, or hermitian, or positive (or negative) definite or semi-definite, or banded, or block, or with some nice eigenvectors (say a Fourier basis), or sparse in some other way, or is it a totally random full matrix with no structure whatsoever? – Henry Gomersall Jun 11 '15 at 15:12
  • Ah ok, now I understand what you want. Well the matrices are "random" values, all positive. They are texture, shadow, slope values of 512x512 images so they don't follow any given pattern. After the normalization takes place, their value is between 0 and 1 – João Pereira Jun 11 '15 at 15:17
  • But they don't change? And they are positive definite, so that's something! – Henry Gomersall Jun 11 '15 at 15:18
  • To sompletely clarify. Stage 1: I receive 7 matrices and copy them to DDR. Stage 2: I normalize them and their values are now between 0 and 1. I can obviously substitute them in the DDR or find a new memory address. Step 3: aggregate the 7 normalized matrices from step 2. – João Pereira Jun 11 '15 at 15:22
  • Note: I can start aggregating as soon as I have at least 1 value of each of the 7 matrices – João Pereira Jun 11 '15 at 15:23
  • Aggregate how? By multiplication? or addition? Is the normalization simply scaling all the values by the largest? How fast do you want it to be? Normalization would be very fast on a CPU (though granted, it could be faster on an FPGA). Also, note that it is likely your textures and shadows and what not could be compressed (effectively made more diagonal) with some transformation - if there are correlations between elements, it is compressible. This may or may not be useful to you (likely not in your case). – Henry Gomersall Jun 11 '15 at 15:27
  • The basic point of all my questions is to get at whether you need the full matrix in memory in order to do everything you need. By way of example, if the normalisation was per row and the operations were all element wise, then it would be trivial - you'd just process a row at a time, copying from DDR and processing on FPGA. If you need every element, then you'll need to be a bit clever about what you're doing, and think hard about it. – Henry Gomersall Jun 11 '15 at 15:36
  • @HenryGomersall That's exactly what I need and what I'll do. Process 2 lines at a time. The big problem that I see here is that at each two lines I need to transfer from MM(Memory-Mapped) to S(Stream) and then S back to MM. And I've measured these velocities and they are of 291 Mbytes/s and 168 Mbytes/s, so they take up a lot of cycles. I'm just worried that with the need of doing this relatively slow transfers each time will cause my parallel design to not be very fast – João Pereira Jun 12 '15 at 09:59
  • 1
    Well have you measured your throughput on a CPU for comparison? Memory bandwidth is a serious consideration, so you want to minimise the number of times you have to copy any part to the FPGA - if you can do it once so much the better, or split it between CPU and FPGA or some other combo. Your question is so open ended that its basically impossible to answer as is. Good luck with it! – Henry Gomersall Jun 12 '15 at 11:55
  • Yes I did, I even posted a question in [here](http://electronics.stackexchange.com/questions/174095/zedboard-clock-cycles-analysis). If I pass the matrix from the DDR to block design I run out of BRAM %, so I really don't know how to pass the whole matrix in a single transfer... – João Pereira Jun 12 '15 at 13:58

1 Answers1

1

My guess is that there isn't enough space in block RAM to store everything. You're going to have to find a way to work on it in smaller pieces that will fit in block RAM.

alex.forencich
  • 40,694
  • 1
  • 68
  • 109
  • I'm really surprised by this. I thought that the FPGA side of a Zedboard had way more space. So the way is to divide my 512x512 matrices into smaller pieces or do you haved another suggestion? – João Pereira Jun 04 '15 at 10:00
  • Yeah, you're going to have to either do some sort of divide and conquer approach to re-use resources in time or get a bigger FPGA. – alex.forencich Jun 05 '15 at 05:41
  • And with that divide and conquer strategy, by your experience do you think I will be able to still significantly reduce the running time when comparing to software? – João Pereira Jun 05 '15 at 10:24
  • It suspect it depends on the structure (or not) of your matrix. Does it have any nice properties that make it amenable to faster strategies? Is it a changing matrix? So many questions... – Henry Gomersall Jun 11 '15 at 14:54
  • 1
    I answered in your other response. If you manage to make an answer with some good tips (that you used in your 75000x75000 matrices) I will gladly mark it as a solution – João Pereira Jun 11 '15 at 15:09
  • Well, it heavily depends on the structure of my matrix. Almost certainly, what I am doing is not applicable to you (well, it might be, but it would more likely be a red herring, hence my questioning). – Henry Gomersall Jun 11 '15 at 15:18