My objective is to read seven 512X512 float matrices from the SD card to the DDR memory (step accomplished already with each matrix occupying around 1Mb), then pass them from DDR to my custom IP block (I'm doing this transition with AXI DMA block), normalize them innside the custom IP block and then output them to DDR memory (also with the AXI DMA block).
Well, I'm doing my custom IP block in Vivado HLS and following the steps that I saw in this Xilinx manual (which shall be the ideal way to do this since its from Xilinx). It works for a 32x32 matrix.
But unfortunately, when increasing the matrix dimensions to 512x512, even doing only a multiplication by 2.0 of each matrix' parameter, the BRAM_18K utilization is 365%!!
What can I do do brutaly decrease the % of resources used? I'll need to do lots of operations to the matrices inside the custom IP block and if a simple multiplication by 2.0 uses 365% of BRAMs a solution that decreases the amount of this example to 80/90% is not good enough. What I'm looking for is a solution that sets the BRAM utilization to around 5% in this example.