Timing complexity for correlation implementation on FPGA

Question

Let's say we have a database of five thousand 512 point discrete signals. Each database entry is unique in itself. The important point to note about the signals in the database is that out of the 512 points, more than half of the points are zero for all the signals. Now, we capture a 512 point discrete signal from the outside (details need not matter).This signal corresponds to one of the signals in the five thousand entries of the database. I compare the acquired signal by taking the Spearman Correlation of the acquired signal with each of the five thousand signals and the entry which gives the highest correlation coefficient is the closest and depicts the acquired signal. This operation of correlation, especially the 512 point correlation, consumes a lot of time on MATLAB. Obviously, there is latency on the PC, which is the main reason for consumption of time. The time consumed is 2.3 seconds on an average for calculating the correlation of the acquired signal with each of the 5000 database entries.

Now, let's say that I want to implement the exact thing on an FPGA (Virtex-7 Xilinx Family). I think that the operation of correlation can be done in parallel because the acquired signal is only one and the database entries can be stored in the FPGA. So, not all the 5000 signals can be correlated parallelly, however, if at least 1000 can be correlated parallelly, then the time will reduce considerably on the FPGA. So, my question is how many signals could I correlate at one time on this FPGA and what will be the approximate time taken for this 512 point correlation in total if implemented on this FPGA.

First estimate storage requirements and price an FPGA with the required internal storage. Or determine how fast you can stream all 5000 signals from external memory (SRAM, SDRAM, DDR etc). — , Jul 26 '16 at 09:26

score 2 · Answer 1 · answered Jul 26 '16 at 11:57

If we assume 2 bytes per sample point (16 bits), then you have about 5 MB of database data, which pretty much dictates external memory to hold it on all but the very largest of FPGAs. The external memory interface will be the rate-limiting step in the correlation process.

If you have a number of DDR memory chips you could have an interface that's, say, 64 bits wide running at 200 MHz or so, which gives you a raw bandwidth of 3.2 GB/second. This allows you to scan through your database in less than 2 ms.

Then, you just need to make sure you have enough parallel logic to calculate your correlation as fast as the data comes in. I'm not familiar enough with the details of Spearman Correlation to offer any suggestions there.

WhatRoughBeast · Answer 2 · 2016-07-26T19:36:59.483

Without knowing more about your application, it's pretty clear that the problem is in MATLAB, not the PC. From your description, your processing algorithm is horribly inefficient, and a software rethink may save you a lot of time, effort, and money.

Consider a Spearman Correlation. Instead of correlating 2 data arrays, x() and y(), it performs the operation $$ C = 1-\frac{6\Sigma(Rx-Ry)^2}{n(n-1)} $$ where Rx and Ry are the rankings in the arrays x and y. That is, their order in a sorted list.

So when MATLAB tries to calculate the summation, first it has to sort 5001, 512-element arrays and produce a ranking list for each array, and this is taking up a great deal of time. As an illustration, using a compiled Basic it only takes about 10 msec to perform the core multiply-accumulates on 5000 512-sample arrays on my machine, producing a 5000-element array of results.

This a gross waste of time, since 5000 out of 5001 arrays can be precomputed, since they don't change between runs.

So, rather than use MATLAB and brute-force it, start by sorting your 5000 reference arrays and generate a corresponding ranking array for each. You only need to do it once, using these precomputed ranking arrays each time you evaluate a new data array.

When you get a new data array, sort it and produce a ranking array, then perform your squaring/accumulation. This will take far less time than your current approach.

Also note that, depending on what you do with your correlations, you may not need to do the division or the subtraction for all the correlations. For instance, if you only are interested in the 10 best correlations, you can sort the MAC results, pick the 10 best, and then do your final computations.

All of this may be interpreted as a suggestion that, rather than posting on the EE SE, you might consider trying on the Computational Science SE, asking for ways to speed up your computations.

Timing complexity for correlation implementation on FPGA

2 Answers2