I have a TOP-level function of the following structure:
struct TYPE1 {uint8 ch[16];};
struct TYPE2 {uint8 ch[100];};
void FUNCT(hls::stream<TYPE1> &inStream, hls::stream<TYPE2> &outStream){
#pragma HLS INTERFACE axis port=inStream
#pragma HLS INTERFACE axis port=outStream
#pragma HLS DATA_PACK variable=outStream struct_level
#pragma HLS DATA_PACK variable=inStream struct_level
TYPE1 inpx;
#pragma HLS ARRAY_PARTITION variable=inpx.ch complete dim=1
TYPE2 outpx;
#pragma HLS ARRAY_PARTITION variable=outpx.ch complete dim=1
inpx = inStream.read();
L0: for(i<100){
L1: for(cha<16){
acc[i] += inpx.ch[cha] * y;
}
// do more stuff
outpx.ch[i] = x; write temp variable
}
outStream.write(outpx);
}
This top-level function receives a stream of pixels and should process one pixel at a time (per function call); the pixel rate is 528 clock cycles, so the function has 528 clock cycles to work on every pixel. Thus, I would like to place a constraint on the function to have latency no more than 528 clock cycles. At the same time, I would like the function to use as least resources as possible. Since my loop L0 is 100 iterations, I know that each iteration needs to finish withing ~5 clock cycles, if executed sequenctially. Thus, I do not need to unroll L0 loop. With these requirements, I put the following constraints:
#pragma HLS LATENCY min=500 max=528 // directive for FUNCT
#pragma HLS UNROLL factor=1 // directive for L0 loop
However, the synthesized design results in function latency over 3000 cycles and the log shows the following warning message:
WARNING: [SCHED 204-71] Latency directive discarded for region FUNCT since it contains subloops.
Q1: How do I place the latency constraint on the function while preserving the loops? Would it make sense to manually unroll the loop by writing all the operations consecutively as below?
{ // manually unrolled L0 and L1
acc[0] = 0; acc[0] += inpx.ch[0] * x; acc[0] += inpx.ch[1] * y; acc[0] +=inpx.ch[2] * z; ........ acc[0] += inpx.ch[16] * zz; do more operations on acc[0]
acc[1] = 0; acc[1] += inpx.ch[0] * x; acc[1] += inpx.ch[1] * y; acc[1] += inpx.ch[2] * z; ........ acc[1] += inpx.ch[16] * zz; do more operations on acc[1]
.........
acc[99] = 0; acc[99] += inpx.ch[0] * x; acc[99] += inpx.ch[1] * y; acc[99] += inpx.ch[2] * z; ........ acc[99] += inpx.ch[16] * zz; do more operations on acc[99]
}
Q2: does HLS have limitation on how long (how many) operations can be written on a single line? Will it have a problem parsing/compiling the source code if I write out operations to substitute for loops of say 1000 iterations?