Efficient single cycle huffman decoding technique with wider input data?

Question

I have a wider input data (Byte width) running in my system, and I'm implementing a huffman decoding on this incoming data. Since, the Huffman encoded words don't have the fixed length, hence I need to build a very complex state machine to cover all the possible combination for the decoding process to minimize the decoding time. (The primarily consideration of the implementation is to decode as many as words from the incoming data, and push the only incomplete encoded word to next cycle.)

I'm sure this implementation is less efficient especially it will easily to grow combinatorial logic which at the same time is hurting the system Fmax.

Is there any alternative technique that can ease the implementation? Still, fast decoding time is the primary consideration factor.

Example:

Input data per clock cycle : 8 data bit
Encoded word length: 1 bit / 4 bit / 5 bit / 6 bit / 7 bit

Case 1) In a single clock cycle, the input data can contain multiple encoded words, such as total 4 encoded word (3 x 1 encoded word & and 1 x 5 encoded word ), this is a very straight forward case as we can simply decode all the incoming data in a single clock cycle

Case 2) In a single clock cycle, the incoming 8 bits data only contain 1 x 6 bit length encoded word. In this case, we need to push the left over 2 bit data to be used in the next clock cycle decoding. Hence, the next decoding process need to decode from the 10 bits data instead the original 8 bits data.

Based on the case 2, If we apply it in all the possible decoding combinations, it will translate to a very bulky & complex logic. (The original problem statement)

There's way too much info missing on what you already have, where the data, the codebook, and the output come and go to, what a "cycle" is to you, etc. for us to jump in and say "that's how you solve that!". — Marcus Müller, Oct 05 '20 at 08:02
@MarcusMüller, I have edited the original post to elaborate more on the problem statement. The cycle means simply mean the implementation targeted to completed as much as decoding word in a single clock cycle (From hardware circuitry perspective). — Learner, Oct 05 '20 at 08:38
uff! That design is unusual, to say the least. Because this means that *per clock cycle* (!!) you need to be able to 1. segmentize the input word, 2. do up to 8 codeword->source word lookup 3. emit 1 to 8 different source words. I don't even think step 1 is possible at any relevant clock rate, because for the (potentially) 8th codeword, you need to know where the 7th ends, for that you need to know where the 6th.. and so on, and that's a sequential problem that can't be parallelized to any speed gain – you'd be better of pipelining all this. What of your design already exists? — Marcus Müller, Oct 05 '20 at 08:50
Software would decode 8 bits at a time with a lookup table. Can you do that in hardware? (note that if you have e.g. 2048-word RAM available then there's probably no point using less than the full address space, so you may as well expand it to 11 bits) — user253751, Oct 05 '20 at 17:39

Dave Tweed · Answer 1 · 2020-10-06T19:46:14.580

If there's one thing that FPGAs are good at, it's lookup tables — both the little LUTs used for logic and the larger block RAMs that can be used as ROMs. In order to avoid falling behind, in the worst case, you need to be able to decode up to 14 input bits (two 7-bit symbols) at a time. A 16k-word (but rather wide) ROM can easily do it in a single clock cycle. You just pre-compute all of the possible decodings for 14-bit combinations and load them into the ROM. The outputs of the ROM will be:

The number of complete symbols found in the current set of 16 input bits (1-8, 3-bits)¹
The decoded symbols (8 × 7 bits)
The number of bits left over for the next cycle (2-8, 3 bits)

for a total of 62 bits of width. This is just under a megabit of block RAM, which would well within the capability of many modern FPGAs. You just need to add a 16-bit register to hold the input data and a barrel shifter² to move it around as each new byte comes in.

¹ Note that it doesn't matter if there are more than 8 symbols in the current set of input bits. If that happens, the rest of them get decoded on the next clock cycle.

² And remember, if your FPGA has dedicated hardware multiplier modules, they make great barrel shifters!

user4574 · Answer 2 · 2020-10-10T22:14:40.553

Input data per clock cycle : 8 data bit
Encoded word length: 1 bit / 4 bit / 5 bit / 6 bit / 7 bit

Use a 14 bit register R that holds the 8 code bits from this cycle + 6 bits of any prior partial codes from the prior cycle.

The most direct way to process all 14 DATA bits is to use a bunch of block RAMs, with the register serving as a 14 bit address. What would come out is up to 64 bits of data and a 3 bit code saying how many partial code bits remain. On a Xilinx 7-series part having 2KB block rams this would use 67 block RAMs. This is not cheap, but otherwise very straightforward once you generate the RAM data.

One way to do this using far less resources is using a set of 8 small distributed RAMs.

To process one 7-bit code in a single cycle you need a 128 element x 11-bit data distributed RAM. Put the code (lower 8 bits of R) on the eight address bits of the RAM. The 8-bit data + the 3-bit code length will come out of the RAM based on the code inputs.

At least on a Xilinx part an 11 x 256 distributed RAM is not that expensive (it will use 22 LUT6s, 11 F7MUX, in 5.5 slices).

After each RAM you can use the code length output from the prior stage to shift a copy of R over by somewhere between 1 and 7 bits. The shift can be done using a 13:1 x 8 bit mux. Such a mux can be implemented using 24 LUTs.

So one stage of decode + shift is 46 LUTs. You would need 8 stages to process everything. I don't know the exact number off hand but for all 8 stages that should use less than 368 LUTs. It should actually be significantly less (closer to 150) since each stage will use less and less LUTs.

You can choose to either pipeline the stages for better clock rate, or cascade them asynchronously to get everything done in one cycle.

Here is roughly what a non-pipelined version of the code would look like. Some details still need to be filled in.

library IEEE;
use IEEE.numeric_std.ALL;

...

--decode lookups
type rom_t is array(0 to 127) of std_logic_vector(10 downto 0);
constant rom : rom_t := (...);

--Array of code signals
type code_t is array(0 to 8) of std_logic_vector(13 downto 0);
signal code : code_t;

--output bytes
type dout_t is array(0 to 7) of std_logic_vector(7 downto 0);
signal dout : dout_t;

--Array of shift signals
type shft_t is array(0 to 7) of std_logic_vector(2 downto 0);
signal shft : shft_t;

begin

--TODO:  some code needs to be added here to merge the new
--data byte with the remaining code bits from the last cycle.
code(0) <= input_data_byte & code(8)(7 downto 0) when rising_edge(clk);--input data byte

--Generate 8 stages of decode logic.
g_decode:  for i in 0 to 7 generate

begin
    dout(i) <= rom(to_integer(unsigned(code(i)(6 downto 0))))( 7 downto 0);--data byte output of ROM
    shft(i) <= rom(to_integer(unsigned(code(i)(6 downto 0))))(10 downto 8);--code length output of ROM

    --The code for the next stage is shifted over to remove the code bits from from this stage
    code(i + 1) <= 
                   code(i)              when shft(i) = "000";
        "0"      & code(i)(13 downto 1) when shft(i) = "001" else --code length was 1
        "000"    & code(i)(13 downto 4) when shft(i) = "100" else --code length was 4
        "0000"   & code(i)(13 downto 5) when shft(i) = "101" else --code length was 5
        "00000"  & code(i)(13 downto 6) when shft(i) = "110" else --code length was 6
        "000000" & code(i)(13 downto 7) when shft(i) = "111" else --code length was 7

    --TODO:  The dcoder will always generate 8 outputs, but not all of them are always valid.
    --Some logic needs to be added to track how many bits have been consumed and then generate a 
    --data-valid flag for each output.
    --



end generate;

...

Efficient single cycle huffman decoding technique with wider input data?

2 Answers2