Input data per clock cycle : 8 data bit
Encoded word length: 1 bit / 4 bit / 5 bit / 6 bit / 7 bit
Use a 14 bit register R that holds the 8 code bits from this cycle + 6 bits of any prior partial codes from the prior cycle.
The most direct way to process all 14 DATA bits is to use a bunch of block RAMs, with the register serving as a 14 bit address. What would come out is up to 64 bits of data and a 3 bit code saying how many partial code bits remain. On a Xilinx 7-series part having 2KB block rams this would use 67 block RAMs. This is not cheap, but otherwise very straightforward once you generate the RAM data.
One way to do this using far less resources is using a set of 8 small distributed RAMs.
To process one 7-bit code in a single cycle you need a 128 element x 11-bit data distributed RAM. Put the code (lower 8 bits of R) on the eight address bits of the RAM. The 8-bit data + the 3-bit code length will come out of the RAM based on the code inputs.
At least on a Xilinx part an 11 x 256 distributed RAM is not that expensive (it will use 22 LUT6s, 11 F7MUX, in 5.5 slices).
After each RAM you can use the code length output from the prior stage to shift a copy of R over by somewhere between 1 and 7 bits. The shift can be done using a 13:1 x 8 bit mux. Such a mux can be implemented using 24 LUTs.
So one stage of decode + shift is 46 LUTs. You would need 8 stages to process everything. I don't know the exact number off hand but for all 8 stages that should use less than 368 LUTs. It should actually be significantly less (closer to 150) since each stage will use less and less LUTs.
You can choose to either pipeline the stages for better clock rate, or cascade them asynchronously to get everything done in one cycle.
Here is roughly what a non-pipelined version of the code would look like. Some details still need to be filled in.
library IEEE;
use IEEE.numeric_std.ALL;
...
--decode lookups
type rom_t is array(0 to 127) of std_logic_vector(10 downto 0);
constant rom : rom_t := (...);
--Array of code signals
type code_t is array(0 to 8) of std_logic_vector(13 downto 0);
signal code : code_t;
--output bytes
type dout_t is array(0 to 7) of std_logic_vector(7 downto 0);
signal dout : dout_t;
--Array of shift signals
type shft_t is array(0 to 7) of std_logic_vector(2 downto 0);
signal shft : shft_t;
begin
--TODO: some code needs to be added here to merge the new
--data byte with the remaining code bits from the last cycle.
code(0) <= input_data_byte & code(8)(7 downto 0) when rising_edge(clk);--input data byte
--Generate 8 stages of decode logic.
g_decode: for i in 0 to 7 generate
begin
dout(i) <= rom(to_integer(unsigned(code(i)(6 downto 0))))( 7 downto 0);--data byte output of ROM
shft(i) <= rom(to_integer(unsigned(code(i)(6 downto 0))))(10 downto 8);--code length output of ROM
--The code for the next stage is shifted over to remove the code bits from from this stage
code(i + 1) <=
code(i) when shft(i) = "000";
"0" & code(i)(13 downto 1) when shft(i) = "001" else --code length was 1
"000" & code(i)(13 downto 4) when shft(i) = "100" else --code length was 4
"0000" & code(i)(13 downto 5) when shft(i) = "101" else --code length was 5
"00000" & code(i)(13 downto 6) when shft(i) = "110" else --code length was 6
"000000" & code(i)(13 downto 7) when shft(i) = "111" else --code length was 7
--TODO: The dcoder will always generate 8 outputs, but not all of them are always valid.
--Some logic needs to be added to track how many bits have been consumed and then generate a
--data-valid flag for each output.
--
end generate;
...