4

While trying to learn FPGA programming, I've decided to implement a massively parallel game of life. Here's my first attempt:

entity LifeCell is
    Port ( neighbours : in      std_logic_vector(7 downto 0);
           state        : inout std_logic;
           clk       : in   std_logic);
end LifeCell;

architecture Behavioral of LifeCell is

begin
    compute: process(clk)
    variable temp : integer range 0 to 8;
    begin
        if rising_edge(clk) then
            temp := 0;
            for i in neighbours'range loop
                if neighbours(i) = '1' then temp := temp + 1; 
                end if;
            end loop;

            if (temp = 3 or (temp = 2 and state = '1')) then
                state <= '1';
            else
                state <= '0';
            end if;

        end if;
    end process;
end Behavioral;

Then I realized that the Technology Diagram was using 13 LUTs. Hmmm... Maybe I can do better? Why not specify in advance the possible combinations?

function count3bits(v : std_logic_vector(7 downto 0)) return boolean is
begin
    case v is
        when "00000111" => return true;
        when "00001011" => return true;
        when "00001101" => return true;
        when "00001110" => return true;
        when "00010011" => return true;
        when "00010101" => return true;
        when "00010110" => return true;
        when "00011001" => return true;
        when "00011010" => return true;
        when "00011100" => return true;
        when "00100011" => return true;
        when "00100101" => return true;
        when "00100110" => return true;
        when "00101001" => return true;
        when "00101010" => return true;
        when "00101100" => return true;
        when "00110001" => return true;
        when "00110010" => return true;
        when "00110100" => return true;
        when "00111000" => return true;
        when "01000011" => return true;
        when "01000101" => return true;
        when "01000110" => return true;
        when "01001001" => return true;
        when "01001010" => return true;
        when "01001100" => return true;
        when "01010001" => return true;
        when "01010010" => return true;
        when "01010100" => return true;
        when "01011000" => return true;
        when "01100001" => return true;
        when "01100010" => return true;
        when "01100100" => return true;
        when "01101000" => return true;
        when "01110000" => return true;
        when "10000011" => return true;
        when "10000101" => return true;
        when "10000110" => return true;
        when "10001001" => return true;
        when "10001010" => return true;
        when "10001100" => return true;
        when "10010001" => return true;
        when "10010010" => return true;
        when "10010100" => return true;
        when "10011000" => return true;
        when "10100001" => return true;
        when "10100010" => return true;
        when "10100100" => return true;
        when "10101000" => return true;
        when "10110000" => return true;
        when "11000001" => return true;
        when "11000010" => return true;
        when "11000100" => return true;
        when "11001000" => return true;
        when "11010000" => return true;
        when "11100000" => return true;
        when others => return false;
    end case;
end count3bits;

function count2bits(v : std_logic_vector(7 downto 0)) return boolean is
begin
    case v is
        when "00000011" => return true;
        when "00000101" => return true;
        when "00000110" => return true;
        when "00001001" => return true;
        when "00001010" => return true;
        when "00001100" => return true;
        when "00010001" => return true;
        when "00010010" => return true;
        when "00010100" => return true;
        when "00011000" => return true;
        when "00100001" => return true;
        when "00100010" => return true;
        when "00100100" => return true;
        when "00101000" => return true;
        when "00110000" => return true;
        when "01000001" => return true;
        when "01000010" => return true;
        when "01000100" => return true;
        when "01001000" => return true;
        when "01010000" => return true;
        when "01100000" => return true;
        when "10000001" => return true;
        when "10000010" => return true;
        when "10000100" => return true;
        when "10001000" => return true;
        when "10010000" => return true;
        when "10100000" => return true;
        when "11000000" => return true;
        when others => return false;
    end case;
end count2bits;

begin
    compute: process(clk)
    begin
        if rising_edge(clk) then
            if (count3bits(neighbours) or (state = '1' and count2bits(neighbours))) then
                state <= '1';
            else
                state <= '0';
            end if;
        end if;
    end process;
end Behavioral;

Implementation is now down to 9 LUTs.

Now, here's the question. Is it possible to do better? (Spartan-6 only has 6-bit LUTs).

3 Answers3

5

Remember that Spartan-6 LUTs can be configured as a single 6-input LUT, or dual 5-input LUTs. I am specifically looking at the SLICEX variation, which is a subset and the least capable of all the Spartan-6 slices.

I think this can be done in 4 six-input LUTs.

The first three LUTs input the state of 6 of the surrounding cells, out outputs a 3 bit number that is the count of 1's in those six cells.

The fourth six-input LUT takes that 3-bit count, plus the remaining 2 surrounding cells and the current state of the center cell and output the new state of the cell.

This entire logic can fit inside a single SliceX block (but will also fit inside all other slice types). In the smallest Spartan-6, you could do 300 cells. In the largest, you can do just over 23,000.

None of this factors in the communication required to read out or set the status of the whole matrix. Reading it out could add one 5-input LUT + FF per cell, and initializing the matrix could add another 5 input LUT + FF per cell. This equates to another 6 input LUT (which has 2 FF's with it) per cell-- and a 25% increase in logic required to do a useful Game of Life.

  • Very nice. I almost got to that point. In any case, your scheme uses the same amount of resources and has the same performance as mine. – Dave Tweed May 03 '13 at 18:05
4

I believe you should be able to do this with 6 LUTs per cell. The basic approach is to divide the 8 inputs into two groups, 3 inputs in one group and 5 inputs in the other. You can count the ones in the first group with 2 LUTs, and count the ones in the second group with 3 LUTs, for a total of 5 intermediate result bits. One more LUT can combine those bits with the current state to give the next state.

Also, since the first 5 LUTs operate in parallel, the total latency is just two LUTs, so you should be able to operate this at a very high clock speed.

The following code gives the general idea. Some of the syntax details might be a little off — my VHDL is a little rusty.

-- This function uses 2 LUTs
function count3bits(v : std_logic_vector(2 downto 0)) return std_logic_vector(1 downto 0) is
begin
    case v is
        when "000" => return "00";
        when "001" => return "01";
        when "010" => return "01";
        when "011" => return "10";
        when "100" => return "01";
        when "101" => return "10";
        when "110" => return "10";
        when "111" => return "11";
        when others => return "00";
    end case;
end count3bits;

-- This function uses 3 LUTs
function count5bits(v : std_logic_vector(4 downto 0)) return std_logic_vector(2 downto 0) is
begin
    case v is
        when "00000" => return "000";
        when "00001" => return "001";
        when "00010" => return "001";
        when "00011" => return "010";
        when "00100" => return "001";
        when "00101" => return "010";
        when "00110" => return "010";
        when "00111" => return "011";
        when "01000" => return "001";
        when "01001" => return "010";
        when "01010" => return "010";
        when "01011" => return "011";
        when "01100" => return "010";
        when "01101" => return "011";
        when "01110" => return "011";
        when "01111" => return "100";
        when "10000" => return "001";
        when "10001" => return "010";
        when "10010" => return "010";
        when "10011" => return "011";
        when "10100" => return "010";
        when "10101" => return "011";
        when "10110" => return "011";
        when "10111" => return "100";
        when "11000" => return "010";
        when "11001" => return "011";
        when "11010" => return "011";
        when "11011" => return "100";
        when "11100" => return "011";
        when "11101" => return "100";
        when "11110" => return "100";
        when "11111" => return "101";
        when others  => return "000";
    end case;
end count5bits;

-- This function uses 1 LUT
function evaluate(a : std_logic_vector(1 downto 0),
                  b : std_logic_vector(2 downto 0),
                  s : std_logic) return std_logic is
    variable temp : std_logic_vector (5 downto 0);
begin
    temp := (a & b & s);
    case temp is
        -- cases where the state is true and the sum is 2
        when "10_000_1" => return '1';
        when "01_001_1" => return '1';
        when "00_010_1" => return '1';

        -- cases where the state is true and the sum is 3
        when "11_000_1" => return '1';
        when "10_001_1" => return '1';
        when "01_010_1" => return '1';
        when "00_011_1" => return '1';

        -- cases where the state is false and the sum is 3
        when "11_000_0" => return '1';
        when "10_001_0" => return '1';
        when "01_010_0" => return '1';
        when "00_011_0" => return '1';

        when others => return '0';
    end case;
end evaluate;

begin
    compute: process(clk)
    begin
        if rising_edge(clk) then
            state <= (evaluate(count3bits(neighbours(7 downto 5)),
                               count5bits(neighbours(4 downto 0)),
                               state);
        end if;
    end process;
end Behavioral;

The largest Spartan-6 has 23038 slices, each of which contains four 64-bit LUTs. Note that each of those LUTs can also be split into two 32-bit LUTs as long as they share the same inputs. This means that one Life cell only requires one slice to implement. You could do a 150×150 game of Life that runs at hundreds of millions of generations/second. Or you could use the BRAMs to double-buffer a game of Life the size of an HDTV screen (1920×1080) and run it at around a million generations/second.

Dave Tweed
  • 168,369
  • 17
  • 228
  • 393
4

I think it can fit in 3 LUTs but it won't be as elegant as David's solution. But reducing LUT is a goal, then this can work. Also, I have not worked with Spartan-6 FPGAs. I am basing this analysis on:

http://www.xilinx.com/support/documentation/user_guides/ug384.pdf specifically figure 6 on page 13.

Here it goes:

  1. LUT 0 is used as dual 5-input LUT. Its inputs are bits [4:0] of "neighbors".
  2. LUT 1 is used as 6-input LUT. Its inputs are {State, neighbors[4:0]}

Both work together to generate 3-bit encoded output as in this table:

enter image description here

Here is how to read the table. Row 1 says "if state is 0 and there are zero 1s in bits[4:0] then for Next State to be 1, there needs to be three 1s in remaining 3 bits." As you can see, there are 12 unique input combinations that need to be encoded. But there are opportunities to merge few. First, if there are 4 or 5 1s already then State won't be 1 so rows 5,6,11 and 12 can be encoded as single code point. Interestingly, rows 4 and 10 also can be represented as single code point. If there are three 1s already, then "State" is essentially don't care.

If we merge rows this way, there are 8 unique code point that contain enough information for processing remaining 3 bits. Please note that State is already encoded here and is not further required.

Code points are selected such a way that LUT0 generates consistent outputs and doesn't need to know about State input.

LUT2 will be 6-input LUT. It will have 3 bits from {LUT1, LUT0} and 3 remaining bits neighbors[7:5]. It will interpret those bits as per Action column of the table to generate next state.

I think it meets the requirement you set. I would like to hear if you see any issue.

mj6174
  • 664
  • 1
  • 5
  • 11