Xilinx XST won't infer block ram

Question

I'm having trouble getting the design of my FPGA 80's computer to fit on a Papilio Duo board which is a Spartan 6 - xcs6slx9. The problem stems from RAM being inferred as distributed instead of block.

Short version : I'm using a generic entity to infer the RAM blocks (see below) and finding that for anything up to an address width of 11 it seems to go distributed, an address width of 12 or more XST is happy to put it into blocks. I've tried attributes to mark it as block but that doesn't seem to work.

Current solution : widen the address width of one instance, zeroing the high address bit... now the design fits.

Long version :

The design requires three dual port 2048x8bit ram modules. One port needs read/write access (cpu access), the other requires read only (video controller). The ports are async and run on different clock domains.

Originally I used this module: RamDualPort for this.

entity RamDualPort is
    generic
    (
        ADDR_WIDTH : integer;
        DATA_WIDTH : integer := 8
    );
    port
    (
        -- Port A
        clock_a : in std_logic;
        clken_a : in std_logic;
        addr_a : in std_logic_vector(ADDR_WIDTH-1 downto 0);
        din_a : in std_logic_vector(DATA_WIDTH-1 downto 0);
        dout_a : out std_logic_vector(DATA_WIDTH-1 downto 0);
        wr_a : in std_logic;

        -- Port B
        clock_b : in std_logic;
        addr_b : in std_logic_vector(ADDR_WIDTH-1 downto 0);
        dout_b : out std_logic_vector(DATA_WIDTH-1 downto 0)
    );
end RamDualPort;

architecture behavior of RamDualPort is 
    constant MEM_DEPTH : integer := 2**ADDR_WIDTH;
    type mem_type is array(0 to MEM_DEPTH-1) of std_logic_vector(DATA_WIDTH-1 downto 0);
    shared variable ram : mem_type;
begin

    process (clock_a)
    begin
        if rising_edge(clock_a) then

            if clken_a='1' then
                if wr_a = '1' then
                    ram(to_integer(unsigned(addr_a))) := din_a;
                end if;

                dout_a <= ram(to_integer(unsigned(addr_a)));
            end if;

        end if;
    end process;

    process (clock_b)
    begin
        if rising_edge(clock_b) then

            dout_b <= ram(to_integer(unsigned(addr_b)));

        end if;
    end process;

end;

A couple of problems with this: 1) depending on address width some are being inferred as distributed (the main problem I'm asking about) but also 2) those that were getting inferred to block RAMS were being implemented as read-first which for async clocks has issues on Spartan 6's.

The only way I could find to fix the read-first issue was to make both ports read/write with a new module "RamTrueDualPort" as follows:

entity RamTrueDualPort is
    generic
    (
        ADDR_WIDTH : integer;
        DATA_WIDTH : integer := 8
    );
    port
    (
        -- Port A
        clock_a : in std_logic;
        clken_a : in std_logic;
        addr_a : in std_logic_vector(ADDR_WIDTH-1 downto 0);
        din_a : in std_logic_vector(DATA_WIDTH-1 downto 0);
        dout_a : out std_logic_vector(DATA_WIDTH-1 downto 0);
        wr_a : in std_logic;

        -- Port B
        clock_b : in std_logic;
        clken_b : in std_logic;
        addr_b : in std_logic_vector(ADDR_WIDTH-1 downto 0);
        din_b : in std_logic_vector(DATA_WIDTH-1 downto 0);
        dout_b : out std_logic_vector(DATA_WIDTH-1 downto 0);
        wr_b : in std_logic
    );
end RamTrueDualPort;

architecture behavior of RamTrueDualPort is 
    constant MEM_DEPTH : integer := 2**ADDR_WIDTH;
    type mem_type is array(0 to MEM_DEPTH-1) of std_logic_vector(DATA_WIDTH-1 downto 0);
    shared variable ram : mem_type;
begin

    process (clock_a)
    begin
        if rising_edge(clock_a) then

            if clken_a='1' then

                if wr_a = '1' then
                    ram(to_integer(unsigned(addr_a))) := din_a;
                end if;

                dout_a <= ram(to_integer(unsigned(addr_a)));

            end if;

        end if;
    end process;

    process (clock_b)
    begin
        if rising_edge(clock_b) then

            if clken_b='1' then

                if wr_b = '1' then
                    ram(to_integer(unsigned(addr_b))) := din_b;
                end if;

                dout_b <= ram(to_integer(unsigned(addr_b)));

            end if;

        end if;
    end process;

end;

So that fixed the read-first issue and those rams going to block ram are now implemented as write-first (NB: I don't actually care about read-first/write-first except for the Spartan 6 corrupting ram read-first issue).

Now the problem is getting the smaller 2k (addrWidth 11) instances onto block ram. As mentioned I've tried attributes but it still insists on putting it in distributed ram. I couldn't find any documentation on ram_style for variables (as opposed to signals) but guessed this: (Note the bit ram:variable)

constant MEM_DEPTH : integer := 2**ADDR_WIDTH;
type mem_type is array(0 to MEM_DEPTH-1) of std_logic_vector(DATA_WIDTH-1 downto 0);
shared variable ram : mem_type;
ATTRIBUTE ram_extract: string;
ATTRIBUTE ram_extract OF ram:variable is "yes";
ATTRIBUTE ram_style: string;
ATTRIBUTE ram_style OF ram:variable is "block";

Now XST spits out this which suggests the attribute syntax is understood: (Note mention of ram_extract and ram_style)

Synthesizing Unit <RamTrueDualPort_1>.
    Related source file is "C:/Documents and Settings/Brad/Projects/fpgabee/Hardware/FPGABeeCore/RamTrueDualPort.vhd".
        ADDR_WIDTH = 12
        DATA_WIDTH = 8
    Set property "ram_extract = yes" for signal <ram>.
    Set property "ram_style = block" for signal <ram>.
    Found 4096x8-bit dual-port RAM <Mram_ram> for signal <ram>.
    Found 8-bit register for signal <dout_b>.
    Found 8-bit register for signal <dout_a>.
    Summary:
    inferred   1 RAM(s).
    inferred  16 D-type flip-flop(s).
    inferred   1 Multiplexer(s).
Unit <RamTrueDualPort_1> synthesized.

Synthesizing Unit <RamTrueDualPort_2>.
    Related source file is "C:/Documents and Settings/Brad/Projects/fpgabee/Hardware/FPGABeeCore/RamTrueDualPort.vhd".
        ADDR_WIDTH = 11
        DATA_WIDTH = 8
    Set property "ram_extract = yes" for signal <ram>.
    Set property "ram_style = block" for signal <ram>.
    Found 2048x8-bit dual-port RAM <Mram_ram> for signal <ram>.
    Found 8-bit register for signal <dout_b>.
    Found 8-bit register for signal <dout_a>.
    Summary:
    inferred   1 RAM(s).
    inferred  16 D-type flip-flop(s).
    inferred   2 Multiplexer(s).
Unit <RamTrueDualPort_2> synthesized.

However the 2k blocks still end up distributed:

2048x8-bit dual-port distributed RAM                  : 2
4096x8-bit dual-port block RAM                        : 1

If I take out the redundant address line (ie: put it back to addrWidth=11) all three instances end up distributed and the design no longer fits:

2048x8-bit dual-port distributed RAM                  : 3

What to do? I really don't want to switch back to coregen for this.

PS: I'm an amateur at this - be gentle!.

You could try our RAM implementation [PoC.mem.ocram.tdp_wf](https://github.com/VLSI-EDA/PoC/blob/master/src/mem/ocram/ocram_tdp_wf.vhdl). That's a true dual port RAM with write-first. It has been synthesised and tested with different tools and boards. Other implementation reside in the same folder. The files are filled with comments :). — Paebbels, Apr 06 '17 at 04:20
Thanks @Paebbels - I'll check it out though at first glance looks like yours uses a shared clock for each port. I need separate clocks. (I'm also interested in just understanding why XST won't use bram for this) — Brad Robinson, Apr 06 '17 at 04:55
Hmmm yes, the normal TDP ram is dual clocked, the WF variation not; because the synchronizers cause a higher delay. — Paebbels, Apr 06 '17 at 08:14

score 5 · Answer 1 · answered Apr 06 '17 at 05:36

If you know precisely what you want to end up with, there's no need to have Xst try to infer it from a behavioral model.

You can instantiate a block RAM object directly in HDL code. Details on the appropriate syntax, and the options involved, can be found in Xilinx UG615: Spartan-6 Libraries Guide for HDL Designs, around page 274 ("RAMB16BWER"). You can also use the BRAM_TDP_MACRO macro, which is explained on page 20 of the same document.

You'll need to be familiar with how the Spartan-6 block RAM element works. Information on this is available in Xilinx UG383: Spartan-6 FPGA Block RAM Resources.

(Note that the block RAM has standard widths of 9, 18, or 36 bits; you will probably want to use it in 9-bit mode, and just ignore the extra bit. It's there for designs which need parity bits.)

Thanks @duskwuff. I was aware of other approaches like this but wanted to understand if there was a reason for what I'm seeing. I thought perhaps I had the attribute syntax slightly wrong or something else trivial. Thanks again - those links will be super handy. — Brad Robinson, Apr 06 '17 at 07:31

scary_jeff · Answer 2 · 2017-04-06T10:53:33.990

Firstly, the supported coding styles are documented in the synthesis user guide. For Xilinx ISE, this would be the XST user guide. Page 200 and onward explains what is supported. Sadly it uses the old conv_integer function, but you are already using the newer numeric_std equivalents, and these will work just the same with XST. If you can find an older version of this document, it will contain a lot more examples. There are also examples in ISE; go to Edit > Language Templates > VHDL > Synthesis Constructs > Coding Examples > RAM. These include a specific 'Write-first' example under Block RAM > Dual Port > Asymmetric Ports.

Now on to your issues:

Depending on address width some are being inferred as distributed

This is deliberate. It is not efficient in most cases to use a block RAM when the address width is small. How small is 'small' will depend on the device family in use, but in general, if one of the data bits can be implemented using LUT resource and fit into one combinatorial logic block (CLB), then a distributed RAM will perform as well as a block RAM would have.

For your particular device, look at UG384, which has a section on 'Distributed RAM', containing a table listing what RAM sizes can be efficiently implemented in a single CLB. That's not to say that larger memories cannot use distributed RAM, just that one larger than this will not operate at as high a clock rate as a block RAM would have.

On top of this, distributed RAM cannot implement a 'true' dual-port memory, so if you have more than one port, and either multiple read addresses are used separately, or multiple write enables are used separately, XST should infer block RAM. Normally you will only be using the write interface of one port, and the read interface of another, so in my experience this is rarely a factor.

You should not normally be worried about whether a memory ends up being implemented using distributed or block RAM. Think of it as an advantage; because you have inferred the RAM instead of instantiated a primitive or used CoreGen, the tools can choose how to implement it based on the resources available in the device, what constraints you have set up, synthesis settings, etc.

It does seem odd that if your design doesn't fit, it isn't making use of the available block RAMs. I don't think you are doing anything wrong. If you really want to force a block RAM, you could try right-clicking 'Synthesize', choosing 'Process Properties', then under 'HDL Options', change 'RAM Style' to 'Block'. Note that in the help, the description for 'Auto' is:

XST determines the best implementation for each macro.

In general, I would be wary of changing this without a good understanding of exactly why the design will be better by forcing block RAM. This option could also be very wasteful if you have many small memories.

You could also try setting the synthesis 'Optimization Goal' to 'Area'. This might change the threshold for when a block RAM gets inferred over a distributed RAM.

Lastly, make sure you are using the latest tool version (14.7).

Hi @scary_jeff - thanks for the detailed reply. Most of what you've said I'm aware of and makes total sense to me. I normally don't worry about where rams end up - it's just that it wasn't fitting and I needed to dig into why. I wouldn't change the global/project options for ram style - that seems a bit extreme. Frustrating that the attributes that are supposed to control this don't seem to work. Yep running latest tools (ISE WebPack 14.7). Thanks again. — Brad Robinson, Apr 06 '17 at 12:11
@BradRobinson I have found before that the `ram_style` attribute only worked for forcing distributed RAM, and not the other way round. I feel your frustration! — scary_jeff, Apr 06 '17 at 15:28

score 1 · Answer 3 · answered Apr 10 '17 at 23:38

So a quick follow up... as part of switching to coregen block ram for this I wrapping my existing RamTrueDualPort inside another module that effectively just passed through. (The intent was to use generics to switch the actual underlying implementation).

Turns out I didn't need to - simply wrapping the original module inside another was enough for XST to start inferring block ram for the smaller blocks.

Go figure...

Xilinx XST won't infer block ram

3 Answers3