How to represent Negative real numbers in Fixed Point representation

Question

I find this module for the addition of two Fixed Point Numbers. Manual for using this Module: https://opencores.org/project,verilog_fixed_point_math_library,manual

How to add -1.5 (or any negative real number) to 0.5(any real number)??

My problem is that: I know how to represent positive numbers in fixed point representation. But I don't know how to represent -1.5(or any negative real number), but I tried taking 2's complement of 1.5 and given as input, but it is not giving proper output.

Edit: I tried, what Wouter and entrepreneur said, but that too not working.

Edit: Added Testbench of module.

module qadd(
input [N-1:0] a,
input [N-1:0] b,
output [N-1:0] c
);


//Parameterized values
parameter Q = 15;
parameter N = 32;

reg [N-1:0] res;

assign c = res;

always @(a,b)
begin
//both negative
if(a[N-1] == 1 && b[N-1] == 1) begin
    //sign
    res[N-1] = 1;
    //whole
    res[N-2:0] = a[N-2:0] + b[N-2:0];
end
//both positive
else if(a[N-1] == 0 && b[N-1] == 0) begin
    //sign
    res[N-1] = 0;
    //whole
    res[N-2:0] = a[N-2:0] + b[N-2:0];
end
//subtract a-b
else if(a[N-1] == 0 && b[N-1] == 1) begin
    //sign
    if(a[N-2:0] > b[N-2:0])
        res[N-1] = 1;
    else
        res[N-1] = 0;
    //whole
    res[N-2:0] = a[N-2:0] - b[N-2:0];
end
//subtract b-a
else begin
    //sign
    if(a[N-2:0] < b[N-2:0])
        res[N-1] = 1;
    else
        res[N-1] = 0;
    //whole
    res[N-2:0] = b[N-2:0] - a[N-2:0];
end
end

endmodule

//Test Bench

module qadd_tf;

// Inputs
reg [32:0] a;
reg [32:0] b;

// Outputs
wire [32:0] c;

// Instantiate the Unit Under Test (UUT)
qadd #(16,33) uut (a, b, c);

initial begin
    // Initialize Inputs

    b[32]=1;
    b[31:16]= 16'b00000000_00000001;
    b[15:0] = 16'b10000000_00000000;
    a[32]=0;
    a[31:16]= 16'b00000000_00000000;
    a[15:0] = 16'b10000000_00000000;

    #100;


    #100;
end

endmodule

For the module at the link they don't mention taking a 2's complement. They only say to set the sign bit. So I think they do the complement for you based on this quote : "For subtraction, set the sign bit for the negative number." — Entrepreneur, Nov 01 '17 at 12:27
In general using 2's complement on fixed point values should work since your are just mentally assigning smaller values to each bit position. For example bit 0 no longer represents a weight of 1 and instead might represent 1/256th for 1-byte fraction. — Entrepreneur, Nov 01 '17 at 12:31
I think this is similar to https://stackoverflow.com/a/15528913/97073 — pre_randomize, Nov 01 '17 at 13:06

Wouter van Ooijen · Answer 1 · 2017-11-01T21:59:07.783

As Entepreneur tries to say in his comment (at least think that's what he means), and as I read the code, they use sign-magnitude, not 2s-complement.

2s-complement addition doesn't have to distinguish between positive and negative numbers (thats one of the big things in favour of 2s-complement).

I don't see what scaling factor is used, so I can't say how to represent - 1.5, but you seem to know how to represnet 1.5, so just flip the sign bit of that.

response to comment

Sign Bit =0, All integer part bits are 1, and all fractional part bits are 0, when I add -1.5 to 0.5.(-1.5 is represented as suggested by Wouter. Also now, i have added Testbench of module, you can simulate it on your PC.

I'm no expert in this language (VHDL? Verilog?), but that is what this code should produce, specifically due to line

res[N-2:0] = a[N-2:0] - b[N-2:0];

This generates an underflow (because |b| > |a|). I think that section should be

else if(a[N-1] == 0 && b[N-1] == 1) begin

   if(a[N-2:0] > b[N-2:0]){
       // |a| > |b]
       res[N-1] = 1;
       res[N-2:0] = a[N-2:0] - b[N-2:0];

   } else {
       // |b| >= |a]
       res[N-1] = 0;
       res[N-2:0] = b[N-2:0] - a[N-2:0];
   }

(And you might want to check the last section.)

I tried it, but that's too not working. Any other suggestion?? — Jay Patel, Nov 01 '17 at 12:56
@JayPatel I'm sure you can be more specific than "not working". Analyze the result. What is wrong? Which bits are right? You're sitting in front of your code, we aren't. — Marcus Müller, Nov 01 '17 at 13:48
@MarcusMüller Answer is: Sign Bit =0, All integer part bits are 1, and all fractional part bits are 0, when I add -1.5 to 0.5.(-1.5 is represented as suggested by Wouter. Also now, i have added Testbench of module, you can simulate it on your PC. — Jay Patel, Nov 01 '17 at 14:54
Is that the error pattern for all sensible inputs, or just a random sample? — Marcus Müller, Nov 01 '17 at 15:09

score 0 · Answer 2 · answered Nov 01 '17 at 22:18

1) Regarding your code for cases Subtract a-b and Subtract b-a, these two steps are interchanged : res[N-1] = 1; res[N-1] = 0;

2) Also, as Wouter pointed out you must check the magnitudes in order to be sure you don't subtract the larger value from the smaller value which will cause a 2's complement negative result which you don't use in your methodology, resulting in many many unintended bits being set.

There are two systems being discussed here. (Assume all values including the result fit within 32 bits).

TWO'S COMPLEMENT : First let's look at 2's complement because that is the easier one to implement and is not the method used in your code. The arithmetic step of adding or subtracting will always be exactly the same, it will be a 32-bit addition. If you input one or both numbers as negative they must be written in 2's complement, and the result of the addition will also be 2's complement. That's the favorable aspect of 2's complement addition. The adder logic doesn't care if the number is negative or positive, it combines the bits using the same exact logical step. So to achieve subtraction you need to create a negative version of the number being subtracted, and represent it in 2's complement. Then add the two values the same as always. Remember, for both integer and fixed point, taking the 2's complement of a number can be done by flipping all 32 bits then adding a "1" to the lowest bit position.

Example with 4 bit values: (5 - 3)=(5 + -3)=(0101b + (2's comp(0011b)))=(0101b + 1101b)=0010b=2

POSITIVE MAGNITUDES WITH SIGN FLAG BIT: The second system discussed here and which seems to be intended by your code is a system where all numbers are represented as positive binary numbers, but with the high bit reserved as a flag to remind us that the number represents a magnitude in the negative direction, below 0. So a 3 and -3 will always be written as _0000000 00000000 00000000 00000011. But the high bit (31) will be 0 for a positive number and 1 for a negative number. This requires that the operation of addition be adapted to accommodate the various combinations of signs and magnitudes of the two operands. The reason for this is that ultimately you are operating with all positive values internally. And to keep from producing a negative number internally you must check the signs and compare the magnitudes of the operands to determine which operation is required, anticipate the sign of the result, and ensure that the order of operations never causes a negative value. After you have calculated the magnitude you need to determine and apply the proper sign to the high bit of your result, setting it to 0 if you know the result should be positive, or to 1 if you know the result should be negative.

How to represent Negative real numbers in Fixed Point representation

2 Answers2