3

I'm performing some operations with fractional numbers in a 16-bit FIXED-POINT processor.

I have to multiply the numbers \$ x=-6.35 \$, represented in \$ Q_{11} \$, and \$ y=-0.1 \$, represented in \$ Q_{14}\$.

First I represent the numbers in the respective notation in binary. The MSB is the sign bit.

So \$ x=11001.10100110011 \$ and \$ y=11.11100110011001 \$. I know the binary point is just in our mind and the processor treats this numbers as integers.

Ok then we multiply the numbers and get \$ x*y=11001000000100010011111001111011 \$. We eliminate the repeated sign bit and save the 16 MSB and represent the result in the appropriate format \$ Q_{10}\$: \$ x*y=100100.0000100010\$. This number corresponds to \$ - 27.966796875\$. But this doesn't make any sense, the result should be \$ 0.635\$.

What is going on here? Why is the result different? Am I missing something?

EDIT: So I realized that it was pointed out that I was performing unsigned multiplication. Taking what was said now I'm doing signed multiplication:

$$1100110100110011$$ $$1111100110011001$$ $$----------------$$ $$\color{blue}{1111111111111111}1100110100110011$$ $$\color{blue}{000000000000000}0000000000000000\text{_}$$ $$\color{blue}{00000000000000}0000000000000000\text{__}$$ $$\color{blue}{1111111111111}1100110100110011\text{___}$$ $$\color{blue}{111111111111}1100110100110011\text{____}$$ $$\color{blue}{00000000000}0000000000000000\text{_____}$$ $$\color{blue}{0000000000}0000000000000000\text{______}$$ $$\color{blue}{111111111}1100110100110011\text{_______}$$ $$\color{blue}{11111111}1100110100110011\text{________}$$ $$\color{blue}{0000000}0000000000000000\text{_________}$$ $$\color{blue}{000000}0000000000000000\text{__________}$$ $$\color{blue}{11111}1100110100110011\text{___________}$$ $$\color{blue}{1111}1100110100110011\text{____________}$$ $$\color{blue}{111}1100110100110011\text{_____________}$$ $$\color{blue}{11}1100110100110011\text{______________}$$ $$\color{blue}{0}0011001011001101\text{_______________}$$ $$----------------$$ $$\color{red}{1}\color{green}{0}{0010001000111111010101001111011}$$

The last carry (red) is eliminated. Then the repeated sign bit (green) is also shifted to the left, adding an extra 0 bit at the end. Finally we save the 16 most significant bit and represent the result in \$ Q_{10}\$:

$${0010001000111111}$$

But this is \$ 8.5615234375 \$, not \$ 0.635 \$... So we are getting closer but the result is still off. Any ideas?

Granger Obliviate
  • 1,391
  • 10
  • 29
  • 1
    If the decimal point is not always in the same place, it is not fixed point arithmetic. It is floating point. In which case you have to somehow keep track of where the decimal point is and shift the result accordingly. Also, when you multiply 2 16 bit numbers, the result could potentially require 32 bits. – user57037 Mar 27 '21 at 01:26
  • 2
    mkeith: this is not floating point. It is common to have different comma positions in fixed point for operands and results. But that doesnt make it floating point, which has its own specific meaning. – tobalt Mar 27 '21 at 07:11
  • @tobalt well without the exponent it is not floating point. I get that. But if the "comma position" is variable then you have to keep track of the comma position. Once you start tracking the comma position separately for each variable, it is pretty similar to floating point. – user57037 Mar 27 '21 at 17:16
  • 1
    mkeith: The idea of fixed point algorithms is that - although the comma position might be different for each operand - you don't need to keep track of it during run time as you define it at compile time. – tobalt Mar 27 '21 at 18:40
  • Granger, do you have your answer below? All you seem to be doing is to want support for signed Q notation multiplication. A \$Q_{11} \times Q_{14} = Q_{25}\$, obviously. If you are having any trouble, it looks to me more that it's trouble performing the signed multiplication, itself. Because the rest just falls out very easily. 0xffffcd33 \$\times\$ 0xfffff999 = 0x01453e7b. – jonk Mar 27 '21 at 19:12
  • @tobalt I just did go through a bunch of code where I did this. So I do know what you are talking about. I see how it is different from floating point. But I think true fixed point representation uses a fixed number of bits before and after the binary point (comma, whatever). Anyway, I don't want to argue about terminology. I do concede that floating point is not the right term for variable fixed point. – user57037 Mar 27 '21 at 19:16
  • @mkeith There's a wiki page on the topic. TI created the "concept" in order to document some of their DSP features. This goes back at least to late 1980's when I was working on Analog Devices ADSP-21xx family and the TI C30 and C40 DSPs. The ADSP multiplier even supported a normalized fixed format fraction type. – jonk Mar 27 '21 at 19:18
  • @mkeith [here?](https://en.wikipedia.org/wiki/Q_(number_format)) 0x01453e7b * 2^-25 = 0.635242313, which is what the OP is supposed to get. – jonk Mar 27 '21 at 19:33
  • 1
    @mkeith That's not how it works, in practice. You always keep track (have to) of the radix point, separately, if you are using a variety of formats. On the ADSP, the multiplier had a 40 bit result (39 if you exclude the sign bit.) But you were responsible for tracking the exponent sums. You could then properly truncate/round to a fixed size if you wanted to. I did entire complex-in, complex-out FFTs this way. – jonk Mar 27 '21 at 19:40
  • @jonk yes I understand that. 16-bit number times a 16-bit number, we get a 32-bit number. As you said \$ Q_{14} \times Q_{11}= Q_{25}\$. Therefore 25 bits for the fractional part, 5 bits for the integer part and 2 sign bits. Since we only need 1 sign bit, we shift to the left now having a \$ Q_{26}\$ number. Finally, we discard the 16 LSB obtaining a \$ Q_{10}\$ number. All is fine here, my doubt is really the multiplication itself, I'm not being able to get a correct result... – Granger Obliviate Mar 27 '21 at 19:44
  • @GrangerObliviate one option is to convert the negative numbers to positive or unsinged numbers, then do unsigned multiplication, then convert the result to the correct sign (based on sign of the input numbers). The conversion is very easy. Invert every bit, then add 1. This operation, bitwise invert and increment, is equivalent to negation in the two's complement system. – user57037 Mar 27 '21 at 20:01
  • 1
    @mkeith that is also a nice option since these are 2 negative numbers I can simply multiply their positive counterparts. Will try it later! – Granger Obliviate Mar 27 '21 at 20:27
  • @GrangerObliviate I'll write up something "slightly interesting" that avoids the need for you to first convert to positive values before multiplying. – jonk Mar 27 '21 at 21:01
  • @GrangerObliviate I've added an answer that I hope helps a little. – jonk Mar 27 '21 at 21:32

4 Answers4

4

What you have done is taken two signed numbers, and multiplied them together using unsigned arithmetic. Signed 2's compliment multiply is different.

In a higher-level language like C, C++ or Rust, make sure that your fractional numbers are sign extended into 32-bit signed integers and multiply them, then convert the result.

In assembly language, use the signed multiply instruction -- and if you're using a 32-bit processor and 32-bit data, make sure to sign-extend the numbers when you put them into the argument registers, or the result will be as if you did unsigned arithmetic.

TimWescott
  • 44,867
  • 1
  • 41
  • 104
3

I have to multiply the numbers \$ x=-6.35 \$, represented in \$ Q_{11} \$, and \$ y=-0.1 \$, represented in \$ Q_{14}\$. So \$ x=11001.10100110011 \$ and \$ y=11.11100110011001 \$

I have just calculated the corresponding values for \$x\$ and \$y\$ using scaling values \$2^{11} \$ and \$2^{14}\$ respectively, and found them to be correct in 2's complement signed notation. So there is nothing wrong in your starting point.

I know the binary point is just in our mind and the processor treats this numbers as integers. Ok then we multiply the numbers and get \$ x*y=11001000000100010011111001111011 \$

This is where things went wrong. You would have been absolutely correct if you were multiplying two unsigned numbers (unsigned arithmetic). It is straight forward. But if one of the numbers is signed, then you have to do signed arithmetic, which is a little different.

There are two points I would like to highlight:

  • When you multiply two signed numbers \$x\$ and \$y\$, each partial product has to be treated as signed and you have to properly sign-extend each partial product.
  • If the multiplier(\$y\$) is a negative number, the last partial product has to be computed differently. It should be computed as the 2's complement of the sign-extended multiplicant (\$x\$).

Based on the two above points I will illustrate a simple example \$x* y\$ for \$x=-3\$ and \$y = -2\$, in 3-bit 2's complement notation (pardon my naive formatting):

$$x=-3=101_2$$ $$y=-2=110_2$$ So \$x * y\$ in signed arithmetic will go like: $$101$$ $$110$$ $$---$$ $$\color{blue}{000}000$$ $$\color{blue}{11}101\text{_}$$ $$\color{blue}{0011}\text{__}$$ $$-------$$ $$\color{red}{1}{000110}$$

Notice how partial products are sign-extended (blue color).

Notice that since \$y\$ is negative, the last partial product was calculated as 2's complement of sign-extended \$x\$ which is \$ = 0011_2\$.

The 7th bit can be disregarded (red color), because the answer is only 6 bits when two 3-bit numbers are multiplied. The rest 6 bits hence form the signed result (product). Therefore, the answer is \$x*y = 000110_2=\color{green}{+6_{10}}\$.

For the same \$ x\$ and \$y\$, if you do unsigned arithmetic, you will get \$x*y= 011110_2=\color{red}{+30_{10}}\$ !!

Conclusion

In your example, while multiplying two signed binary numbers, you probably did (if hand-derived/calculator) unsigned arithmetic, not signed arithmetic, hence the wrong results.

Mitu Raj
  • 10,843
  • 6
  • 23
  • 46
  • 1
    Perfect! Thank you so much! I realized I needed to do the sign extension later, but I was still getting a wrong result. I didn't know about this property: "If the multiplier(y) is a negative number, the last partial product has to be computed differently. It should be computed as the 2's complement of the sign-extended multiplicant (x).". I think it is here that my mistake is now! I will try again and if it is correct now, I'll accept your answer. – Granger Obliviate Mar 27 '21 at 16:17
  • Hi again. Yeah I'm still getting it wrong. After doing what you told me, I got 000100010001111110101010011110110. Then I eliminate the extra sign bit and take the 16 MSB 0010001000111111. The correct format is Q10, so 001000.1000111111. But this is 8.5615234375, but not 0.635... So we are getting closer but the result is still off. Any ideas? – Granger Obliviate Mar 27 '21 at 16:59
  • I just edited my question showing the latest progress on the operation. Do you know what is wrong now? – Granger Obliviate Mar 27 '21 at 18:02
  • Your example is quite hectic and big to try on paper. If I get time, I will try it and I will let u know. – Mitu Raj Mar 27 '21 at 18:04
  • thank you, yes if you have time please check my calculations to help me understand what my mistake is. Thank you in advance for your help – Granger Obliviate Mar 27 '21 at 18:55
  • @GrangerObliviate your calculated value's 11th bit from LSB is wrong while addition. Sum got flawed with wrong carry from there afterwards. Re-check. – Mitu Raj Mar 27 '21 at 20:10
  • what do you mean? It is 1+ previous 1 carry which results in 0 with 1 carry. Is that wrong? – Granger Obliviate Mar 27 '21 at 20:34
  • I got the correct answer on addition: 0000000_10100_0101_0011_1110_0111_1011 --- which is in Q7.25. It translates to 0.635 approx. – Mitu Raj Mar 27 '21 at 20:42
  • 1
    Got it! I was not adding the numbers correctly. Like I would sum um 5 1's and only take a carry 1 instead of 10. Thank you so much! – Granger Obliviate Mar 27 '21 at 21:57
  • No problems – Mitu Raj Mar 28 '21 at 05:52
3

I believe the core problem you are having has to do with multiplying signed (and perhaps unsigned) integers in twos-complement notation.

Let me illustrate by taking a page out of the instruction set for the ADSP-21xx Family DSPs from Analog Devices:

........enter image description here

Please note that they include signed-signed multiplication, signed-unsigned multiplication, and so on. Pretty much every combination is there. There's a good reason why.

To provide a concrete example about why, let's assume that the word size is 16 bits. (I know, convenient.) Keeping in mind that a twos-complement negative number is just \$1+\left(\text{0xFFFF} - N\right)\$, then you find that multiplying two signed values results in the following table of possible permutations. (I'm using + for a positive value and - for a negative value.):

 N x M
(- x -)     [ 1 + (0xFFFF - N) ] * [ 1 + (0xFFFF - M) ]     +(M x N - [(N+M) << 16])
(- x +)     [ 1 + (0xFFFF - N) ] * M                        -(N x M - [M << 16])
(+ x -)     N * [ 1 + (0xFFFF - M) ]                        -(N x M - [N << 16])
(+ x +)     N x M                                           +(N x M)

In your case, you are performing (- x -), so you can directly multiply both your 16-bit values using an unsigned multiplication that produces a 32-bit result. Here, you'd multiply 0xCD33 by 0xF999 and get 0xC8113E7B as the result of this unsigned multiplication. Let's follow through:

    0x C811 3E7B
  - 0x CD33
  - 0x F999
  ------------
    0x 0145 3E7B

Which is the correct result (if you keep in mind that this is a \$Q_{25}\$ format, now.)

So, your answer, in binary, is: +0.1010001010011111001111011

You can round or truncate that as you see fit. But you will need to keep track of its format when doing that.

For example, if you wanted to round that into another 16-bit word while retaining maximum precision, you'd have 0.101000101010000 (the upper bit being '0' to retain the sign, of course) and that would now be in a \$Q_{15}\$ format. You have to "normalize" the value, first, counting the shifts required (6 of them) and the number of bits you are truncating/rounding off (16) in order to know that \$Q_{25} + 6 - 16 = Q_{15}\$.

This last bit is one of the reasons why a fully combinatorial barrel-shifter is very handy! If your MCU lacks one, I'd recommend finding another. The barrel-shifter is super-useful when you are screwing around like this. Analog Device's ADSP-21xx family was a marvel for its time (still is, I think.) You can read the chapters for the ADSP-21xx Family Manual at your leisure.

jonk
  • 77,059
  • 6
  • 73
  • 185
-1

I think the main problem is that you didn't do the two's complement multiplication correctly.

To convert -6.35 to binary with 11 fractional bits, do as follows.

  1. Multiply 6.35 * 211 = 13005.
  2. Optionally, convert to hex: 13005 -> 0x32CD.
  3. Convert to binary: 0011 0010 1100 1101
  4. Negate it by inverting every bit and adding 1: 0011 0010 1100 1101 -> 1100 1101 0011 1110

The binary point would be here: 1100 1.101 0011 1110.

To convert -0.1 to binary with 14 fractional bits, do as follows.

  1. Multiply 0.1 * 214 = 1638.
  2. Optionally convert to hex: 1638 -> 0x0666.
  3. Convert to binary: 0000 0110 0110 0110
  4. Negate it by inverting every bit, then adding 1: -> 1111 1001 1001 1010

The binary point would be here: 11.11 1001 1001 1010.

If you are having trouble with the two's compliment multiplication, convert the numbers to unsigned first then do the unsigned multiplication. The sign of the result is positive if both numbers have the same sign, and negative if both numbers have different signs. After multiplying using unsigned arithmetic, you can negate the result if needed (based on sign of two inputs).

It is easier to do the multiplication in decimal, but of course it doesn't matter what base is used the result will be the same number.

  1. Multiply: 1638 * 13005 = 21302190.
  2. Optionally convert to hex: 21302190 -> 0x1450BAE
  3. Convert to binary: 0001 0100 0101 0000 1011 1010 1110

In this case, we know that the result is positive because the two sign bits of the inputs were both negative. So this is the final result. If only one of the inputs had been negative, then we would negate the final result using the same procedure, invert every bit, then add one.

Since fractional bits accumulate, there are now a total of 25 fractional bits in the result. So the binary point will be located as follows:

000.1 0100 0101 0000 1011 1010 1110

If you convert that back to decimal it is pretty close to 0.635. The way I do it is as follows:

13005 * 1638 / (225) = 0.6349

But you can painstakingly add up the bits, too. 1/2 + 1/8 + 1/128 + 1/512 + 1/16384 + 1/64k + 1/128k + 1/256k + 1/1024k.... etc

user57037
  • 28,915
  • 1
  • 28
  • 81