5

I've a particular case of floating point addition. As you know for given floating point numbers \$x,y\$ one of the steps of the addition involves the fixed point sum:

$$ s = 1.m_x + (-1)^{s_x \oplus s_y} 2^{-\delta}1.m_y, $$

I'm assuming $ |x| > |y| $, sx and sy are inputs signs, delta is the exponent difference. In all my reference the utility of the guard, round and sticky bits is always explained when a subtraction is performed, who would imply a left shift as normalization step. I have now this special case to analize

$$ t = 1.m_x + 2^{-\gamma} 1.m_y + 2^{-\delta} 1.m_z $$

where $$ 0 \leq \gamma \leq \delta. $$

It easy to see that the normalization process in such a case would always imply a "right" shift, never a left shift. I want to perform a correctly rounded operation so I would like to understand how to implement it. As a start point I started from the computation of s, I propose to analyze this special case:

$$ s = 1.m_x + 2^{-\delta}1.m_y, $$

Are in this case still necessary all the three special bits? or are just two of them necessary? (guard and sticky maybe?)

Once answered to that question how can I understand how many bits to look at when I perform the sum of three numbers?

pipe
  • 13,748
  • 5
  • 42
  • 72
user8469759
  • 618
  • 1
  • 11
  • 25

0 Answers0