Trivially Sortable Encoding for Arbitrary-Precision Decimals

Question

I'm looking for a reasonably(*) space-efficient way to encode arbitrary-precision decimals (e.g. BigDecimal), such that when sorting the bit-pattern of the encodings lexicographically, the numbers will be sorted numerically. This is related to this Stackoverflow question.

One simple/naive approach to this would be to convert the number to some IEEE-like representation with some 8-byte exponent (which is what RR in the NTL does?), but obviously this is quite an inefficient(*) representation.

Is there perhaps some other approach to this? Are there any optimizations possible for arbitrary-precision Integers? The encoding need not support any other operations than sorting (and, if possible, decoding the number again).

(*) By reasonably speace-efficient I don't mean cramming bits together into bytes like IEEE-754 does. I just don't want to have any unnecessarily large fields that will remain mostly empty in a large number of situations (as would be the case with an 8-byte exponent). The motive for this is that the (encoded) number will later on be processed by this scheme, which takes a few milliseconds per byte to run.

The runtime-efficiency of the encoding/decoding on the other hand isn't much of an issue (anything less than 10 ms for 128 bytes of data will still be overshadowed by the runtime of the OPE scheme liked above).

(Also note, this should be a decimal representation, so there are to precision-issues due to "binary stuff".)

Haven't tried it yet but try using toEngineeringString to get the exponent value. Have that as a separate sort column which should be fast to sort on. A secondary sort column would be the number portion but converted to a string. This is assuming the toEngineeringString returns something like. 9.636424+6 — Archimedes Trajano, Jun 25 '14 at 08:17

score 2 · Answer 1 · edited May 23 '17 at 12:40

The general scheme for sorting BigDecimal works as follows:

Sort the sign bit.
- Note: if the sign bit is negative, the sort order of the subsequent steps must be negated.
Remove the sign bit and convert the remaining bits into absolute value.
- In other words, a sign-magnitude representation is used.
Sort by the position of the "highest bit set".
- If A > 0 and B > 0, and highestNonZeroBit(A) > highestNonZeroBit(B), then A > B.
- For other bases such as decimals, use highestNonZeroDigit of the corresponding base.
If the highest nonzero bit index are the same for A and B, resolve the tie by sorting the rest of the bits. This may require examining as many bits as are in the numbers.

The reason why sorting by the IEEE representation works (as mentioned in OP's link to the Stackoverflow question ) is that the IEEE representation follows the scheme outlined above, provided that the byte-endian issue has been taken care of.

The "wastefulness" issue of an 8-byte IEEE-754-like exponent field can be solved by encoding it in one of the sortable Universal code. In particular, Elias delta coding seems to be useful for your application.

If the bitstring is stored without any field or bit delimiter (as is the case for Elias delta coding), the search for the highest non-zero digit will need to be modified as follows.

Assume that the overall representation is still "sign-magnitude", or "sign-exponent-magnitude", or "sign-exponent-exponent-magnitude".
As stated previously, a negative sign shall cause subsequent sort order to be reversed.
Starting with the most significant bit on the left, search for the first bit that is set.
Subtract this value (index of the first nonzero bit) from the length of the bitstring. This difference can then be used for comparison.

david.pfx · Answer 2 · 2014-06-28T15:07:14.363

An interesting computer science problem. An encoding strategy that achieved this would necessarily have more or less the following features.

The sign is represented in such a way that the high order bit is 1 for positive values, and all following bits are complemented for negative.
The exponent is encoded as its magnitude, preceded by its length (which must be fixed). A positive exponent has a length preceded by a 1 bit, negative has 0 and the length and magnitude complemented.
The mantissa is encoded as a normalised fraction, occupying as many bits as required. The compare function should assume that trailing bits are zero.

The layout something looks like this.

S 1L EEEE MMMMMMMMMMMMMMM
S 0L EEEEE MMMM

It's much simpler if the arbitrary precision requirement is dropped or relaxed somewhat.

Note that this depends on there being a canonical representation, and there is no room for NaN or similar values. Zero is not zero.

I remember finding a paper which described this scheme in detail and proved how the representation is orderable, and can represent INF/-INF/NaN etc? anyone got any tips? — fbstj, Nov 26 '15 at 14:21

Trivially Sortable Encoding for Arbitrary-Precision Decimals

2 Answers2