What is the purpose of NaN boxing?

Question

Reading 21st Century C I arrived at chapter 6 at the section "Marking Exceptional Numeric Values with NaNs", where it explains the use of the bits in the mantissa to store some arbitrary bit patterns, to use them as markers or pointers (the book mentions that WebKit uses this technique).

I'm not really sure I understood the utility of this technique, that I see as an hack (it relies on the hardware not caring on the value of the mantissa in a NaN) but coming from a Java background I'm not used to the roughness of C.

Here is the snippet of code that sets and reads a marker in a NaN

#include <stdio.h>
#include <math.h> //isnan

double ref;

double set_na(){
    if (!ref) {
        ref=0/0.;
        char *cr = (char *)(&ref);
        cr[2]='a';
    }
    return ref;
}

int is_na(double in){
    if (!ref) return 0;  //set_na was never called==>no NAs yet.

    char *cc = (char *)(&in);
    char *cr = (char *)(&ref);
    for (int i=0; i< sizeof(double); i++)
        if (cc[i] != cr[i]) return 0;
    return 1;
}

int main(){
    double x = set_na();
    double y = x;
    printf("Is x=set_na() NA? %i\n", is_na(x));
    printf("Is x=set_na() NAN? %i\n", isnan(x));
    printf("Is y=x NA? %i\n", is_na(y));
    printf("Is 0/0 NA? %i\n", is_na(0/0.));
    printf("Is 8 NA? %i\n", is_na(8));
}

it prints:

Is x=set_na() NA? 1
Is x=set_na() NAN? 1
Is y=x NA? 1
Is 0/0 NA? 0
Is 8 NA? 0

and at JSValue.h webkit explains the encoding, but not why it's used.

What is the purpose of this technique? Are the benefits of space/performance high enough to balance its hackish nature?

to be clear the OP is asking where [signaling NaNs](http://en.wikipedia.org/wiki/NaN#Signaling_NaN) can be used — ratchet freak, Jan 31 '13 at 11:31
@ratchetfreak : the question is not about signaling NaN, as the webkit JSValue.h explains, But thank you for letting me discovery something new! — andijcr, Jan 31 '13 at 15:27
@BЈовић hope the examples are clear: the snippet shows how to set and read a bit pattern in a NaN, and the link points to where webkit uses this technique — andijcr, Jan 31 '13 at 15:29
Is there a reason to not use isnan() from math.h? http://pubs.opengroup.org/onlinepubs/009695299/functions/isnan.html — Hudson, Jan 31 '13 at 18:54
@Hudson isnan() si used in the second printf in the main. The purpose of is_an() is to test if the bit pattern of the double in input is equal to that saved inside ref global variable. — andijcr, Jan 31 '13 at 20:59

score 79 · Accepted Answer · edited Mar 25 '17 at 13:54

79

When you are implementing a dynamically typed language, you've got to have a single type which can hold any of your objects. There are three different approaches I'm aware of for this:

Firstly, you can pass around pointers. This is what the CPython implementation does. Every object is a PyObject pointer. These pointers get passed around and operations are performed by looking at details in the PyObject struct to figure out the type.

The disadvantage is that small values like numbers get stored as boxed values, So your little 5 gets stored as a block of memory somewhere. So this leads us to the union approach, which is used by Lua. Instead of a PyObject*, each value is a struct which one field to specify the type, and then a union of all the different supported types. That way we avoid allocating any memory for small values, instead storing them directly in the union.

The NaN approach stores everything as doubles, and reuses the unused portion of NaN for the extra storage. The advantage over the union method is that we save the type field. If it's a valid double, it's a double otherwise the mantissa is a pointer to the actual object.

Remember, this is every javascript object. Every variable, every value in an object, every expression. If we can reduce all of those from 96 bits to 64 bits that is pretty impressive.

Is it worth the hack? Recall that there is a lot of demand for efficient Javascript. Javascript is the bottleneck in many web applications, and so making it faster is a higher priority. It's reasonable to introduce a certain degree of hackiness for performance reasons. For most cases, it'd be a bad idea, because its introducing a degree of complexity for little gain. But in this specific case, it is worthwhile for memory and speed improvements.

edited Mar 25 '17 at 13:54

PythonJin

123
5

answered Jan 31 '13 at 15:36

Winston Ewert

24,732
12
72
103

3

Actually CPython caches small numbers. See http://hg.python.org/cpython/file/e6cc582cafce/Objects/longobject.c – Phillip Cloud Jan 31 '13 at 22:13
5

@cpcloud, true, but that detail didn't seem pertinent. – Winston Ewert Jan 31 '13 at 22:36
1

@WinstonEwert You're right. I thought the same thing after I read what I had written. – Phillip Cloud Feb 04 '13 at 20:46
I would like to make one little addition, the NaN approch does not mean you store everything in doubles. Just that you share space with a double. For example some VMs (JSC) the value, if done nothing to it, is a pointer. Transformation has to be applied to the NanBox to get the float. – nickik Jan 11 '15 at 13:29
@nickik, I don't understand what you are saying. – Winston Ewert Jan 12 '15 at 06:03
The question, if you dont change any bits, is it a float or a pointer? See this blogpost, it talks about Nan and NunBoxing. http://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations – nickik Jan 16 '15 at 00:28
2

@Praxeolitic, pointers in a 64-bit processor typically don't actually use the whole 64 bits. For x86-64 only 48-bits are actually used, see https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details. You can also see details of one particular method here: https://github.com/zuiderkwast/nanbox/blob/master/nanbox.h#L194 – Winston Ewert Jun 17 '16 at 20:13
1

Storing numbers inline not only reduces the number of allocations (and thus GC pressure), it also saves you an indirection when using the values. This makes it more cache-friendly and can mean a significant speed-up for numeric computation. Caching number objects doesn't help this use case. – Sebastian Redl Mar 25 '17 at 17:17
4

Using bits of a primitive type to avoid "boxing" all values is a time-honored technique. Smalltalk used it in the 1970s, stealing one bit from 16-bit integers to signal either an object pointer or 15-bit `SmallInteger`. – Jonathan Eunice Mar 25 '17 at 21:24
2

@JonathanEunice, really? That just surprises me because there is really not a long of range in 16 bits that I'd be willing to give up a bit. – Winston Ewert Mar 26 '17 at 02:01
1

@WinstonEwert Indeed, there is not. But the times, available memory sizes, and implementation techniques were very different. Smalltalk objects were accessed through an "object table," so the addresses were really table indices, not absolute memory pointers. 2**15 objects used to seem like a lot. – Jonathan Eunice Mar 26 '17 at 02:35
@JonathanEunice, its not so much the number of objects, its the fact that my largest signed number would 16K which seems woefully inadaquete. – Winston Ewert Mar 26 '17 at 15:58
2

@WinstonEwert Understood. But 1/ Minicomputer machine words just 16-18 bits those days. Even full words had inadequate range for many quantities. 2/ Small integers were empirically the most populous object type/class in the system. 3/ There was a seamless bridge from small to full (arbitrary precision, boxed) integer objects. Unioning small integers with pointers was just an implementation optimization. 4/ As 32-bit CPUs came online in the Eighties, unboxed small integers got more breathing room. – Jonathan Eunice Mar 26 '17 at 18:49
@WinstonEwert - late to the party I know but... when a SmallInteger went outside of the 2^15 range it was converted into a BigInteger object - seamlessly so the programmer wouldn't experience the limitation. – DangerMouse Jan 14 '21 at 11:13

score 9 · Answer 2 · edited May 23 '17 at 12:40

Using NaN for "exceptional values" is a well known and sometimes helpful technique to avoid the need of an extra boolean variable this_value_is_invalid. Used wisely, it can help one to make his code more concise, cleaner, simpler, better readable without any performance trade-offs.

This technique has some pitfalls, of course (see here http://ppkwok.blogspot.co.uk/2012/11/java-cafe-1-never-write-nan-nan_24.html), but in languages like Java (or very similar C#) there are standard library functions like Float.isNaN to make dealing with NaNs simple. Of course, in Java you could use alternatively theFloat and Double class and in C# the nullable value types float? and double?, giving you the possibility of using null instead of NaN for invalid floating point numbers, but those techniques can have a significant negative influence on the performance and memory usage of your program.

In C the use of NaN is not 100% portable, that is true, but you can use it everywhere where the IEEE 754 floating point standard is available. AFAIK this is quite almost every mainstream hardware today (or at least the runtime environment of most compilers support it). For example, this SO post contains some information to find out more details about the use of NaN in C.

the auto-boxing in java is messy and should be avoided, just using it to be able to provide a null value is ridiculous and prone to bugs — ratchet freak, Jan 31 '13 at 15:26
i edited the question to link to where webkit uses NaN-boxing. It seems that webkit has a broader use of NaN, other than to signal 'NaN' — andijcr, Jan 31 '13 at 15:33

What is the purpose of NaN boxing?

2 Answers2

Linked