55

It's a sort of simple compression where you use one numeric variable to store many boolean / binary states, using doubling and the fact that every doubling number is 1 + the sum of all the previous ones.

I'm sure it must be an old, well-known technique, I'd like to know what it's called to refer to it properly. I've done several searches on every way I can think of to describe it, but found nothing beyond some blog articles where the article authors seem to have figured this out themselves and don't know what to call it, either (example 1, example 2).

For example, here's a very simple implementation intended to illustrate the concept:

packStatesIntoNumber () {
  let num = 0
  if (this.stateA) num += 1
  if (this.stateB) num += 2
  if (this.stateC) num += 4
  if (this.stateD) num += 8
  if (this.stateE) num += 16
  if (this.stateF) num += 32
  return num
}

unpackStatesFromNumber (num) {
  assert(num < 64)
  this.stateF = num >= 32; if (this.stateF) num -= 32
  this.stateE = num >= 16; if (this.stateE) num -= 16
  this.stateD = num >= 8; if (this.stateD) num -= 8
  this.stateC = num >= 4; if (this.stateC) num -= 4
  this.stateB = num >= 2; if (this.stateB) num -= 2
  this.stateA = num >= 1; if (this.stateA) num -= 1
}

You could also use bitwise operators, base 2 number parsing, enums... There are many more efficient ways to implement it, I'm interested in the name of the approach more generally.

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
  • 9
    In C#, there are `enums`, and they can have a `Flags` attribute. They could make your code far simpler. – Bernhard Hiller Oct 15 '18 at 08:58
  • Yeah and usually you could simply turn an ordered array of binary states into a base 2 number then store or transmit that as base 64 or whatever is best for the application. The sample code above is intended to clearly illustrate the concept, not be optimised code. – user56reinstatemonica8 Oct 15 '18 at 09:02
  • I suppose referring to this as 'enumeration' would be clear enough for most. But this seems like the sort of basic technique that probably has a long history and a proper name. – user56reinstatemonica8 Oct 15 '18 at 09:16
  • 12
    I'd call this "simulating bit fields". It is almost always a bad idea unless space efficiency is overwhelmingly important. – Kilian Foth Oct 15 '18 at 09:17
  • @KilianFoth Bad idea presumably because it would be easy to have hard-to-detect decoding errors if the order or number of boolean states changes? – user56reinstatemonica8 Oct 15 '18 at 09:31
  • @user568458 Simply because it requires a lot of effort both for coding and execution to do something that could be done trivially - either in a bit more space, or with bitfields (if your language has them). – Kilian Foth Oct 15 '18 at 09:55
  • 2
    In C and C++ code, you'd commonly see `state |= (0x1 << 1); state |= (0x1 << 2);` etc. Ultimately it amounts to the same, but uses bit arithmetic and hex representation instead. – Neil Oct 15 '18 at 10:33
  • @KilianFoth Got it. The use case I had in mind was simplifying transmission where many variables add noise (in this case, replacing hash URLs intended for user copy/paste, from being like `stateA=1&stateB=0&stateC=...` to `states=encodedValue`). Yeah, it'd be crazy to just do this internally. – user56reinstatemonica8 Oct 15 '18 at 11:04
  • 7
    @KilianFoth A `bool` is generally stored as a 32 bit integer internally. As such, packing can make the difference of a factor of 32. That's really a lot. I mean, we programmers are always ready to throw away half of our resources, but I'm generally reluctant to throw away 97% of them. Such waste factors can easily make the difference between being able to run important use cases and running out of memory. – cmaster - reinstate monica Oct 15 '18 at 14:19
  • 3
    Historically, the typically way bit masks are used to declare, set and retrieve values. Using shifts is odd and not really the best illustration of the approach. – JimmyJames Oct 15 '18 at 16:22
  • 3
    @cmaster The reason bools are stored that way is because sharing a single memory location (32 or 64 bits on todays machines) can be very bad for cache performance unless you pay a lot of attention to the machine language code. If you have a truly massive number of bits it's probably worth it, but if not you are probably better off not pre-optimizing and just packing the bits up when you are ready to transmit to network or disk. – Bill K Oct 15 '18 at 20:33
  • 1
    @BillK The CPU does not care whether you load one byte, four bytes, or eight bytes from the same cache line. However, the CPU registers are 32/64 bit wide, so improper implementation of boolean values might introduce spurious extension commands to the assembler code. The impact of that is dwarved by the latency of the load itself. No, I believe that the four-byte `bool` is just a heritage of C. Before C99, the language did not have any boolean type, boolean values were just stored as `int`. Add to that the implicit promotion of smaller integer types to `int`, and any "bool" is 32 bits... – cmaster - reinstate monica Oct 15 '18 at 20:48
  • 3
    @cmaster storing multiple bit fields in a byte makes certain types of caching/parallelism really difficult to do when you are generating machine code. This is why higher level languages don't automatically pack bits. Imagine f1=!f1;f2=!f2;f3=!f3; A current CPU could have all the values calculated and ready to write before it ever reached that code, but if they can effect each other then they must be executed in sequence so that the output of one step can be used as the input to the next. With assembly or carefully coded c (not using struct) it's not an issue. f=0x7^f; would be fine – Bill K Oct 15 '18 at 20:58
  • 3
    @BillK: Storing multiple fields per *byte* adds complexity, but storing up to four fields into a 32-bit word wouldn't. I suspect the reason for using a 32-bit type is that when using smaller types, the size of a structure will depend upon the arrangement of members therein. Saying that a struct containing two `Int32` and four `Boolean` will always be 24 bytes is "easier" than saying that it could be anywhere from 12 to 24 depending upon the order in which the members are stored. – supercat Oct 15 '18 at 21:18
  • @BillK I was not talking about bit packing, I was talking about using one or four **bytes** to store a `bool`. I fully agree that bit-fiddling is cumbersome and impacts performance. But storing one `bool` per byte does not. From a performance perspective, a 32 bit `bool` type just does not make sense, and a 1 bit `bool` type doesn't either. The sensible compromise is the 8 bit `bool` type, but I don't know which languages actually implement it. – cmaster - reinstate monica Oct 15 '18 at 21:39
  • 1
    @cmaster an 8 bit bool still has the problem on a 32/64 bit machine--that they are retrieved with the same fetch if they live in the same word. But if you are hand-coding and not using a C struct it's fine. Heck either way is fine, but that's the reason compilers allocate an entire 32/64 bit word to a Boolean--memory is cheap so they focus on not being a potential performance bottleneck. – Bill K Oct 15 '18 at 22:24
  • My Programming I teacher back in college called this "The thing the cool kids do but you should stay away from because it is ugly and doesn't help that much." Several years later, having to maintain my old code for some exotic apps I made, I can fully understand why. – T. Sar Oct 17 '18 at 12:46
  • Also, worth of note - this is a special type of Magic Number and thus a code smell. It should be avoided unless there is a specific need for this type of trick. – T. Sar Oct 17 '18 at 12:49

3 Answers3

106

It's most commonly referred to as a bit field, and another term you'll often hear is bit masks, which are used to get or set individual bit values or the entire bit field at once.

Many programming languages have auxiliary structures to help with this. As @BernhardHiller notes in the comments, C# has enums with flags; Java has the EnumSet class.

Miles
  • 105
  • 1
Glorfindel
  • 3,137
  • 6
  • 25
  • 33
  • 4
    I would interpret "bit field" as using a language feature that allows individual bits to be assigned to fields of a structure rather than doing it manually with bitwise operators. – Peter Green Oct 15 '18 at 17:25
  • 22
    @PeterGreen That would be different than the standard interpretation. – Eric Oct 15 '18 at 19:26
  • 1
    "Bit Mapping" or "Bit Mapped" , whilst common for recordsets and array processing, can also apply in this case. When extracting common elements from multiple sets the value can be decomposed to identify components of a federated model. We even say this of octal filemode digits. Bit Masks (any mask) tend to be filters (as for IO ports and data direction registers). – mckenzm Oct 16 '18 at 03:50
  • 1
    C# also has `BitArray`, which allows storing an arbitrary amount of bits and indexing them (while flags are limited to an integer type and intended to be used as masks). – Luaan Oct 16 '18 at 06:31
  • True; I just mentioned the two structures I'm most familiar with. There are probably dozens out there, especially in other languages. – Glorfindel Oct 16 '18 at 06:43
  • Java also has [`EnumSet`](https://docs.oracle.com/javase/8/docs/api/java/util/EnumSet.html), which is more likely what you want for many applications (e.g., the OP's `stateA` through `stateF`). This class also implements the standard Collections interfaces, so it's more pleasant to use. – wchargin Oct 16 '18 at 07:45
20

Strange, quite a bit of different terms here but I don't see the one that came immediately to mind (and it's in the title of your question!)--Bit Packing is what I've always heard it termed.

I had thought this was really obvious but strangely when I google it this seems to be a term that is widely used but not officially defined (Wikipedia seems to redirect to bit field which is a way to do bit packing, but not a name for the process). Searching for the definition seems to lead to this page:

http://www.kinematicsoup.com/news/2016/9/6/data-compression-bit-packing-101

Which isn't great for SO purposes but it's the best definition/description I can find including this succinct description: "Bit-packing is a simple concept: Use as few bit as possible to store a piece of data."

Bill K
  • 2,699
  • 18
  • 18
  • Can you provide some references? Interesting term. – Greg Burghardt Oct 15 '18 at 16:16
  • 13
    Bit packing is technically correct but also refers to a more general a thing than just boolean states - storing data in general in the smallest number of bits as possible. For example, another use of it could mean compressing a `char` array by putting two `char`s into one `int`. – Izkata Oct 15 '18 at 16:22
  • @GregBurghardt You know, it's interesting. I didn't think about it when I posted because the term was so prevalent in the 80's/90's when I learned programming in C and assembly--now although a google search finds MANY mentions, there isn't a definitive Wikipedia page for it. The first answer in google has this definition: "Bit-packing is a simple concept: Use as few bit as possible to store a piece of data." http://www.kinematicsoup.com/news/2016/9/6/data-compression-bit-packing-101 – Bill K Oct 15 '18 at 16:39
  • that's when I learned about bit packing too, although you can get a lot crazier than simply repurposing unused 0's in what would nominally be integer values. some years back I ran into a system that stored one of its parameters as an 8 bit float. IIRC 5 bits for an unsigned mantissa (all values were positive no need to store the sign explicitly), and 3 more for a base 10 exponent. At the time I'd assumed it was a legacy hardware kludge with no path forward, but with machine learning having recently started doing stuff with int4 vs int8, I could see some workloads dropping down from FP16. – Dan Is Fiddling By Firelight Oct 15 '18 at 18:06
  • 1
    @DanNeely This kind of thing is also commonly supported by GPUs - trading between precision, memory and computation is pretty important there. This has been exploited pretty well with GPU-based computing too. – Luaan Oct 16 '18 at 06:32
14

There are many different terms used to describe this.

Most commonly the bits are called "bit flags" or "bit fields".
(However, it's worth noting that "bit fields" sometimes refers to a specific feature of the C and C++ languages, which is related but not exactly the same.)

The integer itself is referred to variously as either a "bit array", a "bit set" or a "bit vector", depending on usages and circumstances.

Either way, extracting the bits from the bit set/vector/array is done through shifting and masking.
(i.e. using a bit mask.)


For some examples of each term in active use:


It's not really pertinent to the question but I'd like to say: please do not use addition and subtraction to set and clear bits as those methods are prone to error.
(i.e. if you do num += 1 twice, the result is equivalent to num += 2.)

Prefer to use the appropriate bitwise operations instead, if your chosen language provides them:

packStatesIntoNumber ()
{
  let num = 0
  if (this.stateA) num |= 1
  if (this.stateB) num |= 2
  if (this.stateC) num |= 4
  if (this.stateD) num |= 8
  if (this.stateE) num |= 16
  if (this.stateF) num |= 32
  return num
}

unpackStatesFromNumber (num)
{
  this.stateF = ((num & 32) != 0);
  this.stateE = ((num & 16) != 0);
  this.stateD = ((num & 8) != 0);
  this.stateC = ((num & 4) != 0);
  this.stateB = ((num & 2) != 0);
  this.stateA = ((num & 1) != 0);
}
Pharap
  • 551
  • 3
  • 13
  • 1
    `this.stateF = (num & 32) ? true : false`, etc. No need to mutate `num` while you're extracting the values. – Roger Lipscombe Oct 16 '18 at 11:09
  • 3
    @RogerLipscombe Good point, I wasn't really reading through what the code was doing, just reacting to the use of `+` and `-`. I've now gone one better and used `!= 0` instead of a ternary, which I feel is more concise whilst still being expclit. – Pharap Oct 16 '18 at 11:31