How should I store "unknown" and "missing" values in a variable, while still retaining the difference between "unknown" and "missing"?

Question

Consider this an "academic" question. I have been wondering about about avoiding NULLs from time to time and this is an example where I can't come up with a satisfactory solution.

Let's assume I store measurements where on occasions the measurement is known to be impossible (or missing). I would like to store that "empty" value in a variable while avoiding NULL. Other times the value could be unknown. So, having the measurements for a certain time-frame, a query about a measurement within that time period could return 3 kinds of responses:

The actual measurement at that time (for example, any numerical value including 0)
A "missing"/"empty" value (i.e., a measurement was done, and the value is known to be empty at that point).
An unknown value (i.e., no measurement has been done at that point. It could be empty, but it could also be any other value).

Important Clarification:

Assuming you had a function get_measurement() returning one of "empty", "unknown" and a value of type "integer". Having a numerical value implies that certain operations can be done on the return value (multiplication, division, ...) but using such operations on NULLs will crash the application if not caught.

I would like to be able to write code, avoiding NULL checks, for example (pseudocode):

>>> value = get_measurement()  # returns `2`
>>> print(value * 2)
4

>>> value = get_measurement()  # returns `Empty()`
>>> print(value * 2)
Empty()

>>> value = get_measurement()  # returns `Unknown()`
>>> print(value * 2)
Unknown()

Note that none of the print statements caused exceptions (as no NULLs were used). So the empty & unknown values would propagate as necessary and the check whether a value is actually "unknown" or "empty" could be delayed until really necessary (like storing/serialising the value somewhere).

Side-Note: The reason I'd like to avoid NULLs, is primarily a brain-teaser. If I want to get stuff done I'm not opposed to using NULLs, but I found that avoiding them can make code a lot more robust in some cases.

Why do you wish to distinguish "measurement done but empty value" vs. "no measurement"? In fact, what does "measurement done but empty value" mean? Did the sensor fail to produce a valid value? In that case, how is that different from "unknown"? You aren't going to be able to go back in time and get the correct value. — DaveG, Aug 14 '18 at 13:27
@DaveG Assume fetching the number of CPUs in a server. If the server is switched off, or has been scrapped, that value simply doesn't exist. It will be a measurement which does not make any sense (maybe "missing"/"empty" are not the best terms). But the value is "known" to be nonsensical. If the server exists, but the process of fetching the value crashes, measuring it is valid, but fails resulting an "unknown" value. — exhuma, Aug 14 '18 at 13:42
@DocBrown You're right. I did not review the title after asking the question. I have moved the integer example into the example given in the "clarification". I hope this makes things a bit clearer. — exhuma, Aug 14 '18 at 16:04
Maybe consider asking a related question on dba.stackexchange.com? — TRiG, Aug 14 '18 at 16:46
Out of curiosity, what kind of measurement are you taking where "empty" isn't simply equal to the zero of whatever scale? "Unknown"/"missing" I can see being useful e.g. if a sensor isn't hooked up or if the sensor's raw output is garbage for one reason or another, but "empty" in every case I can think of can be more consistently represented by `0`, `[]`, or `{}` (the scalar 0, the empty list, and the empty map, respectively). Also, that "missing"/"unknown" value is basically exactly what `null` is for -- it represents that there _could_ be an object there, but there isn't. — Nic, Aug 14 '18 at 16:52
@NicHartley I..... see your point. I need to think on this. I know that I had the case in the past, but don't remember the exact situation it came up. — exhuma, Aug 14 '18 at 18:52
"If I want to get stuff done I'm not opposed to using NULLs, but I found that avoiding them can make code a lot more robust in some cases." Can you name what specific cases you would like to avoid them in? — noɥʇʎԀʎzɐɹƆ, Aug 14 '18 at 18:54
"Note that none of the print statements caused exceptions (as no NULLs were used). So the empty & unknown values would propagate as necessary and the check whether a value is actually "unknown" or "empty" could be delayed until really necessary" Python's NULL is `None`. `print(None)` doesn't raise an exception and a null was used. Can you clarify "empty & unknown values would propagate as necessary and the check whether a value is actually "unknown" or "empty" could be delayed until really necessary"? — noɥʇʎԀʎzɐɹƆ, Aug 14 '18 at 18:58
@noɥʇʎԀʎzɐɹƆ It avoids null-checks. Especially in languages without static type-checking or those which ignore the type of `null`. If you miss out on a type-check you will run into an exception where a behaviour like the `float('nan')` would be acceptable (that is, propagating the `NaN`). Like `null + 1` will raise an exception where `NaN + 1` will not. — exhuma, Aug 14 '18 at 19:01
@noɥʇʎԀʎzɐɹƆ to your second comment: The `print` statements in the example contained a multiplication operator which would have raised an exception if the result in `value` were `None`. — exhuma, Aug 14 '18 at 19:05
I recently went the other way with missing data. We have temperature sensors which occasionally did not return a value, so the value came thru as 0, or absolute zero degrees K. Which of course threw off graphs and such. Having that value translated to NULL and then discarding such values before graphing was a huge improvement. — DaveG, Aug 14 '18 at 19:10
@exhuma **Sometimes, it's good to force null-checks.** Writing a null check is uncomfortable, and for good reason: the feeling can force you to ask yourself "Where else could I accomplish the same thing the null-check does?" You could use `NaN` for one and `None` for another. Maybe you'll filter to remove `Missing` before sending the data to the function. Maybe for this particular data operation, unknown needs to be interpreted as increasing the statistical variance. It's not very often that you'll need to avoid a null-check on a function that acts on a single scalar. — noɥʇʎԀʎzɐɹƆ, Aug 14 '18 at 19:34
Whatever solution you do use for this, be sure to ask yourself if it suffers from similar problems to the ones that made you want to eliminate NULL in the first place. — Ray, Aug 14 '18 at 20:31
@DaveG "*You aren't going to be able to go back in time and get the correct value.*" If the input data is coming asynchronously, you may have received the bit saying that the measurement has been taken, but the value hasn't arrived yet. — RonJohn, Aug 14 '18 at 21:07
"just use -99 or 0" seems to be the solution chosen by some expensive sensors we bought at work. :-/ — Eric Duminil, Aug 15 '18 at 22:36
I'm confused by this question. Is there a problem with the solution you've outlined? — JimmyJames, Aug 16 '18 at 16:22
You need two fields, "Measurement Status" (Pending, Success, Failed, etc) and "Measurement Value" — Ben, Aug 17 '18 at 09:39
Alternately as @EricDuminil you can use "Sentinel" values which are impossible values you treat as special cases. — Ben, Aug 17 '18 at 10:04
@Ben "impossible" being key here. I've seen temperature sensors logging 0 for missing values. — Eric Duminil, Aug 18 '18 at 03:35

David Arno · Accepted Answer · 2018-08-15T07:41:58.707

85

The common way to do this, at least with functional languages is to use a discriminated union. This is then a value that is one of a valid int, a value that denotes "missing" or a value that denotes "unknown". In F#, it might look something like:

type Measurement =
    | Reading of value : int
    | Missing
    | Unknown of value : RawData

A Measurement value will then be a Reading, with an int value, or a Missing, or an Unknown with the raw data as value (if required).

However, if you aren't using a language that supports discriminated unions, or their equivalent, this pattern isn't likely of much use to you. So there, you could eg use a class with an enum field that denotes which of the three contains the correct data.

edited Aug 15 '18 at 07:41

answered Aug 14 '18 at 12:59

David Arno

38,972
9
88
121

7

you can do sum types in OO languages but there is a fair bit of boiler plate to make them work https://stackoverflow.com/questions/3151702/discriminated-union-in-c-sharp – jk. Aug 14 '18 at 13:47
11

“[in non-functional languages languages] this pattern isn't likely of much use to you” — It’s a pretty common pattern in OOP. GOF has a variation of this pattern, and languages such as C++ offer native constructs to encode it. – Konrad Rudolph Aug 14 '18 at 15:28
isnt this saying Measurement is one of three possible datatypes rather than supplying a datatype? – Ewan Aug 14 '18 at 16:10
1

@KonradRudolph i'm not sure C++ unions count as discriminated unions - or did you mean something else? – jk. Aug 14 '18 at 16:28
14

@jk. Yes, they don’t count (well I guess they do; they’re just very bad in this scenario due to lack of safety). I meant `std::variant` (and its spiritual predecessors). – Konrad Rudolph Aug 14 '18 at 16:30
2

@Ewan No, it’s saying “Measurement is a datatype that is either … or …”. – Konrad Rudolph Aug 14 '18 at 16:31
1

This is correct in general. For specifically numeric applications, NaN is typically used. And not integers. – Frank Hileman Aug 14 '18 at 22:28
@KonradRudolph, very good point regarding eg C++ and other non "functional" languages that support DUs. I've slightly modified the wording to reflect that. – David Arno Aug 15 '18 at 07:43
1

+1 This is a typical case of "choose the proper tool for the job". – Guran Aug 15 '18 at 08:25
2

@DavidArno Well even without DUs there’s a “canonical” solution for this in OOP, which is to have a superclass of values with subclasses for valid and invalid values. But that’s probably going too far (and in practice it seems that most code bases eschew subclass polymorphism in favour of a flag for this, as shown in other answers). – Konrad Rudolph Aug 15 '18 at 09:08
1

@KonradRudolph Even that's overkill, really - just have a single immutable class that has three different static factory methods that each return one consistent possible object. Subclasses are only a good solution if you can have consistent behaviour with the superclass - if you'll need to do `if (x is SomeUnknown)` anyway, you're not *really* using polymorphism (sadly, I've met plenty of people who don't understand the difference, but still think they're doing "OOP"). – Luaan Aug 16 '18 at 16:47
@Luaan Right, ideally they can and do have consistent behaviour, namely to be monadic closures over a behaviour (see Eric Lippert’s answer). As to whether subclasses are overkill — this entirely depends on the verbosity of the language. In most mainstream languages you’re right though. – Konrad Rudolph Aug 16 '18 at 16:56
I've used the input of the current answers as guideline for a bit more research and came up with the same conclusion that discriminated unions are the cleanest way to solve this. Thanks for the detailed answer and patience :) – exhuma Sep 02 '18 at 12:14

Eric Lippert · Answer 2 · 2018-08-14T22:34:57.467

58

If you do not already know what a monad is, today would be a great day to learn. I have a gentle introduction for OO programmers here:

https://ericlippert.com/2013/02/21/monads-part-one/

Your scenario is a small extension to the "maybe monad", also known as Nullable<T> in C# and Optional<T> in other languages.

Let's suppose you have an abstract type to represent the monad:

abstract class Measurement<T> { ... }

and then three subclasses:

final class Unknown<T> : Measurement<T> { ... a singleton ...}
final class Empty<T> : Measurement<T> { ... a singleton ... }
final class Actual<T> : Measurement<T> { ... a wrapper around a T ...}

We need an implementation of Bind:

abstract class Measurement<T>
{ 
    public Measurement<R> Bind(Func<T, Measurement<R>> f)
  {
    if (this is Unknown<T>) return Unknown<R>.Singleton;
    if (this is Empty<T>) return Empty<R>.Singleton;
    if (this is Actual<T>) return f(((Actual<T>)this).Value);
    throw ...
  }

From this you can write this simplified version of Bind:

public Measurement<R> Bind(Func<A, R> f) 
{
  return this.Bind(a => new Actual<R>(f(a));
}

And now you're done. You have a Measurement<int> in hand. You want to double it:

Measurement<int> m = whatever;
Measurement<int> doubled = m.Bind(a => a * 2);
Measurement<string> asString = m.Bind(a => a.ToString());

And follow the logic; if m is Empty<int> then asString is Empty<String>, excellent.

Similarly, if we have

Measurement<int> First()

and

Measurement<double> Second(int i);

then we can combine two measurements:

Measurement<double> d = First().Bind(Second);

and again, if First() is Empty<int> then d is Empty<double> and so on.

The key step is to get the bind operation correct. Think hard about it.

edited Aug 14 '18 at 22:34

answered Aug 14 '18 at 22:29

Eric Lippert

45,799
22
87
126

I had a gut feeling that the problem felt like something to be tackled with a monad. I still don't feel 100% confident with monads themselves. I will definitely have a look at your article. I like the solution. Although I'm not 100% sure I would use this in production code where I am not certain that everyone in the team has grasped monads yet. Would be a good team-exercise though :) – exhuma Aug 15 '18 at 06:46
4

Monads (thankfully) are much easier to use than to understand. :) – Guran Aug 15 '18 at 14:02
1

I'd remark that the “simplified version of bind” lacks the very thing that characterises a monadic bind. This is just a simple functor mapping, which is also available for many types that _aren't_ monads. Why not call it `fmap` or similar? – leftaroundabout Aug 15 '18 at 17:16
11

@leftaroundabout: Precisely because I didn't want to get into that hair-splitting distinction; as the original poster notes, many people lack confidence when it comes to dealing with monads. Jargon-laden category theory characterizations of simple operations works against developing a sense of confidence and understanding. – Eric Lippert Aug 15 '18 at 17:30
It's not about jargon, it's about actually keeping simple things simple. `Bind' = (m, f) => m.Bind(a => pure(f(a))` invokes a tricky monad operation (well, it would be tricky in types more complex than `Optional`) for doing something which is always trivial if implemented directly. (In fact Haskell derives `instance Functor` for free.) Moreover, the fact that `Bind(pure . f)` reduces to `fmap f` is crucial to even approaching understanding of monads, and you don't get that if you never introduce `fmap` in the first place. – leftaroundabout Aug 15 '18 at 17:49
2

So your advice is to replace `Null` with `Nullable` + some boilerplate code? :) – Eric Duminil Aug 15 '18 at 22:34
For the beginners: where are the monads in the code above? Are `Empty` and `Unknown` monads but `Actual` not? – Claude Aug 16 '18 at 05:38
2

"Monad" is a term used to refer to the whole Measurement system of classes and all similarly laid out systems which can have multiple states and some inner generic value in at least one of those states. The two key things which make something a monad are A. having multiple states where one contains inner value(s) and B. the inner value(s) being generic and otherwise independent from the structure. In functional languages like Haskell, you can then make a function operate on "any kind of monad" and it would work for Option, Measurement, or even a List. – daboross Aug 16 '18 at 06:59
1

@Claude There is only one monad of interest in the code above. It is the following collection of things: the `Measurement`, `Unknown`, `Empty`, and `Actual` classes (which together constitute a single "type former"), the `Bind` method of `Measurement` (which constitutes the monad's bind operation), the `Actual` initializer method (which constitutes the monad's unit operation), and some proofs not shown above that the bind and unit behave sanely with respect to each other. – Daniel Wagner Aug 16 '18 at 15:07
2

I just want to point out how much I appreciate seeing the master of using an alphabet soup of generics in his blog posts suggest a solution that uses an alphabet soup of generics. Though I guess this is pretty tame compared to some of the stuff you've had to deal with! – bvoyelr Aug 16 '18 at 15:07
3

@Claude: You should read my tutorial. A monad is a generic type that follows certain rules and provides the ability to bind together a chain of operations, so in this case, `Measurement` is the monadic type. – Eric Lippert Aug 16 '18 at 15:45
5

@daboross: Though I agree that stateful monads are a good way to introduce monads, I don't think of carrying state as being the thing that characterizes a monad. I think of the fact that you can bind together a sequence of functions is the compelling thing; the statefulness is just an implementation detail. – Eric Lippert Aug 16 '18 at 15:47
1

@EricLippert Good point! I was introduced in a similar way and haven't gotten all the usefulness of monads without data but I guess that's more off topic than we need. Glad you mentioned it! – daboross Aug 18 '18 at 05:52

score 18 · Answer 3 · answered Aug 14 '18 at 14:48

I think that in this case a variation on a Null Object Pattern would be useful:

public class Measurement
{
    private int value;
    private bool isUnknown = false;
    private bool isMissing = false;

    private Measurement() { }
    public Measurement(int value) { this.value = value; }

    public int Value {
        get {
            if (!isUnknown && !isMissing)
            {
                return this.value;
            }
            throw new SomeException("...");
        }                   
    }

    public static readonly Measurement Unknown = new Measurement
    {
        isUnknown = true
    };

    public static readonly Measurement Missing = new Measurement
    {
        isMissing = true
    };
}

You can turn it into a struct, override Equals/GetHashCode/ToString, add implicit conversions from or to int, and if you want NaN-like behavior you can also implement your own arithmetic operators so that eg. Measurement.Unknown * 2 == Measurement.Unknown.

That said, C#'s Nullable<int> implements all that, with the only caveat being that you can't differentiate between different types of nulls. I'm not a Java person, but my understanding is that Java's OptionalInt is similar, and other languages likely have their own facilities to represent an Optional type.

The most common implementation I've seen of this pattern involves inheritance. There could be a case for two sub classes: MissingMeasurement and UnknownMeasurement. They could implement or override methods in the parent Measurement class. +1 — Greg Burghardt, Aug 14 '18 at 18:59
Isn't the point of the *Null Object Pattern* that you don't fail on invalid values, but rather do nothing? — Chris Wohlert, Aug 14 '18 at 21:22
@ChrisWohlert in this case the object doesn't really have any methods except the `Value` getter, which absolutely should fail as you can't convert a `Unknown` back into an `int`. If the measurement had a, say, `SaveToDatabase()` method, then a good implementation would probably not perform a transaction if the current object is a null object (either via comparison with a singleton, or a method override). — Maciej Stachowski, Aug 15 '18 at 00:28
@MaciejStachowski Yeah, I'm not saying it should do nothing, I'm saying the *Null Object Pattern* isn't a good fit. Your solution might be fine, but I wouldn't call it the *Null Object Pattern*. — Chris Wohlert, Aug 15 '18 at 11:40

Ewan · Answer 4 · 2018-08-17T07:59:01.990

14

If you literally MUST use an integer then there is only one possible solution. Use some of the possible values as 'magic numbers' that mean 'missing' and 'unknown'

eg 2,147,483,647 and 2,147,483,646

If you just need the int for 'real' measurements, then create a more complicated data structure

class Measurement {
    public bool IsEmpty;
    public bool IsKnown;
    public int Value {
        get {
            if(!IsEmpty && IsKnown) return _value;
            throw new Exception("NaN");
            }
        }
}

Important Clarification:

You can acheieve the maths requirement by overloading the operators for the class

public static Measurement operator+ (Measurement a, Measurement b) {
    if(a.IsEmpty) { return b; }
    ...etc
}

edited Aug 17 '18 at 07:59

answered Aug 14 '18 at 13:00

Ewan

70,664
5
76
161

An optional could be a valid alternative. – Glenner003 Aug 14 '18 at 14:24
@Glenner003 How can an optional differentiate the reason for a missing value? – Kakturus Aug 14 '18 at 14:36
10

@Kakturus `Option – Bergi Aug 14 '18 at 15:20
5

@Bergi You can't possibly think that's even remotely acceptable.. – BlueRaja - Danny Pflughoeft Aug 14 '18 at 16:43
@Bergi Why not just use a `Variant` at that point? It's no greater an abuse of the type system, at least. – Nic Aug 14 '18 at 16:54
Many years ago we reserved a small range of integers near NaN for an assortment of error conditions and then tested (value below reality floor) – arp Aug 14 '18 at 17:27
8

@BlueRaja-DannyPflughoeft Actually it fits the OPs description quite well, which has a nested structure as well. To become acceptable we'd introduce a proper type alias (or "newtype") of course - but a `type Measurement = Option` for a result that was an integer or an empty read is ok, and so is an `Option` for a measurement that might have been taken or not. – Bergi Aug 14 '18 at 19:00
@NicHartley What language is that? I don't know any type system with variadic higher-kinded types, but an Optional type is well known to many. – Bergi Aug 14 '18 at 19:03
@Bergi [C++](https://en.cppreference.com/w/cpp/utility/variant), though you could argue that it cheats a bit, since by default _everything_ is passed by value, and it can't actually hold a reference. (in contrast to e.g. C#, where most things are passed by value of reference, or whatever that's called) – Nic Aug 14 '18 at 19:17
7

@arp "Integers near NaN"? Could you explain what you mean by that? It seems somewhat counterintuitive to say that a number is "near" the very concept of something not being a number. – Nic Aug 14 '18 at 19:19
3

@Nic Hartley In our system a group of what would "naturally" have been the lowest possible negstive integers was reserved as NaN. We used that space for encoding various reasons why those bytes represented something other than legitimate data. (it was decades ago and I may have fuzzed some of the details, but there was definitely a set of bits you could put into an integer value to make it throw NaN if you tried to do math with it. – arp Aug 14 '18 at 20:15
1

@arp oh, I see my misunderstanding. When you said "near", I was thinking about mathematical distance on the number line, not similarity in the bit pattern (which, it sounds like, would be mathematical distance if you interpreted the bit pattern as another type of integer). The same sort of thing can be done with floats, too -- only a few of the 32 or 64 bits are required to mark a float as NaN, and it's not at all uncommon to record the reason in the remaining bits. – Nic Aug 14 '18 at 20:53
3

For values where the only valid values are 0 (or 1) and higher, the simple negatives (-1, -2, -3, ...) work best for tracking "invalid" values. They're easy for a human to remember if looking at them in storage, and there's no benefit to the computer to using something at the extreme end of the storable range. – Bobson Aug 15 '18 at 12:24
@Kakturus I was thinking when the optional is null no measurement,when it is absent measurement was performed but no meaningful value. – Glenner003 Aug 16 '18 at 05:25
The problem is that the OP wants to be able to math on the results. – Acccumulation Aug 16 '18 at 21:17

Federico Poloni · Answer 5 · 2018-08-16T02:28:04.410

If your variables are floating-point numbers, IEEE754 (the floating point number standard which is supported by most modern processors and languages) has your back: it is a little-known feature, but the standard defines not one, but a whole family of NaN (not-a-number) values, which can be used for arbitrary application-defined meanings. In single-precision floats, for instance, you have 22 free bits that you can use to distinguish between 2^{22} types of invalid values.

Normally, programming interfaces expose only one of them (e.g., Numpy's nan); I don't know if there is a built-in way to generate the others other than explicit bit manipulation, but it's just a matter of writing a couple of low-level routines. (You will also need one to tell them apart, because, by design, a == b always returns false when one of them is a NaN.)

Using them is better than reinventing your own "magic number" to signal invalid data, because they propagate correctly and signal invalid-ness: for instance, you don't risk shooting yourself in the foot if you use an average() function and forget to check for your special values.

The only risk is libraries not supporting them correctly, since they are quite an obscure feature: for instance, a serialization library may 'flatten' them all to the same nan (which looks equivalent to it for most purposes).

David Moles · Answer 6 · 2018-08-15T18:34:06.800

Following on David Arno's answer, you can do something like a discriminated union in OOP, and in an object-functional style such as that afforded by Scala, by Java 8 functional types, or a Java FP library such as Vavr or Fugue it feels fairly natural to write something like:

var value = Measurement.of(2);
out.println(value.map(x -> x * 2));

var empty = Measurement.empty();
out.println(empty.map(x -> x * 2));

var unknown = Measurement.unknown();
out.println(unknown.map(x -> x * 2));

printing

Value(4)
Empty()
Unknown()

(Full implementation as a gist.)

An FP language or library provides other tools like Try (a.k.a. Maybe) (an object that contains either a value, or an error) and Either (an object that contains either a success value or a failure value) that could also be used here.

score 2 · Answer 7 · answered Aug 14 '18 at 18:12

The ideal solution to your problem is going to hinge on why you care about the difference between a known failure and an known unreliable measurement, and what downstream processes you want to support. Note, 'downstream processes' for this case does not exclude human operators or fellow developers.

Simply coming up with a "second flavor" of null doesn't give the downstream set of processes enough information for deriving a reasonable set of behaviors.

If you are relying instead on contextual assumptions about the source of bad behaviors being made by downstream code, I'd call that bad architecture.

If you know enough to distinguish between a reason for failure and a failure without a known reason, and that information is going to inform future behaviors, you should be communicating that knowledge downstream, or handling it inline.

Some patterns for handling this:

Sum types
Discriminated unions
Objects or structs containing an enum representing the result of the operation and a field for the result
Magic strings or magic numbers that are impossible to achieve via normal operation
Exceptions, in languages in which this use is idiomatic
Realizing that there isn't actually any value in differentiating between these two scenarios and just using null

score 2 · Answer 8 · answered Aug 14 '18 at 21:43

2

If I were concerned with "getting something done" rather than an elegant solution, the quick and dirty hack would be to simply use the strings "unknown", "missing", and 'string representation of my numeric value', which would then be converted from a string and used as needed. Implemented quicker than writing this, and in at least some circumstances, entirely adequate. (I'm now forming a betting pool on the number of downvotes...)

answered Aug 14 '18 at 21:43

MickeyfAgain_BeforeExitOfSO

540
3
6

Upvoted for mentioning "getting something done." – barbecue Aug 15 '18 at 18:57
4

Some people might note that this suffers most of the same issues as using NULL, namely that it just switches from needing NULL checks to needing "unknown" and "missing" checks, but keeps the run time crash for the lucky, silent data corruption for the unlucky as the only indicators that you forgot a check. Even missing NULL checks have the advantage that linters might catch them, but this loses that. It does add a distinction between "unknown" and "missing", though, so it beats NULL there... – 8bittree Aug 16 '18 at 01:19

Dewi Morgan · Answer 9 · 2018-08-16T20:54:32.763

The gist if the question seems to be "How do I return two unrelated pieces of information from a method which returns a single int? I never want to check my return values, and nulls are bad, don't use them."

Let's look at what you are wanting to pass. You are passing either an int, or a non-int rationale for why you can't give the int. The question asserts that there will only be two reasons, but anyone who has ever made an enum knows that any list will grow. Scope to specify other rationales just makes sense.

Initially, then, this looks like it might be a good case for throwing an exception.

When you want to tell the caller something special which isn't in the return type, exceptions are often the appropriate system: exceptions are not just for error states, and allow you to return a lot of context and rationale to explain why you just can't int today.

And this is the ONLY system which allows you to return guaranteed-valid ints, and guarantee that every int operator and method that takes ints can accept the return value of this method without ever needing to check for invalid values like null, or magic values.

But exceptions are really only a valid solution if, as the name implies, this is an exceptional case, not the normal course of business.

And a try/catch and handler is just as much boilerplate as a null check, which was what was objected to in the first place.

And if the caller doesn't contain the try/catch, then the caller's caller has to, and so on up.

A naive second pass is to say "It's a measurement. Negative distance measurements are unlikely." So for some measurement Y, you can just have consts for

-1=unknown,
-2=impossible to measure,
-3=refused to answer,
-4=known but confidential,
-5=varies depending on moon phase, see table 5a,
-6=four-dimensional, measurements given in title,
-7=file system read error,
-8=reserved for future use,
-9=square/cubic so Y is same as X,
-10=is a monitor screen so not using X,Y measurements: use X as the screen diagonal,
-11=wrote the measurements down on the back of a receipt and it was laundered into illegibility but I think it was either 5 or 17,
-12=... you get the idea.

This is the way it is done in a lot of old C systems, and even in modern systems where there is a genuine constraint to int, and you can't wrap it to a struct or monad of some type.

If the measurements can be negative, then you just make your data type larger (eg long int) and have the magic values be higher than the range of the int, and ideally begin with some value that will show up clearly in a debugger.

There are good reasons to have them as a separate variable, rather than just having magic numbers, though. For example, strict typing, maintainability, and conforming to expectations.

In our third attempt, then, we look at cases where it is the normal course of business to have non-int values. For example, if a collection of these values may contain multiple non-integer entries. This means an exception handler may be the wrong approach.

In that case, it looks a good case for a structure which passes the int, and the rationale. Again, this rationale can just be a const like the above, but instead of holding both in the same int, you store them as distinct parts of a structure. Initially, we have the rule that if the rationale is set, the int will not be set. But we are no longer tied to this rule; we can provide rationales for valid numbers too, if needs be.

Either way, every time you call it, you still need boilerplate, to test the rationale to see if the int is valid, then pull out and use the int part if the rationale lets us.

This is where you need to investigate your reasoning behind "don't use null".

Like exceptions, null is meant to signify an exceptional state.

If a caller is calling this method and ignoring the "rationale" part of the structure completely, expecting a number without any error handling, and it gets a zero, then it'll handle the zero as a number, and be wrong. If it gets a magic number, it'll treat that as a number, and be wrong. But if it gets a null, it'll fall over, as it damn well should do.

So every time you call this method you must put in checks for its return value, however you handle the invalid values, whether in-band or out of band, try/catch, checking the struct for a "rationale" component, checking the int for a magic number, or checking an int for a null...

The alternative, to handle multiplication of an output which might contain an invalid int and a rationale like "My dog ate this measurement", is to overload the multiplication operator for that structure.

...And then overload every other operator in your application that might get applied to this data.

...And then overload all methods that might take ints.

...And all of those overloads will need to still contain checks for invalid ints, just so that you can treat the return type of this one method as if it were always a valid int at the point when you are calling it.

So the original premise is false in various ways:

If you have invalid values, you can't avoid checking for those invalid values at any point in the code where you're handling the values.
If you're returning anything other than an int, you're not returning an int, so you can't treat it like an int. Operator overloading lets you pretend to, but that's just pretend.
An int with magic numbers (including NULL, NAN, Inf...) is no longer really an int, it's a poor-man's struct.
Avoiding nulls will not make code more robust, it will just hide the problems with ints, or move them into a complex exception-handling structure.

score 1 · Answer 10 · answered Aug 14 '18 at 19:22

I don't understand the premise of your question, but here's the face value answer. For Missing or Empty, you could do math.nan (Not a Number). You can perform any mathematical operations on math.nan and it will remain math.nan.

You can use None (Python's null) for an unknown value. You shouldn't be manipulating an unknown value anyways, and some languages (Python is not one of them) have special null operators so that the operation is only performed if the value is nonnull, otherwise the value remains null.

Other languages have guard clauses (like Swift or Ruby), and Ruby has a conditional early return.

I've seen this solved in Python in a few different ways:

with a wrapper data structure, since numerical information usually is about to an entity and has a measurement time. The wrapper can override magic methods like __mult__ so that no exceptions are raised when your Unknown or Missing values come up. Numpy and pandas might have such capability in them.
with a sentinel value (like your Unknown or -1/-2) and an if statement
with a separate boolean flag
with a lazy data structure- your function performs some operation on the structure, then it returns, the outermost function that needs the actual result evaluates the lazy data structure
with a lazy pipeline of operations- similar to the previous one, but this one can be used on a set of data or a database

score 1 · Answer 11 · answered Aug 14 '18 at 20:08

How the value is stored in memory is dependent on the language and implementation details. I think what you mean is how the object should behave to the programmer. (This is how I read the question, tell me if I'm wrong.)

You've proposed an answer to that in your question already: use your own class that accepts any mathematical operation and returns itself without raising an exception. You say you want this because you want to avoid null checks.

Solution 1: don't avoid null checks

Missing can be represented as math.nan
Unknown can be represented as None

If you have more than one value, you can filter() to only apply the operation on values that aren't Unknown or Missing, or whatever values you want to ignore for the function.

I can't imagine a scenario where you need a null-check on a function that acts on a single scalar. In that case, it's good to force null-checks.

Solution 2: use a decorator that catches exceptions

In this case, Missing could raise MissingException and Unknown could raise UnknownException when operations are performed on it.

@suppressUnknown(value=Unknown) # if an UnknownException is raised, return this value instead
@suppressMissing(value=Missing)
def sigmoid(value):
    ...

The advantage of this approach is that the properties of Missing and Unknown are only suppressed when you explicitly ask for them to be suppressed. Another advantage is that this approach is self-documenting: every function shows whether or not it expects an unknown or a missing and how the function.

When you call a function doesn't expect a Missing gets a Missing, the function will raise immediately, showing you exactly where the error occurred instead of silently failing and propagating a Missing up the call chain. The same goes for Unknown.

sigmoid can still call sin, even though it doesn't expect a Missing or Unknown, since sigmoid's decorator will catch the exception.

wonder what's the point of posting two answers to the same question (this is [your prior answer](https://softwareengineering.stackexchange.com/a/376876/31260), anything wrong with it?) — gnat, Aug 14 '18 at 21:25
@gnat This answer provides reasoning why it shouldn't be done the way the author shows, and I didn't want to go through the hassle of integrating two answers with different ideas- it's just easier to write two answers that can be read independently. I don't understand why you care so much about someone else's harmless reasoning. — noɥʇʎԀʎzɐɹƆ, Aug 14 '18 at 23:39

score 0 · Answer 12 · answered Aug 14 '18 at 20:51

Assume fetching the number of CPUs in a server. If the server is switched off, or has been scrapped, that value simply doesn't exist. It will be a measurement which does not make any sense (maybe "missing"/"empty" are not the best terms). But the value is "known" to be nonsensical. If the server exists, but the process of fetching the value crashes, measuring it is valid, but fails resulting an "unknown" value.

Both of these sound like error conditions, so I would judge that the best option here is to simply have get_measurement() throw both of these as exceptions immediately (such as DataSourceUnavailableException or SpectacularFailureToGetDataException, respectively). Then, if any of these issues occur, the data-gathering code can react to it immediately (such as by trying again in the latter case), and get_measurement() only has to return an int in the case that it can successfully get the data from the data source - and you know that the int is valid.

If your situation doesn't support exceptions or can't make much use of them, then a good alternative is to use error codes, perhaps returned through a separate output to get_measurement(). This is the idiomatic pattern in C, where the actual output is stored in an input pointer and an error code is passed back as the return value.

score 0 · Answer 13 · answered Aug 16 '18 at 12:14

The given answers are fine, but still do not reflect the hierarchical relation between value, empty and unknown.

Highest comes unknown.
Then before using a value first empty must be clarified.
Last comes the value to calculate with.

Ugly (for its failing abstraction), but fully operational would be (in Java):

Optional<Optional<Integer>> unknowableValue;

unknowableValue.ifPresent(emptiableValue -> ...);
Optional<Integer> emptiableValue = unknowableValue.orElse(Optional.empty());

emptiableValue.ifPresent(value -> ...);
int value = emptiableValue.orElse(0);

Here functional languages with a nice type system are better.

In fact: The empty/missing and unknown* non-values seem rather part of some process state, some production pipeline. Like Excel spread sheet cells with formulas referencing other cells. There one would think of maybe storing contextual lambdas. Changing a cell would re-evaluate all recursively dependent cells.

In that case an int value would be gotten by an int supplier. An empty value would give an int supplier throwing an empty exception, or evaluating to empty (recursively upwards). Your main formula would connect all values and possibly also return an empty (value/exception). An unknown value would disable evaluation by throwing an exception.

Values probably would be observable, like a java bound property, notifying listeners on change.

In short: The recurring pattern of needing values with additional states empty and unknown seems to indicate that a more spread sheet like bound properties data model might be better.

smci · Answer 14 · 2018-08-17T00:56:50.633

Yes, the concept of multiple different NA types exists in some languages; more so in statistical ones, where it's more meaningful (viz. the huge distinction between Missing-At-Random, Missing-Completely-At-Random, Missing-Not-At-Random).

if we're only measuring widget lengths, then it's not crucial to distinguish between 'sensor failure' or 'power cut' or 'network failure' (although 'numerical overflow' does convey information)
but in e.g. data mining or a survey, asking respondents for e.g. their income or HIV status, a result of 'Unknown' is distinct to 'Decline to answer', and you can see that our prior assumptions about how to impute the latter will tend to be different to the former. So languages like SAS support multiple different NA types; the R language doesn't but users very often have to hack around that; NAs at different points in a pipeline can be used to denote very different things.
there's also the case where we have multiple NA variables for a single entry ("multiple imputation"). Example: if I don't know any of a person's age, zipcode, education level or income, it's harder to impute their income.

As to how you represent different NA types in general-purpose languages that don't support them, generally people hack up things like floating-point-NaN (requires converting integers), enums or sentinels (e.g. 999 or -1000) for integer or categorical values. Usually there isn't a very clean answer, sorry.

ilhan · Answer 15 · 2018-08-30T15:05:40.053

R has build-in missing value support. https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17

Edit: because I was downvoted I'm going to explain a bit.

If you are going to deal with statistics I recommend you to use a statistics language such as R because R is written by statisticians for statisticians. Missing values is such a big topic that they teach you a whole semester. And there is big books only about missing values.

You can however you want to mark you missing data, like a dot or "missing" or whatever. In R you can define what you mean by missing. You don't need to convert them.

Normal way to define missing value is to mark them as NA.

x <- c(1, 2, NA, 4, "")

Then you can see what values are missing;

is.na(x)

And then the result will be;

FALSE FALSE  TRUE FALSE FALSE

As you can see "" is not missing. You can threat "" as unknown. And NAis missing.

@Hulk, what other functional languages support missing values? Even if they support missing values I'm sure you cannot fill them with statistical methods in only one line of code. — ilhan, Aug 30 '18 at 15:02

score -1 · Answer 16 · answered Aug 14 '18 at 22:24

-1

Is there a reason that the functionality of the * operator cannot be altered instead?

Most of the answers involve a lookup value of some sort, but it might just be easier to amend the mathematical operator in this case.

You would then be able to have similar empty()/unknown() functionality across your entire project.

answered Aug 14 '18 at 22:24

Edward

1

4

This means you would have to overload _all_ operators – pipe Aug 16 '18 at 07:10

How should I store "unknown" and "missing" values in a variable, while still retaining the difference between "unknown" and "missing"?

16 Answers16