Find a "hole" in a list of numbers

Question

What is the fastest way to find the first (smallest) integer that doesn't exist in a given list of unsorted integers (and that is greater than the list's smallest value)?

My primitive approach is sorting them and stepping through the list, is there a better way?

I think it rather depends on how you define the the list of exisiting numbers. Is it a finite list, an infinite progression? — Jodrell, Apr 24 '12 at 16:38
@Jodrell I think sorting an infinite progression would be difficult ;-) — maple_shaft, Apr 24 '12 at 16:41
Do you know anything else about the list? Or is it just a random list of numbers? — Steven Evers, Apr 24 '12 at 16:47
@Jodrell If it is an index based array or list then I would imagine the element at 0 or 1. I would imagine this is the case here. Good question! — maple_shaft, Apr 24 '12 at 17:04
You could split the list in two and sort/step through it in two separate threads... ;) — FrustratedWithFormsDesigner, Apr 24 '12 at 17:19
I just realized this probably belongs on StackOverflow, since it's not really a conceptual problem. — JasonTrue, Apr 24 '12 at 17:42
@JasonTrue From the FAQ, `If you have a question about… •algorithm and data structure concepts` it is on topic IMHO. — maple_shaft, Apr 24 '12 at 17:49
"First" means the smallest integer that is larger than the smallest integer in the list. I updated the question. — Fabian Zeindl, Apr 24 '12 at 20:09
Reopened for justice. If anybody vehemently disagrees with this action then please flag the question and we will consider it for closing. — maple_shaft, Apr 25 '12 at 12:34

score 32 · Accepted Answer · answered Apr 24 '12 at 16:45

32

Assuming that you mean "integer" when you say "number", you can use a bitvector of size 2^n, where n is the number of elements (say your range includes integers between 1 and 256, then you can use an 256-bit, or 32 byte, bitvector). When you come across an integer in position n of your range, set the nth bit.

When you're done enumerating the collection of integers, you iterate over the bits in your bitvector, looking for the position of any bits set 0. They now match the position n of your missing integer(s).

This is O(2*N), therefore O(N) and probably more memory efficient than sorting the entire list.

answered Apr 24 '12 at 16:45

JasonTrue

9,001
1
32
49

This is essentially a "fleshed out" version of Peter Smith's answer http://programmers.stackexchange.com/a/145833/25199 – Jodrell Apr 24 '12 at 16:52
Yes, it wasn't there when I started writing. Sorry about my lack of prescience. I've typically seen this as an interview question (it was used at two places I've worked before). – JasonTrue Apr 24 '12 at 16:55
Very good explanation! I would be interested to see just how much more efficient this can be compared to a sort operation. This of course would be dependent on the nature of the list itself. – maple_shaft Apr 24 '12 at 16:57
7

Well, as a direct comparison, if you had all positive unsigned 32 bit integers but 1, you could solve the missing integer problem in about half a gigabyte of memory. If you sorted instead, you'd have to use over 8 gigabytes of memory. And sorting, except in special cases like this one (your list is sorted once you have a bitvector) is nearly always n log n or worse, so except in cases where the constant outweighs the complexity in cost, the linear approach wins. – JasonTrue Apr 24 '12 at 17:13
1

What if you don't know the range a priori? – Blrfl Apr 24 '12 at 17:13
2

If you have an integer data type, Blrfl, you certainly know the maximum extents of the range, even if you don't have enough information to narrow further. If you happen to know it's a small list, but don't know the exact size, sorting might be a simpler solution. – JasonTrue Apr 24 '12 at 17:17
1

Or do another loop first through the list to find the smallest and the largest element. Then you can allocate an array of exact size with the smallest value as the basic offset. Still O(N). – Secure Apr 24 '12 at 17:31
1

@JPatrick: Not homework, business, i have graduated CS years ago :). – Fabian Zeindl Apr 24 '12 at 18:12
@JasonTrue: A recent edit in the question removed some ambiguities. With the question as-written when I wrote my comment, you'd have no way to know if the hole in the set [5, 6, 7, 8] is 4 or 9, which would have made knowing the range important. – Blrfl Apr 24 '12 at 19:01
1

Did you get this answer from pg 5, section 1.4 of "Programming Pearls" by Jon Bentley? http://www.cs.bell-labs.com/cm/cs/pearls/sec014.html – jmq Apr 24 '12 at 22:43
Actually, I didn't read Programming Pearls until after hearing a related question as a favorite interview question that a coworker at a previous company used. But yes, I have read that solution and thought perhaps that was where coworker was first inspired... – JasonTrue Apr 24 '12 at 23:50
1

@JasonTrue Your time complexity estimate is correct, but not space complexity (memory efficiency). Yours is O(2^n) where as in situ algorithms like Quicksort are O(log(n)) (space not time!). Clearly for memory, sorting wins. – scarfridge Apr 25 '12 at 07:44
In other words... use a histogram. (For a **small range** of integers values this is a good approach.) – Danny Varod Apr 25 '12 at 12:40
@scarfridge, since you still need to represent all 4 bytes of 2 billion positive 32 bit integers if you stored a list, you would at least need the 8 GB to store the original list if you chose a sorting approach, or 16GB if you stored all possible 32-bit integers; however, even 4 billion bits is only half a gigabyte. I assumed in-place in that example. – JasonTrue Apr 25 '12 at 16:14
1

@JasonTrue You need the list anyway. In your approach additionally 0.5 GB, sorting requires something around 30*size of stack frame, probably less than a KB. – scarfridge Apr 25 '12 at 19:10
No, you don't need the list if you use the BitVector solution. You can load records one at a time, record presence of the item, and discard the actual item. If you sort, you have to have the whole list. it's one of the advantages of the BitVector method. – JasonTrue Apr 25 '12 at 20:35
If I had a list like { -2000000000, 0, 2000000000 } I'd hope that my algorithm decided to sort it rather than use the bit array method. – Kirk Broadhurst Apr 26 '12 at 04:56
Yes, the equation is quite different when the collection is sparsely populated. The bit vector solution is best only when your problem is finding a small number if missing values in a large collection of items for which a set representation makes sense. I couldn't use it for "find the missing word" as easily as for "most integers", for example, unless there was a pretty simple mapping of string to element number (though Bloom filters work on a similar principal) – JasonTrue Apr 26 '12 at 07:30
If you prefer to talk about this from the perspective of sorting, think bucket sort (which is still O(n) ). – Brian Apr 26 '12 at 17:47
1

There is a data-structure called "Barcode" that serves the need of having a sparsely populated list with binary values well: https://github.com/glazedlists/glazedlists/blob/master/source/ca/odell/glazedlists/impl/adt/Barcode.java – Fabian Zeindl Jan 12 '17 at 07:44
You don't need such a big array. It's enough to see the range [min;min+count] because at least one of them is missing – RiaD Nov 17 '21 at 08:46
If you have 100 million 64 bit integers supposed to start at 0, then the first missing integer is 100 million or less, so you only need a bit vector for (100 million + 1) integers. – gnasher729 Nov 17 '21 at 10:19

score 5 · Answer 2 · answered Apr 24 '12 at 16:53

5

If you sort the entire list first, then you guarantee worst-case run-time. Also, your choice of sort algorithm is critical.

Here's how I'd approach this problem:

Use a heap sort, focusing on the smallest elements in the list.
After each swap, see if you have a gap.
If you find a gap, then return: You have found your answer.
If you don't find a gap, continue swapping.

Here's a visualization of a heap sort.

answered Apr 24 '12 at 16:53

Jim G.

8,006
3
35
66

One question, how do you identify the "smallest" elements of the list? – Jodrell Apr 24 '12 at 17:01

KeithS · Answer 3 · 2012-04-26T17:54:39.340

Just to be esoteric and "clever", in the special case of the array having only one "hole", you can try an XOR-based solution:

Determine the range of your array; this is done by setting a "max" and "min" variable to the first element of the array, and for each element after that, if that element is less than the min or greater than the max, set the min or max to the new value.
If the range is one less than the cardinality of the set, there is only one "hole" so you can use XOR.
Initialize an integer variable X to zero.
For each integer from min to max inclusively, XOR that value with X and store the result in X.
Now XOR each integer in the array with X, storing each successive result to X as before.
When you're done, X will be the value of your "hole".

This will run in roughly 2N time similar to the bitvector solution, but requires less memory space for any N > sizeof(int). However, if the array has multiple "holes", X will be the XOR "sum" of all the holes, which will be difficult or impossible to separate into the actual hole values. In that case you fall back to some other method such as the "pivot" or "bitvector" approaches from other answers.

You could recurse this as well using something similar to the pivot method to further reduce complexity. Rearrange the array based on a pivot point (which will be the max of the left side and the min of the right; it'll be trivial to find the max and min of the full array while pivoting). If the left side of the pivot has one or more holes, recurse into that side only; otherwise recurse into the other side. At any point where you can determine there's only one hole, use the XOR method to find it (which should be cheaper overall than continuing to pivot all the way down to a collection of two elements with a known hole, which is the base case for the pure pivot algorithm).

That's ridiculously clever and awesome! Now can you come up with a way to do this with a variable number of holes? :-D — , Dec 10 '17 at 01:02

score 3 · Answer 4 · answered Apr 24 '12 at 16:41

What is the range of numbers you will encounter? If that range is not very large, you could solve this with two scans (linear time O(n)) using an array with as many elements as you have numbers, trading space for time. You could find the range dynamically with one more scan. To reduce space, you could assign 1 bit to each number, giving you 8 numbers worth of storage per byte.

Your other option which may be better for early scenarios and would be insitu instead of copying memory is to modify selection sort to quit early if the min found in a scanning pass is not 1 more than the last min found.

score 2 · Answer 5 · answered Apr 24 '12 at 16:52

2

No, not really. Since any not-yet-scanned number could always be one that fills a given "hole", you cannot avoid scanning each number at least once and then comparing it to it's possible neighbours. You probably could speed things up by building up a binary tree or so and then traversing it from left to right until a hole is found, but that is essentially of the same time complexity as sorting, since it is sorting. And you probably won't to come up with anything faster than Timsort.

answered Apr 24 '12 at 16:52

pillmuncher

853
1
6
7

2

Are you saying that traversing a list is the same time complexity as sorting? – maple_shaft Apr 24 '12 at 17:01
@maple_shaft: No, I'm saying building a binary tree from random data and then traversing it left to right is equivalent to sorting and then traversing from small to big. – pillmuncher Apr 24 '12 at 23:12

score 1 · Answer 6 · answered Apr 26 '12 at 07:40

Most ideas here are no more than just sorting. The bitvector version is plain Bucketsort. Heap sort was also mentioned. It basically boils down to chosing the right sorting algorithm which depends on time/space requirements and also on the range and number of elements.

In my view, using a heap structure is probably the most general solution (a heap basically gives you the smallest elements efficiently without a complete sort).

You could also analyze approaches which find the smallest numbers first and then scan for each integer larger than that. Or you find the 5 smallest numbers hoping the will have a gap.

All of these algorithms have their strength depending on the input characteristics and program requirements.

score 1 · Answer 7 · answered Nov 17 '21 at 04:54

Amazingly, there is a better way, that is O(n), if all the elements in the list are distinct. (The list does not need to be sorted.) It comes from the Functional Programming community, and is due to Richard Bird. It is the first problem in his book "Pearls of Functional Algorithm Design" (which is an amazing read). The code in Haskell is:

minfree xs = minfrom 0 xs
minfrom a xs | null xs            = a
             | length us == b - a = minfrom b vs
             | otherwise          = minfrom a us
               where (us,vs) = partition (< b) xs
                           b = a + 1 + (length xs) div 2

(One can precompute the lengths of the lists to save a little time, but this does not change the asymptotic complexity.)

You can see that like other solutions in this thread, we do a partition. But the ingenious idea here is that if the left side of the partition is all full, we can recurse to the right, as there can be no free number on the left. We check that the left side is all full by checking that length us == b - a. b is the value we are partitioning by. us is the left side of the partition. Every element of xs is assumed to be greater than or equal to a.

For the complexity: for evaluating minfrom 0 xs, where length xs = n, we get the recurrence T(n) = T(n div 2) + Theta(n). This has solution T(n) = Theta(n). The idea is that the partition step cuts the size of the list in half, even though it takes a linear amount of time to do the partition. As Bird says, "A recursive process that performs Theta(n) processing steps in order to reduce the problem to another instance of at most half the size is also good enough [to get a linear-time algorithm]."

"All elements are distinct" is a rather strong restriction. – gnasher729 Nov 17 '21 at 10:23 — gnasher729, Nov 17 '21 at 10:23

Kevin · Answer 8 · 2012-04-25T23:06:54.520

I believe I have come up with something that should work generally and efficiently if you are guaranteed not to have duplicates* (however, it should be extensible to any number of holes and any range of integers).

The idea behind this method is like quicksort, in that we find a pivot and partition around it, then recurse on the side(s) with a hole. To see which sides have the hole, we find the lowest and highest numbers, and compare them with the pivot and number of values on that side. Say the pivot is 17 and the minimum number is 11. If there are no holes, there should be 6 numbers (11, 12, 13, 14, 15, 16, 17). If there are 5, we know there is a hole on that side and we can recurse on just that side to find it. I'm having trouble explaining it more clearly than that, so let's take an example.

15 21 10 13 18 16 22 23 24 20 17 11 25 12 14

Pivot:

10 13 11 12 14 |15| 21 18 16 22 23 24 20 17 25

15 is the pivot, indicated by pipes (||). There are 5 numbers on the left side of the pivot, as there should be (15 - 10), and 9 on the right, where there should be 10 (25 - 15). So we recurse on the right side; we'll note that the previous bound was 15 in case the hole is adjacent to it (16).

[15] 18 16 17 20 |21| 22 23 24 25

Now there are 4 numbers on the left side but there should be 5 (21 - 16). So we recurse there, and again we'll note the previous bound (in brackets).

[15] 16 17 |18| 20 [21]

The left side has the correct 2 numbers (18 - 16), but the right has 1 instead of 2 (20 - 18). Depending on our ending conditions, we could compare the 1 number to the two sides (18, 20) and see that 19 is missing or recurse one more time:

[18] |20| [21]

The left side has a size of zero, with a gap between the pivot (20) and previous bound (18), so 19 is the hole.

*: If there are duplicates, you could probably use a hash set to remove them in O(N) time, keeping the overall method O(N), but that might take more time than using some other method.

I don't believe the OP said anything about there being just one hole. The input is an unsorted list of numbers -- they could be anything. It's not clear from your description how you'd determine how many numbers there "should be." — Caleb, Apr 25 '12 at 18:25
@caleb It doesn't matter how many holes there are, just no duplicates (which can be removed in O(N) with a hash set, though in practice that may have more overhead than other methods). I've tried improving the description, see if it's better. — Kevin, Apr 25 '12 at 23:12
This isn't linear, IMO. It's more like (logN)^2. At each step, you pivot the subset of the collection you care about (the half of the previous subarray which you've identified as having the first "hole"), then recurse into either the left side if it has a "hole", or the right side if the left side doesn't. (logN)^2 is still better than linear; if N increases tenfold you only take on the order of 2(log(N)-1) + 1 more steps. — KeithS, Apr 26 '12 at 17:44
@Keith - unfortunately, you have to look at all the numbers at each level to pivot them, so it'll take about n + n/2 + n/4 + ... = 2n (technically, 2(n-m)) comparisons. — Kevin, Apr 27 '12 at 00:25

score 0 · Answer 9 · edited Apr 25 '12 at 17:30

A solution that doesn't use additional storage or assume the width (32 bits) of integers.

In one linear pass find the smallest number. Lets call this "min". O(n) time complexity.
Pick a random pivot element and do a quicksort style partition.
If the pivot ended up in the position = ("pivot" - "min"), then recurse on the right side of the partition, else recurse on the left side of the partition. The idea here is that if there are no holes from the beginning, the pivot would be at ("pivot" - "min")th position, so the first hole should lie to the right of the partition and vice versa.
Base case is an array of 1 element and the hole lies between this element and the next one.

The expected total running time complexity is O(n) (8*n with the constants) and worst case is O(n^2). The time complexity analysis for a similar problem can be found here.

Find a "hole" in a list of numbers

9 Answers9