Get 100 highest numbers from an infinite list

Question

One of my friend was asked this interview question -

"There is a constant flow of numbers coming in from some infinite list of numbers out of which you need to maintain a datastructure as to return the top 100 highest numbers at any given point of time. Assume all the numbers are whole numbers only."

This is simple, you need to keep a sorted list in descending order and keep a track on lowest number in that list. If new number obtained is greater than that lowest number then you have to remove that lowest number and insert the new number in sorted list as required.

Then question was extended -

"Can you make sure that the Order for insertion should be O(1)? Is it possible?"

As far as I knew, even if you add a new number to list and sort it again using any sort algorithm, it would by best be O(logn) for quicksort (I think). So my friend told it was not possible. But he was not convinced, he asked to maintain any other data structure rather than a list.

I thought of balanced Binary tree, but even there you will not get the insertion with order of 1. So the same question I have too now. Wanted to know if there is any such data structure which can do insertion in Order of 1 for above problem or it is not possible at all.

Maybe this is just me misunderstanding the question, but why do you need to keep a *sorted* list? Why not just keep track of the lowest number, and if a number higher than that one is encountered, remove the lowest number and put in the new number, without keeping the list sorted. That would give you O(1). — EdoDodo, Oct 26 '11 at 08:04
@EdoDodo - and after that operation, how do you know what the new lowest number is? — Damien_The_Unbeliever, Oct 26 '11 at 08:08
Considering the variety of answers, this makes a great interview question! — Steven Jeuris, Oct 26 '11 at 10:41
Sort the list [O(100*log(100)) = O(1)] or do a linear search through it for the minimum [O(100) = O(1)] to get the new lowest number. Your list is a constant size, so all of these operations are also constant time. — Random832, Oct 26 '11 at 13:40
You don't have to keep the entire list sorted. You don't care what the highest or 2nd-highest number is. You just need to know what the lowest one is. So after you insert a new number, you just traverse the 100 numbers and see which is now lowest. That's constant-time. — Tom Zych, Oct 26 '11 at 15:10
Hmm, come to think of it, since the list is constant size, sorting it is constant-time too. But the other way is faster. — Tom Zych, Oct 26 '11 at 15:14
Yes, I agree. Need not keep the list sorted, it was only my opinion. But traversing 100 numbers to see which is lowest is still not O(1) right? — Sachin Shanbhag, Oct 26 '11 at 15:15
I think this really comes down to a matter of definitions: Namely, what's n? If n is the number of numbers you've seen then the 100 becomes a constant and it's obviously O(1). Or, since we are only paying attention to 100 numbers is n that 100? Then obviously we aren't going to find a solution better than log(n). — Loren Pechtel, Oct 26 '11 at 17:15
The **asymptotic** order of an operation is only interesting **when the size of the problem can grow without bound.** It is very unclear from your question which quantity is growing without bound; it sounds like you're asking what the asymptotic order is for a problem whose size is bounded at 100; that's not even a sensible question to ask; something has to be growing without bound. If the question is "can you do it for keeping the top n, not the top 100, in O(1) time?" then the question is sensible. — Eric Lippert, Oct 26 '11 at 18:32
Your friend forgot to ask a VERY important question: "Are the numbers arbitrary?" What if the answer was no? What if the interviewer tells you that they represent ages of human beings? This would change your answer SUBSTANTIALLY (for example you'd probably use a counting sort in that case). Interviewers ask ambiguous questions without enough info because they are trying to get you to ask clarifying questions. — riwalk, Oct 26 '11 at 22:30
@Eric Lippert - Well, even I am not clear on that part. But as far as I know, there are arbitrary numbers coming in and at any given point of time, you need to return the top highest 100. Or u can assume top n. But do you think the answer will differ in that case? i.e. top n instead of top 100. — Sachin Shanbhag, Oct 27 '11 at 17:15
@SachinShanbhag: I can sort 100 numbers in O(1) time. I cannot sort n numbers in O(1) time. I cannot even insert a number into an already-sorted list of n numbers in O(1) time. So yes, this changes things significantly. It's possible the interviewer was testing if the interviewee understood Big O notation well enough to realize this. — Brian, Mar 29 '13 at 23:18

duedl0r · Accepted Answer · 2011-10-26T10:37:04.490

35

Let's say k is the number of highest numbers you want to know (100 in your example). Then, you can add a new number in O(k) which is also O(1). Because O(k*g) = O(g) if k is not zero and constant.

edited Oct 26 '11 at 10:37

answered Oct 26 '11 at 08:10

duedl0r

487
4
7

6

O(50) is O(n), not O(1). Inserting into a list of length N in O(1) time means that time does not depend on value of N. That means if 100 becomes 10000, 50 must NOT become 5000. – Oct 26 '11 at 08:37
1

@hamstergene: I don't share your opinion. Do you have a source which proves your claim? – duedl0r Oct 26 '11 at 08:46
18

@hamstergene - but in the case of this question, is `N` the size of the sorted list, or the number of items that have been processed thus far? If you process 10000 items, and keep the top 100 items in a list, or you process 1000000000 items, and keep the top 100 items in a sorted list, the insertion costs in that list remain the same. – Damien_The_Unbeliever Oct 26 '11 at 08:47
3

@duedl0r It's the basics. Read anything about Big-O notation, e.g. http://en.wikipedia.org/wiki/Analysis_of_algorithms or http://en.wikipedia.org/wiki/Big_O_notation You've confused time that is spent in one particular case (n=100) with *order of growth* which describes how time will grow as *n* grows. O(1) means that time does not grow when *n* grows. – Oct 26 '11 at 08:53
6

@hamstergene: In that case you got the basics wrong. In your wikipedia link there is a property ("Multiplication by a constant"): `O(k*g) = O(g) if k not zero and constant`. => `O(50*1) = O(1)`. – duedl0r Oct 26 '11 at 08:57
1

@duedl0r 50 is not a constant. It's a function of list size (100). – Oct 26 '11 at 09:02
9

I think duedl0r is right. Let's reduce the problem and say that you only need the minimum and maximum values. Is this O(n) because the minumum and maximum are 2? (n = 2). No. 2 is part of the definition of the problem. Is a constant, so it's a k in the O(k*something) that is equivalent to O(something) – xanatos Oct 26 '11 at 09:03
9

@hamstergene: what function are you talking about? the value 100 seems pretty constant to me.. – duedl0r Oct 26 '11 at 09:13
1

@duedl0r Order of complexity is not about how much time the algorithm will take, it's about what it depends on. The complexity of inserting into a list depends on list size. The time is constant because list size is constant, but dependency is still O(n), not O(1). – Oct 26 '11 at 09:28
1

@duedl0r: You mean `O(100)` no? – Steven Jeuris Oct 26 '11 at 10:04
@StevenJeuris: Yes, for an unsorted list it's O(100). For a sorted list you can expect O(50), right? – duedl0r Oct 26 '11 at 10:16
3

@duedl0r: Oh, IC, well for asymptotic analysis you shouldn't even think that far. :) You could define it as O(k) where k is the list of items to keep in memory. Using e.g. a sorted list would then result in O(log k) I believe. – Steven Jeuris Oct 26 '11 at 10:23
5

Looking at the [Wikipedia Big-O](http://en.wikipedia.org/wiki/Big_O_notation) that @hamstergene linked, it actually has a *specific example* that describes using a constant-sized lookup-table as being **O(1)**. – Cyclops Oct 26 '11 at 15:49
@Cyclops, but this problem is not a lookup-table, which I read to be a static set of data to be used for reference. In this case the set of 100 values will change over time and we are discussing the effort to implement that change. – cdkMoose Oct 26 '11 at 16:58
Great observation. I've got an interview coming up with a major tech company--I'll keep this in mind. Just in case :) – riwalk Oct 26 '11 at 22:25
Sorry I'm a noob so what does O(x) mean, like in O(1) you are talking about? – Erica Xu Oct 26 '11 at 23:09
@Nosocks: Yes, x is just a number, like O(1) or O(100). It has to be a constant number, not a variable like `n` which grows if the problem gets bigger. – duedl0r Oct 27 '11 at 07:18
@duedl0r - So I assume that without having to keep the list sorted to get highest top 100, the order for inserting new number (if applicable) is always O(1) in above case which my friend was not able to identify in his interview, is it? – Sachin Shanbhag Oct 31 '11 at 13:58
3

@SachinShanbhag: Yes, searching in a set of constant count of elements is always O(1). – duedl0r Oct 31 '11 at 14:11
@EricaXu: [Big O Notation](http://en.wikipedia.org/wiki/Big_O_notation) . This is a concept that most programmers are expected to know. – Brian Mar 29 '13 at 23:22
It will be still O(g) even if k is zero, (but not theta) – RiaD Jan 11 '16 at 12:02

score 19 · Answer 2 · answered Oct 26 '11 at 11:54

19

Keep the list unsorted. Figuring out whether or not to insert a new number will take a longer time, but the insertion will be O(1).

answered Oct 26 '11 at 11:54

Emilio M Bumachar

791
4
6

7

I think this would get you the [smart-aleck](http://english.stackexchange.com/questions/44245/which-smart-aleck-coined-the-term-smart-aleck/44246#44246) award if nothing else. *8') – Mark Booth Oct 26 '11 at 13:20
4

@Emilio, you are technically correct — and of course that is the best kind of correct… – Gareth Oct 27 '11 at 00:39
1

But you can also keep the lowest of your 100 numbers, then you can also decide whether you need to insert in O(1). Then only when you insert a number, do you have to search the new lowest number. But that happens rarer than deciding to insert or not, which happens for every new number. – Andrei Vajna Nov 03 '11 at 12:26

score 12 · Answer 3 · answered Oct 26 '11 at 21:44

12

This is easy. The size of the list of constant, therefore the sort time of the list is constant. A operation that executes in constant time is said to be O(1). Therefore sorting the list is O(1) for a list of fixed size.

answered Oct 26 '11 at 21:44

Kirk Broadhurst

4,199
18
27

Briguy37 · Answer 4 · 2011-10-26T17:05:20.923

Once you pass 100 numbers, the maximum cost you'll ever incur for the next number is the cost to check if the number is in the highest 100 numbers (let's label that CheckTime) plus the cost to enter it into that set and eject the lowest one (let's call that EnterTime), which is constant time (at least for bounded numbers), or O(1).

Worst = CheckTime + EnterTime

Next, if the distribution of numbers is random, the average cost decreases the more numbers you have. For example, the chance you will have to enter the 101st number into the maximum set is 100/101, the chances for the 1000th number would be 1/10, and the chances for the nth number would be 100/n. Thus, our equation for average cost will be:

Average = CheckTime + EnterTime / n

Thus, as n approaches infinity, only CheckTime is important:

Average = CheckTime

If the numbers are bound, CheckTime is constant, and thus it is O(1) time.

If the numbers are not bound, the check time will grow with more numbers. Theoretically, this is because if the smallest number in the maximum set gets big enough, your check time will be greater because you'll have to consider more bits. That makes it seem like it will be slightly higher than constant time. However, you could also argue that the chance that the next number is in the highest set approaches zero as n approaches infinity and so the chance you'll need to consider more bits also approaches 0, which would be an argument for O(1) time.

I'm not positive, but my gut says that it is O(log(log(n))) time. This is because the chance that the lowest number increases is logarithmic, and the chance that the number of bits you need to consider for each check is logarithmic as well. I'm interested in other peoples takes on this, because I'm not really sure...

Except that the list is arbitrary, what if it's a list of ever increasing numbers? — dan_waterworth, Oct 29 '11 at 11:04
@dan_waterworth: If the infinite list is arbritrary and just happens to ever increase (the odds of which would be 1/∞!), that would fit the worst case scenario of `CheckTime + EnterTime` for each number. This only makes sense if numbers are unbounded, and so `CheckTime` and `EnterTime` will both increase at least logarithmically due to the increase in size of the numbers. — Briguy37, Oct 29 '11 at 19:06
The numbers aren't random, there are arbitrary. It makes no sense to talk about odds. — dan_waterworth, Oct 30 '11 at 08:46
@dan_waterworth: You've said twice now that the numbers are arbitrary. Where are you getting this from? Also, I believe you can still apply statistics to arbitrary numbers starting with the random case, and improve their accuracy as you know more about the arbiter. For example, if you were the arbiter, it appears there would be a greater chance of selecting ever-increasing numbers than if, say, I was the arbiter ;) — Briguy37, Oct 31 '11 at 14:27

score 7 · Answer 5 · edited Oct 26 '11 at 10:04

7

this one is easy if you know Binary Heap Trees. Binary heaps support insertion in average constant time, O(1). And give you easy access to the first x elements.

edited Oct 26 '11 at 10:04

ratchet freak

25,706
2
62
97

answered Oct 26 '11 at 09:13

Why store the elements you don't need? (the values which are too low) Seems like a custom algorithm is more appropriate. Not saying you can't 'not add' the values when they aren't higher than the lowest ones. – Steven Jeuris Oct 26 '11 at 10:07
I don't know, my intuition tells me that a heap (of some flavor) could pull this off pretty well. Doesn't mean he would have to keep all of the elements to do so. I didn't research it but it "feels right" (TM). – Rig Oct 26 '11 at 12:42
A heap is the right structure in general for this, but you definitely don't want to keep the irrelevant numbers, and I am not sure how easy it would be to convert a heap to throwing away the unneeded numbers. – jprete Oct 26 '11 at 12:59
3

A heap could be modified to discard anything below some mth level (for binary heaps and k=100, m would be 7, since the number of nodes = 2^m-1). This would slow it down, but it would still be amortized constant time. – Plutor Oct 26 '11 at 13:40
3

If you used a binary min-heap (because then the top is the minimum, which you're checking all the time) and you find a new number > min, then you have to remove the top element before you can insert a new one. Removing the top (min) element will be O(logN) because you have to traverse every level of the tree once. So it's only technically true that inserts are average O(1) because in practice it's still O(logN) every time you find a number > min. – Scott Whitlock Oct 26 '11 at 18:01
1

@Plutor, you're assuming some guarantees that binary heaps don't give you. Visualising it as a binary tree, it could be the case that each element in the left branch is smaller than any element in the right branch, but you're assuming that the smallest elements are nearest the root. – Peter Taylor Oct 26 '11 at 18:55
Insertion into a binary heap is O(log N), where N is the size of the heap, not O(1). And it gives you easy access to the first 1 element, not to some k elements. (Unless you count O(k * log N) operation that destroys the heap as “easy”.) – svick Oct 26 '11 at 21:39
I don't think a binary heap would be an appropriate flavor because of the items mentioned. Maybe something of a fib heap. Oriented correctly I think it's attributes would be good for this. Again, it just "feels right". Maybe I'll look into it at some point. – Rig Oct 27 '11 at 03:56

hamstergene · Answer 6 · 2011-10-26T21:09:30.210

If by the question the interviewer really meant to ask “can we make sure each incoming number is processed in constant time”, then as many already pointed out (e.g. see @duedl0r's answer), your friend's solution is already O(1), and it would be so even if he had used unsorted list, or used bubble sort, or whatever else. In this case the question does not make much sense, unless it was tricky question or you remember it wrong.

I assume the interviewer's question was meaningful, that he was not asking how to make something to be O(1) which is very obviously already that.

Because questioning algorithm complexity only makes sense when size of input grows indefinitely, and the only input that can grow here is 100—the list size; I assume the real question was “can we make sure we get Top N spending O(1) time per number (not O(N) as in your friend's solution), is it possible?”.

The first thing that comes to mind is counting sort, which will buy complexity of O(1) time per number for Top-N-problem for the price of using O(m) space, where m is the length of range of incoming numbers. So yes, it is possible.

Gabe Moothart · Answer 7 · 2011-10-26T21:14:46.340

4

Use a min-priority queue implemented with a Fibonacci heap, which has constant insertion time:

1. Insert first 100 elements into PQ
2. loop forever
       n = getNextNumber();
       if n > PQ.findMin() then
           PQ.deleteMin()
           PQ.insert(n)

edited Oct 26 '11 at 21:14

answered Oct 26 '11 at 15:55

Gabe Moothart

191
5

4

_"Operations delete and delete minimum work in `O(log n)` amortized time"_, so this would still result in `O(log k)` where `k` is the amount of items to store. – Steven Jeuris Oct 26 '11 at 18:29
1

This is no different than [Emilio's answer](http://programmers.stackexchange.com/questions/116346/get-100-highest-numbers-from-an-infinite-list/116366#116366) which got dubbed the "smart-aleck award" since the *delete min* operates in **O(log n)** (according to Wikipedia). – Nicole Oct 26 '11 at 18:31
@Renesis Emilio's answer would be O(k) to find the minimum, mine is O(log k) – Gabe Moothart Oct 26 '11 at 20:52
1

@Gabe Fair enough, I just mean in principle. In other words, if you don't take 100 to be a constant, then this answer is not contant time either. – Nicole Oct 26 '11 at 21:02
@Renesis I've removed the (incorrect) statement from the answer. – Gabe Moothart Oct 26 '11 at 21:15

Benedict · Answer 8 · 2011-10-27T08:46:58.463

The task is clearly to find an algorithm that is O(1) in the length N of the required list of numbers. So it doesn't matter if you need the top 100 number or 10000 numbers, the insertion time should be O(1).

The trick here is that although that O(1) requirement is mentioned for the list insert, the question didn't say anything about the order of search time in the whole number space, but it turns out this can be made O(1) as well. The solution then is as follows:

Arrange for a hashtable with numbers for keys and pairs of linked list pointers for values. Each pair of pointers is the start and end of a linked list sequence. This will normally just be one element then the next. Every element in the linked list goes next to the element with the next highest number. The linked list thus contains the sorted sequence of required numbers.Keep a record of the lowest number.
Take a new number x from the random stream.
Is it higher than the last recorded lowest number? Yes => Step 4, No => Step 2
Hit the hash table with the number just taken. Is there an entry? Yes => Step 5. No => Take a new number x-1 and repeat this step (this is a simple downward linear search, just bear with me here, this can be improved and I'll explain how)
With the list element just obtained from the hash table, insert the new number just after the element in the linked list (and update the hash)
Take the lowest number l recorded (and remove it from the hash/list).
Hit the hash table with the number just taken. Is there an entry? Yes => Step 8. No => Take a new number l+1 and repeat this step (this is a simple upward linear search)
With a positive hit the number becomes the new lowest number. Go to step 2

To allow for duplicate values the hash actually needs to maintain the start and end of the linked list sequence of elements that are duplicates. Adding or removing an element at a given key thus increases or decreases the range pointed to.

The insert here is O(1). The searches mentioned are, I guess something like, O(average difference between numbers). The average difference increases with the size of the number space, but decreases with the required length of the list of numbers.

So the linear search strategy is pretty poor, if the number space is large (e.g. for a 4 byte int type, 0 to 2^32-1) and N=100. To get around this performance issue you can keep parallel sets of hashtables, where the numbers are rounded to higher magnitudes (e.g. 1s, 10s, 100s, 1000s) to make suitable keys. In this way you can step up and down gears to perform the required searches more quickly. The performance then becomes an O(log numberrange), I think, which is constant, i.e. O(1) also.

To make this clearer, imagine that you have the number 197 to hand. You hit the 10s hash table, with '190', it's rounded to the nearest ten. Anything? No. So you go down in 10s until you hit say 120. Then you can start at 129 in the 1s hashtable, then try 128, 127 until you hit something. You've now found where in the linked list to insert the number 197. Whilst putting it in, you must also update the 1s hashtable with the 197 entry, the 10s hashtable with the number 190, 100s with 100, etc. The most steps you ever have to do here are 10 times the log of the number range.

I might have got some of the details wrong, but since this is the programmers exchange, and the context was interviews I would hope the above is a convincing enough answer for that situation.

EDIT I added some extra detail here to explain the parallel hashtable scheme and how it means the poor linear searches I mentioned can be replaced with an O(1) search. I've also realised there is of course no need to search for the next lowest number, because you can step straight to it by looking in the hashtable with the lowest number and progressing to the next element.

The search has to be part of the insert function - they are not independent functions. Since your search is O(n), your insert function is also O(n). — Kirk Broadhurst, Oct 26 '11 at 21:50
No. Using the strategy I have described, where more hashtables are used to traverse the number space more quickly, it is O(1). Please read my answer again. — Benedict, Oct 27 '11 at 08:06
@Benedict, your answer says quite plainly that it has linear searches in steps 4 and 7. Linear searches are not O(1). — Peter Taylor, Oct 27 '11 at 08:14
Yes, it does, but I deal with that later. Would you mind actually reading the rest please. If necessary I'll edit my answer to make it abundantly clear. — Benedict, Oct 27 '11 at 08:17
@Benedict You are correct - excluding the search, your answer is O(1). Unfortunately this solution won't work without the search. — Kirk Broadhurst, Oct 27 '11 at 08:33
Since my solution seems not to have been convincing enough I've tried to elaborate. — Benedict, Oct 27 '11 at 09:02

Hand-E-Food · Answer 9 · 2011-10-26T23:56:56.920

Can we assume that the numbers are of a fixed data type, such as Integer? If so, then keep a tally of every single number that is added. This is an O(1) operation.

Declare an array with as many elements as there are possible numbers:
Read each number as it is streamed.
Tally the number. Ignore it if that number has been tallied 100 times already as you'll never need it. This prevents overflows from tallying it an infinite number of times.
Repeat from step 2.

VB.Net code:

Const Capacity As Integer = 100

Dim Tally(Integer.MaxValue) As Integer ' Assume all elements = 0
Do
    Value = ReadValue()
    If Tally(Value) < Capacity Then Tally(Value) += 1
Loop

When you return the list, you can take as long as you like. Simply itterate from the end of the list and create a new list of the highest 100 values recorded. This is an O(n) operation, but that's irrelivant.

Dim List(Capacity) As Integer
Dim ListCount As Integer = 0
Dim Value As Integer = Tally.Length - 1
Dim ValueCount As Integer = 0
Do Until ListCount = List.Length OrElse Value < 0
    If Tally(Value) > ValueCount Then
        List(ListCount) = Value
        ValueCount += 1
        ListCount += 1
    Else
        Value -= 1
        ValueCount = 0
    End If
Loop
Return List

Edit: In fact, it doesn't really matter if it's a fixed data type. Given there's no imposed limits on memory (or hard disk) consumption, you could make this work for any range of positive integers.

score 1 · Answer 10 · answered Nov 16 '11 at 22:53

A hundred numbers are easily stored in an array, size 100. Any tree, list or set is overkill, given the task at hand.

If the incoming number is higher than the lowest (=last) in the array, run over all entries. Once you find the first that is smaller than your new number (you can use fancy searches to do that), run through the remainder of the array, pushing each entry "down" by one.

Since you keep the list sorted from the beginning, you don't need to run any sort algorithm at all. This is O(1).

score 0 · Answer 11 · answered Oct 26 '11 at 18:17

You could use a Binary Max-Heap. You'd have to keep track of a pointer to the minimum node (which could be unknown/null).

You start by inserting the first 100 numbers into the heap. The max will be at the top. After this is done, you will always keep 100 numbers in there.

Then when you get a new number:

if(minimumNode == null)
{
    minimumNode = findMinimumNode();
}
if(newNumber > minimumNode.Value)
{
    heap.Remove(minimumNode);
    minimumNode = null;
    heap.Insert(newNumber);
}

Unfortunately findMinimumNode is O(n), and you do incur that cost once per insert (but not during the insert :). Removing the minimum node and inserting the new node are, on average, O(1) because they'll tend towards the bottom of the heap.

Going the other way with a Binary Min-Heap, the min is at the top, which is great for finding the min for comparison, but sucks when you have to replace the minimum with a new number that's > min. That's because you have to remove the min node (always O(logN)) and then insert the new node (average O(1)). So, you still have O(logN) which is better than Max-Heap, but not O(1).

Of course, if N is constant, then you always have O(1). :)

Get 100 highest numbers from an infinite list

11 Answers11