50

I'm trying to teach myself how to calculate BigO notation for an arbitrary function. I found this function in a textbook. The book asserts that the function is O(n2). It gives an explanation as to why this is, but I'm struggling to follow. I wonder if someone might be able to show me the math behind why this is so. Fundamentally, I understand that it is something less than O(n3), but I couldn't independently land on O(n2)

Suppose we are given three sequences of numbers, A, B, and C. We will assume that no individual sequence contains duplicate values, but that there may be some numbers that are in two or three of the sequences. The three-way set disjointness problem is to determine if the intersection of the three sequences is empty, namely, that there is no element x such that x ∈ A, x ∈ B, and x ∈ C.

Incidentally, this is not a homework problem for me -- that ship has sailed years ago : ), just me trying to get smarter.

def disjoint(A, B, C):
        """Return True if there is no element common to all three lists."""  
        for a in A:
            for b in B:
                if a == b: # only check C if we found match from A and B
                   for c in C:
                       if a == c # (and thus a == b == c)
                           return False # we found a common value
        return True # if we reach this, sets are disjoint

[Edit] According to the textbook:

In the improved version, it is not simply that we save time if we get lucky. We claim that the worst-case running time for disjoint is O(n2).

The book's explanation, which I struggle to follow, is this:

To account for the overall running time, we examine the time spent executing each line of code. The management of the for loop over A requires O(n) time. The management of the for loop over B accounts for a total of O(n2) time, since that loop is executed n different times. The test a == b is evaluated O(n2) times. The rest of the time spent depends upon how many matching (a,b) pairs exist. As we have noted, there are at most n such pairs, and so the management of the loop over C, and the commands within the body of that loop, use at most O(n2) time. The total time spent is O(n2).

(And to give proper credit ...) The book is: Data Structures and Algorithms in Python by Michael T. Goodrich et. all, Wiley Publishing, pg. 135

[Edit] A justification; Below is the code before optimization:

def disjoint1(A, B, C):
    """Return True if there is no element common to all three lists."""
       for a in A:
           for b in B:
               for c in C:
                   if a == b == c:
                        return False # we found a common value
return True # if we reach this, sets are disjoint

In the above, you can clearly see that this is O(n3), because each loop must run to its fullest. The book would assert that in the simplified example (given first), the third loop is only a complexity of O(n2), so the complexity equation goes as k + O(n2) + O(n2) which ultimately yields O(n2).

While I cannot prove this is the case (thus the question), the reader can agree that the complexity of the simplified algorithm is at least less than the original.

[Edit] And to prove that the simplified version is quadratic:

if __name__ == '__main__':
    for c in [100, 200, 300, 400, 500]:
        l1, l2, l3 = get_random(c), get_random(c), get_random(c)
        start = time.time()
        disjoint1(l1, l2, l3)
        print(time.time() - start)
        start = time.time()
        disjoint2(l1, l2, l3)
        print(time.time() - start)

Yields:

0.02684807777404785
0.00019478797912597656
0.19134306907653809
0.0007600784301757812
0.6405444145202637
0.0018095970153808594
1.4873297214508057
0.003167390823364258
2.953308343887329
0.004908084869384766

Since the second difference is equal, the simplified function is indeed quadratic:

enter image description here

[Edit] And yet even further proof:

If I assume worst case (A = B != C),

if __name__ == '__main__':
    for c in [10, 20, 30, 40, 50]:
        l1, l2, l3 = range(0, c), range(0,c), range(5*c, 6*c)
        its1 = disjoint1(l1, l2, l3)
        its2 = disjoint2(l1, l2, l3)
        print(f"iterations1 = {its1}")
        print(f"iterations2 = {its2}")
        disjoint2(l1, l2, l3)

yields:

iterations1 = 1000
iterations2 = 100
iterations1 = 8000
iterations2 = 400
iterations1 = 27000
iterations2 = 900
iterations1 = 64000
iterations2 = 1600
iterations1 = 125000
iterations2 = 2500

Using the second difference test, the worst case result is exactly quadratic.

enter image description here

SteveJ
  • 636
  • 5
  • 10
  • 6
    Either the book is wrong or your transcription is. – candied_orange Sep 08 '19 at 03:39
  • 4
    @candied_orange; to be fair, there is a third option. Both the book and my transcription could be correct. I think you can derive what that means : ) The book is using this as specific example of how adding the a == b check reduces this from a O(n3) to an O(n2). – SteveJ Sep 08 '19 at 03:42
  • 2
    In "the worst case" everything goes against you. Sorry but big O doesn't care. – candied_orange Sep 08 '19 at 03:44
  • 4
    @candied_orange; I would prefer this not turn into a tit-for-tat discussion. Hopefully you can allow for the possibility that a vetted text book written by three college professors deserves the benefit of the doubt. – SteveJ Sep 08 '19 at 03:47
  • 6
    Nope. Wrong is wrong regardless of how well cited. Either explain why we can't simply assume these if's go the worst way they can when doing big O analysis or accept the results you're getting. – candied_orange Sep 08 '19 at 03:51
  • Also, books have [errata sheets](https://en.wikipedia.org/wiki/Erratum#Errata_sheets) for a reason. – candied_orange Sep 08 '19 at 03:55
  • 8
    @candied_orange; I've added some further justification to the best of my ability - not my strong suit. I would ask that you again allow for the possibility that you might indeed be incorrect. You have made your point, duly taken. – SteveJ Sep 08 '19 at 04:00
  • No, the optimization doesn’t reduce the _worst case_ complexity because the worst case that if becomes `if true`. Clearly `if true` can’t reduce algorithmic complexity. Not in languages that are eagerly evaluated like this python is (though some python is lazily evaluated). If this were Haskell, it’d be different. – Telastyn Sep 08 '19 at 04:06
  • 1
    @SteveJ I'll willingly allow it for whatever that is worth. It's how I learn. I appreciate you editing the question. k + O(n2) + O(n2) would be the result if we were choosing either B or C to iterate over as we iterate over A. But we're choosing to maybe loop over C as we loop A and B. When it's worst case (big O) you always assume the worst. If you want to explain why that's wrong feel free. – candied_orange Sep 08 '19 at 04:11
  • Let me put it this way: anyone answering your question is entitled to tell you you're framing the question wrong and that your premise is flawed. Even if you can cite an authoritative book. If you ask why 2=3 then don't be surprised if the answer is "it's not". – candied_orange Sep 08 '19 at 04:22
  • @candied_orange; So I ran both functions and took the time values. I then ran the second-difference test to indeed demonstrate that the simplified version if a quadratic, not a cubed function. – SteveJ Sep 08 '19 at 04:28
  • 8
    Random numbers aren’t your worst case. That proves nothing. – Telastyn Sep 08 '19 at 04:28
  • @Telastyn; The timing tests shows indeed that the simplified version is quadratic, see my update. – SteveJ Sep 08 '19 at 04:28
  • 2
    @Telastyn; So what is worst case, I'll create the data sets to represent worst case and let the results speak for themselves. – SteveJ Sep 08 '19 at 04:33
  • @stevej - as I mention in my answer, A and B always match, but C never does. A simple example is A and B are always 1 and C is always 2. You might need more elements in your lists to notice a difference from simple timing though. – Telastyn Sep 08 '19 at 04:35
  • @SteveJ regarding your quadratic edit: Telastyn already told you that your worst case is A=B!=C. You could have profiled that. This profiles the average case when you get lucky. – candied_orange Sep 08 '19 at 04:36
  • @Telastyn; With all do respect -- running enough random number tests eventually would get me worst case, but everytime I do so, I get a quadratic. – SteveJ Sep 08 '19 at 04:49
  • 3
    @SteveJ with respect eventually playing the lottery pays off. Or you could simply test the winning numbers you've already been given. – candied_orange Sep 08 '19 at 04:52
  • 2
    @stevej - no... 100 pairs of random integers being 100% equal is unlikely to happen before the heat death of the universe. – Telastyn Sep 08 '19 at 04:53
  • @Telastyn, I'm going to likely delete this post, because it has lost its value. However, you would have to agree that I've both cited trustworthy text and have at least shown results to substantiate my assertions. So far, the only rebuts I've seen are "because I say so", Kind of a disappointing response from engineers. – SteveJ Sep 08 '19 at 04:56
  • @stevej - one moment, I'm updating my answer. – Telastyn Sep 08 '19 at 04:57
  • 4
    @Telastyn; Also, I'm wondering if you missed the requirement that; "We will assume that no individual sequence contains duplicate values." – SteveJ Sep 08 '19 at 05:01
  • @steveJ - I did miss that, but the worst case still applies. A and B match, but C doesn't. – Telastyn Sep 08 '19 at 05:08
  • @Telastyn; The results simply don't agree, if I change the measurement to iterations instead of time, I get iterations1 = 120050, iterations2 = 100, iterations1 = 990000, iterations2 = 100. In every way of measuring the results. The simplified version grows at a much, much, slower rate than does the non simplied - and the growth difference is not a constant, it is another order of magnitude. – SteveJ Sep 08 '19 at 05:19
  • @stevej - if you use worst case data? – Telastyn Sep 08 '19 at 05:20
  • @Telastyn; Yes, if I change the code to; l1, l2, l3 = range(0, c), range(0,c), range(5*c, 6*c) (A = B !=C), my results are; iterations1 = 1000 iterations2 = 100, iterations1 = 27000, iterations2 = 900, iterations1 = 125000, iterations2 = 2500. Note how differently the permutations changes for the two functions. – SteveJ Sep 08 '19 at 05:27
  • 1
    @Telastyn; I've updated my answer to show worst case. Unless I'm missing something, worst case is exactly quadratic. – SteveJ Sep 08 '19 at 05:34
  • 8
    ahh. okay. The "no sequence has duplicate values" does change the worst case since C can only trigger once per any A. Sorry about the frustration - that's what I get for being on stackexchange late on a Saturday :D – Telastyn Sep 08 '19 at 05:35
  • 2
    @candied_orange; Can we now agree that indeed, given the latest proof code, that the worst case is quadratic? If I delete this question, and ask again, I won't be met with the same resistance? – SteveJ Sep 08 '19 at 05:37
  • @SteveJ The timings provided are not relevant to Big-O analysis and cannot be used to "prove" anything. At best, you can only observe results that are "consistent" with the expected behavior, but since this kind of analysis is an *abstract estimate of a function's behavior as input sizes grow towards infinity*, it's not really tied to, or dependent on, any particular or concrete computer system. – code_dredd Sep 08 '19 at 10:08
  • You could have added why you struggle with the explanation in the book. – Carsten S Sep 08 '19 at 12:12
  • @candied_orange: AFAICT, worst case is O(n²) because the authors decreted it cannot be O(n³). See [here](https://softwareengineering.stackexchange.com/a/397115/270812). – Eric Duminil Sep 08 '19 at 17:33
  • I assume all three sequences are the same size? Otherwise there cannot be a single _n_ and the big O result gets more complicated – jaskij Sep 09 '19 at 02:38
  • @code_dredd; not arguing with you, just following up for my own edification. Wouldn't timing be useful if ran on the same hardware with multiple iterations -- at least when determining worst case orders of magnitude? – SteveJ Sep 09 '19 at 02:46
  • 2
    @JanDorniak, usually the *n* is the size of the input. In this case, the most natural way to interpret it is the sum of the lengths of the three lists. Then the length of each list is *O(n)* and the analysis is correct. It is of course possible to do a more detailed analysis in terms of the lengths of the individual lists. – Carsten S Sep 09 '19 at 08:44
  • @CarstenS if you take n as the sum of their sizes you would have to prove that it does not matter what fraction of n each set is. Worst case is likely to be each having size of n/3 but I'm not sure and we go back to the same assumption. Still this all comes down to the fact that n is not defined in the question. – jaskij Sep 09 '19 at 09:45
  • 2
    @JanDorniak, as I said, each of the individual lists has length *O(n)*, which is sufficient for the argument to show that the algorithm is *O(n^2)*. – Carsten S Sep 09 '19 at 10:14
  • @CarstenS I missed that, you're right – jaskij Sep 09 '19 at 10:20
  • @JanDorniak, of course it would be better to state explicitly what *n* is. – Carsten S Sep 09 '19 at 11:54
  • @SteveJ *"not arguing with you, just following up"* I know; we're just discussing the topic. *"Wouldn't timing be useful if ran on the same hardware with multiple iterations"* It might be of *very limited* usefulness, but doesn't work as a proof of anything; that was your original stmt. *"worst case orders of magnitude?"* Not really; at best, you can only measure an *instance* of the problem, in a specific CPU architecture and microcode, with a specific OS workload, based on compiler, compilation options, etc, rather than the general problem itself. You can only extrapolate the latter. – code_dredd Sep 09 '19 at 18:41
  • 3
    @SteveJ You are very patient and tactful. Thank you for the question, I learned something. – MackM Sep 10 '19 at 16:29

8 Answers8

70

The book is indeed correct, and it provides a good argument. Note that timings are not a reliable indicator of algorithmic complexity. The timings might only consider a special data distribution, or the test cases might be too small: algorithmic complexity only describes how resource usage or runtime scales beyond some suitably large input size.

The book makes the argument that complexity is O(n²) because the if a == b branch is entered at most n times. This is non-obvious because the loops are still written as nested. It is more obvious if we extract it:

def disjoint(A, B, C):
  AB = (a
        for a in A
        for b in B
        if a == b)
  ABC = (a
         for a in AB
         for c in C
         if a == c)
  for a in ABC:
    return False
  return True

This variant uses generators to represent intermediate results.

  • In the generator AB, we will have at most n elements (because of the guarantee that input lists won't contain duplicates), and producing the generator takes O(n²) complexity.
  • Producing the generator ABC involves a loop over the generator AB of length n and over C of length n, so that its algorithmic complexity is O(n²) as well.
  • These operations are not nested but happen independently, so that the total complexity is O(n² + n²) = O(n²).

Because pairs of input lists can be checked sequentially, it follows that determining whether any number of lists are disjoint can be done in O(n²) time.

This analysis is imprecise because it assumes that all lists have the same length. We can say more precisely that AB has at most length min(|A|, |B|) and producing it has complexity O(|A|•|B|). Producing ABC has complexity O(min(|A|, |B|)•|C|). Total complexity then depends how the input lists are ordered. With |A| ≤ |B| ≤ |C| we get total worst-case complexity of O(|A|•|C|).

Note that efficiency wins are possible if the input containers allow for fast membership tests rather than having to iterate over all elements. This could be the case when they are sorted so that a binary search can be done, or when they are hash sets. Without explicit nested loops, this would look like:

for a in A:
  if a in B:  # might implicitly loop
    if a in C:  # might implicitly loop
      return False
return True

or in the generator-based version:

AB = (a for a in A if a in B)
ABC = (a for a in AB if a in C)
for a in ABC:
  return False
return True
amon
  • 132,749
  • 27
  • 279
  • 375
  • 5
    This would be so much clearer if we just abolished this magical `n` variable, and talked about the actual variables at play. – Alexander Sep 08 '19 at 18:58
  • 2
    @Alexander `n` *is* the actual variable at play... – code_dredd Sep 08 '19 at 23:53
  • 15
    @code_dredd No it's not, it has no direct connection to the code. It's an abstraction that envisions that `len(a) == len(b) == len(c)`, which although true in the context of time complexity analysis, tends to confuse the conversation. – Alexander Sep 09 '19 at 00:14
  • 1
    @Alexander I get your "technically, it's..." type of response, but nothing you've said changes what I noted. In time-complexity analysis `n` is the name of the variable generally used to represent the size of the input. If there's any confusion, it's b/c someone hasn't been careful with how it gets used (e.g. multiple inputs, but not different variables for each of them). This answer addressed that by instead referring to the cardinality of the sets in question (e.g. |A|, etc). – code_dredd Sep 09 '19 at 00:19
  • @Alexander you can do that but the point of the magical n is that you can use it to simplify until your big O expression visually matches a known category. – candied_orange Sep 09 '19 at 01:30
  • Thank you for the explanation -- breaking it out in series operations adds the clarity I needed. A follow up question, wouldn't you agree that timing does help prove complexity if run on the same hardware over multiple iterations? At least in this case, I think it useful to differentiate between O(n3) and O(n2) – SteveJ Sep 09 '19 at 02:48
  • 1
    @candied_orange Converting something like `O(width * height)` to `O(n^2)` is easy. Deriving how something is `O(n^2)` (that is to say, getting "back to" `O(width * height)`) is hard. So as a matter of communication, it's more beneficial to provide the first form (because the second form is easily derived), than the opposite. – Alexander Sep 09 '19 at 04:21
  • 3
    @SteveJ Timing can help verify the complexity, but not only is good timing very tricky, it also runs into problems with the definition of Big-O notation. Let T(n) be the runtime and f(n) some complexity class. Then T(n) is in O(f(n)) if there exists some minimal input size n0 and some constant factor k, so that for every input n ≥ n0: T(n) ≤ k f(n). So the complexity class only describes *growth* at suitably large input sizes. We do not know up front what that minimum size n0 is. Below this size, other effects can dominate. This also means: best complexity doesn't imply best performance. – amon Sep 09 '19 at 05:42
  • 1
    All of that is true, but still, “timings are not a useful indicator of algorithmic complexity” is a bit of an exaggeration. For lots of algorithms, timing (of not-too-tiny randomised test cases) _can_ give a decent idea of the average-case complexity. It is mostly in advanced “big-O-optimised” algorithms like e.g. Schönhage-Strassen that you need to go to _huge_ inputs before the complexity becomes evident. – leftaroundabout Sep 09 '19 at 09:12
  • 11
    Perhaps saying that OP's code has worst case complexity O(|A|•|B| + min(|A|, |B|)•|C|) is enough to trigger understanding? – Pablo H Sep 09 '19 at 11:23
  • 3
    One other thing about timing tests: as you found out, they did not help you in understanding what was going on. On the other hand, they seem to have given you additional confidence in standing up to various incorrect but forcefully stated claims that the book was obviously wrong, so that's a good thing, and in this case, your testing beat intuitive hand-waving... For understanding, a more effective way of testing would be to run it in a debugger with breakpoints (or add prints of the variables' values) at the entry of each loop. – sdenham Sep 09 '19 at 15:42
  • Even though there may be two input variables, the worst case is when both variables have the same value. – Barmar Sep 09 '19 at 15:52
  • 5
    "Note that timings are not a useful indicator of algorithmic complexity. " I think this would be more accurate if it said "rigorous" or "reliable" rather than "useful". – Acccumulation Sep 09 '19 at 18:44
  • 2
    "This could be the case when they are sorted so that a binary search can be done" Of course, if they are sorted, then there are O(n) algorithms. Just take the smallest element of each set and compare them, then for the smallest of those element, take the next element of its set. – Acccumulation Sep 09 '19 at 18:54
  • Final follow up question. Instead of timing, I ultimately put a counter in the code (at the return statements) to count total iterations. Am I correct in asserting that to be an accurate measure of complexity (assuming that I can identify worst case and the counters are properly placed?) – SteveJ Sep 09 '19 at 19:31
  • @Acccumulation two very good points. I rephrased the beginning paragraph to use “reliable”. – amon Sep 09 '19 at 20:03
  • 1
    @SteveJ Counters are often better than timings in practice because they get rid of many confounding factors like random noise or constant factors. However, the counts still depend on the input data distribution, and may inadvertently only exercise best-case scenarios. Even random data is often the best case scenario! So counters are useful at building an intuition, especially when only making statements about a particular data distribution. But I think it's still a stretch to infer a complexity class from counts or measurements. – amon Sep 09 '19 at 20:08
  • Pablo H, when we do big O we throw out constants in front of n or other factors of a lower order. O (350 n^2 + 40000n -15) is just n^2. The reason for this is we are not discussing small algorithmic variants with big O. we are discussing the effect of n as n gets huge. – Michael Sep 09 '19 at 22:09
  • @Michael: I know :-) Perhaps you confused |A| etc. with constants? They are [sizes of inputs](https://en.wikipedia.org/wiki/Big_O_notation#Multiple_variables). – Pablo H Sep 10 '19 at 12:23
8

Note that if all elements are different in each of the list which is assumed, you can iterate C only once for each element in A (if there's element in B which is equal). So inner loop is O(n^2) total

RiaD
  • 1,700
  • 2
  • 12
  • 13
6

We will assume that no individual sequence contains duplicate.

is a very important piece of information.

Otherwise, the worst-case of optimized version would still be O(n³), when A and B are equal and contain one element duplicated n times:

i = 0
def disjoint(A, B, C):
    global i
    for a in A:
        for b in B:
            if a == b:
                for c in C:
                    i+=1
                    print(i)
                    if a == c:
                        return False 
    return True 

print(disjoint([1] * 10, [1] * 10, [2] * 10))

which outputs:

...
...
...
993
994
995
996
997
998
999
1000
True

If the inputs of this function are considered to be three arbitrary collections, the above code is O(n³).

But, as mentioned by @sdenham :

by stipulation, it is being analyzed as a function from three sets to a boolean, which is O(n²) for a non-obvious (and therefore pedagogically useful) reason.

As explained by other answers, if no duplicates are allowed, the worst-case is indeed O(n²).

An additional optimization would be to use sets or dicts in order to test inclusion in O(1). In that case, disjoint would be O(n) for every input.

Eric Duminil
  • 252
  • 2
  • 6
  • Your last comment is quite interesting, hadn't thought of that. Are you suggesting that is due to your being able to do three O(n) operations in series? – SteveJ Sep 09 '19 at 02:44
  • 2
    Unless you get a perfect hash with at least one bucket per input element you cannot test inclusion in O(1). A sorted set usually has O(log n) lookup. Unless you are talking about average cost, but that's not what the question is about. Still, having a balanced binary set getting hard O(n log n) is trivial. – jaskij Sep 09 '19 at 02:50
  • @EricDuminil hash sets indeed have bad worst-case performance, but on average work well. A balanced binary tree, such as a red-black tree, guarantees an O(log n) lookup, getting worst case to O(n log n). The downside is that it requires an order on the elements (or their keys). – jaskij Sep 09 '19 at 10:18
  • I don't know if Python has such data structures, but C++'s `std::set` and `std::map` work that way if memory serves. Also ping me once you edit so that I can remove that downvote. – jaskij Sep 09 '19 at 10:22
  • @JanDorniak Python has sets and dicts. In fact using dicts in python is more common and easier than using map in C++ – Baldrickk Sep 09 '19 at 12:56
  • @Baldrickk that much I know but do Python sets or dicts have guaranteed big O? Most containers in C++ standard library have guarantees on big O that's why I even mentioned it here – jaskij Sep 09 '19 at 14:12
  • 2
    @JanDorniak I think sets and dicts are hash tables in python as opposed to the red-black trees in C++. So the absolute worst case is worse, up to 0(n) for a search, but the average case is O(1). As opposed to O(log n) for C++ https://wiki.python.org/moin/TimeComplexity. Given that it's a python question, and that the domain of the problem leads to a high likelyhood of average case performance, I don't think the O(1) claim is a poor one. – Baldrickk Sep 09 '19 at 14:18
  • While a discussion of other functions to solve this problem may be informative in its own right (though there seems to be some confusion here), it does not answer the question asked. There is a proof (without quotes) that this function is O(n²), and it is not really an answer if one is left unsure why it is not O(n³). – sdenham Sep 09 '19 at 16:28
  • 4
    I think I see the issue here: when the authors say "we will assume that no individual sequence contains duplicate values", that is not a step in answering the question; it is, rather, a precondition under which the question is going to be addressed. For pedagogical purposes, this turns an uninteresting problem into one that challenges peoples' intuitions about big-O - and it seems to have been successful at that, judging by the number of people who have strongly insisted that O(n²) must be wrong... Also, while it is moot here, counting the number of steps in one example is not an explanation. – sdenham Sep 09 '19 at 17:22
  • You are missing the point that you are not answering the question that was posed. – sdenham Sep 09 '19 at 19:58
  • Furthermore, in your last reply, you also presented an answer to a question that was not posed - I wrote "counting the number of steps in one example is not an *explanation*." – sdenham Sep 09 '19 at 20:12
  • Even as an answer to what you thought the question was, note that there are a great many different curves that pass through two points - and anything you offer as explanation here is missing from that answer. – sdenham Sep 09 '19 at 22:04
  • @sdenham: I just stumbled upon this question again. Just to be sure : do you agree that, in the general case (e.g. with duplicate elements in the sequences), the above algorithm is O(n**3)? – Eric Duminil Oct 01 '21 at 07:17
  • 2
    @EricDuminil This code, if considered as a function from three arbitrary collections to a boolean, is O(n³). However, by stipulation, it is being analyzed as a function from three sets to a boolean, which is O(n²) for a non-obvious (and therefore pedagogically useful) reason. In situations where it is being used as the latter, it would be pointless to analyze it as the former. If it were written in a more strongly-typed language, it might not even be possible to create a program in which it is applied to arbitrary collections... – sdenham Oct 02 '21 at 12:59
  • @EricDuminil ...Partial functions are perfectly reasonable and even ubiquitous in math and comp. sci. - e.g. the binary sort, defined only for sorted collections (even though a Python implementation could be applied to unsorted ones.) It is certainly something we use and can justifiably analyze, even though the code could give wrong answers when applied outside of its domain... I hope this goes some way to answering your question "why?'' – sdenham Oct 02 '21 at 13:00
  • @sdenham Excellent explanation, thanks. I'll update my answer. – Eric Duminil Oct 02 '21 at 13:10
  • @EricDuminil A correction to my previous example - it should be "binary search", not "binary sort" - though I imagine you already recognized it... Please do update your answer, as doing so will unlock my downvote - which I now think was undeserved in the first place - so I can undo it. – sdenham Oct 02 '21 at 15:22
  • @sdenham: I didn't even realize you typed "sort" twice, and indeed understood it as binary search. I edited the answer and tried to integrate your comments. Thanks for the feedback. – Eric Duminil Oct 02 '21 at 19:34
3

To put things into the terms that your book uses:

I think you have no problem understanding that the check for a == b is worst-case O(n2).

Now in the worst case for the third loop, every a in A has a match in B, so the third loop will be called every time. In the case where a doesn't exist in C, it will run through the entire C set.

In other words, it's 1 time for every a and 1 time for every c, or n * n. O(n2)

So there is the O(n2) + O(n2) that your book points out.

Mars
  • 158
  • 5
0

The trick of the optimized method is to cut corners. Only if a and b match, c will be given worth a look. Now you may figure that in the worst case you would still have to evaluate each c. This is not true.

You probably think the worst case is that every check for a == b results in a run over C because every check for a == b returns a match. But this is not possible because the conditions for this are contradictory. For this to work you would need an A and a B that contain the same values. They may be ordered differently but each value in A would have to have a matching value in B.

Now here's the kicker. There is no way to organize these values so that for each a you would have to evaluate all b's before you find your match.

A: 1 2 3 4 5
B: 1 2 3 4 5

This would be done instantly because the matching 1's are the first element in both series. What about

A: 1 2 3 4 5
B: 5 4 3 2 1

That would work for the first run over A: only the last element in B would yield a hit. But the next iteration over A would already have to be quicker because the last spot in B is already occupied by 1. And indeed this would take only four iterations this time. And this gets a little better with every next iteration.

Now I am no mathematician so I cannot proof this will end up in O(n2) but I can feel it on my clogs.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • 1
    The order of the elements does not play a role here. The significant requirement is that there are no duplicates; the argument then is that the loops can be transformed into two separate `O(n^2)` loops; which gives overall `O(n^2)` (constants are ignored). – AnoE Sep 09 '19 at 12:04
  • @AnoE Indeed, the order of the elements does not matter. Which is exactly what I am demonstrating. – Martin Maat Sep 09 '19 at 12:58
  • I see what you are trying to do, and what you are writing is not wrong, but from the perspective of OP, your answer is mainly showing why a particular train of thought is irrelevant; it is not explaining how to arrive at the actual solution. OP does not seem to give an indication that he actually thinks this is related to order. So it is unclear to me how this answer would help OP. – AnoE Sep 09 '19 at 13:51
0

Note that the algorithm given is unnecessary complicated. An obvious O(n^2) algorithm that is also O(n^2) for arrays with duplicated elements is very simple:

  1. Write a function contains(array A, value X) which returns whether A contains X in O(n); this is trivial.

  2. Disjoint(array A, B, C): for a in A: if contains(B, a) and contains (C, a) return false. Finally return true. Obviously O(n^2).

If your standard library implements sets then likely as a hash table checking membership in O(log n), so the total would be O(n log n). And if A, B and C are all sorted arrays then you can check if they are disjoint in O(n).

To make the algorithm a bit faster, pay some attention to the size of the sets or arrays. For unsourced arrays, pick A as the array with the smallest number of elements, and B with the second smallest number. This makes the worst case better. But also this will exclude the most searches. The number of common elements will be smallest between the two smallest sets.

The effect is stronger for sets with O(log n) lookup. The number of lookups grows linear with the size of A, the time for each lookup only logarithmic with the size of B or C, so you really want A to be the smallest set.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
-1

Was baffled at first, but Amon's answer is really helpful. I want to see if I can do a really concise version:

For a given value of a in A, the function compares a with every possible b in B, and it does it only once. So for a given a it performs a == b exactly n times.

B doesn't contain any duplicates (none of the lists do), so for a given a there will be at most one match. (That's the key). Where there is a match, a will be compared against every possible c, which means that a == c is carried out exactly n times. Where's there is no match, a == c doesn't happen at all.

So for a given a, there are either n comparisons, or 2n comparisons. This happens for every a, so the best possible case is (n²) and the worst is (2n²).

TLDR: every value of a is compared against every value of b and against every value of c, but not against every combination of b and c. The two issues add together, but they don't multiply.

-3

Think about it this way, some numbers may be in two or three of the sequences but the average case of this is that for each element in set A, an exhaustive search is performed in b. It is guaranteed that every element in set A will be iterated over but implied that less than half of the elements in set b will be iterated over.

When the elements in set b are iterated over, an iteration happens if there's a match. this means that the average case for this disjoint function is O(n2) but the absolute worst case for it could be O(n3). If the book didn't go into detail, it would probably give you average case as an answer.

Eddie Ekpo
  • 11
  • 1
  • 4
    The book is quite clear that O(n2) is the worst case, not the average case. – SteveJ Sep 08 '19 at 03:32
  • A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function. Associated with big O notation are several related notations, using the symbols o, Ω, ω, and Θ, to describe other kinds of bounds on asymptotic growth rates. [Wikipedia - Big O](https://en.wikipedia.org/wiki/Big_O_notation) – candied_orange Sep 08 '19 at 04:50
  • 6
    "If the book didn't go into detail, it would probably give you average case as an answer." – Uhm, no. Without any explicit qualification, we are *usually* talking about worst-case step complexity in the RAM model. When talking about operations on data structures, and it is clear from the context, then we *might* be talking about *amortized* worst-case step complexity in the RAM model. Without *explicit* qualification, we will generally *not* talk about best case, average case, expected case, time complexity, or any other model except RAM. – Jörg W Mittag Sep 08 '19 at 06:07