3

This is kind of a multi-part question.

Is it possible to do a binary search tree if the data does not possess natural ordering? Would you be forced to impose artificial ordering to such data? Like images? Or executable files? Or video? or sound files? Items which do not possess an obvious alphabetic or numerical order (my idea of 'natural' total ordering).

Or would it just be better to use a hashmap at that point?

adv
  • 347
  • 1
  • 9
  • 6
    To the computer, a "natural" "obvious" or "reasonable" ordering is in no way different from an absurd, technocratic, arbitrary ordering. All that counts is that it's computable. – Kilian Foth Apr 27 '16 at 15:03
  • 2
    Aren't binary search trees based on ordering the data (numerically or alphabetically) left to right? Or comparable. I suppose comparability is the key, even if it is arbitrarily assigned. – adv Apr 27 '16 at 15:03
  • 2
    All that matters is that you can provide a function of the form `bool lessThan( T left, T right )` built-in types just have this provided by the system – Caleth Apr 27 '16 at 15:06
  • I assume that, by natural ordering, you're referring to the phenomenon of a binary tree degrading into a linked list if the insertion data is already sorted. Automatically-balancing binary trees solve this problem by rotating and reorienting subtrees until the height of all branches is the same (give or take 1 level). – Robert Harvey Apr 27 '16 at 15:09
  • What do you mean by "the data does not possess natural ordering". The whole point of an ordered data structure is to feed unordered (yet orderable) data into it, of course. So, perhaps you are talking about graph data that is not fully ordered or orderable? In graph data, you may know that foo is after bar, but not a total ordering of all the nodes. In other words, you're looking for a topological sort instead of a total order. For graphs we use different algorithms and data structures. – Erik Eidt Apr 27 '16 at 15:14
  • I am just learning binary search trees now, hence my stupid questions. It is understanding that I seek here. By natural ordering, I am referring to the phenomenon of binary tree always -- so far as my reading has gone -- are ordered either alphabetically or numerically in order to allow quicker searches. – adv Apr 27 '16 at 15:14
  • 1
    Yes, the structure of the tree is based on the ordering of the keys. – Robert Harvey Apr 27 '16 at 15:20

3 Answers3

7

Is it possible to do a binary search tree if the data does not possess natural ordering?

I don't know what the word "natural" means in your context; it seems vague.

Moreover, images, videos, executables and sound files all seem perfectly obviously orderable to me. Order them by byte ordinal comparison, in the event of a tie, the shorter file is smaller. Why do you think this is not a natural way to put something in order? That's how you put strings in order, so why not sound files? A sound file is just a string of bytes, so order it as a string of bytes.

Let me answer a question I know how to answer instead:

Is it possible to do a binary search tree if the data does not possess a total ordering?

No. Binary search trees require a total ordering.

What is a total order?

A total order simply means that you must provide a comparison operator that it can determine equality, greater than or less than on every pair of elements. And moreover, all the rules that you think of as obviously true for ordering must be met, such as:

  • A always equals A (reflexivity)
  • If A = B then B = A (symmetry of equality)
  • If A = B and B = C then A = C (transitivity)
  • If A < B and B < C then A < C (transitivity)
  • If A < B then B > A (antisymmetry of inequality)
  • ...

and so on. If you cannot meet these rules then you cannot do a binary search and consistently get correct results. If you can meet these rules, then you can.

Or would it just be better to use a hashmap at that point?

Better by what criterion? You haven't said what operations you intend to perform on this data structure other than searching. Hash maps are very good at some tasks that binary trees are bad at, and vice versa.

Eric Lippert
  • 45,799
  • 22
  • 87
  • 126
  • In practice, I think the typical binary tree would be ordered by a simple key of some sort, and not the binary content of an image, executable or sound file. Those items would be considered "payloads" for the tree nodes, retrievable using the simple key. – Robert Harvey Apr 27 '16 at 16:40
  • 1
    @RobertHarvey: Sure, that works too. My point was that there is no reason to suppose that an ordinal comparison of bytes is "natural" for strings containing bytes interpreted as text but "unnatural" for strings containing bytes interpreted as sound. Consider by contrast, say, a set of objects representing types in a type system. They are not strings of text, they are not strings of bytes, heck, they need not even have *names* or be *printable*, and there is no natural, obvious way to impose an ordering on them. I can think of lots of things that are unnatural to order; bytes aren't among them. – Eric Lippert Apr 27 '16 at 16:54
  • To sum up: Any value your program can encounter is just a string of bits, and all strings of bits have at least one total ordering, called [dictionary order](https://en.wikipedia.org/wiki/Lexicographical_order). So even if your data doesn't have a natural ordering, its _representation_ in bits does. – Kevin J. Chase Apr 28 '16 at 03:44
  • 2
    @KevinJ.Chase: That's not quite what I'm saying. I'm saying that files on disk are very clearly strings of bytes. Many programming languages do not make it easy to extract a string of bytes from an arbitrary data type. – Eric Lippert Apr 28 '16 at 03:46
  • So, in order for binary search trees to work well (lowest search time), they need to be totally, transitively, and antisymmetrically ordered? Because as I understand it, the median value has to be the root in order to minimize search times. – adv Apr 28 '16 at 15:34
  • @adv: I said *correct* results, not *best performance*. In order for a binary search tree **to work at all** there must be a total order imposed; if there is not then it is possible that there is an element that is in the tree that cannot be found! – Eric Lippert Apr 28 '16 at 15:36
  • Okay, good, so my understanding is correct then! – adv Apr 28 '16 at 15:37
  • @adv: Your statement that the median value **must** be the root in order to minimize search times is false, given no additional information about whether certain searches are more likely than others. Let us notate a tree as X if it is empty and (left-value-right) if it is not. Consider the trees `((X1X)2(((X3X)4(X5X))))` and `(((X1X)2X)3(X4(X5X)))` The first has 2 as the root, the second has 3, the median, as the root, but both are equally efficient in amortized number of comparisons per search. (That is, on average 2.2 comparisons for every search.) – Eric Lippert Apr 28 '16 at 15:44
  • @adv: But usually the property that we want is not *minimal* number of comparisons per search, but rather *sublinear* number of comparisons per search. To obtain this property we can use a number of techniques, the most common of which is to height-balance the tree. If you can maintain an invariant that, say, the path from any node to the deepest leaf, and the path from any node to the shallowest leaf, differ by no more than some multiple factor, then you can obtain sublinear average search times. – Eric Lippert Apr 28 '16 at 15:48
  • @adv: if the data in the tree is never changing then yes, it may make sense to organize the tree into a *complete tree*. But if nodes are being added and removed from the tree then maintaining completeness is an expensive proposition; maintaining a weaker property such as height balance is cheaper. – Eric Lippert Apr 28 '16 at 15:50
  • Right, but both 2 and 3 are around the median value, only off because of the odd number of leaves. The volatility of the content is something that I need to look into further; it seems that rebalancing trees is an expensive operation. Although, the [Day-Stout-Warren](http://stackoverflow.com/questions/14001676/balancing-a-bst) algorithm, seems to be fairly space-time-memory efficient. – adv Apr 28 '16 at 16:04
  • @adv: We can construct efficient trees like `((((X1X)2(X3X))4((X5X)6(X7X)))8((X10X)9(X11X)))` which has 8 as the root and is 2 away from the median of 6, but this tree is still optimal. Now, you are right to point out that it is characteristic of balanced trees to have the root value be close to the median value of its descendants. But it only need be close, not equal. – Eric Lippert Apr 28 '16 at 16:34
3

An order is just a relationship between two elements in a set. The relationship is not always "is greater than". For example the relationship "is the son of" is, mathematically speaking, an order. It is not the order you need for a binary search because it is not well defined for all the elements: what if I pick two siblings? This relation exist only for some of the elements of a given set.

Instead the order that exist in the natural numbers is good because FOR EVERY number in the set is possible to state which "is greater than". However until you define a total order every relationship is good: for images "is greener than" will work fine. That's all you need for a binary search, a total order; It's up to your creativity/needs to find the proper order proposition. For more information.

This is the documentation of the Comparable interface in java needed to build a TreeSet. It's particularly interesting the description of the method compareTo.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
JoulinRouge
  • 678
  • 3
  • 9
  • 1
    Thanks for the link. Getting those 'jewels' from people who have learned this stuff for much longer than I helps shorten my learning curve immensely. – adv Apr 27 '16 at 15:42
  • When a relation exists for only some of the elements of a given set, we might call that a graph. – Erik Eidt Apr 27 '16 at 16:19
  • 2
    @JoulinRouge: You should add the such a relation must be reflexive, transitive, and antisymmetric. An arbitrary relation is not necessarily an order, take e.g. an equivalence relation (reflexive, symmetric, transitive). See e.g. https://en.wikipedia.org/wiki/Order_theory#Partially_ordered_sets – Giorgio Apr 27 '16 at 18:31
2

There's orderable data (which may be unordered), but it can be sorted by a variety of sorting algorithms. All of these sorting algorithms depend on the ability to perform a basic comparison ordering test between arbitrary given elements. Such a comparison must return the relative ordering of any two of the elements, for example, usually as -1 for less, 0 for equal, and 1 for greater. There is only one possible answer for the sorted result as a whole (barring duplicates).

There is also related data as you would have in a graph, possibly a directed graph, but not necessarily so. Given a graph we know something about the relative positions of the nodes via their connecting edges. However, there is no total ordering of all the nodes, just that some nodes are know to be before or after other nodes. We can perform a topological sort to order the nodes, but the bottom line is that there are many correct answers, so usually a topo sort will just pick one. A cyclic graph can have cycles, so again the ordering is (even more) arbitrary. Often we'll then look to other properties, for example, to choose the head element of a cycle to determine a good ordering for the domain.

Then there is unrelated data, where all we can do is compare for equality. For those, hashing is a reasonable data structure for storage and retrieval. There is no notion of sorting at all.

We should also consider that the same list of entities can be sorted by different properties or qualities. For example, files can be sorted by their size, which will provide a total ordering. They could be sorted by their timestamps. These may be useless for your domain, but still the point is to think about the various properties available for sorting, categorizing, etc...

Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
  • Thanks, this was my general feeling, but I did not know for certain. I am going to wait and give others a chance to answer before marking. – adv Apr 27 '16 at 15:23
  • do note that on a real system you will always be able to come up with `some` scheme that provides an ordering (e.g. by size or by binary representation), but whether there is one that is helpful to the problem domain gives the three categorizations of this answer – Caleth Apr 27 '16 at 15:30
  • Yea, which is why I included the artificial ordering bit. Otherwise a search of the tree wouldn't be improved over a hashmap. – adv Apr 27 '16 at 15:31