5

My understanding is that one of the key features of a B-Tree (and a B+Tree) is that it is designed such that the size of its nodes are some multiple of the block size of whatever media the data is stored on.

Considering that, in a memory managed language like java/c#, we don't really have access to how, when and what order, data is accessed from the drive... can we still predictably benefit from the major advantage of this data structure?

Steven Evers
  • 28,200
  • 10
  • 75
  • 159

2 Answers2

8

Yes, B-Trees still make good sense in managed languages.

A few points of explanation:

  • If you're using the B-Tree as an on-disk data structure, then I can absolutely guarantee that disk IO will be your bottleneck, not the fact that you are using a managed language.
  • If you are using a B-Tree in memory, then you can still have considerable control over memory layout from a caching perspective. For example, you can use large arrays for data storage in Java/C# and store tree nodes/data in the arrays using offsets rather than having a separate object to represent each tree node.
  • The advantages of a data structure are largely independent of language, at least up to a constant % factor. So if a B-Tree makes sense for your algorithm / access pattern, it will probably do so regardless of what language you are using.
  • On top of all that, it is generally the case that Java/C# can be nearly as fast as C/C++ if well optimised.
mikera
  • 20,617
  • 5
  • 75
  • 80
  • +1 Good response. Thanks for that mikera. I'm also wondering, could you expand on your second point? Aren't the nodes in a b-tree usually implemented as AVL or red-black trees? – Steven Evers Jan 09 '12 at 14:38
4

The use of a managed language like Java, C# etc. has absolutely nothing to do with the way data is accessed from the drive, and in any case it certainly does not deprive developers from an iota of control over precisely how, when, and in what order data will be accessed from the drive.

The problem is elsewhere: managed languages suffer from the overhead of managed-to-native and native-to-managed transitions, where data often need to be copied from native buffers into managed buffers, and from the fact that they do not offer quick and easy support for inherently unsafe operations like picking four bytes from within an array of bytes and interpreting them as an integer. So, when you want to do a thing like that you have to invoke a function that will do the conversion for you, where in C++ you would just use a single machine instruction which dereferences a pointer.

Therefore, an implementation of a B-Tree in a managed language will suffer, but it will not be due to lack of control over precisely how, when, and in what order data is accessed from the drive.

Mike Nakis
  • 32,003
  • 7
  • 76
  • 111
  • While it's true that strict memory management is a hinderance to such low-level tricks, not all languages commonly considered "managed" are truly unable to do this. C# at least (it's a CLR thing, but I don't know if other CLR languages expose similar functionality) *can* do `unsafe` stuff. Python (of all things) as well, through `ctypes` (though yes, it leads tocalling C functions - except maybe with PyPy's JIT). –  Jan 08 '12 at 16:24
  • @delnan Good point. I just did not mention it, because when I have done that in the past people retorted with things like "oh, so in C# you have to do _unsafe_ stuff to achieve this!" and then for me it is facepalm and end-of-conversation, because I just do not where to begin explaining them where they went wrong. – Mike Nakis Jan 09 '12 at 02:17
  • Copying memory one of the fastest operation a computer can do. Disk access is one of the slowest. Any differences in performance will be negligible in practice. – Boris Yankov Jan 10 '12 at 00:37
  • 1
    Copying memory is just one of the things. And they tend to accumulate. But in any case, when writing a B-tree performance counts, so any technique which performs unnecessary operations can be said to "suffer" with respect to a technique which doesn't. Even if the total working time of both techniques is negligible compared to the time spent waiting for the disk. – Mike Nakis Jan 10 '12 at 00:43