2

Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is the best way to merge/sort/filter/page through these activities via a web UI? For example, say we want to take a few thousand of the documents, merge them into a gigantic array, sort the array by the "created_datetime" attribute descending and then page through it 10 activities at a time.

Also keep in mind that roughly 25% of these JSON documents are updated every day, and updates have to make it into the view within 5 minutes.

My first thought is to parse all of the documents into an RDBMS table and then it would just be a simple query such as "select top 10 name, created_datetime from Activity where user_id=12345 order by created_datetime desc".

Some have suggested I use NoSQL techniques such as hadoop or map/reduce instead. How exactly would this work?

For more background, see: Why is NoSQL better for this scenario?

Infin8Loop
  • 1,459
  • 2
  • 11
  • 16
  • 1
    NoSQL is not a one size fit all, there are many advantages and disadvantages to using it. However given you're dealing with JSON you can use NoSQL to store them and query in a way that easily meets your requirements, in your case it seems like a good fit like Falcon suggested in the linked questions. I'll wait for someone who has better NoSQL knowledge to answer. – Benjamin Gruenbaum Feb 05 '13 at 20:26
  • how many is tons? if you can get a system with enough RAM - like 80Gb and if that is enough for now and next 2 years -> can use an in memory db. and there might be other ways to divide and conquer? some other attribute to divide the docs/ objects? – tgkprog May 06 '13 at 20:27
  • "How exactly would this work?" is kind of a vague question. How does map reduction work? – Erik Reppen Jul 05 '13 at 20:57
  • ^ I meant "Do you mean, 'How does map...'" – Erik Reppen Jul 05 '13 at 21:41

1 Answers1

1

How to merge/sort/page through a huge amount of data?

Well, for sorting, look at Quicksort if the data is more or less randomized, or Timsort if it's highly ordered. (Quicksort degenerates easily into horrible performance on highly ordered data.)

For merging, there's a pretty simple algorithm for this: list comparison.

  • Take two lists, list A and list B. Sort both lists using the same criteria.
  • Declare two iterators that refer to a single list element, one for each list. Initialize them to reference the first element of their respective lists.
  • Repeat
    • Compare referenced element A (eA) to referenced element B (eB)
    • If eA < eB then handle eA appropriately and increment iterator-A
    • else if eB < eA then eB appropriately and increment iterator-B
    • else handle equality case appropriately and increment both iterators
  • until one iterator reaches the end of its list.
  • (optional) Handle the remaining elements in the other list, if appropriate

This basic concept can be used for a number of operations involving two lists, including merging, by specifying the "handle appropriately" cases. In this case, the appropriate way to handle it is to add an element to the output list.

For paging, let your database engine handle this one.

Mason Wheeler
  • 82,151
  • 24
  • 234
  • 309
  • 2
    What you suggest are algorithms for merging data that is in memory. This data is on the file system, in JSON documents, and loading all of these documents into memory and merging/sorting them there would require an application that would not scale. I am looking for the best way to do this at scale. The only option I see currently is to parse the JSON into RDBMS tables and accept the parse cost as part of the expense of exposing this data via a decently performing UI. Can you think of an alternative to using an RDBMS? – Infin8Loop Feb 05 '13 at 19:23
  • Some have suggested RDBMS is not suited for the volume of data and that I should use some flavor of NoSQL instead. How would that work? – Infin8Loop Feb 05 '13 at 19:24
  • 1
    @Joshiatto: How much data are you talking about? At work, our system manages SQL databases that regularly get into multiple hundreds of GB in size without trouble. You just need to know a few things about database tuning. NoSql and extreme scalability really only becomes necessary if you're working at seriously huge scales, on the order of Facebook, Google or EBay. Otherwise, a well-maintained SQL database should work fine for you. – Mason Wheeler Feb 05 '13 at 19:49
  • 2
    @MasonWheeler your last comment more sense as an answer than your answer. Implementing the merge part of merge sort doesn't seem to do much to answer OP's question. I think OP was looking to a way to query a lot of documents which NoSQL document stores actually excel at, even if his DB is only a few megabyes. – Benjamin Gruenbaum Feb 05 '13 at 20:19