We have a codebase at work that:
- Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items”
- These “micro-items” are then clustered together to find “macro-items”
- The “macro-items” become the input to a variety of other important business computations and analyses. These macro-items are the life-blood of our organization.
- Does all of this work using Apache’s Crunch library (which is backed by Hadoop)
It is true that some of these steps above are hard to do without Crunch. The “clustering problem" in particular is probably impossible to do strictly in-memory on one machine. There are just too many items that need to be considered in the clustering step.
You could also argue that we have so much data that any solution that isn't built for scale isn’t worth having.
That said, I feel like our code-base break SRP left and right. For example, the algorithms that parse raw input files cannot be easily separated from the Crunch “do it enmasse” classes. Meaning if I have just one input file I cannot parse it without running a full-scale Crunch job. Additionally, I cannot easily access or even test the “clustering algorithm” and the “important business computations”.
Is this a common problem in the big data space? Does SRP fly out the window when I have tons of data?
Is trying to separate this code base into two project A and B a reasonable goal? I am assuming Project (A) would define and properly test the parsing algorithms, the clustering logic, and the important business computation while project (B) would depend on project (A) and use Crunch to do all of these things at scale.
Part of the reason I want to advocate a strict separation is to vastly improve the testing on all the non-distributed computations. It is horrible when a distributed Crunch job fails and its hard to pin down why.