SRP in the "big data" setting

Question

We have a codebase at work that:

Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items”
These “micro-items” are then clustered together to find “macro-items”
The “macro-items” become the input to a variety of other important business computations and analyses. These macro-items are the life-blood of our organization.
Does all of this work using Apache’s Crunch library (which is backed by Hadoop)

It is true that some of these steps above are hard to do without Crunch. The “clustering problem" in particular is probably impossible to do strictly in-memory on one machine. There are just too many items that need to be considered in the clustering step.

You could also argue that we have so much data that any solution that isn't built for scale isn’t worth having.

That said, I feel like our code-base break SRP left and right. For example, the algorithms that parse raw input files cannot be easily separated from the Crunch “do it enmasse” classes. Meaning if I have just one input file I cannot parse it without running a full-scale Crunch job. Additionally, I cannot easily access or even test the “clustering algorithm” and the “important business computations”.

Is this a common problem in the big data space? Does SRP fly out the window when I have tons of data?

Is trying to separate this code base into two project A and B a reasonable goal? I am assuming Project (A) would define and properly test the parsing algorithms, the clustering logic, and the important business computation while project (B) would depend on project (A) and use Crunch to do all of these things at scale.

Part of the reason I want to advocate a strict separation is to vastly improve the testing on all the non-distributed computations. It is horrible when a distributed Crunch job fails and its hard to pin down why.

The purpose of SRP is to improve maintainability. Do you feel that your current solution has a maintainability problem? Also, given the context you've described, "Clustering" is probably a single responsibility. Some programmers view "SRP" as "Do One Thing," but that's not exactly what it means. — Robert Harvey, Aug 05 '16 at 19:05
I absolutely think it is a maintainability issue. In particular we have a difficult time verifying to others that our computations are done properly — Ivan, Aug 05 '16 at 19:35
That should be your focus. SRP is just a guideline; your real effort should be focused on answering the question: *how can we make this more maintainable?* — Robert Harvey, Aug 05 '16 at 20:55

score 1 · Answer 1 · answered Aug 08 '16 at 02:32

SRP happens at the Module (or library), Package (or namespace), Class, and Function level.

What this means is:

A Library has one reason to change. For instance, you build a network library. Valid reasons for it to change include: supporting a new protocol, fixing a bug in an existing protocol. Invalid reasons to change: client or server endpoints change.
A namespace has one reason to change: Taking the network library a step further. The MyCompany.Networking.Http namespace should only change with regard to changes in the implementation of HTTP. It should not be concerned with FTP, or SMTP, you would only change the package to support changes to HTTP, bugs in your implementation or new versions of the protocol.
A class has only one reason to change: MyCompany.Networking.Http.HttpClient should...well you get the picture.

If Crunch is just a utility that your system uses to scale the operations, then yes I would argue that the operations themselves should be implemented separately from crunch.

The division you mentioned makes perfect logical sense. The operations themselves would be bound by SRP to only change to support changes in your business logic. The coordinator would do the orchestration between your operations and Crunch. (If you feel the need, you can protect the coordinator from changes in whether you use crunch for the large scale processing but a rule of thumb for SRP is that you protect against a class of changes as they become applicable. I.e if the chances of you switching task engines are slim to none, it doesn't make much since to provide abstraction from it.

score 0 · Answer 2 · edited Aug 08 '16 at 07:10

I'm not a Java programmer, but the problem you described is more general.

Right now you have a solution that is not 100% reliable. From what I understood you need to maintain it. There is a technical debt that you would want to pay off in order to reduce the pain you experiencing during "recovery sessions".

Try to extract single pieces from the codebase. Usually after playing with the code you understand it better and feel more confident what to refactor. Separating every single line of code is a win for you.

Later on you would see patterns or clusters of similar functions. Merge them in to one class and provide it as DI.

SRP in the "big data" setting

2 Answers2