Reverse engineering an ETL pipeline

Question

I have a one-way large ETL pipeline in Scala where I start with protobuf schema A and end up with protobuf schema B. I want to create a many-to-many mapping between the fields of schema A and schema B, where schema B uses a subset of the fields from schema A. The ETL is complex and has lots of transformations where information is stored in variables and then transformed in various ways before being output to schema B. Things I have attempted so far:

Take a piece of data that consists of entirely populated set of schema A with values as "flags", run it through the ETL, and analyze the output of schema B for those flags to match. This doesn't take into account transformations of the values, fails on some input constraints and you cannot attach such "flags" to the values of boolean or enum fields.
Build a Scala compiler plugin that analyzes the AST for usage of types in Schema A and where they go to be inserted into Schema B. This gets me most of the way there, but the approach ends up introducing ambiguities and complexity such as where variables are stored in common functions and re-used in different locations and scope in the ETL code.
Doing something similar as previous, but in runtime with AspectJ. This introduces the same sort of problems as in the previous bullet point.

Is there a lower-level or more straightforward approach to doing something like this? Like attaching some sort of flag to the data that will follow it during its transform and output?

For bullet point one, have you looked at `Writer` monad from the Scalaz library? You could, for instance, define a `Writer[List[String], Boolean]`, and then create operators that show the history of transformations of a value by adding them to the list that travels with the value. You would need to translate the ETL code to the `Writer` monoid code. See http://eed3si9n.com/learning-scalaz/Writer.html — BobDalgleish, Aug 02 '17 at 19:03
Update, we ended up solving this by extending our compiler plugin and refactoring the ETL codebase to standardize on input/output functions i.e. we created a library of 8-10 methods that all took input schema in and returned output schema. From that we were able to use the AST to create the mapping between fields. So long as developers continue to use those functions (which we enforce as part of build phase), we can continue to derive the mapping. — Conor, Apr 07 '18 at 08:38
Nice solution! For the community, would you be willing to write up some more detail and post it as an answer to your question? — BobDalgleish, Apr 08 '18 at 16:09

score 1 · Answer 1 · answered Nov 06 '17 at 02:25

I may be coming at this in a manner that is difficult for your situation.

But I believe you should look at this more as a people and domain understanding problem, and less a technical problem. It sounds like you've been stuck with an ETL project that was written by one, or very few, people who are hard to get access to, and it's now your job to sort it out and take it over.

The 100% best thing you can do is get access to as many people as you can who had something to do with this project, and fit as much of their time into your discovery period as possible. If you can speak to one of the developers that is the best-case scenario.

I can't see why you would expect much success with tinkering with the inputs and running the ETL process like a "black box." Just a few fundamental differences between the two schemas would cause lots of little quirks that would be very difficult to sort out. For the long run you would have much better insight into the execution by studying the structure of the project, its comments, object naming, etc.

Surely you need to study the domain in question first on either the source or destination system, regardless of their schemas. Ask questions to stakeholders that use the systems on either end to get an idea what the source and destination schemas are trying to model. Then compare the table names and try to sort out differences in the models assumptions, using the ETL project along the way to enrich your reasoning.

In summary, you do not want to come at this as a reverse-engineering problem with tracing tools and watching the results of inputs and outputs. You want to come at it as a business problem and involve the stakeholders, to get a good idea of the big picture.

Thank you for your thoughtful answer. Your assumptions are true to a large extent. This is an actively maintained ETL repository in BigCorp that we have little control over but have been asked to provide analysis of. So the solution is not entirely technical since it could be somewhat easier if we had full control over repository. However, even if ETL team were to help us out, technical problem of how to divine these things still remains. The question should assume we have full control over repository in which case we might get a better answer. — Conor, Nov 06 '17 at 21:02

score 0 · Answer 2 · answered Aug 07 '17 at 19:15

0

You could try to inspect more low-level operations during your ETL's execution.

For instance, if you want to discover a mapping between variable A's value to some output variable B's value:

...lots of instructions...
Read value from A's address into address XYZ
...lots of instructions...
Perform some operations with value from address XYZ
...lots of instruction...
Copy value from address XYZ into B's address

Then, you need to analyze the transformations done using values in the addresses A, XYZ and, finally, B.

answered Aug 07 '17 at 19:15

Emerson Cardoso

2,050
7
14

Can this be done without modifying the ETL source code (or with very minimal modification that can be done automatically). – Conor Aug 09 '17 at 22:38
I've done a quick search here and found this tool that allows inspection of the bytecode (regardless if you have the source): http://www.drgarbage.com/bytecode-visualizer/ – Emerson Cardoso Aug 10 '17 at 13:25

Reverse engineering an ETL pipeline

2 Answers2