I have a one-way large ETL pipeline in Scala where I start with protobuf schema A and end up with protobuf schema B. I want to create a many-to-many mapping between the fields of schema A and schema B, where schema B uses a subset of the fields from schema A. The ETL is complex and has lots of transformations where information is stored in variables and then transformed in various ways before being output to schema B. Things I have attempted so far:
- Take a piece of data that consists of entirely populated set of schema A with values as "flags", run it through the ETL, and analyze the output of schema B for those flags to match. This doesn't take into account transformations of the values, fails on some input constraints and you cannot attach such "flags" to the values of boolean or enum fields.
- Build a Scala compiler plugin that analyzes the AST for usage of types in Schema A and where they go to be inserted into Schema B. This gets me most of the way there, but the approach ends up introducing ambiguities and complexity such as where variables are stored in common functions and re-used in different locations and scope in the ETL code.
- Doing something similar as previous, but in runtime with AspectJ. This introduces the same sort of problems as in the previous bullet point.
Is there a lower-level or more straightforward approach to doing something like this? Like attaching some sort of flag to the data that will follow it during its transform and output?