Best practice to handle input and output with several csv-like files

Question

I am currently writing an application with the following structure:

input: the equivalent of an excel workbook, i.e. a few tables with different headers and a few scalar values. They represent properties of hardware, workers, production processes,ecc.. in a factory and a few "global" parameters.
processing: compute some plan for the factory
output: the plan, i.e. a kind of schedule, in the form of a few table-like data structures, plus some KPIs

The current application is implemented in python. Other times I have used C++/C#/Java, R....

My problem is with the input and the output steps. It seems they have to be done over and over for most applications I encounter and that gets boring.

What I usually do is create a few data structures in memory to represent the input and the output. It could be a few classes each representing a row of a table, so to speak, or a C# datatable/pandas dataframe/dict-of-dict .

What I mean by "boring" is that I need to specify both the input and output logic at least three times each: one in the file, one in the code and one in the documentation.

For example, for a CSV file I have decide the number,position and name of the columns and write a template or example of the file. Then write and document some code to parse or write it and, finally, I have to write a "guide" that documents how the program handles the data, in particular missing or invalid values.

Of course, when the data format changes, the code has to change too.

I realize that in a larger team this would not be a problem as different people would handle different tasks. But currently it's mostly me.

Is there some practice to abstract and speed-up the process of translating the input and output format respectively to and from memory data-structure? Or some pointers to further sources that could help me better understand the opportunities and pitfalls in implementing those procedures?

Finally, I would also appreciate if you could point me to some useful libraries for that in either python, java or C#.

Regarding: "I need to specify both the input and output logic at least three times each: one in the file, one in the code and one in the documentation". When you say 'logic', do you mean the *schematic information* about the data (field types, lengths, nullable, etc), or are you referring to metadata? OR something else? Secondly, Is there enough metadata to auto-generate some/all of the documentation? (i.e. *from* the code and/or input data) — Tersosauros, Mar 05 '18 at 14:22
I am not sure what you mean by metadata but I meant both schematic information and the meaning of the data. For example, if it is about power plants then a field is about fuels, which must correspond to a table of handled fuels. I could do that with a DBMS, but most of the time people use hand-edited excel files, each having a different "context" and, thus, different lists of handled fuels I have to match for. I could get some "metadata" if needed. I use "docstrings" whenever available, but I guess that would need some editing and pasting to obtain documentation for the final user. — Andreas, Mar 05 '18 at 22:32
Sounds like you have a data integration / ETL type of issue, rather than a software engineering one. — Tersosauros, Mar 10 '18 at 05:53
In some sense it is ETL. But I usually see "ETL" mentioned for handling large quantities of data on a DB or some BI framework. Here it would be overkill. Is there an ETL framework for "small data"? — Andreas, Mar 14 '18 at 19:11

Best practice to handle input and output with several csv-like files

0 Answers0