2

I have a large (>1,000 LOC) Python ETL script - call it fetch_and_transform_data.py - that fetches data from a remote database, appends the raw data to a local table, does some transformations and appends the transformed data to another local table. Currently the fetch_and_transform_data module has a class with a main method, which doesn't return anything.

My task is to write a periodic update script, which will call fetch_and_transform_data.main and handle these possible outcomes: 1) run failed, 2) run succeeded with zero rows added, 3) run succeeded with nonzero rows added. (And notify appropriately in each case).

What I'm confused about is how to pass these different outcomes to my update script.

A couple of solutions have come up for this, none of which seem like best practice to me:

  1. Run fetch_and_transform_data.main in a subprocess, and parse the output. (It prints the number of rows added.)
  2. Change fetch_and_transform_data.main to return a boolean, where False means no new records, then wrap in a try/except
  3. Query the local tables after the run and see if there are new rows with a recent append date

This seems like a fairly generic task and I thought there might be a standard way to do it. Thanks for the help, I hope this is the right place to post this.

To clarify, I can make any changes I like to fetch_and_transform_data.py.

2 Answers2

3

People are funny about zero. Computers don't care but we humans make a bigger deal about zero than we should. More on that in a moment.

I'm rejecting your 3rd option because it's simply giving up on communicating and dumping all the responsibly on the update script to figure out what happened on it's own. Bleh. If you can communicate results you should.

As for your 1st and 2nd options, I advocate squishing them together. The only reason you separated them is because, like a lot of coders, you feel funny about zero. Relax. It's just a number. It's ok to loop zero times.

So, rather than return a Boolean, just return the new_record_count. I'm sure, if you have to, you can write code that will tell you if it's true that there are zero records. Still not sure you have to.

Far more importantly, of your three solutions, only the 2nd had any plan at all to explain why the run failed. This is critical to get right. Dealing with things going wrong is easily 80% of programming.

Now you did ask about a standard way. When returning from main, the old school standard is: on success return 0 and on failure return some positive number that represents an error code. If, for some reason, exceptions aren't something you want to deal with, and output isn't something you want to parse, you could return the negation of number of records returned. That is, -3 means 3 records were successfully added. But that means your test for success is now if (x < 1) which is just weird to look at.

So on the whole I'd prefer returning the record count and throwing something when things go wrong. No, that doesn't conform to the old convention for how to return from main. But why does this have to be the main function anyway? Seems like this would make a nice API function.

I do feel compelled to point out that there is a wildly different option. Rather than returning the results it is also possible to use an output port. Pass fetch_and_transform_data a reference to something that implements that output port. The port can provide a way to say how many rows were added on success and provide a handler for errors.

The point of that is it simplifies the call down to a pure event. Nothing is returned. No exceptions come back. This minimizes coupling. The calling code only has to know the time to call is now. Nothing else. What happens after the call is for something else to deal with. In CQS terms this becomes a simple command. But doing this is a bit of work. Do it when it’s worth it.


In response to your comments:

I was checking for zero-ness because there is a process that generates a patch file and uploads them somewhere else, which only runs if there is new data. Is this a good enough reason? Or maybe the generate-patch-and-upload should just be changed to handle zero? - Josh Friedlander

I'm in the handle zero gracefully camp. When I tell you to do nothing don't complain about the nothing. Just do the nothing. Empty strings are glorious! Nulls are evil! My favorite pattern is the Null Object Pattern.

But I'm also in the pragmatic camp. So don't kill yourself doing this. Just give it a fair chance.

when you say "use an output port", is this different from writing to and reading from a file? And thanks for your help! - Josh Friedlander

Yes, this is not file IO. I'm talking about how using return is not required.

Rather than use return you can use an output port. When you return you have no idea who called you and no interface to talk through. Which is why so many people give up and break encapsulation by returning internal state.

Instead of sending control back where it came from you can accept an output port that tells you where to send it next. Now you're talking through an interface instead of sneaking back inside.

You can go as far as to demand both success and error handlers in the output port. Done that way calling this is an event. The caller doesn't know, nor does it want to know, what happens next. That's something else's problem.

Sounds nice but comes with the cost of setting this up. Something, somewhere knows what handlers are going to get used. That's configuration. We simplified the call by pushing the complexity there. Choose your poison.

I'm certainly not forbidding use of return. I brought this up because it looked like you were struggling to say everything you wanted to say through the return. Return limits you to saying things with what you can either find or build. An output port gives you an interface to call and lets you put polymorphism behind it. If that difference turns out to be valuable to you consider an output port.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • `(Zero) It's just a number. It's ok to loop zero times.` I Hope Kent Beck or any other "influencer" read this and makes it canon or best practice for once and all. – Laiv May 30 '23 at 14:17
  • @Laiv aw shucks. You’re making me blush. : ) – candied_orange May 30 '23 at 15:05
  • Two responses: I was checking for zero-ness because there is a process that generates a patch file and uploads them somewhere else, which only runs if there is new data. Is this a good enough reason? Or maybe the generate-patch-and-upload should just be changed to handle zero? Secondly, when you say "use an output port", is this different from writing to and reading from a file? And thanks for your help! – Josh Friedlander May 31 '23 at 06:55
  • 1
    @JoshFriedlander better now? – candied_orange May 31 '23 at 15:08
  • yes, this is great - thank you! – Josh Friedlander Jun 01 '23 at 20:59
2

If one can make arbitrary changes to fetch_and_transform_data, I would adapt in a way it produces machine-readable output with all the information required for a calling script. For very simple cases, some additional output lines on standard output might be sufficient, for more complex cases, a separate log file can simplify things for the caller.

The question whether to run the script in a subprocess or call it directly in the same process as your update script depends mainly on the level robustness required. If you expect fetch_and_transform_data to crash the process with some out-of-memory error under certain circumstances, or to run into an endless loop, and you want the update script to handle such situations gracefully, then you need a subprocess.

If you do not need that level of robustness and want to run the code from fetch_and_transform_data in the same process as the caller, you can probably refactor fetch_and_transform_data.main into a function which returns all the information about the number of successful and unsuccessful rows processed, and maybe the full list of unsuccessful rows. Exceptions are more suitable when you want to stop the processing immediately at the first error. You should catch them as a last resort, but usually not use them as the primary result channel for the results of a recurring batch process.

Querying the local tables before and after the run is a good idea when you do not trust the returned results from fetch_and_transform_data, or when you cannot change the script by yourself. It may be useful when testing the script or when validating its output.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • Thanks! `If you expect fetch_and_transform_data to crash the process with some out-of-memory error under certain circumstances, or to run into an endless loop, and you want the update script to handle such situations gracefully, then you need a subprocess.` Yes. This persuaded me to switch to a subprocess. – Josh Friedlander May 31 '23 at 06:52