Unit testing for data munging pipelines made up of one-line functions

Question

Reading Mary Rose Cook's Practical Introduction to Functional Programming, she give as an example of an anti-pattern

def format_bands(bands):
    for band in bands:
        band['country'] = 'Canada'
        band['name'] = band['name'].replace('.', '')
        band['name'] = band['name'].title()

since

the function does more than one thing
the name isn't descriptive
it has side effects

As a proposed solution, she suggests pipelining anonymous functions

pipeline_each(bands, [call(lambda x: 'Canada', 'country'),
                      call(lambda x: x.replace('.', ''), 'name'),
                      call(str.title, 'name')])

However this seems to me to have the downside of being even less testable; at least format_bands could have a unit test to check if it does what it's meant to, but how to test the pipeline? Or is the idea that the anonymous functions are so self-explanatory that they don't need to be tested?

My real-world application for this is in trying to make my pandas code more functional. I'll often have some sort of pipeline inside a "munging" function"

def munge_data(df)
     df['name'] = df['name'].str.lower()
     df = df.drop_duplicates()
     return df

Or rewriting in the pipeline style:

def munge_data(df)
    munged = (df.assign(lambda x: x['name'].str.lower()
                .drop_duplicates())
    return munged

Any suggestions for best practices in this kind of situation?

Those individual lambda functions are too small to unit test. Test the final result. To put it another way, anonymous functions are not unit testable, so don't write the function as an anonymous function if you plan to unit test it individually. — Robert Harvey, Dec 14 '15 at 22:59
["Design For Testing, But Don't Code For Testing"](http://programmers.stackexchange.com/a/199328/31260) — gnat, Dec 15 '15 at 07:42

score 1 · Answer 1 · answered Jun 04 '16 at 00:08

I think you missed probably the more important part of the book's corrected example. The more fundamental change to the code is from the method operating on all values in a list to operating on one element.

There already exist functions like iter (in this case named pipeline_foreach) which perform a given operation on all elements in a list. There was no need to duplicate that with a for loop. Also using a well-known list operation makes your intent clear. With map you are transforming the values. With iter you are performing a side-effect with each element. With for loop you are... well, you don't really know until you look through it.

The example corrected code is still not very functional, because it (as far as I can tell) mutates the values in the list without returning them, preventing further piping or function composition. The functionally preferred method map would create a new list of bands with the updated country and name. Then you could pipe that output to the next function or compose map with another function that took a band list. With iter, it's like a pipelining dead end.

I think the end result code has small functions that are too trivial to bother testing here. After all, you shouldn't need to write unit tests against replace or title. Now maybe you do want to compose these together into your own function and unit test that the desired combination is achieved on a single item. Myself, I probably would have just changed format_bands to format_band singular, dropped the for loop, and called pipeline_each(bands, format_band). Then you could test format_band to make sure that you didn't forget something.

Anyway, on to your code. Your second example of code does seem more pipeline-y. But that alone does not provide the benefits of functional programming. In practice, functional programming means ensuring compatibility of functions with other functions by defining their compatibility only in terms of their inputs and outputs. If there are hidden side effects inside the function, then despite its input/output lining up with other function, you can't know if they are compatible until run time. If however, two functions are side-effect free and match output-to-input then you can pipeline or compose them with little worry of unexpected results.

Unit testing for data munging pipelines made up of one-line functions

1 Answers1