Can I easily parallelize this script?

Question

I have a script that compares in some way each line from file1 and file2, and outputs the lines if there is a difference. I want to make it faster - right now it's in Python. I could use threads, but I would like to know is there some easier way to improve it?

Since each test is independent, it could run in parallel - I just need to make sure that each line from file1 is compared with each line from file2.

EDIT: The bottleneck so far is the processor(comparison process); the disc usage isn't that big, but the core with program is 100%. Note that files are "large"(e.g. over 20MB), so I understand that it takes some time to process them.

related: [Evaluating concurrent application design approaches on Linux](http://programmers.stackexchange.com/questions/263141/evaluating-concurrent-application-design-approaches-on-linux) — gnat, Jul 07 '16 at 09:39
What is your bottleneck ? Is the files size big, or the comparison complex ? If it's a matter of size the disk will always be slower than the CPU no matter what you try. If the calculations on lines are complex then you might get good results by loading the file and forking through multiprocessing module instead of using threads. — Diane M, Jul 07 '16 at 12:11
@ArthurHavlicek I've updated the question to mention the bottleneck. — MatthewRock, Jul 07 '16 at 12:52

score 1 · Answer 1 · answered Jul 07 '16 at 14:02

If you want to get real CPU parallelization as Mason stated you need to get around GIL by forking instead of using threads. This has extra overhead compared to threads but it may work if the process time is the bottleneck.

The best way to achieve this in a non-hacky way is to use multiprocess.Pool and use a variant of map. This will dispatch your iterable to a pool of workers who will consume the input and agregate the result in your parent process.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
         print(p.map(f, [1, 2, 3])) #[1, 4, 9]

score 0 · Answer 2 · answered Jul 07 '16 at 09:54

Probably not.

Python has what's known as a Global Interpreter Lock that ensures that the interpreter is never running on more than one thread of a process at a time. This means that, unless your processing is making very heavy use of native code processing such as NumPy that spends most of its time outside of the interpreter, it is impossible to speed it up by parallelizing it.

You might be able to get some speed gains by parallelizing via multiprocessing, but that can impose some heavy overhead for setup and communication, so it's hard to say for sure without testing it.

Turns out I can, with use of parallel. ;) – MatthewRock Jul 11 '16 at 07:45 — MatthewRock, Jul 11 '16 at 07:45

score 0 · Accepted Answer · answered Jul 11 '16 at 07:44

0

I've managed to paralellize it using GNU Parallel.

Firstly, I had to make a slight changes to the script - I had to make sure that only the "file" part will use seek to restart file pointer(you can't seek on pipe). I also had to use this hack to get utf8 stdout/stdin:

reload(sys)
sys.stedefaultencoding('utf8')

After that, I was ready to call the script:

parallel --pipepart -a file_to_stdin ./myscript.py --secondfile second_fle > result

(pipepart allows -a to be treated as input that goes to stdin, instead of passing lines from the file as the arguments)

This way, the changes to the script were minimal, and concurrency was achieved.

answered Jul 11 '16 at 07:44

MatthewRock

799
1
5
15

Hm, what are you using seek for? It kinda seems to me you are trying to do too low-level stuff. Reading two files line-by-line and comparing them normally isn't all that CPU-intensive. – Roel Schroeven Jul 11 '16 at 08:09
Oh, now I see you're not doing a simple string comparison between the lines from each file. If you're doing complex calculations, that can explain the CPU usage of course. – Roel Schroeven Jul 11 '16 at 08:31
@RoelSchroeven I compare two files. Each line with each line - in other words, I am mapping a comparator function over a cartesian product of lines in both files. Files are big, so for each line in one file, I go through the whole second file. Then, I need to go back to the beginning, so I use seek. I don't know other way to do this in Python, and this one works(but is ugly). – MatthewRock Jul 11 '16 at 09:22
Yes, now I see. I thought you had to compare each line from the first file with the corresponding line from the second file (like a simple diff), with a total of n comparisons (assuming both files have the same number of lines). But I completely misunderstood; you need n*m comparisons. – Roel Schroeven Jul 11 '16 at 09:51

Can I easily parallelize this script?

3 Answers3