I recently had need for using parallel processing in Python. Parallel processing is very useful when:

  • you have a large set of data that you want to (or are able to) process as separate ‘chunks’.
  • you want to perform an identical process on each individual chunk (i.e. the basic code running on each chunk is the same). Of course, each chunk may have its own corresponding parameter requirements.
  • the order in which each chunk is processed is not important, i.e. the output result from one chunk does not affect the processing of a subsequent chunk.

Under these conditions, if you are working on a multi-core computer (which I think is true for virtually all of us), you can set up your code to run parallelly using several or all of your computer’s cores. Using multiple cores is of paramount importance in order to gain any improvement in computation time. If you attempt such parallel processing on a single core, the computer will simply switch between separate computational threads on that single core, and the total computation time will remain constant (in fact, more likely the total time will increase because of the incessant switching between threads).


Anyhow, there are several methods of achieving multi-core parallel processing in Python. In this post, I will describe what I think is the simplest method to implement. This is the method I chose, and with whose results I am quite happy.

Additionally, most examples online that go over implementing parallel processing never mention how to handle multiple input arguments separate from the iteration parameter. There are several methods of including that too, and I will also describe what I think is the simplest method to implement and maintain.

Say, you have the following code setup:

arg1 = val1
arg2 = [val2, val3]
arg3 = ['val4', 'val5']
fileslist = ['list', 'of', 'files', 'that', 'are', 'to', 'be', 'processed']

for file in fileslist:
    print('Start: {}'.format(file))
    # perform a task with arg1
    # perform a task with arg2
    # print something with arg3
    # save some data to disk
    print('Status Update based on {}'.format(file))

Now, for parallel processing, the target is to convert the for loop into a parallel process controller, which will ‘assign’ file values from fileslist to available cores.

To achieve this, there are two steps we need to perform. First, convert the contents of your for loop into a separate function that can be called. In case of parallel processing, this function is only allowed one argument. Set up your function accordingly, planning that this single argument will be a tuple of variables. One of these variables will be the iteration variable, in our case file, and the rest will be the remaining variables required.

def loopfunc(argstuple):
    file = argstuple[0]
    arg1 = argstuple[1]
    arg2 = argstuple[2]
    arg3 = argstuple[3]
    print('Start: {}'.format(file))
    # perform a task with arg1
    # perform a task with arg2
    # print something with arg3
    # save some data to disk
    return 'Status Update based on {}'.format(file)

Second, update the main code structure to enable multi-core processing. We will be using the module concurrent.futures. Let’s see the updated code first, before I explain what is happening.

import concurrent.futures

arg1 = val1
arg2 = [val2, val3]
arg3 = ['val4', 'val5']
fileslist = ['list', 'of', 'files', 'that', 'are', 'to', 'be', 'processed']

argslist = ((file, arg1, arg2, arg3) for file in fileslist)
with concurrent.futures.ProcessPoolExecutor() as executor:
    results = executor.map(loopfunc, argslist)

    for rs in results:
        print(rs)

OK, now let’s go over it. The with ... line invokes the parallel processing tool which creates the executor object. In the next line, executor.map() is used to provide two pieces of information: (a) what function is to be repeatedly executed, and (b) a tuple of arguments that need to be passed for each function execution. Notice that when calling executor.map(), we are providing loopfunc as an object, and are not attempting to execute the function itself via loopfunc().

Now, argslist is meant to be a tuple containing arguments for all iterations of loopfunc, i.e. len(argslist) = len(fileslist). However, in our case, only the fileslist variable is iterated over, while other arguments are provided ‘as-is’. The workaround for this is to use list-comprehension (err… I mean tuple-comprehension) to generate a new variable (in our case argslist) that contains all relevant arguments for each function iteration.

In this way, the first process is created with loopfunc( (fileslist[0], arg1, arg2, arg3) ), the second process is created with loopfunc( (fileslist[1], arg1, arg2, arg3) ), and so on. Of course, within loopfunc(), we have already converted the input single argument into multiple arguments as we need.

Values return-ed from loopfunc() are stored in the variable results, which is looped over to print out each value. The fun behavior here is that each rs item is executed as that value becomes available, i.e. when each process completes. For example, if you’re running on a 4-core machine, output from the code can look like the following, depending upon the speed of execution of each iteration:

Start: fileslist[0]
Start: fileslist[1]
Start: fileslist[2]
Start: fileslist[3]
Status Update based on fileslist[0]
Status Update based on fileslist[1] 
Start: fileslist[4]
Start: fileslist[5]
Status Update based on fileslist[2] 
Start: fileslist[6]
Status Update based on fileslist[3] 
Start: fileslist[7]
...

Without any arguments, ProcessPoolExecutor() creates as many processes as there are cores on your computer. This is great if you want to run your code and walk away for a few hours, letting your Python script take over your whole computational capability. However, if you only want to allow a specific number of processes, you can use ProcessPoolExecutor(max_workers=nproc), where nproc is the number of processes you want to simultaneously allow at most.

To-do

In my current implementation I have used the above method to work on ‘chunks’ of data and then saved the resultant output with appropriate markers to disk. However, another way to implement parallel processing would be to take the output from each iteration, and save it as an element in an array, at the correct array index.

This should not be hard to do, all I should need is to return both the output data and the correct marker for the array index. I just haven’t done it (nor needed to do it) yet. I actually prefer saving the output from each chunk to disk separately, if possible, so that even if something crashes (or the power goes out, or whatever) and the process is interrupted, I won’t lose all progress made until then.