research - arnabocean

Multi-core parallel processing in Python with multiple arguments

I recently had need for using parallel processing in Python. Parallel processing is very useful when:

you have a large set of data that you want to (or are able to) process as separate ‘chunks’.
you want to perform an identical process on each individual chunk (i.e. the basic code running on each chunk is the same). Of course, each chunk may have its own corresponding parameter requirements.
the order in which each chunk is processed is not important, i.e. the output result from one chunk does not affect the processing of a subsequent chunk.

Under these conditions, if you are working on a multi-core computer (which I think is true for virtually all of us), you can set up your code to run parallelly using several or all of your computer’s cores. Using multiple cores is of paramount importance in order to gain any improvement in computation time. If you attempt such parallel processing on a single core, the computer will simply switch between separate computational threads on that single core, and the total computation time will remain constant (in fact, more likely the total time will increase because of the incessant switching between threads).

Anyhow, there are several methods of achieving multi-core parallel processing in Python. In this post, I will describe what I think is the simplest method to implement. This is the method I chose, and with whose results I am quite happy.

Additionally, most examples online that go over implementing parallel processing never mention how to handle multiple input arguments separate from the iteration parameter. There are several methods of including that too, and I will also describe what I think is the simplest method to implement and maintain.

Say, you have the following code setup:

arg1 = val1
arg2 = [val2, val3]
arg3 = ['val4', 'val5']
fileslist = ['list', 'of', 'files', 'that', 'are', 'to', 'be', 'processed']

for file in fileslist:
    print('Start: {}'.format(file))
    # perform a task with arg1
    # perform a task with arg2
    # print something with arg3
    # save some data to disk
    print('Status Update based on {}'.format(file))

Now, for parallel processing, the target is to convert the for loop into a parallel process controller, which will ‘assign’ file values from fileslist to available cores.

To achieve this, there are two steps we need to perform. First, convert the contents of your for loop into a separate function that can be called. In case of parallel processing, this function is only allowed one argument. Set up your function accordingly, planning that this single argument will be a tuple of variables. One of these variables will be the iteration variable, in our case file, and the rest will be the remaining variables required.

def loopfunc(argstuple):
    file = argstuple[0]
    arg1 = argstuple[1]
    arg2 = argstuple[2]
    arg3 = argstuple[3]
    print('Start: {}'.format(file))
    # perform a task with arg1
    # perform a task with arg2
    # print something with arg3
    # save some data to disk
    return 'Status Update based on {}'.format(file)

Second, update the main code structure to enable multi-core processing. We will be using the module concurrent.futures. Let’s see the updated code first, before I explain what is happening.

import concurrent.futures

arg1 = val1
arg2 = [val2, val3]
arg3 = ['val4', 'val5']
fileslist = ['list', 'of', 'files', 'that', 'are', 'to', 'be', 'processed']

argslist = ((file, arg1, arg2, arg3) for file in fileslist)
with concurrent.futures.ProcessPoolExecutor() as executor:
    results = executor.map(loopfunc, argslist)

    for rs in results:
        print(rs)

OK, now let’s go over it. The with ... line invokes the parallel processing tool which creates the executor object. In the next line, executor.map() is used to provide two pieces of information: (a) what function is to be repeatedly executed, and (b) a tuple of arguments that need to be passed for each function execution. Notice that when calling executor.map(), we are providing loopfunc as an object, and are not attempting to execute the function itself via loopfunc().

Now, argslist is meant to be a tuple containing arguments for all iterations of loopfunc, i.e. len(argslist) = len(fileslist). However, in our case, only the fileslist variable is iterated over, while other arguments are provided ‘as-is’. The workaround for this is to use list-comprehension (err… I mean tuple-comprehension) to generate a new variable (in our case argslist) that contains all relevant arguments for each function iteration.

In this way, the first process is created with loopfunc( (fileslist[0], arg1, arg2, arg3) ), the second process is created with loopfunc( (fileslist[1], arg1, arg2, arg3) ), and so on. Of course, within loopfunc(), we have already converted the input single argument into multiple arguments as we need.

Values return-ed from loopfunc() are stored in the variable results, which is looped over to print out each value. The fun behavior here is that each rs item is executed as that value becomes available, i.e. when each process completes. For example, if you’re running on a 4-core machine, output from the code can look like the following, depending upon the speed of execution of each iteration:

Start: fileslist[0]
Start: fileslist[1]
Start: fileslist[2]
Start: fileslist[3]
Status Update based on fileslist[0]
Status Update based on fileslist[1] 
Start: fileslist[4]
Start: fileslist[5]
Status Update based on fileslist[2] 
Start: fileslist[6]
Status Update based on fileslist[3] 
Start: fileslist[7]
...

Without any arguments, ProcessPoolExecutor() creates as many processes as there are cores on your computer. This is great if you want to run your code and walk away for a few hours, letting your Python script take over your whole computational capability. However, if you only want to allow a specific number of processes, you can use ProcessPoolExecutor(max_workers=nproc), where nproc is the number of processes you want to simultaneously allow at most.

To-do

In my current implementation I have used the above method to work on ‘chunks’ of data and then saved the resultant output with appropriate markers to disk. However, another way to implement parallel processing would be to take the output from each iteration, and save it as an element in an array, at the correct array index.

This should not be hard to do, all I should need is to return both the output data and the correct marker for the array index. I just haven’t done it (nor needed to do it) yet. I actually prefer saving the output from each chunk to disk separately, if possible, so that even if something crashes (or the power goes out, or whatever) and the process is interrupted, I won’t lose all progress made until then.

This may be the best journal article ever

I was forwarded a PDF of a journal article by friends, and I had a hard time believing when I read it. This has to be the best journal article ever; it even includes comments from a reviewer (which is not common at all, at least in my area of research).

Here’s the link to the article listing at nih.gov, and here’s a link to the PDF itself.

Enjoy.

Creating Bandpass Bessel Filter with MATLAB

Bessel filters are incredibly useful in numerical analysis, especially for acoustic-type waveforms. This is because analog Bessel filters are characterized by almost constant group delay over any frequency band, and this means that the shape of waves does not change when passed through such a filter.

Well, MATLAB provides some of the building blocks required to create a bandpass analog filter, but does not actually combine the pieces to make a usable filter function.

I created a function for my own research (sourced from pieces I found elsewhere, but it’s been too long—I don’t remember where I found each piece, sorry!), and can be found at my MATLAB repository, specifically, here.

Here’s the documentation that I included with the function:

besselfilter. Function to implement a bandpass Bessel Filter.

[filtData, b, a] = besselfilter(order,low,high,sampling,data)

Inputs:

    - order:      Number of poles in the filter. Scalar numeric value.
                    Eg.: 4  
    - low:        Lower frequency bound (Hz). Scalar numeric value.
                    Eg.: 50000 (= 50kHz)
    - high:       Upper frequency bound (Hz). Scalar numeric value.
                    Eg.: 1000000 (= 1MHz)
    - sampling:   Sampling frequency (Hz). Scalar numeric value.
                    Eg.: 25000000 (= 25MHz)
    - data:       Input data. Numeric vector.
                    Eg.: data vector of size (n x 1)

Output:

    - filtData:   Output filtered data. Numeric vector. 
                    Eg.: data vector of size (n x 1)
    - b, a:       Transfer function values for the filter. Scalar numeric.

MATLAB repository

I’ve been using MATLAB for quite a few years now, using it both for my own research as well for work at VTTI. Well, I decided to share some of the code that I’ve been writing, which may come in handy to others in the same field. I’ve long appreciated the help I’ve received from the larger MATLAB community, and I thought I should start contributing as well. :)

Here’s the link: https://bitbucket.org/arnabocean/public-matlab/

Update in 2020: Well, my Bitbucket repository was Mercurial based, as you will read below. Bitbucket has since moved away from Mercurial and became Git-only. So I moved to the more popular Git repo: Github. Here’s where the repository resides now: https://github.com/arnabocean/public-matlab

I’ve been using bitbucket as my code repository. I know, GitHub seems to be more popular in the tech and programming community, but by coincidence I happened to learn Mercurial (Hg) first, and Mercurial it has been for me since. Bitbucket supports both Hg and Git, so you may use either based on your preference.

To be sure, these are and will be scripts that I can share, and in some cases may be snippets of code that I think may be useful in implementing certain logic scenarios. They will almost certainly not be pathbreaking new pieces of programming. :) I’ll be glad if you find them useful in any way.

Matlab: find a string within a cell array of strings

Table of Contents

Method 1
Method 2
Method 3
Reference

I just wanted to jot down a few points about Matlab programming. Specifically, this is about finding a string within another cell array of strings, where the thing I’m really interested in is the index of the cell array where the reference string occurs. For example, if my reference string is 'Gamma', and my cell array is {'Alpha','Beta','Gamma','Delta'}, then the result of the code should be 3.

Say,

cellArray = {'Alpha','Beta','Gamma','Delta','GammaSquared'};
refString = 'Gamma';

Method 1

This method uses the Matlab function strfind (link).

index = strfind(cellArray,refString);
index = find(~cellfun(@isempty,index));

Result:

index = 
    3   5

This method works great if the idea is to find a substring, i.e. in the case where we are looking for all possible matches. It doesn’t work too well, however, if we’re looking for a specific match.

Method 2

This uses the Matlab function ismember (link).

index = find(ismember(cellArray,refString));

Result:

index = 
    3

Works great if the idea is to find a perfect match. However, let’s also keep tabs on the computation time.

tic; index = find(ismember(cellArray,refString)); toc;

Result:

Elapsed time is 0.001047 seconds.

Method 3

This uses the Matlab function strcmp (link).

index = find(strcmp(cellArray,refString));

Result:

index = 
    3

Same result as in Method 2, but what about computation time?

tic; index = find(strcmp(cellArray,refString)); toc;

Result:

Elapsed time is 0.000025 seconds.

Turns out Method 3 is more than 41 times faster to execute. So we have a winner!

Reference

Stack Overflow: How to search for a string in cell array in MATLAB?