Well, your python program is running slow? Here is an idea to boost its performance. The concept of parallel processing is very helpful for all those data scientists and programmers leveraging Python for Data Science. Python with its powerful libraries such as numpy, scipy, matplotlib etc., has already reduced the time and cost of development and other useful works. In this article, I’m going to discuss parallel processing to boost up the processing of the Python program. Let’s explore:

What is Parallel Processing?

The parallel processing is an approach for performing a complex operation through simultaneously running tasks over various processors of the same computer. The main objective is to reduce the entire processing time of the program.

But there can be circumstances of establishing communication between the processors because it increases the processing time of a task instead of decreasing it.

In python programming, the multiprocessing resources are very useful for executing independent parallel processes. It uses subprocesses rather than threads to accomplish this task. With this, one can use all the processors on their machine and each process will execute in its separated memory allocated during execution.

What is synchronous and asynchronous execution?

The parallel processing holds two varieties of execution: Synchronous and Asynchronous.

In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.

While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.

The multiprocessing module contains two objects for performing parallel execution of a function.

Pool class

Pool class: Synchronous execution:

Pool.map() & Pool.starmap()
Pool.apply()

Pool class: Asynchronous execution

Pool.map_async() & Pool.starmap_async()
Pool.apply_async()

Process class

What is the role of parallel processing in data science?

The branch of data science is vast as it includes gathering, processing, analysis, visualization, regression, predictive analysis etc., like steps to fetch valuable insights from the collected data source. Parallel processing doesn’t require any supercomputer for faster execution all it demands is a computer with multiple processors in the same system. It is clear that parallel processing is a readymade syrup for a data scientist to reduce their extra effort and time.

Complex frameworks used for performing big data analysis such as Hadoop and Spark leverage the parallel computing concept to process the massive amount of dataset at higher speed.

The data science market is growing at unstoppable pace with a CAGR of 36.5 percent. Its overall market will reach 128.2 billion USD by the end of 2022.

A report by Linkedin reveals that most of the US cities are facing the shortage of data scientists in the companies workforce. Overall, only American states are suffering 151,171 data scientists skills shortage. Thus, grabbing opportunities with practical skills will not be that challenging for beginners.

Now, you must be wondering that how many parallel processes can be executed on your system?

The total number of process execution on your system is dependent upon the total number of available processors. To check the processor’s figure, you can run the cpu_count() function of the multiprocessing:

import multiprocessing as multip
print(“Total number of processors on your machine is: ”, multip.cpu_count())
total number of processors on your machine is: 4

Now, let’s explore parallel computing in-depth with python programming:

Program to check the total number of variables falling under the given range in each row of metrics.

We have a metrics from which we have to count total numbers falling under a prescribed range. First, we will prepare the data. Here, the RandomState from numpy is used for the vast number of probability distributions in selection. Also, random.randint(a,b) will return a pseudorandom integer between a and b.

import numpy as nump
from time import time

nump.random.RandomState(110)                   
met = nump.random.randint(0, 11, size=[300000, 6]
collect = met.tolist()                                             
collect[:6]

Now let’s understand the scenario without Parallelization.

To check, how much time this task will take to accomplish without parallelization we will use function total_range(). It will tell you about the total counts within the prescribed range.

def total_range(row, min, max):

total = 0
for x in row:
if min <= x <= max:
total = total + 1
return total

outcome = []
for row in data:
outcomes.append(total_range( row, min=4, max= 8)
print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

Steps to parallelize a function:

To parallelize a function, you need to execute it in various processors. Start with initializing a pool with x number of processors. Now, pass that function to any one of the parallelization methods of a pool.

The apply(), starmap() and map() methods of the multiprocessing.Pool() will allow you to execute functions parallelly. Here, both map and apply functions are used to take the functions for parallelization as the main argument. The only difference is that map() will only bring one iterable as an argument while the apply() collects args argument which takes the parameters of the parallelized function as an argument. Let’s practically explore them:

Pool.apply()

Let’s parallelize our total_range() function with multiprocessing_Pool().

import multiprocessing as multip

poolv = multip.Pool(multip.cpu_count())      
outcomes = [poolv.apply(total_range, args=(row, 5,9)) for row in data]
poolv.close()

print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

With Pool.map()

Pool.map() takes just one iterable an argument. We will modify our function total_range turning default for min and max parameters. Another function total_range_row() will take the iterable list of row.

import multiprocessing as multip
def total_range_row(row, min=5, max=8):
total = 0
for x in row: 
      if min <= x <= max:
          total = total + 1
	return total
poolv = multip.Pool(multip.cpu_count())
outcomes = poolv.map(total_range_row, [row for row in data])
poolv.close()
print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

With Pool.starmap()

We can avoid the last step of redefining function total_range() by using starmap(). It also takes a single iterable argument parameter. The only difference is each element of the iterable is iterable itself. Thus, Pool.starmap() is the type of a Pool.map() which can take arguments.

import multiprocessing as multip
poolv = multip.Pool(multip.cpu_count())
outcomes = poolv.starmap(total_range, [(row, 5,9) for row in data])
poolv.close()
print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

Asynchronous Parallel Processing:

The apply_async(), starmap_async() and map_async() methods will assist you in running the asynchronous parallel processes. Thus, another process will not be dependent on the beginning order. It will automatically start executing as one gets finished. This is why asynchronous parallel processing doesn’t provide output in the same way as the input.

With Pool.apply_async():

apply_async() is more like apply(), the only difference is that it requires a callback function so that your system can understand which outcome to be stored inside it.
Below I’m redefining a new total_range1() function for accepting or providing the iteration number x. It will also sort the outcome.

import multiprocessing as multip
poolv = multip.Pool(multip.cpu_count())
outcomes = []
def total_range1(x, row, min, max):                     
total = 0
for b in row: 
if min <= b <= max:
total = total +1
return (x,total)
def take_outcomes(outcomes):		    
global outcomes
outcomes.append(outcomes)
for x, row in enumerate(data):		      
poolv.apply_async(total_range1, args=(x,row,5,9), callback=take_outcomes)
poolv.close()
poolv.join()  		
print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

With Pool.starmap_async():

import multiprocessing as multip
poolv = multip.Pool(multip.cpu_count())
outcomes = []
outcomes = poolv.starmap_async(total_range1, [(x, row, 5, 9) for x, row in enumerate(data)]).get()
poolv.close()
print(outcomes[:10])
[2, 1, 3, 6, 4, 4, 6, 3, 2, 3]

Thus, now we have successfully completed the synchronous and asynchronous parallel processing methods in Python programming.

Programming languages like Python, R avails the numerous number of packages to reduce a data scientists and programmers effort in performing analysis of data. In this growing market, one can grab any of the available opportunity. A Data Science Course, blogs, FAQs, videos, applications like resources can be very helpful for beginners as well as professionals to drive their growth.