Multiprocessing with class functions and class attributes - oop

I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?

It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.

Related

Enhancing performance of pandas groupby&apply

These days I've been stucked in problem of speeding up groupby&apply,Here is code:
dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()
which cost large time when data size grows.
I've try to change the groupby&apply to for type which didn't work;
then I tried to use unique() but still fail to speed up the running time.
I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem
I think you can consider to use multiprocessing
Check the following example
import multiprocessing
import numpy as np
# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
return grouped_by_KEY_SEQ_and_count
# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)
# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)
# Close multiprocessing pool
pool.close()
pool.join()
concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()

How to run multiple functions synchronous in Jupyter Notebook?

I try to run multiple functions at the same time in Jupiter Notebook.
I have two web scraping functions that use Selenium and run for an infinite amount of time, both always creating an updated DataFrame. Another function merges the two DataFrames and does some calculations.
As the data always changes and the calculations from the different DataFrames need to be calculated within the same second (The two DataFrames update every 5 seconds), I wonder how I can run all functions at the same time.
As my code is mainly WebScraping I used this more to describe my goal and hopefully make it more readable. I already tried using 'multiprocessing' but it just does not do anything in the notebook.
def FirstWebScraping():
while True:
time.sleep(5).
#getting all data for DataFrame
def SecondtWebScraping():
while True:
time.sleep(5).
#getting all data for DataFrame
def Calculations():
while True:
#merging DataFrame from First- and SecondWebScraping
#doing calculations
#running this function infinite and looking for specific values
#Goal
def run_all_at_the_same_time()
FirstWebScraping()
SecondWebScraping()
Calculations()
Even though threading does not show the same benefits as multiprocessing it worked for me and with selenium. I put a waiting time at the beginning for the Calculations function and from there they were all looped infinitely.
from threading import Thread
if __name__ == '__main__':
Thread(target = FirstWebScraping).start()
Thread(target = SecondWebscraping).start()
Thread(target = Calculations).start()
You can run multiprocessing in Jupyter, if you follow two rules:
Put the worker functions in a separate module.
Protect the main process-only code with if __name__ == '__main__':
Assuming your three functions are moved to worker.py:
import multiprocessing as mp
import worker
if __name__ == '__main__':
mp.Process(target=worker.FirstWebScraping).start()
mp.Process(target=worker.SecondWebscraping).start()
mp.Process(target=worker.Calculations).start()

Why is my multiprocessing program spawning processes infinitely?

import time
from multiprocessing import Pool, RawArray, sharedctypes
from ctypes import c_int
def init_worker(X):
print(f"{X}")
def worker_func(i):
print(f"{X}")
time.sleep(i) # Some heavy computations
return
# We need this check for Windows to prevent infinitely spawning new child
# processes.
if __name__ == '__main__':
X =sharedctypes.RawValue(c_int)
X=3
with Pool(processes=4, initializer=init_worker, initargs=(X)) as pool:
pool.map(worker_func, [1,2,3,4])
print(X)
--- I am simply trying to print the value of X in each subprocess. This is a toy program in order to check whether I can share a value
and update it by using multiple processes.
This program spawns infinite number of processes because it there is no ,(comma) after the X in initargs=(X) ; it should be initargs=(X,) . That is because, if the comma is left-out that causes an error.

Monitor progress of dd.DataFrame.apply

How can I monitor the progress of a row-wise Dask DataFrame apply operation?
Wrapping the line with ProgressBar() doesn't seem to do anything, i.e. nothing is printed on the console?
from dask.diagnostics import ProgressBar
with ProgressBar():
df_calc = ddf.apply(myfunc, axis=1)
Dask operations are lazy by default. Computation only happens when you call compute or persist.
df = dd.read_csv(...) # This lazily builds up a computation
df = df[df.name == 'alice'] # This lazily builds up a computation
result = df.amount.sum() # This lazily builds up a computation
result = result.compute() # This triggers actual work

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?
You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.
The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.
you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel
If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df
Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8
The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))
Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)
Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.
Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.
In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)