How to parallelize groupby() in dask? - pandas

I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks

By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

Related

Can I apply a function to a pandas dataframe asyncronously?

In my use case I need to fetch data from remote server. The code is roughly equivalent to:
def get_user_data(user_id):
time.sleep(5)
...
return data
df = pd.DataFrame({'user_id': ['uid1', 'uid2', 'uid3', ..., 'uid9999']})
answer = df['user_id'].apply(get_user_data)
It seems to me pandas could be running the get_user_data function asynchronously.
Note:
I've tried df['user_id'].swifter.apply(get_user_data) and using dask. They both give me an good speedup by running multiple functions in parallel, but my CPU, network, and remote server utilization remain very low.
Is there a way to do an asynchronous .apply() ?

Enhancing performance of pandas groupby&apply

These days I've been stucked in problem of speeding up groupby&apply,Here is code:
dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()
which cost large time when data size grows.
I've try to change the groupby&apply to for type which didn't work;
then I tried to use unique() but still fail to speed up the running time.
I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem
I think you can consider to use multiprocessing
Check the following example
import multiprocessing
import numpy as np
# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
return grouped_by_KEY_SEQ_and_count
# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)
# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)
# Close multiprocessing pool
pool.close()
pool.join()
concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()

What is the fastest way to return one row from a big pyspark dataframe or koalas dataframe in databricks?

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:
mdf.path_info = mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])
After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.
mdf.head()
display(mdf)
mdf.take([1])
mdf.iloc[0]
also tried converting into a spark dataframe and then trying:
df = mdf.to_spark()
df.show(1)
df.rdd.takeSample(False, 1, seed=0)
df.first()
The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)
Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.
As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:
https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
df.sample(False, 0.1, seed=0)
You could cache the computation result after you convert to spark dataframe and then call the action.
df = mdf.to_spark()
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache()
df.show(1)
You may want to free up the memory used for caching with:
df.unpersist()

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?
You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.
The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.
you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel
If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df
Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8
The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))
Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)
Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.
Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.
In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)

Parallelizing apply function in pandas taking longer than expected

I have a simple cleaner function which removes special characters from a dataframe (and other preprocessing stuff). My dataset is huge and I want to make use of multiprocessing to improve performance. My idea was to break the dataset into chunks and run this cleaner function in parallel on each of them.
I used dask library and also the multiprocessing module from python. However, it seems like the application is stuck and is taking longer than running with a single core.
This is my code:
from multiprocessing import Pool
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
def process_columns(data):
for i in data.columns:
data[i] = data[i].apply(cleaner_func)
return data
mydf2 = parallelize_dataframe(mydf, process_columns)
I can see from the resource monitor that all cores are being used, but as I said before, the application is stuck.
P.S.
I ran this on windows server 2012 (where the issue happens). Running this code on unix env, I was actually able to see some benefit from the multiprocessing library.
Thanks in advance.