Dask still Slower than Pandas on Large Dataset 3.2 Go - pandas

I am currently Trying Dask locally (parallel processing) for the first Time on a large Dataset (3.2 Go). I am comparing Dasks speed with pandas on simple computations. Using Dask seems to result in slower execution time in any task beside reading and transforming data.
example:
#pandas code
import numpy as np
import pandas as pd
import time
T=pd.read_csv("transactions_train.csv")
Data reading is slow, it takes about 1.5 minutes to read the data.
After trying simple Computations
%%time
T.price.mean()
this executes in about 3 seconds
as for dask:
from dask.distributed import Client, progress,LocalCluster
client = Client()
client
import dask.dataframe as dd
DT = dd.read_csv("transactions_train.csv")
executed in 0.551 seconds
%%time
DT.price.mean().compute()
Takes 25 seconds to run this.
It gets worse for heavy computation like modelling.
Any help would be appreciated as I am new to dask and not sure if I am not using it right.
My pc has 4 cores

Avoid calling compute repeatedly:
For example for these simple operations, do something like this
xmin, xmax = dask.compute(df.x.min(), df.x.max())

Related

Enhancing performance of pandas groupby&apply

These days I've been stucked in problem of speeding up groupby&apply,Here is code:
dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()
which cost large time when data size grows.
I've try to change the groupby&apply to for type which didn't work;
then I tried to use unique() but still fail to speed up the running time.
I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem
I think you can consider to use multiprocessing
Check the following example
import multiprocessing
import numpy as np
# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
return grouped_by_KEY_SEQ_and_count
# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)
# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)
# Close multiprocessing pool
pool.close()
pool.join()
concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

How to parallelize groupby() in dask?

I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

pandas "isin" is much slower than numpy "in1d"

There is a huge difference between pandas "isin" and numpy "in1d" from the efficiency aspect. After some research I've noticed that the type of the data and the values that passed as parameter to the "in" method has huge impact on the run time. Anyway it looks that numpy implementation suffer much less from this problem.
What's going on here?
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
f = lambda: df['A'].isin(vals)
g = lambda: pd.np.in1d(df['A'],vals)
print 'pandas:', timeit.timeit(stmt='f()',setup='from __main__ import f',number=10)/10
print 'numpy :', timeit.timeit(stmt='g()',setup='from __main__ import g',number=10)/10
>>
**pandas: 0.0541711091995
numpy : 0.000645089149475**
Numpy and Pandas use different algorithms for isin. For some cases numpy's version is faster and for some pandas'. For your test case numpy seems to be faster.
Pandas' version has however a better asymptotic running time, in will win for bigger datasets.
Let's assume that there are n elements in the data-series (df in your example) and m elements in the query (vals in your example).
Usually, Numpy's algorithm does the following:
Use np.unique(..) to find all unique elements in the series. Thus is done via sorting, i.e. O(n*log(n)), there might be N<=n unique elements.
For every element use binary search to look up whether element is in the series, i.e. O(m*log(N)) in overall.
Which leads to overall running time of O(n*log(n) + m*log(N)).
There are some hard-coded optimizations in place for the cases, when vals only few elements and for this cases numpy really shines.
Pandas does something different:
Populates a hash-map (wrapped khash-functionality) in order to find all unique elements, which takes O(n).
Looks-up in the hash map in O(1) for every query, i.e. O(m) in overall.
I overall, running time is O(n)+O(m), which is much better than Numpy's.
However, for smaller inputs, constant factors and not the asymptotic behavior is that what counts and it is just way better for Numpy. There are also other consideration, like memory consumption (which is higher for Pandas) which might play a role.
But if we take a bigger query set, the situation is completely different:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
vals2 = np.random.randint(0,10,(10**6),dtype='int64')
And now:
%timeit df['A'].isin(vals) # 17.0 ms
%timeit df['A'].isin(vals2) # 16.8 ms
%timeit pd.np.in1d(df['A'],vals) # 1.36
%timeit pd.np.in1d(df['A'],vals2) # 82.9 ms
Numpy is really losing ground as long as there are more queries. It can also be seen, that building of the hash-map is the bottleneck for Pandas and not the queries.
In the end it doesn't make much sense (even if I just did!) to evaluate the performance for only one input size - it should be done for a range of input sizes - there are some surprises to be discovered!
E.g. fun fact: if you would take
df = pd.DataFrame(np.random.randint(0,10,(10**6+1), dtype='int8'),columns=['A'])
i.e. 10^6+1 instead of 10^6, pandas would fall back to numpy's algorithm (which is not clever in my opinion) and would become better for small inputs but worse for big:
%timeit df['A'].isin(vals) # 6ms was 17.0 ms
%timeit df['A'].isin(vals2) # 100ms was 16.8 ms

convert dask dataframe to dataframe is too slow, it does not save time when using it parallel process

import pandas as pd
import dask.dataframe as dd
import time
import warnings
warnings.simplefilter('ignore')
data['x'] = range(1000)
data['y'] = range(1000)
def add(s):
s['sum'] = s['x']+s['y']
return s
start = time.time()
n_data = data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
d_data = dd.from_pandas(data, npartitions=10)
s_data = d_data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
s_data = s_data.compute()
print('but transform it cost time is {} sec'.format(time.time()-start))
the result is :
it cost time is 1.0297248363494873 sec
it cost time is 0.008629083633422852 sec
but transform it cost time is 1.3664238452911377 sec
Pandas apply is slow. Because you operate row-by-row with a Python function it has to use Python for loops rather than C for loops.
Dask dataframe's default scheduler uses threads, which are usually very good for fast vectorized Pandas operations, but won't help for slow Pandas operations that are bound by Python code. You could consider trying the multiprocessing or distributed schedulers. See http://docs.dask.org/en/latest/scheduling.html
However, I encourage you use Pandas better before you try Dask. Probably using fast Pandas APIs can accelerate your computation much more than Dask can.