Monitor progress of dd.DataFrame.apply - dataframe

How can I monitor the progress of a row-wise Dask DataFrame apply operation?
Wrapping the line with ProgressBar() doesn't seem to do anything, i.e. nothing is printed on the console?
from dask.diagnostics import ProgressBar
with ProgressBar():
df_calc = ddf.apply(myfunc, axis=1)

Dask operations are lazy by default. Computation only happens when you call compute or persist.
df = dd.read_csv(...) # This lazily builds up a computation
df = df[df.name == 'alice'] # This lazily builds up a computation
result = df.amount.sum() # This lazily builds up a computation
result = result.compute() # This triggers actual work

Related

(Dask) How to distribute expensive resource needed for computation?

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
I plan on using this with dask_jobqueue with SGECluster.
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?
You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.
The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.
you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel
If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df
Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8
The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))
Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)
Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.
Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.
In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)

newbie: holoviews curves from pandas follow up: issues with stream

The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?
Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()

Multiprocessing with class functions and class attributes

I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?
It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.

How to filter tensor from queue based on some predicate in tensorflow?

How can I filter data stored in a queue using a predicate function? For example, let's say we have a queue that stores tensors of features and labels and we just need those that meet the predicate. I tried the following implementation without success:
feature, label = queue.dequeue()
if (predicate(feature, label)):
enqueue_op = another_queue.enqueue(feature, label)
The most straightforward way to do this is to dequeue a batch, run them through the predicate test, use tf.where to produce a dense vector of the ones that match the predicate, and use tf.gather to collect the results, and enqueue that batch. If you want that to happen automatically, you can start a queue runner on the second queue - the easiest way to do that is to use tf.train.batch:
Example:
import numpy as np
import tensorflow as tf
a = tf.constant(np.array([5, 1, 9, 4, 7, 0], dtype=np.int32))
q = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue = q.enqueue_many([a])
dequeue = q.dequeue_many(6)
predmatch = tf.less(dequeue, [5])
selected_items = tf.reshape(tf.where(predmatch), [-1])
found = tf.gather(dequeue, selected_items)
secondqueue = tf.FIFOQueue(6, dtypes=[tf.int32], shapes=[])
enqueue2 = secondqueue.enqueue_many([found])
dequeue2 = secondqueue.dequeue_many(3) # XXX, hardcoded
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(enqueue) # Fill the first queue
sess.run(enqueue2) # Filter, push into queue 2
print sess.run(dequeue2) # Pop items off of queue2
The predicate produces a boolean vector; the tf.where produces a dense vector of the indexes of the true values, and the tf.gather collects items from your original tensor based upon those indexes.
A lot of things are hardcoded in this example that you'd need to make not-hardcoded in reality, of course, but hopefully it shows the structure of what you're trying to do (create a filtering pipeline). In practice, you'd want QueueRunners on there to keep things churning automatically. Using tf.train.batch is very useful to handle that automatically -- see Threading and Queues for more detail.