I have some PySpark code that aims to run a machine learning model trained in sklearn on a pyspark dataframe looks like this:
from sklearn.ensemble import RandomForestRegressor
X = np.random.rand(1000, 100)
y = np.random.randint(2, size=1000)
tree = RandomForestRegressor(n_jobs=4)
tree.fit(X, y)
pdf = pd.DataFrame(X)
df = spark.createDataFrame(pdf)
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double')
# Input/output are both a pandas.Series of doubles
def pandas_plus_one(*args):
return pd.Series(tree.predict(pd.concat([args[i] for i in range(100)],axis=1)))
df = df.withColumn('result', pandas_plus_one(*[df[i] for i in range(100)]))
My question is that is this the most efficient way to do things with PySpark? In particular, I would like to avoid having to do pd.concat which involves copying all the Series (which were probably adjacent in memory anyways) to a new pandas DataFrame inside of the UDF function. The ideal solution would be for the Pandas UDF to accept a DataFrame as an input, but I haven't found a way to make it work.
Note: I am not looking for solutions that involve SparkML scikit-spark etc.
In an attempt to get some outlier plots on large datasets I need to convert a spark DataFrame to pandas. Turing to Apache Arrow a simple run is crashing my pyspark console when casting x as string (it works fine without the cast), why?
Using Python version 3.8.9 (default, Apr 10 2021 15:47:22)
Spark context Web UI available at http://6d0b1018a45a:4040
Spark context available as 'sc' (master = local[*], app id = local-1621164597906).
SparkSession available as 'spark'.
>>> import time
>>> from pyspark.sql.functions import rand
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.appName("Console_Test").getOrCreate()
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/05/16 11:31:03 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> a_df = spark.range(1 << 25).toDF("id").withColumn("x", rand())
>>> a_df = a_df.withColumn("id", F.col("id").cast("string"))
>>> start_t = time.time()
>>> a_pd = a_df.toPandas()
Killed
#
Additionally I noticed that options such as spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")are seemingly without effect as the web ui shows records of significantly more than 5000 being assigned to the tasks.
Any indication on how to resolve the pyspark console crash or more directly render large scatter plots would be highly appreciated - I have (unsuccessfully) tried to find a way to apply Table.to_pandas(split_blocks=True, self_destruct=True)but did not get the able structure from a spark DataFrame.
You try to convert 33.5 mio (2^25) rows into a Pandas dataframe. This will lead to an OutOfMemoryError, as all data will be transfered to the Spark driver.
A way to find outliers would be to calculate the histogram for the column x and then filter down a_df to the relevant bins in Spark before creating the Pandas dataframe:
hist = a_df.select("x").rdd.flatMap(lambda x: x).histogram(10) #create 10 bins
hist is a tuple of two arrays: the first array contains the boundaries of the bins and the second array contains the numbers of elements in each bin:
([1.7855041778425118e-08,
0.1000000152099446,
0.20000001256484742,
0.30000000991975023,
0.40000000727465307,
0.5000000046295558,
0.6000000019844587,
0.6999999993393615,
0.7999999966942644,
0.8999999940491672,
0.99999999140407],
[3355812,
3356891,
3352364,
3352438,
3357564,
3356213,
3354933,
3355144,
3357241,
3355832])
rand creates uniformly distributed randon numbers, so the histogram in this case is not very interesting. But for real world distributions, the histogram will be useful.
I have a base DataFrame and I want to apply a specific function to each element given by another DataFrame with the same index and columns, for example something like this:
df = pd.DataFrame([[1,2],[3,4]])
df_format = pd.DataFrame([[lambda x: f'{x:.1f}', lambda x: f'{x:.2f}'],
[lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']])
for i in range(2):
for j in range(2):
df.iloc[i,j] = df_format.iloc[i,j](df.iloc[i,j])
print(df)
0 1
0 1.0 2.00
1 3.000 4.0000
This isn't vectorised and I wonder if there is a much more efficient way to do this especially for larger DataFrames
You could make the code much faster by working on columns and using the vectorize function of Numpy. Indeed, direct accesses to Pandas dataframe (using iloc) or internal Numpy arrays (using arr[i]) are slow. Python loops are very slow too. Moreover, data are stored by column internally making column-wise operations faster than row-wise ones.
Here is a solution to vectorize your operation:
def callOn(func, value):
return func(value)
for j in range(2):
# np.vectorize(callOn) generate a function calling callOn(x,y) for
# each input pair (x,y) of zip(df_format[j],df[j]).
df[j] = np.vectorize(callOn)(df_format[j],df[j])
However, note that Numpy do not truely vectorize the calls internally since it deals with Python objects/functions. But this problem inherently comes from the assumption that all lambda could be different and are defined as plain Python objects.
On my machine this code is about 200 times faster than the initial one using the following setup:
nRows, nCols = 1000, 20
fList = [lambda x: f'{x:.1f}', lambda x: f'{x:.2f}', lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']
df = pd.DataFrame(np.random.randint([[10 for j in range(nCols)] for i in range(nRows)]))
df_format = pd.DataFrame(np.random.choice(fList, size=(nRows, nCols)))
I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.
As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?
You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.
The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.
you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel
If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df
Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8
The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))
Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)
Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.
Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.
In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)