Transform after groupby in dask - pandas

I'm looking to perform a transform-like operation after a groupby using dask dataframes. Looking through the docs, it doesn't seem like dask currently gives this option - but does anyone have any work around?
In practice: I'm looking to subtract the means of column B (after doing a groupby on A) from the raw values of B. In pure pandas, it looks like this:
def demean_and_log(x):
x_log = np.log(x)
x_log_mean = x_log.mean()
return x_log - x_log_mean
log_demean_col = X.groupby(['A'])['B'].transform(demean_and_log)
However, this is very slow - as I'm dealing with very large dataframes, and since I'm using a custom transformation function in Python, pandas doesn't release the GIL.

Related

pandas to_numeric a large wide dataframe

I need to apply pd.to_numeric to a long and wide (1000+ columns) dataframe where invalid values are coerced as NaN.
Currently I'm using
df.apply(pd.to_numeric, errors="coerce")
which can take substantial amount of time due to the number of columns.
df.astype()
does not work either as it does not take coerce option.
Any comment is appreciated.
As it has already been commented on, the amount of data you're working with makes it pretty hard for pandas transformations to not be extremely slow.
I recommend you set up a PySpark session inside your local machine, transform the DataFrame column types and proceed to convert to Pandas at the end if you really need it.
In PySpark, you can convert all your dataframe's column to float by doing this:
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Afterwards you can just save your DataFrame back to where you want it to be (or maybe stick to PySpark and join the group!):
df.toPandas().to_csv('my_file.csv')

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Groupby and filter with dask

I want to use dask to make a groupby.
Using pandas I would have to write this to make a basic a basic group by and filter.
My dataset contains 2 indexes : ORDER_ID and PROD_ID. Each ORDER defined by ORDER_ID, we can have 1 or more product defined by its PROD_ID.
My objective is to remove ORDER_ID that contain 1 product.
Using pandas I can do it this way:
df = df.groupby('ORDER_ID').filter(lambda x: len(x) >= 2)
I didn't find any suitable solution with dask.
https://docs.dask.org/en/latest/dataframe-best-practices.html discusses the issues with pandas and dask.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.
So this task is not working in pandas as it takes too much memory?

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.
In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s