adding pandas df to dask df - pandas

One of the recent problems with dask I encountered was encodings that take a lot of time and I wanted to speed them up.
Problem: given a dask df (ddf), encode it, and return ddf.
Here is some code to start with:
# !pip install feature_engine
import dask.dataframe as dd
import pandas as pd
import numpy as np
from feature_engine.encoding import CountFrequencyEncoder
df = pd.DataFrame(np.random.randint(1, 5, (100,3)), columns=['a', 'b', 'c'])
# make it object cols
for col in df.columns:
df[col] = df[col].astype(str)
ddf = dd.from_pandas(df, npartitions=3)
x_freq = ddf.copy()
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
x_freq.head()
It will run fine as I would expect, adding pandas df to dask df - no problem.
But when I try another ddf, there is an error:
x_freq = x.copy()
# npartitions = x_freq.npartitions
# x_freq = x_freq.repartition(npartitions=npartitions).reset_index(drop=True)
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
break
x_freq.head()
Error is happening during concat:
ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1
This is how I load "error" ddf:
ddf = dd.read_parquet(os.path.join(dir_list[0], '*.parquet'), engine='pyarrow').repartition(partition_size='100MB')
I read I should try repartition and/or reset index and/or use assign. Neither worked.
x_freq = x.copy()
in the second example is similar to:
x_freq = ddf.copy()
in the first example in a sense that x is just some ddf I'm trying to encode but it would be a lot of code to define it here.
Can anyone help, please?

Here's what I think might be going on.
Your parquet file probably doesn't have divisions information within it. You thus cannot just dd.concat, since it's not clear how the partitions align.
You can check this by
x_freq.known_divisions # is likely False
x_freq.divisions # is likely (None, None, None, None)
Since unknown divisions are the problem, you can re-create the issue by using the synthetic data in the first example
x_freq = ddf.clear_divisions().copy()
You might solve this problem by re-setting the index:
x_freq.reset_index().set_index(index_column_name)
where index_column_name is the name of the index column.
Consider also saving the data with the correct index afterwards so that it doesn't have to be calculated each time.
Note 1: Parallelization
By the way, since you're computing each column before working with it, you're not really utilizing dask's parallelization abilities. Here is a workflow that might utilize parallelization a bit better:
def count_frequency_encoder(s):
return s.replace(s.value_counts(normalize=True).compute().to_dict())
frequency_columns = {
f'{col_name}_freq': count_frequency_encoder(x_freq[col_name])
for col_name in x_freq.columns}
x_freq = x_freq.assign(**frequency_columns)
Note 2: to_frame
A tiny tip:
x_freq[col_name].to_frame()
is equivalent to
x_freq[[col_name]]

Related

Fastest way to compute only last row in pandas dataframe

I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

How to fill a pandas dataframe in a list comprehension?

I need to fill a pandas dataframe in a list comprehension.
Although rows satisfying the criterias are appended to the dataframe.
However, at the end, dataframe is empty.
Is there a way to resolve this?
In real code, I'm doing many other calculations. This is a simplified code to regenerate it.
import pandas as pd
main_df = pd.DataFrame(columns=['a','b','c','d'])
main_df=main_df.append({'a':'a1', 'b':'b1','c':'c1', 'd':'d1'},ignore_index=True)
main_df=main_df.append({'a':'a2', 'b':'b2','c':'c2', 'd':'d2'},ignore_index=True)
main_df=main_df.append({'a':'a3', 'b':'b3','c':'c3', 'd':'d3'},ignore_index=True)
main_df=main_df.append({'a':'a4', 'b':'b4','c':'c4', 'd':'d4'},ignore_index=True)
print(main_df)
sub_df = pd.DataFrame()
df_columns = main_df.columns.values
def search_using_list_comprehension(row,sub_df,df_columns):
if row[0]=='a1' or row[0]=='a2':
dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', dict)
sub_df=sub_df.append(dict, ignore_index=True)
print('sub_df.shape: ', sub_df.shape)
[search_using_list_comprehension(row,sub_df,df_columns) for row in main_df.values]
print(sub_df)
print(sub_df.shape)
The problem is that you define an empty frame with sub_df = dp.DataFrame() then you assign the same variable within the function parameters and within the list comprehension you provide always the same, empty sub_df as parameter (which is always empty). The one you append to within the function is local to the function only. Another “issue” is using python’s dict variable as user defined. Don’t do this.
Here is what can be changed in your code in order to work, but I would strongly advice against it
import pandas as pd
df_columns = main_df.columns.values
sub_df = pd.DataFrame(columns=df_columns)
def search_using_list_comprehension(row):
global sub_df
if row[0]=='a1' or row[0]=='a2':
my_dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', my_dict)
sub_df = sub_df.append(my_dict, ignore_index=True)
print('sub_df.shape: ', sub_df)
[search_using_list_comprehension(row) for row in main_df.values]
print(sub_df)
print(sub_df.shape)

Updating single row is slow with mixed types in pandas

A simple line of code df.iloc[100] = df.iloc[500] gets very slow on a large DataFrame with mixed types due to the fact that pandas copies the entire columns (found it in the source code). What I don't get is why this behaviour is necessary and how to avoid it and force pandas to just update the relevant values if I am sure in advance that the dtypes are the same. When the DF is single-type then the copying doesn't take place and values are modified in-place.
I found a workaround that seems to have the desired effect but it works only on row numbers:
for c in df.columns:
df[c].array[100] = df[c].array[500]
It is literally 1000x faster than df.iloc[100] = df.iloc[500].
Here is how to reproduce the slowness of assignment:
import string
import itertools
import timeit
import numpy as np
import pandas as pd
data = list(itertools.product(range(200_000), string.ascii_uppercase))
df = pd.DataFrame(data, columns=['i', 'p'])
df['n1'] = np.random.randn(len(df))
df['n2'] = np.random.randn(len(df))
df['n3'] = np.random.randn(len(df))
df['n4'] = np.random.randn(len(df))
print(
timeit.timeit('df.loc[100] = df.loc[500]', number=100, globals=globals()) / 100
)
df_o = df.copy()
# Remove mixed types
for c in df_o.columns:
df_o[c] = df_o[c].astype('object')
print(
timeit.timeit('df_o.loc[100] = df_o.loc[500]', number=100, globals=globals()) / 100
)
This example alone shows 10x performance difference. I still don't fully understand why even with non-mixed types assigning a single row is quite slow.

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.