Use pandas cut function in Dask - pandas

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)

The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})
Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

Retaining Group by character values in pandas dataframe

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': ['sam', 'anu', 'rita', 'first', 'mid', 'last']})
print(df)
I have the data frame as above and I would like to convert is as below
Any help much appreciated. Thank you!
Try with
out = df.groupby('A',as_index=False).agg(tuple)
A B
0 1 (sam, anu, rita)
1 2 (first, mid, last)

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

Printing unique list of indices in multiindex pandas dataframe

I am just starting out with pandas and have the following code:
import pandas as pd
d = {'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'class': ['mammal', 'mammal','bird-mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog','cat', 'bat', 'penguin'],
'locomotion': ['walks', 'walks','hops', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
I want to print everything that the animal cat does; here, that will be 'walks' and 'hops'.
I can filter to just the cat cross-section using
df2=df.xs('cat', level=1)
But from here, how do I access the level 'locomotion'?
You can do get_level_values
df.xs('cat', level=1).index.get_level_values(1)
Out[181]: Index(['walks', 'hops'], dtype='object', name='locomotion')

How can pd.cut return a number as group?

Example
pd.cut(df['a'],[0,2,4,10,np.inf],right=False)
It returns [0,2),[2,4),[4,10),[10,np.inf) .
But how can I get [0],(0,2),[2,4),[4,10),[10,np.inf)?
If all values are integers and greater than zero, this could work:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 3, 5, 7, 9, 11, 13]})
pd.cut(df['a'], [-np.inf, 1, 2, 4, 10, np.inf], right=False)