Retaining Group by character values in pandas dataframe - pandas

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': ['sam', 'anu', 'rita', 'first', 'mid', 'last']})
print(df)
I have the data frame as above and I would like to convert is as below
Any help much appreciated. Thank you!

Try with
out = df.groupby('A',as_index=False).agg(tuple)
A B
0 1 (sam, anu, rita)
1 2 (first, mid, last)

Related

pandas row wise comparison and apply condition

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

Convert tabular pandas DataFrame into nested pandas DataFrame

Supposing that i have a simple pd.DataFrame like so:
d = {'col1': [1, 20], 'col2': [3, 40], 'col3': [5, 50]}
df = pd.DataFrame(data=d)
df
col1 col2 col4
0 1 3 5
1 20 40 60
is there a way to convert this to nasted pandas Dataframe (df_new) , so as when i call df_new.values[0] taking as ouptut:
array(
[0 1
1 3
2 5
Length: 3, dtype: int], dtype=object)
I still don't think I understand the exact requirement, but here is something:
One way of getting the desired output is this:
>>> pd.Series(df.T[0].values)
0 1
1 3
2 5
dtype: int64
If you want to have these as 2d arrays:
>>> np.array(pd.DataFrame(df.T[0].values).reset_index())
array([[0, 1],
[1, 3],
[2, 5]])
>>> np.array(pd.DataFrame(df.T[1].values).reset_index())
array([[ 0, 20],
[ 1, 40],
[ 2, 50]])

Use pandas cut function in Dask

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1

Printing unique list of indices in multiindex pandas dataframe

I am just starting out with pandas and have the following code:
import pandas as pd
d = {'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'class': ['mammal', 'mammal','bird-mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog','cat', 'bat', 'penguin'],
'locomotion': ['walks', 'walks','hops', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
I want to print everything that the animal cat does; here, that will be 'walks' and 'hops'.
I can filter to just the cat cross-section using
df2=df.xs('cat', level=1)
But from here, how do I access the level 'locomotion'?
You can do get_level_values
df.xs('cat', level=1).index.get_level_values(1)
Out[181]: Index(['walks', 'hops'], dtype='object', name='locomotion')

pandas dataframe subplot grouping by columns

df = pd.DataFrame([[0, 1, 2], [0, 1, 2]])
df.plot(subplots=True)
I want subplot by group [0, 1] and [2] columns. is there the way?
You can use DataFrameGroupBy.plot by Index.map by dictionary for 2 groups:
mapping = {0:'a', 1:'a', 2:'b'}
df.groupby(df.columns.map(mapping.get), axis=1).plot()
Detail:
print (df.columns.map(mapping.get))
Index(['a', 'a', 'b'], dtype='object')