I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2
I have a DataFrame with cities coordinates, like this (example):
x y
A 10 20
B 20 30
C 15 60
I want to calculate their distance : sqrt(x^2 + y^2) from each other with sort of a multiplication table (example):
A B C
A 0 20 30
B 20 0 25
C 30 25 0
How can I do this? I've tried using apply function but need some guidance.
You can make use of the broadcasting feature in pandas, together with .apply():
df['distance'] = (df['x'] ** 2 + df['y'] ** 2).apply(np.sqrt)
The easiest way is to use distance_matrix of scipy:
from scipy.spatial import distance_matrix
df = pd.DataFrame({'x':[10,20,30], 'y': [20,30,60]},index=list('ABC'))
pd.DataFrame(distance_matrix(df,df), index=df.index, columns=df.index)
Output:
A B C
A 0.000000 14.142136 40.311289
B 14.142136 0.000000 30.413813
C 40.311289 30.413813 0.000000
Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100
I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']
I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50
Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30