Groupby, Transform by Ranking - pandas

I am trying to do a groupby transform by rank with the condition of the same value will rank in ascending order (method='first') and ranking will be by descending (ascending=False). Rather than doing a groupby rank and pandas merge.
Sample code for groupby rank and pandas merge:
data = {
"id": [1,1,2,2,3,3,4,4,5,5],
"value": [10,10,20,20,30,30,40,40,20,20]
}
df = pd.DataFrame(data)
df_rank = df.drop_duplicates()
df_rank["rank"] = df_rank["value"].rank(method="first", ascending=False)
df = pd.merge(df, df_rank[["id","rank"]], on="id", how="left")
df
Out[71]:
id value rank
0 1 10 5.0
1 1 10 5.0
2 2 20 3.0
3 2 20 3.0
4 3 30 2.0
5 3 30 2.0
6 4 40 1.0
7 4 40 1.0
8 5 20 4.0
9 5 20 4.0
I want it to be done by groupby transform method or a more optimized solution. Thanks!

Related

pandas function to find index of the first future instance where column is less than each row's value

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Slice a dataframe by max value as an index

I currently have a dataframe of strain and stress, containing corresponding values. I want to slice the dataframe in a particular way - I want to find the max value in stress, and then take the next 5 rows of the dataframe. (I don't want to just find all the highest values in the column and sort by that.) Here is what I'm doing currently:
import pandas as pd
df = pd.DataFrame({"strain": [1,2,4,6,2,4,7,4,8,3,4,7,3,3,6,4,7,4,3,2],
"stress": [0,0.2,0.5,0.8,0.7,1,0.7,0.6,0.7,0.8,0.4,0.2,0,-0.5,-0.8,-1,-0.8,-0.9,-0.7,-0.6]})
#Sort by stress values
new_df = df.copy()
new_df = new_df.sort_values(by = ['stress'], ascending = False)
new_df = new_df[0:5]
And this is my current output:
print(new_df)
strain stress
5 4 1.0
3 6 0.8
9 3 0.8
4 2 0.7
6 7 0.7
So my code is sorting by the highest values in stress. However, I want to main the row order behind the highest value in the column. This would be my expected output:
print(new_df)
strain stress
5 4 1.0
6 7 0.7
7 4 0.6
8 8 0.7
9 3 0.8
You can use argmax to find the index of the maximum:
imax = df.stress.argmax()
df.iloc[imax:imax+5]
Result:
strain stress
5 4 1.0
6 7 0.7
7 4 0.6
8 8 0.7
9 3 0.8

Calculating temporal and sptial gradients while using groupby in multi-index pandas dataframe

Say I have the following sample pandas dataframe of water content (i.e. "wc") values at specified depths along a column of soil:
import pandas as pd
df = pd.DataFrame([[1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1], [1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1]], columns=pd.MultiIndex.from_product([['wc'], [10, 20, 30, 45, 80]]))
df['model'] = [5,5, 5, 6,6,6]
df['time'] = [0, 1, 2,0, 1, 2]
df.set_index(['time', 'model'], inplace=True)
>> df
[Out]:
wc
10 20 30 45 80
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
I would like to calulate the spatial (between columns) and temporal (between rows) gradients for each model "group" in the following structure:
wc temp_grad spat_grad
10 20 30 45 80 10 20 30 45 80 10 20 30 45
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
My attempt involved writing a function first for the temporal gradients and combining this with groupby:
def temp_grad(df):
temp_grad = np.gradient(df[('wc', 10.0)], df.index.get_level_values(0))
return pd.Series(temp_grad, index=x.index)
df[('temp_grad', 10.0)] = (df.groupby(level = ['model'], group_keys=False)
.apply(temp_grad))
but I am not sure how to automate this to apply for all wc columns as well as navigate the multi-indexing issues.
Assuming the function you write is actually what you want, then for temp_grad, you can do at once all the columns in the apply. use np.gradient the same way you did in your function but specify along the axis=0 (rows). Built a dataframe with index and columns as the original data. For the spat_grad, I think the model does not really matter, so no need of the groupby, do np.gradient directly on df['wc'], and along the axis=1 (columns) this time. Built a dataframe the same way. To get the expected output, concat all three of them like:
df = pd.concat([
df['wc'], # original data
# add the temp_grad
df['wc'].groupby(level = ['model'], group_keys=False)
.apply(lambda x: #do all the columns at once, specifying the axis in gradient
pd.DataFrame(np.gradient(x, x.index.get_level_values(0), axis=0),
columns=x.columns, index=x.index)), # build a dataframe
# for spat, no need of groupby as it is row-wise operation
# change the axis, and the values for the x
pd.DataFrame(np.gradient(df['wc'], df['wc'].columns, axis=1),
columns=df['wc'].columns, index=df['wc'].index)
],
keys=['wc','temp_grad','spat_grad'], # redefine the multiindex columns
axis=1 # concat along the columns
)
and you get
print(df)
wc temp_grad spat_grad \
10 20 30 45 80 10 20 30 45 80 10 20
time model
0 5 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 5 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 5 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
0 6 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 6 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 6 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
30 45 80
time model
0 5 0.126667 -0.110476 -0.057143
1 5 0.066667 -0.101905 -0.028571
2 5 -0.080000 -0.157143 -0.057143
0 6 0.126667 -0.110476 -0.057143
1 6 0.066667 -0.101905 -0.028571
2 6 -0.080000 -0.157143 -0.057143

Why I get the different group size number using pandas groupby() with or without column selection?

I try to use the numpy.size() to count the group size for the groups from pandas Dataframe groupby(), and I get strange result.
>>> df=pd.DataFrame({'A':[1,1,2,2], 'B':[1,2,3,4],'C':[0.11,0.32,0.93,0.65],'D':["This","That","How","What"]})
>>> df
A B C D
0 1 1 0.11 This
1 1 2 0.32 That
2 2 3 0.93 How
3 2 4 0.65 What
>>> df.groupby('A',as_index=False).agg(np.size)
A B C D
0 1 2 2.0 2
1 2 2 2.0 2
>>> df.groupby('A',as_index=False)['C'].agg(np.size)
A C
0 1 8
1 2 8
>>> df.groupby('A',as_index=False)[['C']].agg(np.size)
A C
0 1 2.0
1 2 2.0
>>> grouped = df.groupby('A',as_index=False)
>>> grouped['C','D'].agg(np.size)
A C D
0 1 2.0 2
1 2 2.0 2
In the code, if we use groupby() following ['C'], the group size is 8, equal to the correct group size * column number, that is 2 * 4; if we use groupby() following column [['C']] or ['C','D'], the group size is right.
Why?
It seems that pandas try to execute the aggregation first then do the actual column selection.
If you want to know the group size use one of these:
grouped.size()
grouped.agg("size)
len(grouped)

pandas aggregate include all groups

I have the following issue with groupby aggregation, i.e adding groups which are not presented in the dataframe but based on the desired output should be included. An example:
import pandas as pd
from pandas.compat import StringIO
csvdata = StringIO("""day,sale
1,1
2,4
2,10
4,7
5,2.3
7,4.4
2,3.4""")
#day 3,6 are intentionally not included here but I'd like to have it in output
df = pd.read_csv(csvdata, sep=",")
df1=df.groupby(['day'])['sale'].agg('sum').reset_index().rename(columns={'sale':'dailysale'})
df1
How can I get the following? Thank you!
1 1.0
2 17.4
3 0.0
4 7.0
5 2.3
6 0.0
7 4.4
You can add Series.reindex with specified range after aggregating sum:
df1 = (df.groupby(['day'])['sale']
.sum()
.reindex(range(1, 8), fill_value=0)
.reset_index(name='dailysale'))
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4
Another idea is use ordered categorical, so aggregate sum add missing rows:
df['day'] = pd.Categorical(df['day'], categories=range(1, 8), ordered=True)
df1 = df.groupby(['day'])['sale'].sum().reset_index(name='dailysale')
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4