Sorting Pandas data frame with groupby and conditions - pandas

I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*

You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0

The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Pandas cumsum only if positive else zero

I am making a table, where i want to show that if there's no income, no expense can happen
it's a cumulative sum table
This is what I've
Incoming
Outgoing
Total
0
150
-150
10
20
-160
100
30
-90
50
70
-110
Required output
Incoming
Outgoing
Total
0
150
0
10
20
0
100
30
70
50
70
50
I've tried
df.clip(lower=0)
and
df['new_column'].apply(lambda x : df['outgoing']-df['incoming'] if df['incoming']>df['outgoing'])
That doesn't work as well
is there any other way?
Update:
A more straightforward approach inspired by your code using clip and without numpy:
diff = df['Incoming'].sub(df['Outgoing'])
df['Total'] = diff.mul(diff.ge(0).cumsum().clip(0, 1)).cumsum()
print(df)
# Output:
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
Old answer:
Find the row where the balance is positive for the first time then compute the cumulative sum from this point:
start = np.where(df['Incoming'] - df['Outgoing'] >= 0)[0][0]
df['Total'] = df.iloc[start:]['Incoming'].sub(df.iloc[start:]['Outgoing']) \
.cumsum().reindex(df.index, fill_value=0)
Output:
>>> df
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
IIUC, you can check when Incoming is greater than Outgoing using np.where and assign a helper column. Then you can check when this new column is not null, using notnull(), calculate the difference, and use cumsum() on the result:
df['t'] = np.where(df['Incoming'].ge(df['Outgoing']),0,np.nan)
df['t'].ffill(axis=0,inplace=True)
df['Total'] = np.where(df['t'].notnull(),(df['Incoming'].sub(df['Outgoing'])),df['t'])
df['Total'] = df['Total'].cumsum()
df.drop('t',axis=1,inplace=True)
This will give back:
Incoming Outgoing Total
0 0 150 NaN
1 10 20 NaN
2 100 30 70.0
3 50 70 50.0

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Forcing dataframe recalculation after a change of a specific cell

I start with a simple
df = pd.DataFrame({'units':[30,20]})
And I get
units
0 30
1 20
I then add a row to total the column:
my_sum = df.sum()
df = df.append(my_sum, ignore_index=True)
Finally, I add a column to calculate percentages off of the 'units' column:
df['pct'] = df.units / df.units[:-1].sum()
ending with this:
units pct
0 30 0.6
1 20 0.4
2 50 1.0
So far so good - but now the question: I want to change the middle number of units from 20 to, for example, 30. I can use this:
df3.iloc[1, 0] = 40
or
df3.iat[1, 0] = 40
which change the cell, but the calculated values at both the last row and second column don't change to reflect it:
units pct
0 30 0.6
1 40 0.4
2 50 1.0
How do I force these calculated values to adjust following the change in that particular cell?
Make a function that calculates it
def f(df):
return df.append(df.sum(), ignore_index=True).assign(
pct=lambda d: d.units / d.units.iat[-1])
df.iat[1, 0] = 40
f(df)
units pct
0 30 0.428571
1 40 0.571429
2 70 1.000000