Mask minimum values per group in a `pd.DataFrame` - pandas

Given a pd.DataFrame containing different time series in different groups, I want to create a mask over all rows that indicates per group at which timepoints the minima of value is reached in respect of type 0:
For example, given the pd.DataFrame:
>>> df
group type time value
0 A 0 0 4
1 A 0 1 5
2 A 1 0 6
3 A 1 1 7
4 B 0 0 11
5 B 0 1 10
6 B 1 0 9
7 B 1 1 8
In group A the minima for type 0 is reached at the timepoint 0. For group B the minima for type 0 is reached at the timepoint 1. Therefore, the resulting column should look like:
is_min
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
I have created a version that seems very cumbersome, first finding out the minima locations and then constructing the final column:
def get_minima(df):
type_mask = df.type == 0
min_value = df[type_mask].value.min()
value_mask = df.value == min_value
return df[type_mask & value_mask].time.max()
min_ts = df.groupby('group').apply(get_minima)
df['is_min'] = df.apply(lambda row: min_ts[row.group] == row.time, axis=1)

IIUC, you can try with groupby+apply and min
df['is_min']= df.groupby(['group','type'])['value']
.apply(lambda x: x==x.min())
Same as this with transform+min to get the minimal and eq to create the mask desired:
df['is_min']= df.groupby(['group','type'])['value']
.transform('min').eq(df['value'])
Output:
df
group type time value is_min
0 A 0 0 4 True
1 A 0 1 5 False
2 A 1 0 6 True
3 A 1 1 7 False
4 B 0 0 11 False
5 B 0 1 10 True
6 B 1 0 9 False
7 B 1 1 8 True

You can remove the rows with an excluding merge. sort the values, subset to only "type==0" and drop_duplicates to get the times per group you need to exclude. Then merge with an indicator to exclude.
m = (df.sort_values('value').query('type == 0').drop_duplicates('group')
.drop(columns=['type', 'value']))
# group time
#0 A 0
#5 B 1
df = (df.merge(m, how='outer', indicator=True).query('_merge == "left_only"')
.drop(columns='_merge'))
group type time value
2 A 0 1 5
3 A 1 1 7
4 B 0 0 11
5 B 1 0 9
If you separately need the mask and don't want to automatically query to subset the rows, map the indicator
df = df.merge(m, how='outer', indicator='is_min')
df['is_min'] = df['is_min'].map({'left_only': False, 'both': True})
group type time value is_min
0 A 0 0 4 True
1 A 1 0 6 True
2 A 0 1 5 False
3 A 1 1 7 False
4 B 0 0 11 False
5 B 1 0 9 False
6 B 0 1 10 True
7 B 1 1 8 True

Related

Pandas groupby of specific catergorical column

With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0

Fill the row in a data frame with a specific value based on a condition on the specific column

I have a data frame df:
df=
A B C D
1 4 7 2
2 6 -3 9
-2 7 2 4
I am interested in changing the whole row values to 0 if it's element in the column C is negative. i.e. if df['C']<0, its corresponding row should be filled with the value 0 as shown below:
df=
A B C D
1 4 7 2
0 0 0 0
-2 7 2 4
You can use DataFrame.where or mask:
df.where(df['C'] >= 0, 0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Another option is simple masking via multiplication:
df.mul(df['C'] >= 0, axis=0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
You can also set values directly via loc as shown in this comment:
df.loc[df['C'] <= 0] = 0
df
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Which has the added benefit of modifying the original DataFrame (if you'd rather not return a copy).

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Delete rows in dataframe based on info from Series

I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?
If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7