Delete rows in data frame based on condition of ranked values - pandas

if i have the below dataframe
raw_data = {
'code': [1,1,1,1,2,2,2,2],
'Date': ['2022-01-04','2022-01-01', '2022-01-03','2022-01-02', '2022-01-08', '2022-01-07','2022-01-06','2022-01-05'],
'flag_check': [np.NaN, np.NaN, '11-33-24-33333' ,np.NaN, np.NaN,'11-55-24-33443' ,np.NaN, np.NaN],
'rank':[np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]
}
df = pd.DataFrame(raw_data, columns=['code', 'Date','flag_check', 'rank'])
I need to do the following
1- rank the entries based on code then date
2- within the same code entries fill the rank column with series numbers 1,2,3 based on the code and the date.
3- check the value of a "flag_check" if it is not null then delete all rows after it
Expected result

Here's a way to do it:
df['rank'] = df.groupby(['code'])['Date'].rank(method='dense').astype(int)
df = df.sort_values(['code','Date'])
x = df.groupby('code')['flag_check'].apply(lambda x:x.shift().notna().cumsum())
df = df.loc[x[x==0].index,:].reset_index(drop=True)
Input:
code Date flag_check rank
0 1 2022-01-04 NaN NaN
1 1 2022-01-01 NaN NaN
2 1 2022-01-03 11-33-24-33333 NaN
3 1 2022-01-02 NaN NaN
4 2 2022-01-08 NaN NaN
5 2 2022-01-07 11-55-24-33443 NaN
6 2 2022-01-06 NaN NaN
7 2 2022-01-05 NaN NaN
Output:
code Date flag_check rank
0 1 2022-01-01 NaN 1
1 1 2022-01-02 NaN 2
2 1 2022-01-03 11-33-24-33333 3
3 2 2022-01-05 NaN 1
4 2 2022-01-06 NaN 2
5 2 2022-01-07 11-55-24-33443 3

Annotated code
# Order by Date
s = df.sort_values('Date')
# rank the date column per code group
s['rank'] = s.groupby('code')['Date'].rank(method='dense')
# create boolean mask to identify the rows after the first non-null value
mask = s['flag_check'].notna()[::-1].groupby(df['code']).cummax()
Result
s[mask]
code Date flag_check rank
1 1 2022-01-01 NaN 1.0
3 1 2022-01-02 NaN 2.0
2 1 2022-01-03 11-33-24-33333 3.0
7 2 2022-01-05 NaN 1.0
6 2 2022-01-06 NaN 2.0
5 2 2022-01-07 11-55-24-33443 3.0

Related

Pandas rolling average window for dates excluding row

df = pd.DataFrame(
{"date": [pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-05")],
"numbers": [1,2,3,4,5]
}
)
If I have the following df and I would like to get the rolling mean for the values of numbers that are before each rows date column, how would I do that?
I know I can do
df["av"] = df.shift(1).rolling(window=3).mean()
but this does not shift dynamically so it includes today.
My expected output for the new av column for a 3 day window over the sample df would be
date numbers av
0 2022-01-01 1 NaN
1 2022-01-01 2 NaN
2 2022-01-01 3 NaN
3 2022-01-03 4 2.0
4 2022-01-03 7 2.0
5 2022-01-05 5 5.5
I think you need rolling means per unique dates with add excluded dates shifted by 1 day.
Here is used alternative solution for means by definition - sum / count.
df1 = (df.groupby('date')['numbers']
.agg(['sum','size'])
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum())
df['av'] = df['date'].map(df1['sum'].div(df1['size']).shift())
print (df)
date numbers av
0 2022-01-01 1 NaN
1 2022-01-01 2 NaN
2 2022-01-01 3 NaN
3 2022-01-03 4 2.0
4 2022-01-03 7 2.0
5 2022-01-05 5 5.5
Explanation:
First are aggregate sum and size for count:
print (df.groupby('date')['numbers'].agg(['sum','size']))
sum size
date
2022-01-01 6 3
2022-01-03 11 2
2022-01-05 5 1
Added missing consecutives dates by DataFrame.asfreq:
print (df.groupby('date')['numbers']
.agg(['sum','size'])
.asfreq('d', fill_value=0))
sum size
date
2022-01-01 6 3
2022-01-02 0 0
2022-01-03 11 2
2022-01-04 0 0
2022-01-05 5 1
Use rolling per 3 days by sum:
df1 = (df.groupby('date')['numbers']
.agg(['sum','size'])
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum())
print (df1)
sum size
date
2022-01-01 6.0 3.0
2022-01-02 6.0 3.0
2022-01-03 17.0 5.0
2022-01-04 11.0 2.0
2022-01-05 16.0 3.0
Divide columns from df1 for averages:
print (df1['sum'].div(df1['size']))
date
2022-01-01 2.000000
2022-01-02 2.000000
2022-01-03 3.400000
2022-01-04 5.500000
2022-01-05 5.333333
Freq: D, dtype: float64
Exclude day by Series.shift by one day:
print (df1['sum'].div(df1['size']).shift())
date
2022-01-01 NaN
2022-01-02 2.0
2022-01-03 2.0
2022-01-04 3.4
2022-01-05 5.5
Freq: D, dtype: float64
Last for new column use Series.map:
print (df['date'].map(df1['sum'].div(df1['size']).shift()))
0 NaN
1 NaN
2 NaN
3 2.0
4 2.0
5 5.5
Name: date, dtype: float64
answer above seems too complex ,here is a simple one
def function1(ss:pd.Series):
return ss.loc[ss.index!=ss.index.max()].mean()
df1.assign(av=df1.rolling('4d',on='date').numbers.apply(function1))
numbers av
date
2022-01-01 1 NaN
2022-01-01 2 NaN
2022-01-01 3 NaN
2022-01-03 4 2.0
2022-01-03 7 2.0
2022-01-05 5 5.5

Calculate the cumulative count for all NaN values in specific column

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
I would like to calculate the cumulative count of all np.nan for column stock_price for every ID.
The expected result is:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
Any ideas how to solve it?
Idea is change order by indexing, and then in custom function testing missing values, shifting and used cumlative sum:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0

Creating a monotonic list with groupby with missing dates

I am finding the rolling average of all counties over a period of time. However, the first county, county "A", is missing the time point for January 3rd.
def crt_data():
data = [[datetime(2020, 1, 1), 'A', 1],
[datetime(2020,1,2), 'A', 2],
#[datetime(2020,1,3), 'A', 3],
[datetime(2020,1,4), 'A', 4],
[datetime(2020,1,1), 'B', 10],
[datetime(2020,1,2), 'B', 11],
[datetime(2020,1,3), 'B', 12],
[datetime(2020,1,4), 'B', 13],
[datetime(2020,1,1), 'C', 4],
[datetime(2020,1,2), 'C', 3],
[datetime(2020,1,3), 'C', 2],
[datetime(2020,1,4), 'C', 1]
]
df = pd.DataFrame(data, columns=['my_date', 'County', 'cmi'])
return df
df = crt_data()
print('\n \n roll over by timepoint')
df['my_mean'] = df.groupby('my_date')['cmi'].mean().reset_index(0, drop=True)
df = df.sort_values(by=['County', 'my_date'])
df['rolling_cmi2'] = df.my_mean.rolling(2).mean()
print(df)
my_date County cmi my_mean rolling_cmi2
0 2020-01-01 A 1 5.000000 NaN
1 2020-01-02 A 2 5.333333 5.166667
2 2020-01-04 A 4 7.000000 6.166667
3 2020-01-01 B 10 6.000000 6.500000
4 2020-01-02 B 11 NaN NaN
5 2020-01-03 B 12 NaN NaN
6 2020-01-04 B 13 NaN NaN
7 2020-01-01 C 4 NaN NaN
8 2020-01-02 C 3 NaN NaN
9 2020-01-03 C 2 NaN NaN
10 2020-01-04 C 1 NaN NaN
EDIT:
What I expect to see is something like this:
my_date County cmi my_mean rolling_cmi2
0 2020-01-01 A 1 5.000000 NaN
1 2020-01-02 A 2 5.333333 5.166667
3 2020-01-03 B 12 7.000000 6.166667
2 2020-01-04 A 4 7.000000 7.000000
When I group, I get no dates for January third, and two dates for January 1st. This makes the rolling value incorrect.
How do I reduce it to a single average with one of each of the dates, and that date be the correct date? I know County C has all the correct time points, can I move County C to the top of the list to get a complete list of dates? How would you do this?

How to build a window through number positive and negative ranges in dataframe column?

I would like to have average value and max value in every positive and negative range.
From sample data below:
import pandas as pd
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
Which give me dataframe like this:
value
0 -1
1 -2
2 -3
3 -2
4 -1
5 1
6 2
7 3
8 2
9 1
10 -1
11 -4
12 -5
13 2
14 4
15 7
I would like to have something like that:
AVG1 = [-1, -2, -3, -2, -1] / 5 = - 1.8
Max1 = -3
AVG2 = [1, 2, 3, 2, 1] / 5 = 1.8
Max2 = 3
AVG3 = [2 ,4 ,7] / 3 = 4.3
Max3 = 7
If solution need new column or new dataframe, that is ok for me.
I know that I can use .mean like here
pandas get column average/mean with round value
But this solution give me average from all positive and all negative value.
How to build some kind of window that I can calculate average from first negative group next from second positive group and etc..
Regards
You can create Series by np.sign for distinguish positive and negative groups with compare shifted values with cumulative sum for groups and then aggregate mean and max:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df = df_test.groupby(g)['value'].agg(['mean','max'])
print (df)
mean max
value
1 -1.800000 -1
2 1.800000 3
3 -3.333333 -1
4 4.333333 7
EDIT:
For find locale extremes is used solution from this answer:
test_list = [-1, -2, -3, -2, -1, 1, 2, 3, 2, 1, -1, -4, -5, 2 ,4 ,7 ]
df_test = pd.DataFrame(test_list, columns=['value'])
from scipy.signal import argrelextrema
#https://stackoverflow.com/a/50836425
n=2 # number of points to be checked before and after
# Find local peaks
df_test['min'] = df_test.iloc[argrelextrema(df_test.value.values, np.less_equal, order=n)[0]]['value']
df_test['max'] = df_test.iloc[argrelextrema(df_test.value.values, np.greater_equal, order=n)[0]]['value']
Then are replaced values after extremes to missing values, separately for negative and positive groups:
s = np.sign(df_test['value'])
g = s.ne(s.shift()).cumsum()
df_test[['min1','max1']] = df_test[['min','max']].notna().astype(int).iloc[::-1].groupby(g[::-1]).cumsum()
df_test['min1'] = df_test['min1'].where(s.eq(-1) & df_test['min1'].ne(0))
df_test['max1'] = df_test['max1'].where(s.eq(1) & df_test['max1'].ne(0))
df_test['g'] = g
print (df_test)
value min max min1 max1 g
0 -1 NaN -1.0 1.0 NaN 1
1 -2 NaN NaN 1.0 NaN 1
2 -3 -3.0 NaN 1.0 NaN 1
3 -2 NaN NaN NaN NaN 1
4 -1 NaN NaN NaN NaN 1
5 1 NaN NaN NaN 1.0 2
6 2 NaN NaN NaN 1.0 2
7 3 NaN 3.0 NaN 1.0 2
8 2 NaN NaN NaN NaN 2
9 1 NaN NaN NaN NaN 2
10 -1 NaN NaN 1.0 NaN 3
11 -4 NaN NaN 1.0 NaN 3
12 -5 -5.0 NaN 1.0 NaN 3
13 2 NaN NaN NaN 1.0 4
14 4 NaN NaN NaN 1.0 4
15 7 NaN 7.0 NaN 1.0 4
So is possible separately aggregate last 3 values per groups with lambda function and mean, rows with missing values in min1 or max1 are removed by default in groupby:
df1 = df_test.groupby(['g','min1'])['value'].agg(lambda x: x.tail(3).mean())
print (df1)
g min1
1 1.0 -2.000000
3 1.0 -3.333333
Name: value, dtype: float64
df2 = df_test.groupby(['g','max1'])['value'].agg(lambda x: x.tail(3).mean())
print (df2)
g max1
2 1.0 2.000000
4 1.0 4.333333
Name: value, dtype: float64

Insert multiples dates at start of every group in pandas

I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0