Fill consecutive NaNs with cumsum, to increment by one on each consecutive NaN - pandas

Given a dataframe with lots of missing value in a certain inverval, my desired output dataframe should have all consecutive NaN filled with a cumsum starting from the first valid value, and adding 1 for each NaN.
Given:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 NaN
2 2018-12-14 NaN
3 2018-12-15 NaN
4 2018-12-16 1
5 2018-12-17 NaN
Desired output:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 2
2 2018-12-14 3
3 2018-12-15 4
4 2018-12-16 1
5 2018-12-17 2

Use:
g = (~df.quantity.isnull()).cumsum()
df['quantity'] = df.fillna(1).groupby(g).quantity.cumsum()
shop_id calendar_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 2.0
2 2 2018-12-14 3.0
3 3 2018-12-15 4.0
4 4 2018-12-16 1.0
5 5 2018-12-17 2.0
Details
Use .isnull() to check where quantity has valid values, and take the cumsum of the boolean Series:
g = (~df.quantity.isnull()).cumsum()
0 1
1 1
2 1
3 1
4 2
5 2
Use fillna
so that when you group by g and take the cusmum the values will increase starting from whatever the value is:
df.fillna(1).groupby(g).quantity.cumsum()
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
5 2.0

Another approach ?
data
shop_id calender_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 NaN
2 2 2018-12-14 NaN
3 3 2018-12-15 NaN
4 4 2018-12-16 1.0
5 5 2018-12-17 NaN
6 6 2018-12-18 NaN
7 7 2018-12-17 NaN
using np.where
where = np.where(data['quantity'] >= 1)
r = []
for i in range(len(where[0])):
try:
r.extend(np.arange(1,where[0][i+1] - where[0][i]+1))
except:
r.extend(np.arange(1,len(data)-where[0][i]+1))
data['quantity'] = r
print(data)
shop_id calender_date quantity
0 0 2018-12-12 1
1 1 2018-12-13 2
2 2 2018-12-14 3
3 3 2018-12-15 4
4 4 2018-12-16 1
5 5 2018-12-17 2
6 6 2018-12-18 3
7 7 2018-12-17 4

Related

How to match 2 rows column in pandas

I have a dataframe looks like below:
index Value Next_value number date
0 ABC DEF2 3 1/1/2023
1 ABC DEF2 4 2/1/2023
2 BDC DEF2 1 3/1/2023
3 BDC CCC2 2 4/1/2023
4 CCC ABC 10 5/1/2023
5 DEF BDC 11 6/1/2023
6 ABC DEF3 7 7/1/2023
7 BDD ABC 8 8/1/2023
I am trying to shift the row by 1, if the Next value matches with the value in the previous row. Above example, index 4 (Next value == Value in index 1; index 0 not considered, because index 1 is latest with date) and index 5 Next value matches with Value in index 3 (index 2 is not considered because index 3 is latest with date) also 7 matches with 6 because 6 appears the latest record (date). You can assume the dataframe is sorted based on date and time
Since my dataframe is very huge, I am not preferring cross join. The output the I expect is
index Value Next_value number prev_number date
0 ABC DEF2 3 NaN 1/1/2023
1 ABC DEF2 4 NaN 2/1/2023
2 BDC DEF2 1 NaN 3/1/2023
3 BDC CCC2 2 NaN 4/1/2023
4 CCC ABC 10 4 5/1/2023
5 DEF BDC 11 2 6/1/2023
6 ABC DEF3 7 NaN 7/1/2023
7 BDD ABC 8 7 8/1/2023
Use cross-mapping (between columns):
df.assign(prev_number=df['Next_value'].map(dict(zip(df['Value'], df['number']))))
dict(zip(df['Value'], df['number']))) - while constructing will capture/assign the last among duplicated Value keys
index Value Next_value number prev_number
0 0 ABC DEF2 3 NaN
1 1 ABC DEF2 4 NaN
2 2 BDC DEF2 1 NaN
3 3 BDC CCC2 2 NaN
4 4 CCC ABC 10 4.0
5 5 DEF BDC 11 2.0
I think you need a merge_asof:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(by='date')
df['prev_number'] = pd.merge_asof(
df, df,
left_by='Next_value', right_by='Value',
left_on='date', right_on='date'
)['number_y']
print(df)
Output:
index Value Next_value number date prev_number
0 0 ABC DEF2 3 2023-01-01 NaN
1 1 ABC DEF2 4 2023-01-02 NaN
2 2 BDC DEF2 1 2023-01-03 NaN
3 3 BDC CCC2 2 2023-01-04 NaN
4 4 CCC ABC 10 2023-01-05 4.0
5 5 DEF BDC 11 2023-01-06 2.0
6 6 ABC DEF3 7 2023-01-07 NaN
7 7 BDD ABC 8 2023-01-08 7.0
Let's use drop_duplicates and map:
mapper = df.drop_duplicates('Value', keep='last').set_index('Value')['number']
df['prev_number'] = df['Next_value'].map(mapper)
Output:
index Value Next_value number prev_number
0 0 ABC DEF2 3 NaN
1 1 ABC DEF2 4 NaN
2 2 BDC DEF2 1 NaN
3 3 BDC CCC2 2 NaN
4 4 CCC ABC 10 4.0
5 5 DEF BDC 11 2.0
def function1(ss:pd.Series):
if ss.name>0:
ss1=df1.loc[:ss.name-1].query("Value==#ss.Next_value").tail(1)
if ss1.size>0:
ss["prev_number"]=ss1.number.squeeze()
return ss
df1.apply(function1,axis=1)
out
Next_value Value index number prev_number
0 DEF2 ABC 0 3 NaN
1 DEF2 ABC 1 4 NaN
2 DEF2 BDC 2 1 NaN
3 CCC2 BDC 3 2 NaN
4 ABC CCC 4 10 4.0
5 BDC DEF 5 11 2.0

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Calculate the cumulative count for all NaN values in specific column

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
I would like to calculate the cumulative count of all np.nan for column stock_price for every ID.
The expected result is:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
Any ideas how to solve it?
Idea is change order by indexing, and then in custom function testing missing values, shifting and used cumlative sum:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0

Backfill and Increment by one?

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0
YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

To count every 3 rows to fit the condition by Pandas rolling

I have dataframe look like this:
raw_data ={'col0':[1,4,5,1,3,3,1,5,8,9,1,2]}
df = DataFrame(raw_data)
col0
0 1
1 4
2 5
3 1
4 3
5 3
6 1
7 5
8 8
9 9
10 1
11 2
What I want to do is to count every 3 rows to fit condition(df['col0']>3) and make new col looks like this:
col0 col_roll_count3
0 1 0
1 4 1
2 5 2 #[index 0,1,2/ 4,5 fit the condition]
3 1 2
4 3 1
5 3 0 #[index 3,4,5/no fit the condition]
6 1 0
7 5 1
8 8 2
9 9 3
10 1 2
11 2 1
How can I achieve that?
I tried this but failed:
df['col_roll_count3'] = df[df['col0']>3].rolling(3).count()
print(df)
col0 col1
0 1 NaN
1 4 1.0
2 5 2.0
3 1 NaN
4 3 NaN
5 3 NaN
6 1 NaN
7 5 3.0
8 8 3.0
9 9 3.0
10 1 NaN
11 2 NaN
df['col_roll_count3'] = df['col0'].gt(3).rolling(3).sum()
Let's use rolling, apply, np.count_nonzero:
df['col_roll_count3'] = df.col0.rolling(3,min_periods=1)\
.apply(lambda x: np.count_nonzero(x>3))
Output:
col0 col_roll_count3
0 1 0.0
1 4 1.0
2 5 2.0
3 1 2.0
4 3 1.0
5 3 0.0
6 1 0.0
7 5 1.0
8 8 2.0
9 9 3.0
10 1 2.0
11 2 1.0