Create a column counting number of consecutive negative days - pandas

I have huge (more than 3 million rows) pandas dataframe containing the following data:
companyId dateBalance amount
1 2020-04-17 100
1 2020-04-18 40
1 2020-04-19 20
1 2020-04-20 -40
1 2020-04-21 30
2 2020-04-18 5
2 2020-04-19 1
2 2020-04-20 -6
2 2020-04-21 -60
2 2020-04-22 200
I would like to create a new column that counts the number of days in a row the company is with negative balance, so for this case, we have the following
companyId dateBalance amount negCount
1 2020-04-17 100 0
1 2020-04-18 40 0
1 2020-04-19 20 0
1 2020-04-20 -40 1
1 2020-04-21 30 0
2 2020-04-18 5 0
2 2020-04-19 1 0
2 2020-04-20 -6 1
2 2020-04-21 -60 2
2 2020-04-22 200 0
Is there a quick way of doing this (i.e., some way that does not require iteration over every line)? Note that the index must "reset" every sign change and for every different company as well.

Use groupby().cumsum() on the negation of the criteria to identify the blocks, then groupby the blocks again:
blocks = df['amount'].ge(0).groupby(df['companyId']).cumsum()
df['negCount'] = df.groupby([df['companyId'],blocks]).cumcount()
Output:
companyId dateBalance amount negCount
0 1 2020-04-17 100 0
1 1 2020-04-18 40 0
2 1 2020-04-19 20 0
3 1 2020-04-20 -40 1
4 1 2020-04-21 30 0
5 2 2020-04-18 5 0
6 2 2020-04-19 1 0
7 2 2020-04-20 -6 1
8 2 2020-04-21 -60 2
9 2 2020-04-22 200 0

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

Calculate day's difference between successive pandas dataframe rows with condition

I have a dataframe as following:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1
XYZ 3/3/2020 1
XYZ 3/4/2020 1
XYZ 3/5/2020 1
XYZ 3/5/2020 0
XYZ 3/6/2020 1
XYZ 3/8/2020 1
ABC 3/9/2020 0
ABC 3/10/2020 1
ABC 3/11/2020 0
ABC 3/12/2020 1
The relTweet displays whether the tweet is relevant (1) or not (0).
\nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.
Below is the example of needed output:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 0
XYZ 3/3/2020 1 1
XYZ 3/4/2020 1 1
XYZ 3/5/2020 1 1
XYZ 3/5/2020 0 1
XYZ 3/6/2020 1 1
XYZ 3/8/2020 1 2
ABC 3/9/2020 0 0
ABC 3/10/2020 1 0
ABC 3/11/2020 0 1
ABC 3/12/2020 1 2
Following is my code:
dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')
dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
.apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))
This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition.
Following is the output of the above code:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 NaT
XYZ 3/3/2020 1 1 days
XYZ 3/4/2020 1 1 days
XYZ 3/5/2020 1 1 days
XYZ 3/5/2020 0 0 days
XYZ 3/6/2020 1 1 days
XYZ 3/8/2020 1 2 days
ABC 3/9/2020 0 NaT
ABC 3/10/2020 1 1 days
ABC 3/11/2020 0 1 days
ABC 3/12/2020 1 1 days
Change your mind sometime we need merge_asof rather than groupby
df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]:
Company Date relTweet Date1 GaplastRel
0 XYZ 2020-03-02 1 NaT 0.0
1 XYZ 2020-03-03 1 2020-03-02 1.0
2 XYZ 2020-03-04 1 2020-03-03 1.0
3 XYZ 2020-03-05 1 2020-03-04 1.0
4 XYZ 2020-03-05 0 2020-03-04 1.0
5 XYZ 2020-03-06 1 2020-03-05 1.0
6 XYZ 2020-03-08 1 2020-03-06 2.0
7 ABC 2020-03-09 0 NaT 0.0
8 ABC 2020-03-10 1 NaT 0.0
9 ABC 2020-03-11 0 2020-03-10 1.0
10 ABC 2020-03-12 1 2020-03-10 2.0

How to merge two dataframe base on dates which the datediff is one day?

Input
df1
id A
2020-01-01 10
2020-02-07 20
2020-04-09 30
df2
id B
2019-12-31 50
2020-02-06 20
2020-02-07 70
2020-04-08 34
2020-04-09 44
Goal
df
id A B
2020-01-01 10 50
2020-02-07 20 20
2020-04-09 30 34
The detail as follows:
df1 merges df2 base on id, which add columns from df2.
the type of id is datetime.
merge rules: df1 based on yesterday
Could you simply add 1 day to df2's ID column before merging?
df1.merge(df2.assign(id=df2['id'] + pd.Timedelta(days=1)), on='id')
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34
Try pd.merge_asof
df = pd.merge_asof(df1,df2,on='id',tolerance=pd.Timedelta('1 day'),allow_exact_matches=False)
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34

Pandas function issues - equation output incorrect

row['conus_days']>0 or row['conus_days1']>0:
return row ['conus_days']* 8 + row['conus_days1']12
elif (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and row['oconus_days']>0 or row['oconus_days1']>0:
return row ['oconus_days'] 12 + row['oconus_days1']*8
elif (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen'):
return row ['days_in_month']*12
elif (row['Country'] == 'Germany') and row['conus_days']>0:
return row['conus_days']*8 + row['conus_days1']10
elif (row['Country'] == 'Germany'):
return row['days_in_month'] 10
elif row['Country'] == 'Conus':
return row['working_days']* 8
else:
return row['working_days']*8
forecast ['hours']= forecast.apply(lambda row: get_hours(row), axis=1)
print(forecast.head())
this is returning the following output:
Name EID Start Date End Date Country year Month \
0 xx 123456 2019-08-01 2020-01-03 Afghanistan 2020 1
1 XX 3456789 2019-09-22 2020-02-16 Conus 2020 1
2 xx. 456789 2019-12-05 2020-03-12 Conus 2020 1
3 DR. 789456 2019-09-11 2020-03-04 Iraq 2020 1
4 JR. 985756 2020-01-03 2020-05-06 Germany 2020 1
days_in_month start_month end_month working_days conus_mth oconus_mth \
0 31 2020-01-01 2020-01-31 21 8 1
1 31 2020-01-01 2020-01-31 21 9 2
2 31 2020-01-01 2020-01-31 21 12 3
3 31 2020-01-01 2020-01-31 21 9 3
4 31 2020-01-01 2020-01-31 21 1 5
conus_days conus_days1 oconus_days oconus_days1 hours
0 0 0 2 25 224
1 0 0 0 0 168
2 0 0 0 0 168
3 0 0 0 0 372
4 1 28 0 0 344
---output on row 4 is incorrect, this should return 288
Closing each if statement in double parenthesis allows for each if statement to run individual and accurately.
def get_hours(row):
if ((row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and (row['conus_days']>0 or row['conus_days1']>0)):
return row ['conus_days']* 8 + row['conus_days1']12
if ((row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen') and row['oconus_days']>0 or row['oconus_days1']>0):
return row ['oconus_days'] 12 + row['oconus_days1']*8
if (row['Country']== 'Afghanistan' or row['Country']== 'Iraq' or row['Country']=='Somalia' or row['Country']=='Yemen'):
return row ['days_in_month']*12
if ((row['Country'] == 'Germany') and row['conus_days']>0):
return row['conus_days']*8 + row['conus_days1']*10
if ((row['Country'] == 'Germany') and row['oconus_days']>0):
return row['oconus_days']*10 + row['oconus_days1']8
if (row['Country'] == 'Germany'):
return row['days_in_month'] 10
if (row['Country'] == 'Conus'):
return row['working_days']* 8
else:
return row['working_days']*8
forecast ['hours']= forecast.apply(lambda row: get_hours(row), axis=1)
print(forecast.head())
Name EID Start Date End Date Country year Month \
0 XX. 123456 2019-08-01 2020-01-03 Afghanistan 2020 1
1 xx 3456789 2019-09-22 2020-02-16 Conus 2020 1
2 Mh 456789 2019-12-05 2020-03-12 Conus 2020 1
3 DR 789456 2019-09-11 2020-03-04 Iraq 2020 1
4 JR 985756 2020-01-03 2020-05-06 Germany 2020 1
days_in_month start_month end_month working_days conus_mth oconus_mth \
0 31 2020-01-01 2020-01-31 21 8 1
1 31 2020-01-01 2020-01-31 21 9 2
2 31 2020-01-01 2020-01-31 21 12 3
3 31 2020-01-01 2020-01-31 21 9 3
4 31 2020-01-01 2020-01-31 21 1 5
conus_days conus_days1 oconus_days oconus_days1 hours
0 0 0 2 25 224
1 0 0 0 0 168
2 0 0 0 0 168
3 0 0 0 0 372
4 1 28 0 0 288 ​

Subtract day column from date column in pandas data frame

I have two columns in my data frame.One column is date(df["Start_date]) and other is number of days.I want to subtract no of days column(df["days"]) from Date column.
I was trying something like this
df["new_date"]=df["Start_date"]-datetime.timedelta(days=df["days"])
I think you need to_timedelta:
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
Sample:
np.random.seed(120)
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'Start_date': rng, 'days': np.random.choice(np.arange(10), size=10)})
print (df)
Start_date days
0 2015-02-24 7
1 2015-02-25 0
2 2015-02-26 8
3 2015-02-27 4
4 2015-02-28 1
5 2015-03-01 7
6 2015-03-02 1
7 2015-03-03 3
8 2015-03-04 8
9 2015-03-05 9
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
print (df)
Start_date days new_date
0 2015-02-24 7 2015-02-17
1 2015-02-25 0 2015-02-25
2 2015-02-26 8 2015-02-18
3 2015-02-27 4 2015-02-23
4 2015-02-28 1 2015-02-27
5 2015-03-01 7 2015-02-22
6 2015-03-02 1 2015-03-01
7 2015-03-03 3 2015-02-28
8 2015-03-04 8 2015-02-24
9 2015-03-05 9 2015-02-24