get last week date from last group for each group by pandas? - pandas

hy_code date week_date last_week_date
340 880301 2006/10/31 2006/11/5 NaT
341 880301 2006/11/1 2006/11/5 NaT
342 880301 2006/11/2 2006/11/5 NaT
343 880301 2006/11/3 2006/11/5 NaT
355 880301 2006/11/21 2006/11/26 2006/11/5
482916 880969 2021/12/24 2021/12/26 NaT
482918 880969 2021/12/28 2022/1/2 2021/12/26
482919 880969 2021/12/29 2022/1/2 2021/12/26
482920 880969 2021/12/30 2022/1/2 2021/12/26
482921 880969 2021/12/31 2022/1/2 2021/12/26
Goal
I want to get last week date for each group week date.The expected result is above.
Try
df['new_week_date']=tmp_df.groupby(['hy_code','week_date'])['week_date'].shift(1), but failed.
hy_code date week_date last_week_date
340 880301 2006-10-31 2006-11-05 NaT
341 880301 2006-11-01 2006-11-05 2006-11-05
342 880301 2006-11-02 2006-11-05 2006-11-05
343 880301 2006-11-03 2006-11-05 2006-11-05
355 880301 2006-11-21 2006-11-26 NaT
482916 880969 2021-12-24 2021-12-26 NaT
482918 880969 2021-12-28 2022-01-02 2021-12-26
482919 880969 2021-12-29 2022-01-02 2022-01-02
482920 880969 2021-12-30 2022-01-02 2022-01-02
482921 880969 2021-12-31 2022-01-02 2022-01-02

You can first compute the shifted week as a series with a double groupby:
s = (df
.groupby(['hy_code','week_date'],sort=False)['week_date'].first()
.groupby('hy_code').shift()
.rename('last_week_date')
)
hy_code week_date
880301 2006/11/5 NaN
2006/11/26 2006/11/5
880969 2021/12/26 NaN
2022/1/2 2021/12/26
Name: last_week_date, dtype: object
Then merge it to the original data:
# ensure last_week_date is not pre-existing
# df = df.drop(columns='last_week_date')
# merge
df.merge(s, left_on=['hy_code','week_date'], right_index=True)
Output:
hy_code date week_date last_week_date
340 880301 2006/10/31 2006/11/5 NaN
341 880301 2006/11/1 2006/11/5 NaN
342 880301 2006/11/2 2006/11/5 NaN
343 880301 2006/11/3 2006/11/5 NaN
355 880301 2006/11/21 2006/11/26 2006/11/5
482916 880969 2021/12/24 2021/12/26 NaN
482918 880969 2021/12/28 2022/1/2 2021/12/26
482919 880969 2021/12/29 2022/1/2 2021/12/26
482920 880969 2021/12/30 2022/1/2 2021/12/26
482921 880969 2021/12/31 2022/1/2 2021/12/26

You can first drop_duplicates, then create the shift week date and merge back
out = df.drop_duplicates(['hy_code','week_date']).copy()
out['lastweekdate'] = out.groupby('hy_code')['week_date'].shift(1)
df = df.merge(out.drop('date',1),how='left')
df
Out[233]:
hy_code date week_date lastweekdate
0 880301 2006/10/31 2006/11/5 NaN
1 880301 2006/11/1 2006/11/5 NaN
2 880301 2006/11/2 2006/11/5 NaN
3 880301 2006/11/3 2006/11/5 NaN
4 880301 2006/11/21 2006/11/26 2006/11/5
5 880969 2021/12/24 2021/12/26 NaN
6 880969 2021/12/28 2022/1/2 2021/12/26
7 880969 2021/12/29 2022/1/2 2021/12/26
8 880969 2021/12/30 2022/1/2 2021/12/26
9 880969 2021/12/31 2022/1/2 2021/12/26

Related

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

How to merge two dataframes based on time?

Trying to merge to dataframes. First is big, second is small. They have a datetime set as index.
I want the datetime value of the second one (and the row) merged in between the datetime values of the first one, sorted by time.
df1:
df1 = pd.read_csv(left_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df1.columns = ['FLIGHT_ID','X', 'Y','MODE_C', 'SPEED', 'HEADING', 'TRK_ROCD', 'TIJD']
df1['datetime'] = pd.to_datetime(df1['TIJD'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S")
df1.set_index(['datetime'], inplace=True)
print(df1)
FLIGHT_ID X Y MODE_C SPEED HEADING TRK_ROCD TIJD
datetime
2019-01-28 00:26:56 20034026 -13345 -1923 230.0 414 88 NaN 28-1-2019 00:26:56
2019-01-28 00:27:00 20034026 -13275 -1923 230.0 414 88 NaN 28-1-2019 00:27:00
2019-01-28 00:27:05 20034026 -13204 -1923 230.0 414 88 NaN 28-1-2019 00:27:05
2019-01-28 00:27:10 20034026 -13134 -1923 230.0 414 88 NaN 28-1-2019 00:27:10
2019-01-28 00:27:15 20034026 -13064 -1923 230.0 414 88 NaN 28-1-2019 00:27:15
... ... ... ... ... ... ... ... ...
2019-01-29 00:08:32 20035925 13443 -531 230.0 257 85 NaN 29-1-2019 00:08:32
2019-01-29 00:08:37 20035925 13487 -526 230.0 257 85 NaN 29-1-2019 00:08:37
2019-01-29 00:08:42 20035925 13530 -520 230.0 257 85 NaN 29-1-2019 00:08:42
2019-01-29 00:08:46 20035925 13574 -516 230.0 257 85 NaN 29-1-2019 00:08:46
2019-01-29 00:08:51 20035925 13617 -510 230.0 257 85 NaN 29-1-2019 00:08:51
551446 rows × 8 columns
df2:
df2 = pd.read_csv(right_inputfile_to_read, decimal=".",sep=';', parse_dates = True, low_memory=False)
df2['datetime'] = pd.to_datetime(df2['T_START'], infer_datetime_format = True, format="%Y-%M-%D %H:%M:%S" , dayfirst=True)
df2.set_index(['datetime'], inplace=True)
df2.drop(columns=['T_START', 'T_END', 'AIRFIELD'], inplace=True)
print(df2)
QNH MODE_C_CORRECTION
datetime
2019-01-28 02:14:00 1022 235
2019-01-28 02:14:00 1022 235
2019-01-28 02:16:00 1019 155
2019-01-28 02:21:00 1019 155
2019-01-28 02:36:00 1019 155
... ... ...
2019-01-28 21:56:00 1014 21
2019-01-28 22:56:00 1014 21
2019-01-28 23:26:00 1014 21
2019-01-28 23:29:00 1014 21
2019-01-28 23:52:00 1014 21
[69 rows x 2 columns]
The idea is that the first row of df2 should be inserted somewhere at 2019-01-28 02:14:00.
I have spent hours on Stackoverflow and pandas documentation (merge, join, concat) but cannot find the right solution.
The next step would be to interpolate the values in column 'QNH' to the rows that are in df1, based on that time.
Any help greatly appreciated!
Just concatenate two DataFrames and sort by date:
df = pd.concat([df1,df2]).sort_values(by='datetime')
For the next step you can use pandas.DataFrame.interpolate.

Pandas DF will in for Missing Months

I have a dataframe of values that are mostly (but not always) quarterly values.
I need to fill in for any missing months so it is complete.
Here i need to put it into a complete df from 2015-12 to 2021-03.
Thank you.
id date amt rate
0 15856 2015-12-31 85.09 0.0175
1 15857 2016-03-31 135.60 0.0175
2 15858 2016-06-30 135.91 0.0175
3 15859 2016-09-30 167.27 0.0175
4 15860 2016-12-31 173.32 0.0175
....
19 15875 2020-09-30 305.03 0.0175
20 15876 2020-12-31 354.09 0.0175
21 15877 2021-03-31 391.19 0.0175
You can use pd.date_range() to generate a list of months end with freq='M' then reindex datetime index.
df_ = df.set_index('date').reindex(pd.date_range('2015-12', '2021-03', freq='M')).reset_index().rename(columns={'index': 'date'})
print(df_)
date id amt rate
0 2015-12-31 15856.0 85.09 0.0175
1 2016-01-31 NaN NaN NaN
2 2016-02-29 NaN NaN NaN
3 2016-03-31 15857.0 135.60 0.0175
4 2016-04-30 NaN NaN NaN
.. ... ... ... ...
58 2020-10-31 NaN NaN NaN
59 2020-11-30 NaN NaN NaN
60 2020-12-31 15876.0 354.09 0.0175
61 2021-01-31 NaN NaN NaN
62 2021-02-28 NaN NaN NaN
To fill the NaN value, you can use df_.fillna(0).

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

When Data Group by, How can I cumsum milliseconds in pandas DataFrame?

When data using group by, how can I cumsum millisenconds in df?
Inputs is bellow here.
inputs:
time key isValue
2018-03-04 00:00:06.520 1 NaN
2018-03-04 00:00:07.230 1 NaN
2018-03-04 00:00:08.140 1 1
2018-03-04 00:00:08.720 1 1
2018-03-04 00:00:09.110 1 1
2018-03-04 00:00:09.650 1 NaN
2018-03-04 00:00:10.360 1 NaN
2018-03-04 00:00:11.150 1 NaN
2018-03-04 00:00:11.770 2 NaN
2018-03-04 00:00:12.320 2 NaN
2018-03-04 00:00:12.910 2 1
2018-03-04 00:00:13.250 2 1
2018-03-04 00:00:13.960 2 1
2018-03-04 00:00:14.550 2 NaN
2018-03-04 00:00:15.250 2 NaN
....
And I wanna Outputs is bellow here.
outputs
key : time
1 : 1.030
2 : 1.050
3 : X.xxx
4 : X.xxx
....
Well, I'm using this code
df.groupby(["key"])["time"].cumsum()
Is not correct code that I think.
I think need:
df['new'] = df["time"].dt.microsecond.groupby(df["key"]).cumsum() / 1000
print (df)
time key isValue new
0 2018-03-04 00:00:06.520 1 NaN 520.0
1 2018-03-04 00:00:07.230 1 NaN 750.0
2 2018-03-04 00:00:08.140 1 1.0 890.0
3 2018-03-04 00:00:08.720 1 1.0 1610.0
4 2018-03-04 00:00:09.110 1 1.0 1720.0
5 2018-03-04 00:00:09.650 1 NaN 2370.0
6 2018-03-04 00:00:10.360 1 NaN 2730.0
7 2018-03-04 00:00:11.150 1 NaN 2880.0
8 2018-03-04 00:00:11.770 2 NaN 770.0
9 2018-03-04 00:00:12.320 2 NaN 1090.0
10 2018-03-04 00:00:12.910 2 1.0 2000.0
11 2018-03-04 00:00:13.250 2 1.0 2250.0
12 2018-03-04 00:00:13.960 2 1.0 3210.0
13 2018-03-04 00:00:14.550 2 NaN 3760.0
14 2018-03-04 00:00:15.250 2 NaN 4010.0