How to merge records with aggregate historical data? - pandas

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.

IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

Related

Pandas resample only when makes sense

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0
You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.

get the records before and after the nearest merge by 30 minutes in python

I have two data frames in csv files. First data described traffic incidents (df1) and second data has the traffic record data for each 15 minutes(df2). I want to merge between them based on the closest time. I used python pandas_merge_asof and I got the nearest match. but I want the 30 minutes records before and after the match from the traffic record data. And I want to join the closest incidents to the traffic data time. if the incidents occured 14:02:00, it will be mereged with the traffic date that recorded at 14:00:00
For example:
1- Incidents data
Date detector_id Inident_type
09/30/2015 8:00:00 1 crash
09/30/2015 8:02:00 1 congestion
04/22/2014 15:30:00 9 congestion
04/22/2014 15:33:00 9 Emergency vehicle
2 - Traffic data
Date detector_id traffic_volume
09/30/2015 7:30:00 1 55
09/30/2015 7:45:00 1 45
09/30/2015 8:00:00 1 60
09/30/2015 8:15:00 1 200
09/30/2015 8:30:00 1 70
04/22/2014 15:00:00 9 15
04/22/2014 15:15:00 9 7
04/22/2014 15:30:00 9 50
04/22/2014 15:45:00 9 11
04/22/2014 16:00:00 9 7
2- the desired table
Date detector_id traffic_volume Incident_type
09/30/2015 7:30:00 1 55 NA
09/30/2015 7:45:00 1 45 NA
09/30/2015 8:00:00 1 60 Crash
09/30/2015 8:00:00 1 60 congestion
09/30/2015 8:15:00 1 200 NA
09/30/2015 8:30:00 1 70 NA
04/22/2014 15:00:00 9 15 NA
04/22/2014 15:15:00 9 7 NA
04/22/2014 15:30:00 9 50 Congestion
04/22/2014 15:30:00 9 50 Emergency vehicle
04/22/2014 15:45:00 9 11 NA
04/22/2014 16:00:00 9 7 NA
The code that I used as follow
Merge = pd.merge_asof(df2, df1, left_index = True, right_index = True, allow_exact_maches = False,
on='Date', by='detector_id', direction='nearest')
but it gave me this table.
Date detector_id traffic_volume Incident_type
09/30/2015 8:00:00 1 60 Crash
04/22/2014 15:30:00 9 50 Congestion
and I want to know the situation after and before the incidents occur.
Any Idea?
Thank you.
*If I made mistake by asking like this way, please let me know.
For anyone has the same problem and want to do merge by using pandas.merge_asof, you have to use the Tolerance function. This function helps you adjust the time different between the two datasets.
But you may face two problems related to Timedelta and sorting index. so the solution of Timedelta is converting the time to datetime as follow:
df1.Date = pd.to_datetime(df1.Date)
df2.Date = pd.to_datetime(df2.Date)
and the sorting index you need apply sort in your main code as follow:
x = pd.merge_asof(df1.sort_values('Date'), #sort_values fix the error"left Key must be sorted"
df2.sort_values('Date'),
on = 'Date',
by = 'Detector_id',
direction = 'backward',
tolerance =pd.Timedelta('45 min'))
The direction could be nearest which in my case will select all the records accord before and after the match records within 45 minutes.
The direction could be backward will merge all records within 45 minutes after the exact or nearest match
and Forward will select all the records within 45 minutes before the exact or nearest match.
Thank you and hopefully this will help anyone in future.

How to extract the last value of a group in pandas dataframe

I have a huge data that needs to be grouped based on its 'IDs' and only the last value of each ID needs to be exported into a SINGLE csv/excel file.
incl = ['A', 'B', 'C']
for k, g in df[df['ID'].isin(incl)].groupby('ID'):
g.tail(1).to_csv(f'{k}.csv')
I have tried this but it makes different csv files for each ID instead of a one big file containing the last value of each group.
Sample data:
ID Date Open High Low
30 UNITY 2020-06-18 11.50 11.75 11.41
31 UNITY 2020-06-21 11.44 11.50 10.88
32 UNITY 2020-06-22 11.26 11.78 11.26
33 UNITY 2020-06-23 11.72 12.08 11.53
34 UNITY 2020-06-24 11.51 11.59 11.40
35 UNITY 2020-06-25 11.85 11.85 11.11
36 SSOM 2020-05-03 27.50 27.95 27.00
37 SSOM 2020-05-05 27.50 27.50 27.50
38 SSOM 2020-05-06 29.20 29.56 29.20
39 SSOM 2020-05-07 31.77 31.77 31.77

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02
All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

How to convert tic data to 5 minute OHLC?

I am learning KDB+ and have loaded the tic data into the table W as below. My question is, how to transfer the data into 5 (or n) minutes OHLCVA?
"Stk_ID","Date","Time","Price","Chg","Vol","Amt","Ty"
300032,2011-03-03,09:51:40,20.40,0.00,10.0,20400.0,S
300032,2011-03-03,09:51:30,20.40,-0.01,9.0,18360.0,S
300032,2011-03-03,09:51:00,20.41,0.01,2.0,4082.0,B
300032,2011-03-03,09:51:00,20.40,-0.01,115.0,234599.0,S
300032,2011-03-03,09:50:45,20.41,0.00,10.0,20410.0,S
300032,2011-03-03,09:50:45,20.41,-0.02,7.0,14287.0,S
300032,2011-03-03,09:50:20,20.43,-0.01,4.0,8172.0,S
300032,2011-03-03,09:50:05,20.44,0.01,25.0,51100.0,B
300032,2011-03-03,09:50:00,20.43,-0.01,28.0,57204.0,S
I use such Q code to get 1 minute data, but don't know how to get 5 minutes . :
select Open: first price,High: max price, Low: min price,Close: last price,Vol: sum vol, Amt: sum amt,Avg_Price: ((sum amt)%(sum vol))%100 by stk_id,time.hh,time.mm from asc W
result:
stk_id hh mm| Open High Low Close Vol Amt Avg_Price
------------| ----------------------------------------------------
000001 9 30| 16.24 16.24 16.22 16.24 3253 5282086 16.23758
000001 9 31| 16.22 16.24 16.21 16.21 1974 3204276 16.2324
000001 9 32| 16.23 16.23 16.2 16.2 3764 6102207 16.21203
000001 9 33| 16.21 16.21 16.19 16.2 4407 7143120 16.20858
000001 9 34| 16.2 16.2 16.19 16.19 1701 2756614 16.20584
000001 9 35| 16.19 16.21 16.19 16.21 2756 4466988 16.20823
000001 9 36| 16.22 16.25 16.22 16.24 3123 5076089 16.25389
000001 9 37| 16.25 16.27 16.25 16.27 1782 2897340 16.25892
Rather than grouping separately by time.hh and then time.mm, I'd recommend doing a singe group:
by stk_id,time.minute
From there, all you need to do for 5 minute buckets is use xbar:
by stk_id,5 xbar time.minute
A slightly more dynamic version of aggregating the values:
q)b:`Date`Time`stk!(`Date;(xbar;1;`Time.minute);`Stk_ID)
q)a:`op`cp`hp`lp!((first;`Price);(last;`Price);(max;`Price);(min;`Price))
q)?[w;();b;a]
Date Time stk | op cp hp lp
-----------------| -----------------------
2011 09:50 300032| 20.41 20.43 20.44 20.41
2011 09:51 300032| 20.4 20.4 20.41 20.4