I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126
I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.
I want to resample the data every 1h, but only when the sequential records are less than 1h.
How to approach this, without making too many loops?
In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min).
The others should remain as they are.
tmp=vals[['datumtijd','filter data']]
datumtijd filter data
0 1970-11-01 00:00:00 129.0
1 1970-12-01 00:00:00 143.0
2 1971-01-05 00:00:00 151.0
3 1971-02-01 00:00:00 151.0
4 1971-03-01 00:00:00 163.0
5 1971-03-01 00:00:10 163.0
6 1971-03-01 00:00:20 163.0
7 1971-03-01 00:01:10 163.0
8 1971-03-01 00:04:10 163.0
.. ... ...
244 1981-08-19 00:00:00 102.0
245 1981-09-02 00:00:00 98.0
246 1981-09-17 00:00:00 92.0
247 1981-10-01 00:00:00 89.0
248 1981-10-19 00:00:00 92.0
You can be a little explicit about this by using groupby on the hour-floor of the time stamps:
grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()
This is explicitly looking for the hour of each existing data point and grouping the matching ones.
But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:
resampled = df.resample('1H', on='datumtijd').mean().dropna()
In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):
filter data
datumtijd
1970-11-01 00:00:00 129.0
1970-12-01 00:00:00 143.0
1971-01-05 00:00:00 151.0
1971-02-01 00:00:00 151.0
1971-03-01 00:00:00 163.0
1981-08-19 00:00:00 102.0
1981-09-02 00:00:00 98.0
1981-09-17 00:00:00 92.0
1981-10-01 00:00:00 89.0
1981-10-19 03:00:00 92.0
One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.
Also, see this related post.
I am trying to create a new column, in which e.g. the time 14:02 should be saved as 14.0, whereas 14:16 should be 14.5. This would equal half-hour units. Of course 15min units should also be creatable and so on. This is my approach for full hours, but I need a higher resolution.
df["Time"] = df.StartDateTime.apply(lambda x: x.hour)
So long as the units evenly divide an hour you can round with that frequency and then divide by an hour.
import pandas as pd
df = pd.DataFrame({'Time': pd.timedelta_range('14:00:00', freq='4min', periods=10)})
for freq in ['30min', '15min', '20min', '10min']:
df[freq] = df['Time'].dt.round(freq)/pd.Timedelta('1H')
Time 30min 15min 20min 10min
0 14:00:00 14.0 14.00 14.000000 14.000000
1 14:04:00 14.0 14.00 14.000000 14.000000
2 14:08:00 14.0 14.25 14.000000 14.166667
3 14:12:00 14.0 14.25 14.333333 14.166667
4 14:16:00 14.5 14.25 14.333333 14.333333
5 14:20:00 14.5 14.25 14.333333 14.333333
6 14:24:00 14.5 14.50 14.333333 14.333333
7 14:28:00 14.5 14.50 14.333333 14.500000
8 14:32:00 14.5 14.50 14.666667 14.500000
9 14:36:00 14.5 14.50 14.666667 14.666667
If you start from a datetime64[ns] column you can isolate the time by subtracting off the normalized date. For example:
df = pd.DataFrame({'Time': pd.date_range('2010-01-01 14:00:00', freq='4min', periods=5)})
df['Time_only'] = df['Time'] - df['Time'].dt.normalize()
# Time Time_only
#0 2010-01-01 14:00:00 14:00:00
#1 2010-01-01 14:04:00 14:04:00
#2 2010-01-01 14:08:00 14:08:00
#3 2010-01-01 14:12:00 14:12:00
#4 2010-01-01 14:16:00 14:16:00
print(df.dtypes)
#Time datetime64[ns]
#Time_only timedelta64[ns]
#dtype: object
I have a table that contains Id, Date and a float value as below:
ID startDt Days
1328 2015-04-01 00:00:00.000 15
2444 2015-04-03 00:00:00.000 5.7
1658 2015-05-08 00:00:00.000 6
1329 2015-05-12 00:00:00.000 28.5
1849 2015-06-23 00:00:00.000 28.5
1581 2015-06-30 00:00:00.000 25.5
3535 2015-07-03 00:00:00.000 3
3536 2015-08-13 00:00:00.000 13.5
2166 2015-09-22 00:00:00.000 28.5
3542 2015-11-05 00:00:00.000 13.5
3543 2015-12-18 00:00:00.000 6
2445 2015-12-25 00:00:00.000 5.7
4096 2015-12-31 00:00:00.000 7.5
2446 2016-01-01 00:00:00.000 5.7
4287 2016-02-11 00:00:00.000 13.5
4288 2016-02-18 00:00:00.000 13.5
4492 2016-03-02 00:00:00.000 19.7
2447 2016-03-25 00:00:00.000 5.7
I am using a stored procedure which adds up the Days then subtracts it from a fixed value stored in a variable.
The total in the table is 245 and the variable is set to 245 so I should get a value of 0 when subtracting the two. However, I am getting a value of 5.6843418860808E-14 instead. I cant figure out why this is the case and I have gone and re entered each number in the table but I still get the same result.
This is my sql statement that I am using to calculate the result:
Declare #AL_Taken as float
Declare #AL_Remaining as float
Declare #EntitledLeave as float
Set #EntitledLeave=245
set #AL_Taken= (select sum(Days) from tblALMain)
Set #AL_Remaining=#EntitledLeave-#AL_Taken
Select #EntitledLeave, #AL_Taken, #AL_Remaining
The select returns the following:
245, 245, 5.6843418860808E-14
Can anyone suggest why I am getting this number when I should be getting 0?
Thanks for the help
Rob
I changed the data type to Decimal as Tab Allenman suggested and this resolved my issue. I still dont understand why I didnt get zero when using float as all the values added up to 245 exactly (I even re-entered the values manually) and 245 - 245 should have given me 0.
Thanks again for all the comments and explanations.
Rob
I am learning KDB+ and have loaded the tic data into the table W as below. My question is, how to transfer the data into 5 (or n) minutes OHLCVA?
"Stk_ID","Date","Time","Price","Chg","Vol","Amt","Ty"
300032,2011-03-03,09:51:40,20.40,0.00,10.0,20400.0,S
300032,2011-03-03,09:51:30,20.40,-0.01,9.0,18360.0,S
300032,2011-03-03,09:51:00,20.41,0.01,2.0,4082.0,B
300032,2011-03-03,09:51:00,20.40,-0.01,115.0,234599.0,S
300032,2011-03-03,09:50:45,20.41,0.00,10.0,20410.0,S
300032,2011-03-03,09:50:45,20.41,-0.02,7.0,14287.0,S
300032,2011-03-03,09:50:20,20.43,-0.01,4.0,8172.0,S
300032,2011-03-03,09:50:05,20.44,0.01,25.0,51100.0,B
300032,2011-03-03,09:50:00,20.43,-0.01,28.0,57204.0,S
I use such Q code to get 1 minute data, but don't know how to get 5 minutes . :
select Open: first price,High: max price, Low: min price,Close: last price,Vol: sum vol, Amt: sum amt,Avg_Price: ((sum amt)%(sum vol))%100 by stk_id,time.hh,time.mm from asc W
result:
stk_id hh mm| Open High Low Close Vol Amt Avg_Price
------------| ----------------------------------------------------
000001 9 30| 16.24 16.24 16.22 16.24 3253 5282086 16.23758
000001 9 31| 16.22 16.24 16.21 16.21 1974 3204276 16.2324
000001 9 32| 16.23 16.23 16.2 16.2 3764 6102207 16.21203
000001 9 33| 16.21 16.21 16.19 16.2 4407 7143120 16.20858
000001 9 34| 16.2 16.2 16.19 16.19 1701 2756614 16.20584
000001 9 35| 16.19 16.21 16.19 16.21 2756 4466988 16.20823
000001 9 36| 16.22 16.25 16.22 16.24 3123 5076089 16.25389
000001 9 37| 16.25 16.27 16.25 16.27 1782 2897340 16.25892
Rather than grouping separately by time.hh and then time.mm, I'd recommend doing a singe group:
by stk_id,time.minute
From there, all you need to do for 5 minute buckets is use xbar:
by stk_id,5 xbar time.minute
A slightly more dynamic version of aggregating the values:
q)b:`Date`Time`stk!(`Date;(xbar;1;`Time.minute);`Stk_ID)
q)a:`op`cp`hp`lp!((first;`Price);(last;`Price);(max;`Price);(min;`Price))
q)?[w;();b;a]
Date Time stk | op cp hp lp
-----------------| -----------------------
2011 09:50 300032| 20.41 20.43 20.44 20.41
2011 09:51 300032| 20.4 20.4 20.41 20.4