Removing Rows With The Same Id Based On Overlapping Time - pandas

I have the following dataframe
df = pd.DataFrame({'user':['Dan','Dan','Dan','Dan','Ron'], 'start':['2020-01-01 17:00:00',
'2020-01-01 16:20:00','2020-01-01 17:00:00', '2020-01-01 06:30:00','2020-01-01 17:00:00'],
'end':['2020-01-01 21:00:00', '2020-01-02 01:00:00','2020-01-01 21:15:00',
'2020-01-01 10:00:00','2020-01-01 21:00:00']})
user
start
end
Dan
2020-01-01 17:00:00
2020-01-01 21:00:00
Dan
2020-01-01 16:20:00
2020-01-02 01:00:00
Dan
2020-01-01 17:00:00
2020-01-01 21:15:00
Dan
2020-01-01 06:30:00
2020-01-01 10:00:00
Ron
2020-01-01 17:00:00
2020-01-01 21:00:00
For the same user, I would like to leave just one of the records which have overlapping start-end time intervals (no matter which) for example:
user
start
end
Dan
2020-01-01 17:00:00
2020-01-01 21:00:00
Dan
2020-01-01 06:30:00
2020-01-01 10:00:00
Ron
2020-01-01 17:00:00
2020-01-01 21:00:00
My try, for a given user to run:
idx, single_user, single_start, single_end = df.to_records()[0]
a = df.loc[((df['user'] == single_user) & ((df['start'] < single_start) & (df['end']< single_end))
|((df['start']> single_start) &(df['end'] > single_end)))].index.tolist()
a.append(df.iloc[idx].name)
And obtain the result using:
test_df.iloc[a]
There must be a better way, is there a pandas method to tackle it?

Here is a potential approach:
# ensure datetime type for meaningful comparisons
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
# sort by start time
df = df.sort_values(by='start')
# valid data have a start time greater than the previous end time
# or no previous time (NaT in the shift)
s = df.groupby('user')['end'].shift()
df[df['start'].gt(s)|s.isna()]
Output:
user start end
3 Dan 2020-01-01 06:30:00 2020-01-01 10:00:00
1 Dan 2020-01-01 16:20:00 2020-01-02 01:00:00
4 Ron 2020-01-01 17:00:00 2020-01-01 21:00:00

Related

Pandas replace daily observations by monthly mean

Suppose, I have a pandas Series with daily observations:
pd_series = pd.Series(np.random.rand(26281), index = pd.date_range('2022-01-01', '2024-12-31', freq = 'H'))
pd_series
2022-01-01 00:00:00 0.933746
2022-01-01 01:00:00 0.588907
2022-01-01 02:00:00 0.229040
2022-01-01 03:00:00 0.557752
2022-01-01 04:00:00 0.798649
2024-12-30 20:00:00 0.314143
2024-12-30 21:00:00 0.670485
2024-12-30 22:00:00 0.300531
2024-12-30 23:00:00 0.075403
2024-12-31 00:00:00 0.716685
What I want is to replace every observation by the monthly average. I know that the average can be calculated as
pd_series.resample('MS').mean()
But how do I put the observations to the respective observations?
Use Resampler.transform:
print (pd_series.resample('MS').transform('mean'))
2022-01-01 00:00:00 0.495015
2022-01-01 01:00:00 0.495015
2022-01-01 02:00:00 0.495015
2022-01-01 03:00:00 0.495015
2022-01-01 04:00:00 0.495015
2024-12-30 20:00:00 0.508646
2024-12-30 21:00:00 0.508646
2024-12-30 22:00:00 0.508646
2024-12-30 23:00:00 0.508646
2024-12-31 00:00:00 0.508646
Freq: H, Length: 26281, dtype: float64

Replacing values in a row based on conditions

I'm trying to fill down a column based on 2 conditions. In this case, whether the index (time series) falls between sunrise and sunset, in which case I want 1 in a new column called "sunlight'. Otherwise, I want the value to be zero. I'm new to pandas from excel so I'm trying to do this as I would there, probably wrongly.
df['sunlight'] = 0
mask1 = df.index > df['sunrise']
mask2 = df.index < df['sunset']
df[mask1 & mask2]
df.loc[df[mask1 & mask2],'sunlight'] = 1
df
enter image description here
Index
sunrise
sunset
Sunlight
08:18:00
08:19:17
15:56:43
0
08:19:00
08:19:17
15:56:43
0
08:20:00
08:19:17
15:56:43
1
08:21:00
08:19:17
15:56:43
1
08:22:00
08:19:17
15:56:43
1
Let`s look on a DataFrame with only on day of data with a frequency of one hour (not minutes) as an example.
df = pd.DataFrame({'sunrais':[pd.to_datetime('2020-01-01 08:19:17')]*24,
'sunset':[pd.to_datetime('2020-01-01 15:46:43')]*24 },
index=pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:00:00', freq='H')
)
If you now cast the truth value as integer you can multiply both selections in one step.
df['sunlight'] = (df['sunrais']<df.index).astype(int) * (df.index<df['sunset']).astype(int)
The the output looks like this:
sunrais sunset sunlight
2020-01-01 07:00:00 2020-01-01 08:19:17 2020-01-01 15:46:43 0
2020-01-01 08:00:00 2020-01-01 08:19:17 2020-01-01 15:46:43 0
2020-01-01 09:00:00 2020-01-01 08:19:17 2020-01-01 15:46:43 1
2020-01-01 10:00:00 2020-01-01 08:19:17 2020-01-01 15:46:43 1

How to efficienty up-sample hourly averages in a Series?

I had a look to the "sql-like" windows function for pandas, and to "rolling".
However, it seems to me I can't have a condition on timestamps in the index, but maybe I'm wrong.
So far, I've been writing this very inefficient code to have an hourly average as a window function.
Anyone knowing a quicker and nicer method?
def avg_on_hour(data: pd.Series()):
new_series = pd.Series()
start_date = data.index.min()
end_date = data.index.max()
delta = dt.timedelta(hours=1)
this_time = start_date
while this_time < end_date:
this_date = this_time.date()
this_hour = this_time.hour
day_slice = data[(data.index.date == this_date) & (data.index.hour == this_hour)]
day_avg = day_slice.mean()
day_slice.iloc[:] = day_avg
new_series = new_series.append(day_slice, verify_integrity=True)
this_time = this_time + delta
return new_series
Example:
Pandas has rolling on datetime, given that the series is datetime indexed
# sample data:
np.random.seed(1)
size=10
s = pd.Series(np.random.rand(size),
index=pd.date_range('2020-01-01', freq='7T', periods=size))
# rolling mean
series.rolling('1H').mean()
Output:
2020-01-01 00:00:00 0.417022
2020-01-01 00:07:00 0.568673
2020-01-01 00:14:00 0.379154
2020-01-01 00:21:00 0.359948
2020-01-01 00:28:00 0.317310
2020-01-01 00:35:00 0.279815
2020-01-01 00:42:00 0.266450
2020-01-01 00:49:00 0.276339
2020-01-01 00:56:00 0.289720
2020-01-01 01:03:00 0.303252
Freq: 7T, dtype: float64
Update: from your comment, it looks like you are looking for groupby:
s.groupby(s.index.floor('H')).transform('mean')
or
s.groupby(pd.Grouper(freq='H')).transform('mean')
Output:
2020-01-01 00:00:00 0.289720
2020-01-01 00:07:00 0.289720
2020-01-01 00:14:00 0.289720
2020-01-01 00:21:00 0.289720
2020-01-01 00:28:00 0.289720
2020-01-01 00:35:00 0.289720
2020-01-01 00:42:00 0.289720
2020-01-01 00:49:00 0.289720
2020-01-01 00:56:00 0.289720
2020-01-01 01:03:00 0.538817
Freq: 7T, dtype: float64

pandas resample uneven hourly data into 1D or 24h bins

I have weekly hourly FX data which I need to resample into '1D' or '24hr' bins Monday through Thursday 12:00pm and at 21:00 on Friday, totaling 5 days per week:
Date rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957
So the number of hours in each some days of the week is uneven, ie "Monday" = 00:00:00 Monday through 12:00:00 Monday, "Tuesday" (and also Weds, Thu) = i.e. 13:00:00 Monday though 12:00:00 Tuesday, and Friday = 13:00:00 through 21:00:00
In trying to find a solution I see that base is now deprecated, and offset/origin methods aren't working as expected, likely due to uneven number of rows per day:
df.rate.resample('24h', offset=12).ohlc()
I've spent hours attempting to find a solution
How can one simply bin into ohlc() columns all data rows between each 12:00:00 timestamp?
the desired output would look something like this:
Out[69]:
open high low close
2020-01-02 00:00:00.0000000 0.673355 0.673355 0.673355 0.673355
2020-01-03 00:00:00.0000000 0.673110 0.673110 0.668830 0.669570
2020-01-04 00:00:00.0000000 0.668280 0.668280 0.664950 0.666395
2020-01-05 00:00:00.0000000 0.666425 0.666425 0.666425 0.666425
Is this what you are looking for, using both origin and offset as parameters:
df.resample('24h', origin='start_day', offset='13h').ohlc()
For your example, this gives me:
open high low close
datetime
2020-01-01 13:00:00 0.673355 0.673355 0.66884 0.66946
2020-01-02 13:00:00 0.670160 0.671240 0.66883 0.66957
Since the period lengths are unequal, IMO it is necessary to craft the mapping wheel yourself. Speaking precisely, the 1.5-day length on Monday makes it impossible for freq='D' to do the mapping correctly at once.
The hand-crafted code is also able to guard against records outside the well-defined periods.
Data
A slightly different timestamp is used to demonstrate the correctness of the code. The days are from Mon. to Fri.
import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp, Timedelta
df = pd.read_csv(io.StringIO("""
rate
Date
2020-01-06 00:00:00 0.673355
2020-01-06 23:00:00 0.673110
2020-01-07 00:00:00 0.672925
2020-01-07 12:00:00 0.672240
2020-01-07 13:00:00 0.671980
2020-01-07 23:00:00 0.672230
2020-01-08 00:00:00 0.671895
2020-01-08 12:00:00 0.672175
2020-01-08 23:00:00 0.672085
2020-01-09 00:00:00 0.670870
2020-01-09 12:00:00 0.670580
2020-01-09 23:00:00 0.668840
2020-01-10 00:00:00 0.669460
2020-01-10 12:00:00 0.670160
2020-01-10 21:00:00 0.670560
2020-01-10 22:00:00 0.671240
2020-01-10 23:00:00 0.669170
"""), sep=r"\s{2,}", engine="python")
df.set_index(pd.to_datetime(df.index), inplace=True)
Code
def find_day(ts: Timestamp):
"""Find the trading day with irregular length"""
wd = ts.isoweekday()
if wd == 1:
return ts.date()
elif wd in (2, 3, 4):
return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
elif wd == 5:
if ts.hour <= 12:
return ts.date() - Timedelta("1D")
elif 13 <= ts.hour <= 21:
return ts.date()
# out of range or nulls
return None
# map the timestamps, and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)), inplace=True)
# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()
Result
print(ans)
open high low close
Date
2020-01-06 0.673355 0.673355 0.672240 0.672240
2020-01-07 0.671980 0.672230 0.671895 0.672175
2020-01-08 0.672085 0.672085 0.670580 0.670580
2020-01-09 0.668840 0.670160 0.668840 0.670160
2020-01-10 0.670560 0.670560 0.670560 0.670560
I ended up using a combination of grouby and datetime day of the week identification to arrive at my specific solution
# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
days.append(df.index[i].date().weekday())
df['day'] = days
dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates
for i in range(len(grouped_dates)):
tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
if tempDf['day'][0] == 6:
continue # skip Sundays
times = [] # gets just the time portion of index
for y in range(len(tempDf.index)):
times.append(str(tempDf.index[y])[-8:])
tempDf['time'] = times # add time column to df
tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row
dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1, 'idx'][0]) # idx value

Pandas DateTime Calculating Daily Averages

I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134