Pandas: Filtered property behaves like unfiltered - pandas

I have DT, which is a datetime64 Series:
0 2019-12-12 18:43:00
1 2019-03-22 18:30:00
2 NaT
3 2019-04-17 02:00:00
4 2009-03-15 18:00:00
5 2019-04-02 20:25:00
6 2019-05-01 11:00:00
7 2019-04-10 17:00:00
8 1973-07-14 22:00:00
9 2019-06-06 19:00:00
10 2019-06-18 21:00:00
11 2019-06-12 22:00:00
12 2019-06-11 22:00:00
13 2018-06-15 01:00:00
14 1999-08-15 02:30:00
...
88110 2019-10-01 22:00:00
88111 2019-10-01 22:45:00
88112 2019-10-02 01:00:00
88113 2019-10-02 03:26:00
88114 2019-10-02 03:26:00
88115 2019-10-02 05:33:00
88116 2019-10-02 06:35:00
88117 2019-10-02 12:00:00
88118 2019-10-02 19:00:00
88119 2019-10-02 19:15:00
88120 2019-10-02 20:00:00
88121 2019-10-02 20:00:00
88122 2019-10-02 20:03:00
88123 2019-10-02 22:00:00
88124 2019-10-02 22:00:00
Name: date_time, Length: 88125, dtype: datetime64[ns]
and a piece of code:
DT[DT.between("2019-12-05", "2019-12-08") & DT.dt.weekday == 1].dt.weekday.value_counts()
which yields:
5 27
3 23
4 19
Name: date_time, dtype: int64
which includes 3, 4 and 5 days but not a single requested day 1!
So, when I code just:
DT[DT.between("2019-12-05", "2019-12-08")].dt.weekday
it yields:
3821 3
87138 3
87139 3
87140 3
87141 3
..
87328 5
87329 5
87330 5
87331 5
87332 5
which is logical because we have 3 days interval, which corresponds to 3 week days. And yes, we do not have week day 1 at all in our days range! So why does this & DT.dt.weekday == 1 filter not work?
Thank you a lot for your time!
UPDATE
When I try to use any other filter like & DT.dt.weekday == 2, & DT.dt.weekday == 3 etc., I get an empty Series as a result of the filtering like this:
DT[DT.between("2019-12-05", "2019-12-08") & DT.dt.weekday == 4]
Moreover, DT.dt.weekday == 1 returns normal True/False list!
Maybe, we cannot filter by dt.(...) parameters?

Turns out that this:
DT[DT.between("2019-12-05", "2019-12-08") & DT.dt.weekday == 1]
is performed as this:
DT[ (DT.between("2019-12-05", "2019-12-08") & DT.dt.weekday) == 1 ]
which is why DT.dt.weekday the filter returned True for each day between 2019-12-05 and -08 because & DT.dt.weekday never really influenced as it was 3 to 5 for all the mentioned days range.
So, when I coded it like this:
DT[ (DT.between("2019-12-05", "2019-12-08")) & (DT.dt.weekday == 1) ]
everything worked out as was expected, i.e. nothing was chosen. But this, on the other hand:
DT[ (DT.between("2019-12-05", "2019-12-08")) & (DT.dt.weekday == 3) ]
yielded resulted in a few lines corresponding to day 3.
So, once parentheses are correctly put to separate A and B statements in A & B filtering expression, everything works as designed!
Thank you all for your time anyway! =)

Related

Get the value in a dataframe based on a value and a date in another dataframe

I tried countless answers to similar problems here on SO but couldn't find anything that works for this scenario. It's driving me nuts.
I have these two Dataframes:
df_op:
index
Date
Close
Name
LogRet
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
1
2022-11-29 00:00:00
280.57
QQQ
-0.0076
2
2022-12-13 00:00:00
342.46
ADBE
0.0126
3
2022-12-13 00:00:00
256.92
MSFT
0.0173
df_quotes:
index
Date
Close
Name
72
2022-11-29 00:00:00
141.17
AAPL
196
2022-11-29 00:00:00
240.33
MSFT
73
2022-11-30 00:00:00
148.03
AAPL
197
2022-11-30 00:00:00
255.14
MSFT
11
2022-11-30 00:00:00
293.36
QQQ
136
2022-12-01 00:00:00
344.11
ADBE
198
2022-12-01 00:00:00
254.69
MSFT
12
2022-12-02 00:00:00
293.72
QQQ
I would like to add a column to df_op that indicates the close of the stock in df_quotes 2 days later. For example, the first row of df_op should become:
index
Date
Close
Name
LogRet
Next
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
254.69
In other words:
for each row in df_op, find the corresponding Name in df_quotes with Date of 2 days later and copy its Close to df_op in column 'Next'.
I tried tens of combinations like this without success:
df_quotes[df_quotes['Date'].isin(df_op['Date'] + pd.DateOffset(days=2)) & df_quotes['Name'].isin(df_op['Name'])]
How can I do this without recurring to loops?
Try this:
#first convert to datetime
df_op['Date'] = pd.to_datetime(df_op['Date'])
df_quotes['Date'] = pd.to_datetime(df_quotes['Date'])
#merge on Date and Name, but the date is offset 2 business days
(pd.merge(df_op,
df_quotes[['Date','Close','Name']].rename({'Close':'Next'},axis=1),
left_on=['Date','Name'],
right_on=[df_quotes['Date'] - pd.tseries.offsets.BDay(2),'Name'],
how = 'left')
.drop(['Date_x','Date_y'],axis=1))
Output:
Date index Close Name LogRet Next
0 2022-11-29 0 240.33 MSFT -0.0059 254.69
1 2022-11-29 1 280.57 QQQ -0.0076 NaN
2 2022-12-13 2 342.46 ADBE 0.0126 NaN
3 2022-12-13 3 256.92 MSFT 0.0173 NaN

Overlap in seconds between datetime range and a time range

I have a dataframe like this:
df11 = pd.DataFrame(
{
"Start_date": ["2018-01-31 12:00:00", "2018-02-28 16:00:00", "2018-02-27 22:00:00"],
"End_date": ["2019-01-31 21:45:00", "2019-03-24 22:00:00", "2018-02-28 01:00:00"],
}
)
Start_date End_date
0 2018-01-31 12:00:00 2019-01-31 21:45:00
1 2018-02-28 16:00:00 2019-03-24 22:00:00
2 2018-02-27 22:00:00 2018-02-28 01:00:00
I need to check the overlap time duration in specific periods in seconds. My expected results are like this:
Start_date End_date 12h-16h 16h-22h 22h-00h 00h-02h30
0 2018-01-31 12:00:00 2019-01-31 21:45:00 14400 20700 0 0
1 2018-02-28 16:00:00 2019-03-24 22:00:00 0 21600 0 0
2 2018-02-27 22:00:00 2018-02-28 01:00:00 0 0 7200 3600
I know it`s completely wrong and I´ve tried other solutions. This is one of my attempts:
df11['12h-16h']=np.where(df11['Start_date']<timedelta(hours=16, minutes=0, seconds=0) & df11['End_date']>timedelta(hours=12, minutes=0, seconds=0),(np.minimum(df11['End_date'],timedelta(hours=16, minutes=0, seconds=0)))-(np.maximum(df11['Start_date'],timedelta(hours=12, minutes=0, seconds=0)))

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

Interpolating datetime Index

I have a DataFrame (df) as follow where 'date' is a datetime index (Y-M-D):
df :
values
date
2010-01-01 10
2010-01-02 20
2010-01-03 - 30
I want to create a new df with interpolated datetime index as follow:
values
date
2010-01-01 12:00:00 10
2010-01-01 17:00:00 15 # mean value betw. 2010-01-01 and 2010-01-02
2010-01-02 12:00:00 20
2010-01-02 17:00:00 - 5 # mean value betw. 2010-01-02 and 2010-01-03
2010-01-03 12:00:00 -30
Can anyone help me on this?
I believe need add 12 hours to index first, then reindex by union new indices with 17 and last interpolate:
df1 = df.set_index(df.index + pd.Timedelta(12, unit='h'))
idx = (df.index + pd.Timedelta(17, unit='h')).union(df1.index)
df2 = df1.reindex(idx).interpolate()
print (df2)
values
date
2010-01-01 12:00:00 10.0
2010-01-01 17:00:00 15.0
2010-01-02 12:00:00 20.0
2010-01-02 17:00:00 -5.0
2010-01-03 12:00:00 -30.0
2010-01-03 17:00:00 -30.0

Setting the day in a pandas frame column, from a string list containing only the hours

I wonder if anyone could please help me with this issue: I have a pandas data frame (generated from a text file) which should have a structure similar to this one:
import pandas as pd
data = {'Objtype' : ['bias', 'bias', 'flat', 'flat', 'StdStar', 'flat', 'Arc', 'Target1', 'Arc', 'Flat', 'Flat', 'Flat', 'bias', 'bias'],
'UT' : pd.date_range("23:00", "00:05", freq="5min").values,
'Position' : ['P0', 'P0', 'P0', 'P0', 'P1', 'P1','P1', 'P2','P2','P2', 'P0', 'P0', 'P0', 'P0']}
df = pd.DataFrame(data=data)
I would like to do some operations taking in consideration the time of the observation so I change the UT column from a string format to a numpy datetime64:
df['UT'] = pd.to_datetime(df['UT'])
Which gives me something like this:
Objtype Position UT
0 bias P0 2016-08-31 23:45:00
1 bias P0 2016-08-31 23:50:00
2 flat P0 2016-08-31 23:55:00
3 flat P0 2016-08-31 00:00:00
4 StdStar P1 2016-08-31 00:05:00
5 flat P1 2016-08-31 00:10:00
6 Arc P1 2016-08-31 00:15:00
7 Target1 P1 2016-08-31 00:20:00
However, in here there are two issues:
First) the year/month/day is assigned to the current one.
Second) the day has not changed from 23:59 -> 00:00. Rather it has gone backwards.
If we know the true date at the first data frame index row and we know that all the entries are sequentially (and they always go from sunset to sunrise). How could we correct for these issues?
To find the time delta between 2 rows:
df.UT - df.UT.shift()
Out[48]:
0 NaT
1 00:05:00
2 00:05:00
3 -1 days +00:05:00
4 00:05:00
5 00:05:00
6 00:05:00
7 00:05:00
Name: UT, dtype: timedelta64[ns]
To find when time goes backwards:
df.UT - df.UT.shift() < pd.Timedelta(0)
Out[49]:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
Name: UT, dtype: bool
To have an additional 1 day for each row going backward:
((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D'))
Out[50]:
0 0 days
1 0 days
2 0 days
3 1 days
4 0 days
5 0 days
6 0 days
7 0 days
Name: UT, dtype: timedelta64[ns]
To broadcast forward the additional days down the series, use the cumsum pattern:
((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D')).cumsum()
Out[53]:
0 0 days
1 0 days
2 0 days
3 1 days
4 1 days
5 1 days
6 1 days
7 1 days
Name: UT, dtype: timedelta64[ns]
Add this correction vector back to your original UT column:
df.UT + ((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D')).cumsum()
Out[51]:
0 2016-08-31 23:45:00
1 2016-08-31 23:50:00
2 2016-08-31 23:55:00
3 2016-09-01 00:00:00
4 2016-09-01 00:05:00
5 2016-09-01 00:10:00
6 2016-09-01 00:15:00
7 2016-09-01 00:20:00
Name: UT, dtype: datetime64[ns]