I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0
Related
I tried countless answers to similar problems here on SO but couldn't find anything that works for this scenario. It's driving me nuts.
I have these two Dataframes:
df_op:
index
Date
Close
Name
LogRet
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
1
2022-11-29 00:00:00
280.57
QQQ
-0.0076
2
2022-12-13 00:00:00
342.46
ADBE
0.0126
3
2022-12-13 00:00:00
256.92
MSFT
0.0173
df_quotes:
index
Date
Close
Name
72
2022-11-29 00:00:00
141.17
AAPL
196
2022-11-29 00:00:00
240.33
MSFT
73
2022-11-30 00:00:00
148.03
AAPL
197
2022-11-30 00:00:00
255.14
MSFT
11
2022-11-30 00:00:00
293.36
QQQ
136
2022-12-01 00:00:00
344.11
ADBE
198
2022-12-01 00:00:00
254.69
MSFT
12
2022-12-02 00:00:00
293.72
QQQ
I would like to add a column to df_op that indicates the close of the stock in df_quotes 2 days later. For example, the first row of df_op should become:
index
Date
Close
Name
LogRet
Next
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
254.69
In other words:
for each row in df_op, find the corresponding Name in df_quotes with Date of 2 days later and copy its Close to df_op in column 'Next'.
I tried tens of combinations like this without success:
df_quotes[df_quotes['Date'].isin(df_op['Date'] + pd.DateOffset(days=2)) & df_quotes['Name'].isin(df_op['Name'])]
How can I do this without recurring to loops?
Try this:
#first convert to datetime
df_op['Date'] = pd.to_datetime(df_op['Date'])
df_quotes['Date'] = pd.to_datetime(df_quotes['Date'])
#merge on Date and Name, but the date is offset 2 business days
(pd.merge(df_op,
df_quotes[['Date','Close','Name']].rename({'Close':'Next'},axis=1),
left_on=['Date','Name'],
right_on=[df_quotes['Date'] - pd.tseries.offsets.BDay(2),'Name'],
how = 'left')
.drop(['Date_x','Date_y'],axis=1))
Output:
Date index Close Name LogRet Next
0 2022-11-29 0 240.33 MSFT -0.0059 254.69
1 2022-11-29 1 280.57 QQQ -0.0076 NaN
2 2022-12-13 2 342.46 ADBE 0.0126 NaN
3 2022-12-13 3 256.92 MSFT 0.0173 NaN
I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).
I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT
I wonder if anyone could please help me with this issue: I have a pandas data frame (generated from a text file) which should have a structure similar to this one:
import pandas as pd
data = {'Objtype' : ['bias', 'bias', 'flat', 'flat', 'StdStar', 'flat', 'Arc', 'Target1', 'Arc', 'Flat', 'Flat', 'Flat', 'bias', 'bias'],
'UT' : pd.date_range("23:00", "00:05", freq="5min").values,
'Position' : ['P0', 'P0', 'P0', 'P0', 'P1', 'P1','P1', 'P2','P2','P2', 'P0', 'P0', 'P0', 'P0']}
df = pd.DataFrame(data=data)
I would like to do some operations taking in consideration the time of the observation so I change the UT column from a string format to a numpy datetime64:
df['UT'] = pd.to_datetime(df['UT'])
Which gives me something like this:
Objtype Position UT
0 bias P0 2016-08-31 23:45:00
1 bias P0 2016-08-31 23:50:00
2 flat P0 2016-08-31 23:55:00
3 flat P0 2016-08-31 00:00:00
4 StdStar P1 2016-08-31 00:05:00
5 flat P1 2016-08-31 00:10:00
6 Arc P1 2016-08-31 00:15:00
7 Target1 P1 2016-08-31 00:20:00
However, in here there are two issues:
First) the year/month/day is assigned to the current one.
Second) the day has not changed from 23:59 -> 00:00. Rather it has gone backwards.
If we know the true date at the first data frame index row and we know that all the entries are sequentially (and they always go from sunset to sunrise). How could we correct for these issues?
To find the time delta between 2 rows:
df.UT - df.UT.shift()
Out[48]:
0 NaT
1 00:05:00
2 00:05:00
3 -1 days +00:05:00
4 00:05:00
5 00:05:00
6 00:05:00
7 00:05:00
Name: UT, dtype: timedelta64[ns]
To find when time goes backwards:
df.UT - df.UT.shift() < pd.Timedelta(0)
Out[49]:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
Name: UT, dtype: bool
To have an additional 1 day for each row going backward:
((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D'))
Out[50]:
0 0 days
1 0 days
2 0 days
3 1 days
4 0 days
5 0 days
6 0 days
7 0 days
Name: UT, dtype: timedelta64[ns]
To broadcast forward the additional days down the series, use the cumsum pattern:
((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D')).cumsum()
Out[53]:
0 0 days
1 0 days
2 0 days
3 1 days
4 1 days
5 1 days
6 1 days
7 1 days
Name: UT, dtype: timedelta64[ns]
Add this correction vector back to your original UT column:
df.UT + ((df.UT - df.UT.shift() < pd.Timedelta(0))*pd.Timedelta(1, 'D')).cumsum()
Out[51]:
0 2016-08-31 23:45:00
1 2016-08-31 23:50:00
2 2016-08-31 23:55:00
3 2016-09-01 00:00:00
4 2016-09-01 00:05:00
5 2016-09-01 00:10:00
6 2016-09-01 00:15:00
7 2016-09-01 00:20:00
Name: UT, dtype: datetime64[ns]
I have a dataframe like that:
df = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 02:00 0 30
2 01.08.2009 03:00 10 18
But I need that one (in 15-min-periods):
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 01:15 15 25
2 01.08.2009 01:30 15 25
3 01.08.2009 01:45 15 25
4 01.08.2009 02:00 0 30
5 01.08.2009 02:15 0 30
6 01.08.2009 02:30 0 30
7 01.08.2009 02:45 0 30
8 01.08.2009 03:00 10 18
....and so on.
I have tried df.resample(). But it does not worked. Does someone know a nice pandas method?!
If fileA.csv looks like this:
Date;Buy;Sell
01.08.2009 01:00;15;25
01.08.2009 02:00;0;30
01.08.2009 03:00;10;18
then you could parse the data with
df = pd.read_csv("fileA.csv", delimiter=";", parse_dates=['Date'])
so that df will look like this:
In [41]: df
Out[41]:
Date Buy Sell
0 2009-01-08 01:00:00 15 25
1 2009-01-08 02:00:00 0 30
2 2009-01-08 03:00:00 10 18
You might want to check df.info() to make sure you successfully parsed your data into a DataFrame with three columns, and that the Date column has dtype datetime64[ns]. Since the repr(df) you posted prints the date in a different format and the column headers do not align with the data, there is a good chance that the data has not yet been parsed properly. If that's true and you post some sample lines from the csv, we should be able help you parse the data into a DataFrame.
In [51]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
Date 3 non-null datetime64[ns]
Buy 3 non-null int64
Sell 3 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 96.0 bytes
Once you have the DataFrame correctly parsed, resampling to 15 minute time periods can be done with asfreq with forward-filling the missing values:
In [50]: df.set_index('Date').asfreq('15T', method='ffill')
Out[50]:
Buy Sell
2009-01-08 01:00:00 15 25
2009-01-08 01:15:00 15 25
2009-01-08 01:30:00 15 25
2009-01-08 01:45:00 15 25
2009-01-08 02:00:00 0 30
2009-01-08 02:15:00 0 30
2009-01-08 02:30:00 0 30
2009-01-08 02:45:00 0 30
2009-01-08 03:00:00 10 18