Adding additional date rows to pandas dataframe - pandas

I have a dataframe that holds quarterly earnings dates for a number stocks, looking like this:
df = pd.DataFrame( {
'ticker': ['AAPL', 'AAPL', 'AAPL','AAPL', 'MSFT', 'MSFT','MSFT', 'MSFT'],
'datetime': ['2015-01-01', '2015-04-01', '2015-07-01', '2015-12-01', '2015-01-01', '2015-04-01', '2015-07-01', '2015-12-01'],
})
df['datetime'] = pd.to_datetime(df['datetime'])
I would like to create an additional n rows before and after each date, to make it look like this (for n=1):
df = pd.DataFrame( {
'ticker': ['AAPL', 'AAPL', 'AAPL','AAPL','AAPL', 'AAPL', 'AAPL','AAPL','AAPL', 'AAPL', 'AAPL','AAPL', 'MSFT', 'MSFT','MSFT', 'MSFT','MSFT', 'MSFT','MSFT', 'MSFT','MSFT', 'MSFT','MSFT', 'MSFT'],
'datetime': ['2014-12-31','2015-01-01','2015-01-02','2015-03-30','2015-04-01', '2015-04-02', '2015-06-30','2015-07-01','2015-07-02', '2015-11-30','2015-12-01', '2015-12-02',
'2014-12-31','2015-01-01','2015-01-02','2015-03-30','2015-04-01', '2015-04-02', '2015-06-30','2015-07-01','2015-07-02', '2015-11-30','2015-12-01', '2015-12-02'],
})
df['datetime'] = pd.to_datetime(df['datetime'])

Substract and add to each date one day and then .explode():
t = pd.Timedelta(days=1)
df["datetime"] = df["datetime"].apply(lambda x: [x - t, x, x + t])
df = df.explode("datetime").reset_index(drop=True)
print(df)
Prints:
ticker datetime
0 AAPL 2014-12-31
1 AAPL 2015-01-01
2 AAPL 2015-01-02
3 AAPL 2015-03-31
4 AAPL 2015-04-01
5 AAPL 2015-04-02
6 AAPL 2015-06-30
7 AAPL 2015-07-01
8 AAPL 2015-07-02
9 AAPL 2015-11-30
10 AAPL 2015-12-01
11 AAPL 2015-12-02
12 MSFT 2014-12-31
13 MSFT 2015-01-01
14 MSFT 2015-01-02
15 MSFT 2015-03-31
16 MSFT 2015-04-01
17 MSFT 2015-04-02
18 MSFT 2015-06-30
19 MSFT 2015-07-01
20 MSFT 2015-07-02
21 MSFT 2015-11-30
22 MSFT 2015-12-01
23 MSFT 2015-12-02
EDIT: For n > 1:
n = 2
df["datetime"] = df["datetime"].apply(
lambda x: [x + pd.Timedelta(days=d) for d in range(-n, n)]
)
df = df.explode("datetime").reset_index(drop=True)
print(df)
Prints:
ticker datetime
0 AAPL 2014-12-30
1 AAPL 2014-12-31
2 AAPL 2015-01-01
3 AAPL 2015-01-02
4 AAPL 2015-03-30
5 AAPL 2015-03-31
...

you can use DateOffset and concat like:
n = 1 # nb of rows to add
res = (pd.concat([df.assign(datetime = df['datetime'] + pd.DateOffset(days=i))
for i in range(-n, n+1)])
.sort_values(['ticker','datetime'])
.reset_index(drop=True)
)
print(res.head(10))
ticker datetime
0 AAPL 2014-12-31
1 AAPL 2015-01-01
2 AAPL 2015-01-02
3 AAPL 2015-03-31
4 AAPL 2015-04-01
5 AAPL 2015-04-02
6 AAPL 2015-06-30
7 AAPL 2015-07-01
8 AAPL 2015-07-02
9 AAPL 2015-11-30

Related

pandas: get range between date columns

I have pandas DataFrame:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
I converted columns to datetime.date because I don't need time in df:
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
So, I need a new column which will be contain year-month if start_date and finish_date have same month. And if different put range between them. For example start_date = 06-2020, finish_date = 08-2020 the result is [06-2020, 07-2020, 08-2020]. Then I need to explode it by column.
I tried:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
but as a result I had many NaT's in the column.
Any solutions will be great.
One alternative is the following. Assume you have the following dataframe, df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
It is the same you shared pllus one row where the dates (year and month) differ.
Then applying this:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
gives you
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00

How to convert to datetime if the format of dates changes gradually through the column?

df.head():
start_date end_date
0 03.09.2013 03.09.2025
1 09.08.2019 14.05.2020
2 03.08.2015 03.08.2019
3 31.03.2014 31.03.2019
4 02.02.2015 02.02.2019
5 21.08.2019 21.08.2024
when I do df.tail():
start_date end_date
30373 2019-07-05 00:00:00 2023-07-05 00:00:00
30374 2019-06-11 00:00:00 2023-06-11 00:00:00
30375 19.01.2017 2020-02-09 00:00:00 #these 2 start dates are just same as in head
30376 11.12.2009 2011-12-11 00:00:00
30377 2019-07-30 00:00:00 2023-07-30 00:00:00
when i do
df[start_date] = pd.to_datetime(df[start_date])
some dates have month converted as days.
The format is inconsistent through the column. How to convert properly?
Use dayfirst=True parameter:
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
Or specify format by http://strftime.org/:
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m.%Y')
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
print (df)
start_date end_date
0 2013-09-03 2025-09-03
1 2019-08-09 2020-05-14
2 2015-08-03 2019-08-03
3 2014-03-31 2019-03-31
4 2015-02-02 2019-02-02
5 2019-08-21 2024-08-21

Pandas Datetime format conversion error yyyy-mm-dd to yyyy-mm or yyyy/mm as a string

I have a bunch of data in the form of yyyy-mm-dd and I need it in the form of yyyy-mm (string format) so I can plot monthly bar charts
I don't receive any errors but it outputs incorrect data for some values and correct values for other
df = dx
print(df["Collection_End_Date"])
df['Date_Modified'] = pd.to_datetime(df['Collection_End_Date']).dt.strftime('%m/%y')
print(df["Date_Modified"])
0 25/02/2019
1 06/01/2019
2 10/02/2019
3 17/01/2019
4 18/03/2019
...
1149 27/01/2019
1150 04/03/2019
1151 10/02/2019
1152 10/03/2019
1153 24/02/2019
Name: Collection_End_Date, Length: 1154, dtype: object
0 02/19
1 06/19
2 10/19
3 01/19
4 03/19
...
1149 01/19
1150 04/19
1151 10/19
1152 10/19
1153 02/19
Name: Date_Modified, Length: 1154, dtype: object
The data in the csv file is yyyy-mm-dd but it outputs in the form of dd/mm/yyyy. After modifying the data it outputs data sometimes as mm/yyyy or as dd/yyyy. I need the data in a string format ideally
try using pd.to_datetime() and to_period and strftime to change the format of date
df = pd.DataFrame(
{
"Collection_End_Date": ["2019-01-07 12:00:00", "2019-01-07 12:00:00", "2019-02-08 12:00:00", "2019-01-05 12:00:00", "2019-01-05 12:00:00"]
}
)
df['Collection_End_Date'] = pd.to_datetime(df['Collection_End_Date'])
df['month_year'] = df['Collection_End_Date'].dt.to_period('M')
Collection_End_Date month_year
0 2019-01-07 12:00:00 2019-01
1 2019-01-07 12:00:00 2019-01
2 2019-02-08 12:00:00 2019-02
3 2019-01-05 12:00:00 2019-01
4 2019-01-05 12:00:00 2019-01
if you want to replace - with / in date you can do
df["Collection_End_Date"] = pd.to_datetime(df["Collection_End_Date"])
df['month_year'] = df['Collection_End_Date'].dt.to_period('M')
df['month_year'] = df['month_year'].dt.strftime('%Y/%m')
Collection_End_Date month_year
0 2019-01-07 12:00:00 2019/01
1 2019-01-07 12:00:00 2019/01
2 2019-02-08 12:00:00 2019/02
3 2019-01-05 12:00:00 2019/01
4 2019-01-05 12:00:00 2019/01

how do i access only specific entries of a dataframe having date as index

[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144

Pandas - Group into 24-hour blocks, but not midnight-to-midnight

I have a time Series. I'd like to group into into blocks of 24-hour blocks, from 8am to 7:59am the next day. I know how to group by date, but I've tried and failed to handle this 8-hour offset using TimeGroupers and DateOffsets.
I think you can use Grouper with parameter base:
print df
date name
0 2015-06-13 00:21:25 1
1 2015-06-14 01:00:25 2
2 2015-06-14 02:54:48 3
3 2015-06-15 14:38:15 2
4 2015-06-15 15:29:28 1
print df.groupby(pd.Grouper(key='date', freq='24h', base=8)).sum()
name
date
2015-06-12 08:00:00 1.0
2015-06-13 08:00:00 5.0
2015-06-14 08:00:00 NaN
2015-06-15 08:00:00 3.0
alternatively to #jezrael's method you can use your custom grouper function:
start_ts = '2016-01-01 07:59:59'
df = pd.DataFrame({'Date': pd.date_range(start_ts, freq='10min', periods=1000)})
def my_grouper(df, idx):
return df.ix[idx, 'Date'].date() if df.ix[idx, 'Date'].hour >= 8 else df.ix[idx, 'Date'].date() - pd.Timedelta('1day')
df.groupby(lambda x: my_grouper(df, x)).size()
Test:
In [468]: df.head()
Out[468]:
Date
0 2016-01-01 07:59:59
1 2016-01-01 08:09:59
2 2016-01-01 08:19:59
3 2016-01-01 08:29:59
4 2016-01-01 08:39:59
In [469]: df.tail()
Out[469]:
Date
995 2016-01-08 05:49:59
996 2016-01-08 05:59:59
997 2016-01-08 06:09:59
998 2016-01-08 06:19:59
999 2016-01-08 06:29:59
In [470]: df.groupby(lambda x: my_grouper(df, x)).size()
Out[470]:
2015-12-31 1
2016-01-01 144
2016-01-02 144
2016-01-03 144
2016-01-04 144
2016-01-05 144
2016-01-06 144
2016-01-07 135
dtype: int64