Finding max date of the month in a list of pandas timeseries dates - pandas

I have a timeseries without every date (ie. trading dates). Series can be reproduced here.
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
I would like the last day of the month in my list of dates ie: '2010-01-29' and '2010-02-16'
I have looked at Get the last date of each month in a list of dates in Python
and more specifically...
import pandas as pd
import numpy as np
df = pd.read_csv('/path/to/file/') # Load a dataframe with your file
df.index = df['my_date_field'] # set the dataframe index with your date
dfg = df.groupby(pd.TimeGrouper(freq='M')) # group by month / alternatively use MS for Month Start / referencing the previously created object
# Finally, find the max date in each month
dfg.agg({'my_date_field': np.max})
# To specifically coerce the results of the groupby to a list:
dfg.agg({'my_date_field': np.max})['my_date_field'].tolist()
... but can't quite figure out how to adapt this to my application. Thanks in advance.

You can try the following to get your desired output:
import numpy as np
import pandas as pd
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
This:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True)
Or this:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))]
Both should yield something like the following:
#2010-01-29 43
#2010-02-16 48
To convert it into a list:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True).tolist()
Or:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))].tolist()
Both yield something like:
#[43, 48]
If you're dealing with a dataset that spans beyond one year, then you will need to group by both year and month. The following should help:
import numpy as np
import pandas as pd
z = ['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16', '2011-01-04', '2011-01-05',
'2011-01-06', '2011-01-07', '2011-01-08', '2011-01-11',
'2011-01-12', '2011-01-13', '2011-01-14', '2011-01-15',
'2011-01-19', '2011-01-20', '2011-01-21', '2011-01-22',
'2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
'2011-01-29', '2011-02-01', '2011-02-02', '2011-02-03',
'2011-02-04', '2011-02-05', '2011-02-08', '2011-02-09',
'2011-02-10', '2011-02-11', '2011-02-12', '2011-02-16']
dates1 = pd.Series(np.random.randint(100,size=60),index=pd.to_datetime(z))
This:
dates1.groupby((dates1.index.year, dates1.index.month)).apply(pd.Series.tail,1).reset_index(level=(0,1), drop=True)
Or:
dates1[dates1.groupby((dates1.index.year, dates1.index.month)).apply(lambda s: np.max(s.index))]
Both yield something like:
# 2010-01-29 66
# 2010-02-16 80
# 2011-01-29 13
# 2011-02-16 10
I hope this proves useful.

You can use groupby by month and apply last value of index:
print (dates.groupby(dates.index.month).apply(lambda x: x.index[-1]))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
Another solution:
print (dates.groupby(dates.index.month).apply(lambda x: x.index.max()))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
For list first convert to string by strftime:
print (dates.groupby(dates.index.month)
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-29', '2010-02-16']
If need values per last Month value use iloc:
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]))
1 55
2 48
dtype: int64
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]).tolist())
[55, 48]
EDIT:
For year and month need convert index to_period by months:
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(
['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2011-01-11', '2011-01-12', '2011-01-13',
'2012-01-14', '2012-01-15', '2012-01-19', '2012-01-20',
'2013-01-21', '2013-01-22', '2013-01-25', '2013-01-26',
'2013-01-27', '2013-01-28', '2013-01-29', '2013-02-01',
'2014-02-02', '2014-02-03', '2014-02-04', '2014-02-05',
'2015-02-08', '2015-02-09', '2015-02-10', '2015-02-11',
'2016-02-12', '2016-02-16']))
#print (dates)
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1]))
2010-01 2010-01-08
2011-01 2011-01-13
2012-01 2012-01-20
2013-01 2013-01-29
2013-02 2013-02-01
2014-02 2014-02-05
2015-02 2015-02-11
2016-02 2016-02-16
Freq: M, dtype: datetime64[ns]
print (dates.groupby(dates.index.to_period('m'))
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-08', '2011-01-13', '2012-01-20', '2013-01-29',
'2013-02-01', '2014-02-05', '2015-02-11', '2016-02-16']
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]))
2010-01 68
2011-01 96
2012-01 53
2013-01 4
2013-02 16
2014-02 18
2015-02 41
2016-02 90
Freq: M, dtype: int64
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]).tolist())
[68, 96, 53, 4, 16, 18, 41, 90]
EDIT1: If need convert period to end of month datetime:
df = dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1])
df.index = df.index.to_timestamp('m')
print (df)
2010-01-31 2010-01-08
2011-01-31 2011-01-13
2012-01-31 2012-01-20
2013-01-31 2013-01-29
2013-02-28 2013-02-01
2014-02-28 2014-02-05
2015-02-28 2015-02-11
2016-02-29 2016-02-16
dtype: datetime64[ns]

Related

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Pandas DatetimeIndex to dataframe

How do you change a DatetimeIndex into a simple dataframe like this:
month
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
This is the DatetimeIndex:
DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
Thank you.
Code below should work:
# Import libraries
import pandas as pd
import numpy as np
import datetime as dt
# Create dataframe
df = pd.DataFrame(pd.DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M'), columns=['month'])
df.head(2)
Use DataFrame contructor:
idx = pd.DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
df = pd.DataFrame({'month':idx})
#alternative
#df = pd.DataFrame({'month':df1.index})
print (df)
month
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
4 2013-11-30
5 2013-12-31
6 2014-01-31
7 2014-02-28
8 2014-03-31
9 2014-04-30
10 2014-05-31
11 2014-06-30

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.
UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")

Selecting rows with specified days in datetimeindex dataframe - Pandas

I have a dataframe with datetimeindex. I only need those rows whose index belong to days specified in a list e.g. [1,2] for Monday and Tuesday. Can this be possible in pandas in a single line of code.
IIUC then the following should work:
df[df.index.to_series().dt.dayofweek.isin([0,1])]
Example:
In [9]:
df = pd.DataFrame(index=pd.date_range(start=dt.datetime(2015,1,1), end = dt.datetime(2015,2,1)))
df[df.index.to_series().dt.dayofweek.isin([0,1])]
Out[9]:
Empty DataFrame
Columns: []
Index: [2015-01-05 00:00:00, 2015-01-06 00:00:00, 2015-01-12 00:00:00, 2015-01-13 00:00:00, 2015-01-19 00:00:00, 2015-01-20 00:00:00, 2015-01-26 00:00:00, 2015-01-27 00:00:00]
So this converts the DateTimeIndex to a Series so that we can call isin to test for membership, using .dt.dayofweek and passing 0,1 (this corresponds to Monday and Tuedsay), we use the boolean mask to mask the index
Another way is to construct a boolean mask without converting to a Series:
In [12]:
df[(df.index.dayofweek == 0) | (df.index.dayofweek == 1)]
Out[12]:
Empty DataFrame
Columns: []
Index: [2015-01-05 00:00:00, 2015-01-06 00:00:00, 2015-01-12 00:00:00, 2015-01-13 00:00:00, 2015-01-19 00:00:00, 2015-01-20 00:00:00, 2015-01-26 00:00:00, 2015-01-27 00:00:00]
Or in fact this would work:
In [13]:
df[df.index.dayofweek < 2]
Out[13]:
Empty DataFrame
Columns: []
Index: [2015-01-05 00:00:00, 2015-01-06 00:00:00, 2015-01-12 00:00:00, 2015-01-13 00:00:00, 2015-01-19 00:00:00, 2015-01-20 00:00:00, 2015-01-26 00:00:00, 2015-01-27 00:00:00]
TIMINGS
In [14]:
%timeit df[df.index.dayofweek < 2]
%timeit df[np.in1d(df.index.dayofweek, [1, 2])]
1000 loops, best of 3: 464 µs per loop
1000 loops, best of 3: 521 µs per loop
So my last method is slightly faster here than the np method
You could try this:
In [3]: import pandas as pd
In [4]: import numpy as np
In [5]: index = pd.date_range('11/23/2015', end = '11/30/2015', freq='d')
In [6]: df = pd.DataFrame(np.random.randn(len(index),2),columns=list('AB'),index=index)
In [7]: df
Out[7]:
A B
2015-11-23 -0.673626 -1.009921
2015-11-24 -1.288852 -0.338795
2015-11-25 -1.414042 -0.767050
2015-11-26 0.018223 -0.726230
2015-11-27 -1.288709 -1.144437
2015-11-28 0.121093 1.396825
2015-11-29 -0.791611 -1.014375
2015-11-30 1.223220 -1.223499
In [8]: df[np.in1d(df.index.dayofweek, [1, 2])]
Out[8]:
A B
2015-11-24 0.116678 -0.715655
2015-11-25 -1.494921 0.218176
1 is actually Tuesday here. But that should be fairly easy to account for if needed.
The previous answer was posted while writing this, as a comparison:
In [15]: %timeit df.loc[df.index.to_series().dt.dayofweek.isin([0,1]).values]
100 loops, best of 3: 2.01 ms per loop
In [16]: %timeit df[np.in1d(df.index.dayofweek, [0, 1])]
1000 loops, best of 3: 393 µs per loop
Note this comparison was done on the test DF I created and I don't know how it necessarily extends to larger dataframes, though performance should be consistent.

pandas dataframe shift dates

I have a dataframe that is indexed by dates. I'd like to shift just the dates, one business day forward (Monday-Friday), without changing the size or anything else. Is there a simple way to do this?
You can shift with 'B' (I think this requires numpy >= 1.7):
In [11]: rng = pd.to_datetime(['21-11-2013', '22-11-2013'])
In [12]: rng.shift(1, freq='B') # 1 business day
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-11-22 00:00:00, 2013-11-25 00:00:00]
Length: 2, Freq: None, Timezone: None
On the Series (same on a DataFrame):
In [21]: s = pd.Series([1, 2], index=rng)
In [22]: s
Out[22]:
2013-11-21 1
2013-11-22 2
dtype: int64
In [23]: s.shift(1, freq='B')
Out[23]:
2013-11-22 1
2013-11-25 2
dtype: int64