pandas resampling dataframe and keep datetime index as a column - pandas

I'm trying to resample daily data to weekly data using pandas.
I'm using the following:
weekly_start_date =pd.Timestamp('01/05/2011')
weekly_end_date =pd.Timestamp('05/28/2013')
daily_data = daily_data[(daily_data["date"] >= weekly_start_date) & (daily_data["date"] <= weekly_end_date)]
daily_data = daily_data.set_index('date',drop=False)
weekly_data = daily_data.resample('7D',how=np.sum,closed='left',label='left')
The problem is weekly_data doesn't have the date column anymore.
What did I miss?
Thanks,

If I understand your question, it looks like your doing the resampling correctly (Pandas docs on resampling here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html).
weekly_data = daily_data.resample('7D',how=np.sum,closed='left',label='left')
If the only issue is that you'd like the DateTimeIndex replicated in a column you can just do this.
weekly_data['date'] = weekly_data.index.values
Apologies if I misunderstood the question. :)

You can only resample by numeric columns:
In [11]: df = pd.DataFrame([[pd.Timestamp('1/1/2012'), 1, 'a', [1]], [pd.Timestamp('1/2/2012'), 2, 'b', [2]]], columns=['date', 'no', 'letter', 'li'])
In [12]: df1 = df.set_index('date', drop=False)
In [13]: df1
Out[13]:
date no letter li
date
2012-01-01 2012-01-01 00:00:00 1 a [1]
2012-01-02 2012-01-02 00:00:00 2 b [2]
In [15]: df1.resample('M', how=np.sum)
Out[15]:
no
date
2012-01-31 3
We can see that it uses the dtype to determine whether it's numeric:
In [16]: df1.no = df1.no.astype(object)
In [17]: df1.resample('M', how=sum)
Out[17]:
date no letter li
date
2012-01-31 0 0 0 0
An awful hack for actual summing:
In [21]: rng = pd.date_range(weekly_start_date, weekly_end_date, freq='M')
In [22]: g = df1.groupby(rng.asof)
In [23]: g.apply(lambda t: t.apply(lambda x: x.sum(1))).unstack()
Out[23]:
date no letter li
2011-12-31 2650838400000000000 3 ab [1, 2]
The date is the sum of the epoch nanoseconds...
(Hopefully I'm doing something silly, and there's is an easier way!)

Related

Parsing date in pandas.read_csv

I am trying to read a CSV file which has in its first column date values specified in this format:
"Dec 30, 2021","1.1","1.2","1.3","1"
While I can define the types for the remaining columns using dtype= clause, I do not know how to handle the Date.
I have tried the obvious np.datetime64 without success.
Is there any way to specify a format to parse this date directly using read_csv method?
You may use parse_dates :
df = pd.read_csv('data.csv', parse_dates=['date'])
But in my experience it is a frequent source of errors, I think it is better to specify the date format and convert manually the date column. For example, in your case :
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format = '%b %d, %Y')
Just specify a list of columns that should be convert to dates in the parse_dates= of pd.read_csv:
>>> df = pd.read_csv('file.csv', parse_dates=['date'])
>>> df
date a b c d
0 2021-12-30 1.1 1.2 1.3 1
>>> df.dtypes
date datetime64[ns]
a float64
b float64
c float64
d int64
Update
What if I want to further specify the format for a,b,c and d? I used a simplified example, in my file numbers are formated like this "2,345.55" and those are read as object by read_csv, not as float64 or int64 as in your example
converters = {
'Date': lambda x: datetime.strptime(x, "%b %d, %Y"),
'Number': lambda x: float(x.replace(',', ''))
}
df = pd.read_csv('data.csv', converters=converters)
Output:
>>> df
Date Number
0 2021-12-30 2345.55
>>> df.dtypes
Date datetime64[ns]
Number float64
dtype: object
# data.csv
Date,Number
"Dec 30, 2021","2,345.55"
Old answer
If you have a particular format, you can pass a custom function to date_parser parameter:
from datetime import datetime
custom_date_parser = lambda x: datetime.strptime(x, "%b %d, %Y")
df = pd.read_csv('data.csv', parse_dates=['Date'], date_parser=custom_date_parser)
print(df)
# Output
Date A B C D
0 2021-12-30 1.1 1.2 1.3 1
Or let Pandas try to determine the format as suggested by #richardec.

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.

Panda Index Datetime Switching Months and Days

I have a panda df.index in the format below.
It's a string of day/month/year, so the first item is 05Sep2017 etc:
05/09/17 #05Sep2017
07/09/17 #07Sep2017
...
18/10/17 #18Oct2017
Applying
df.index = pd.to_datetime(df.index)
to the above, transforms it to:
2017-05-09 #09May2017
2017-07-09 #09Jul2017
...
2017-10-18 #18Oct2017
What seems to be happening is that the first entries are having the Day and Month switched. The last entry instead, where the day is greater than 12, is converted correctly.
I tried to switch month days by converting the index to a column and applying:
df['date'] = df.index
df['date'].apply(lambda x: dt.datetime.strftime(x, '%Y-%d-%m'))
as well as:
df['date'].apply(lambda x: dt.datetime.strftime(x, '%Y-%m-%d'))
but to no avail.
How can i convert the index to datetime, where all entries are day/month/year please?
In pandas is default format of dates YY-MM-DD.
df = df.set_index('date_col')
df.index = pd.to_datetime(df.index)
print (df)
val
2017-05-09 4
2017-07-09 8
2017-10-18 2
print (df.index)
DatetimeIndex(['2017-05-09', '2017-07-09', '2017-10-18'], dtype='datetime64[ns]', freq=None)
You need strftime, but lost datetimes, because get strings:
df.index = pd.to_datetime(df.index).strftime('%Y-%d-%m')
print (df.index)
Index(['2017-09-05', '2017-09-07', '2017-18-10'], dtype='object')
df.index = pd.to_datetime(df.index).strftime('%d-%b-%Y')
print (df)
val
09-May-2017 4
09-Jul-2017 8
18-Oct-2017 2

Append a tuple to a dataframe as a row

I am looking for a solution to add rows to a dataframe. Here is the data I have :
A grouped object ( obtained by grouping a dataframe on month and year i.e in this grouped object key is [month,year] and value is all the rows / dates in that month and year).
I want to extract all the month , year combinations and put that in a new dataframe. Issue : When I iterate over the grouped object, month, row is a tuple, so I converted the tuple into a list and added it to a dataframe using thye append command. Instead of getting added as rows :
1 2014
2 2014
3 2014
it got added in one column
0 1
1 2014
0 2
1 2014
0 3
1 2014
...
I want to store these values in a new dataframe. Here is how I want the new dataframe to be :
month year
1 2014
2 2014
3 2014
I tried converting the tuple to list and then I tried various other things like pivoting. Inputs would be really helpful.
Here is the sample code :
df=df.groupby(['month','year'])
df = pd.DataFrame()
for key, value in df:
print "type of key is:",type(key)
print "type of list(key) is:",type(list(key))
df = df.append(list(key))
print df
When you do the groupby the resulting MultiIndex is available as:
In [11]: df = pd.DataFrame([[1, 2014, 42], [1, 2014, 44], [2, 2014, 23]], columns=['month', 'year', 'val'])
In [12]: df
Out[12]:
month year val
0 1 2014 42
1 1 2014 44
2 2 2014 23
In [13]: g = df.groupby(['month', 'year'])
In [14]: g.grouper.result_index
Out[14]:
MultiIndex(levels=[[1, 2], [2014]],
labels=[[0, 1], [0, 0]],
names=['month', 'year'])
Often this will be sufficient, and you won't need a DataFrame. If you do, one way is the following:
In [21]: pd.DataFrame(index=g.grouper.result_index).reset_index()
Out[21]:
month year
0 1 2014
1 2 2014
I thought there was a method to get this, but can't recall it.
If you really want the tuples you can use .values or to_series:
In [31]: g.grouper.result_index.values
Out[31]: array([(1, 2014), (2, 2014)], dtype=object)
In [32]: g.grouper.result_index.to_series()
Out[32]:
month year
1 2014 (1, 2014)
2 2014 (2, 2014)
dtype: object
You had initially declared both the groupby and empty dataframe as df. Here's a modified version of your code that allows you to append a tuple as a dataframe row.
g=df.groupby(['month','year'])
df = pd.DataFrame()
for (key1,key2), value in g:
row_series = pd.Series((key1,key),index=['month','year'])
df = df.append(row_series, ignore_index = True)
print df
If all you want are the unique values, you could use drop_duplicates
In [29]: df[['month','year']].drop_duplicates()
Out[29]:
month year
0 1 2014
2 2 2014