Understanding resampling of datetime in pandas - pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.

Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64

Related

Pandas reindex Dates To Subset of Dates from List

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,
If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

Daily to Weekly Pandas conversion

I am trying to convert my 15ys worth of daily data into weekly by taking the mean, diff and count of certain features. I tried using .resample but I was not sure if that is the most efficient way.
My sample data:
Date,Product,New Quantity,Price,Refund Flag
8/16/1994,abc,10,0.5,
8/17/1994,abc,11,0.9,1
8/18/1994,abc,15,0.6,
8/19/1994,abc,19,0.4,
8/22/1994,abc,22,0.2,1
8/23/1994,abc,19,0.1,
8/16/1994,xyz,16,0.5,1
8/17/1994,xyz,10,0.9,1
8/18/1994,xyz,12,0.6,1
8/19/1994,xyz,19,0.4,
8/22/1994,xyz,26,0.2,1
8/23/1994,xyz,30,0.1,
8/16/1994,pqr,0,0,
8/17/1994,pqr,0,0,
8/18/1994,pqr,1,1,
8/19/1994,pqr,2,0.6,
8/22/1994,pqr,9,0.1,
8/23/1994,pqr,12,0.2,
This is the output I am looking for:
Date,Product,Net_Quantity_diff,Price_avg,Refund
8/16/1994,abc,9,0.6,1
8/22/1994,abc,-3,0.15,0
8/16/1994,xyz,3,0.6,3
8/22/1994,xyz,4,0.15,1
8/16/1994,pqr,2,0.4,0
8/22/1994,pqr,3,0.15,0
I think the pandas resample method is indeed ideal for this. You can pass a dictionary to the agg method, defining which aggregation function to use for each column. For example:
import numpy as np
import pandas as pd
df = pd.read_csv('sales.txt') # your sample data
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
del df['Date']
df['Refund Flag'] = df['Refund Flag'].fillna(0).astype(bool)
def span(s):
return np.max(s) - np.min(s)
df_weekly = df.resample('w').agg({'New Quantity': span,
'Price': np.mean,
'Refund Flag': np.sum})
df_weekly
New Quantity Price Refund Flag
Date
1994-08-21 19 0.533333 4
1994-08-28 21 0.150000 2

Reshape Pandas dataframe (partial transpose)

I have a csv similar to the following, where the column heading specifies the time (hour number):
Day,Location,1,2,3
1/1/2021,A,0.26,0.25,0.49
1/1/2021,B,0.8,0.23,0.55
1/1/2021,C,0.32,0.11,0.58
1/2/2021,A,0.67,0.72,0.49
1/2/2021,B,0.25,0.09,0.56
1/2/2021,C,0.83,0.54,0.7
When I load it as a dataframe using
df = pd.read_csv(open('VirusLevels.csv', 'r'), index_col=[0,1], header=0)
Pandas creates a dataframe with indices Day and Location, and column names 1, 2, and 3.
I need it to be reshaped as shown below, where Day and Time are the indices, and the Location is the column heading:
I've tried a lot of things and followed a lot of rabbitholes, but haven't been successful. The most on-point example I could find suggested something like the following, but it doesn't work (says "KeyError: 'Day'").
df.melt(id_vars=['Day'], var_name= 'Time',
value_name = 'VirusLevels').sort_values(by='Location').reset_index(drop=True)
Thanks in advance for any help.
Try:
df = pd.read_csv('VirusLevels.csv', index_col=[0,1])
df.rename_axis(columns='Time').stack().unstack('Location')
# or
# df.rename_axis('Time',axis='columns').stack().unstack('Location')
Output:
Location A B C
Day Time
1/1/2021 1 0.345307 0.099403 0.474077
2 0.299947 0.853091 0.352472
3 0.400975 0.599249 0.743099
1/2/2021 1 0.660258 0.003976 0.295406
2 0.425434 0.953433 0.418783
3 0.421021 0.844761 0.369561

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

How do I get pandas update function to correctly handle numpy.datetime64?

I have a dataframe with a column that may contain None and another dataframe with the same index that has datetime values populated. I am trying to update the first from the second using pandas.update.
import numpy as np
import pandas as pd
df = pd.DataFrame([{'id': 0, 'as_of_date': np.datetime64('2017-05-08')}])
print(df.as_of_date)
df2 = pd.DataFrame([{'id': 0, 'as_of_date': None}])
print(df2.as_of_date)
df2.update(df)
print(df2.as_of_date)
print(df2.apply(lambda x: x['as_of_date'] - np.timedelta64(1, 'D'), axis=1))
This results in
0 2017-05-08
Name: as_of_date, dtype: datetime64[ns]
0 None
Name: as_of_date, dtype: object
0 1494201600000000000
Name: as_of_date, dtype: object
0 -66582 days +10:33:31.122941
dtype: timedelta64[ns]
So basically update converts the datetime to milliseconds, but keeps the type as object. Then if I try to do date math on it, I get wacky results because numpy doesn't know how to treat it.
I was hoping df2 would look like df1 after updating. How can I fix this?
Try this:
In [391]: df2 = df2.combine_first(df)
In [392]: df2
Out[392]:
as_of_date id
0 2017-05-08 0
In [396]: df2.dtypes
Out[396]:
as_of_date datetime64[ns]
id int64
dtype: object
A two step approach
Fill None data in df2 using date from df:
df2 = df2.combine_first(df)
Update all elements in df2 using elements from df
df2.update(df)
Without 2nd step, df2 will only take the values from df to fill its Nones.