pandas group by week - pandas

I have the following test dataframe:
date user answer
0 2018-08-19 19:08:19 pga yes
1 2018-08-19 19:09:27 pga no
2 2018-08-19 19:10:45 lry no
3 2018-09-07 19:12:31 lry yes
4 2018-09-19 19:13:07 pga yes
5 2018-10-22 19:13:20 lry no
I am using the following code to group by week:
test.groupby(pd.Grouper(freq='W'))
I'm getting an error that Grouper is only valid with DatetimeIndex, however I'm unfamiliar on how to structure this in order to group by week.

Probably you have date column as a string.
In order to use it in a Grouper with a frequency, start from converting this column to DateTime:
df['date'] = pd.to_datetime(df['date'])
Then, as date column is an "ordinary" data column (not the index), use key='date' parameter and a frequency.
To sum up, below you have a working example:
import pandas as pd
d = [['2018-08-19 19:08:19', 'pga', 'yes'],
['2018-08-19 19:09:27', 'pga', 'no'],
['2018-08-19 19:10:45', 'lry', 'no'],
['2018-09-07 19:12:31', 'lry', 'yes'],
['2018-09-19 19:13:07', 'pga', 'yes'],
['2018-10-22 19:13:20', 'lry', 'no']]
df = pd.DataFrame(data=d, columns=['date', 'user', 'answer'])
df['date'] = pd.to_datetime(df['date'])
gr = df.groupby(pd.Grouper(key='date',freq='W'))
for name, group in gr:
print(' ', name)
if len(group) > 0:
print(group)
Note that the group key (name) is the ending date of a week, so dates from group members are earlier or equal to the date printed above.
You can change it passing label='left' parameter to Grouper.

Related

how to group by month and another column pandas data frame

I have a data frame that looks like below:
import pandas as pd
df = pd.DataFrame({'Date':[2019-08-06,2019-08-08,2019-08-01,2019-10-12], 'Name':['A','A','B','C'], 'grade':[100,90,69,80]})
I want to groupby the data by month and year from the Datetime and also group by Name. Then sum up the other columns.
So, the desired output will be something similar to this
df = pd.DataFrame({'Date':[2019-08, 2019-08, 2019-10-12], 'Name':['A','B','C'], 'grade':[190,69,80]})
I have tried grouper
df.groupby(pd.Grouper(freq='M').sum()
However, it won't take the Name column into play and just drop the entire column.
Try :
df['Date'] = pd.to_datetime(df.Date)
df.groupby([df.Date.dt.to_period('M'), 'Name']).sum().reset_index()
Date Name grade
0 2019-08 A 190
1 2019-08 B 69
2 2019-10 C 80
I assume date column is of dtype datetime. Then group with
grouped = df.groupby([df.Date.dt.year, df.Date.dt.month, 'Name']).sum()

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Finding the longest sequence of dates in a dataframe

I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.
Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.

Understanding resampling of datetime in pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64