I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32
Related
From a panda dataframe with theses columns:
DAT_MESURE datetime64[ns]
MES_TEMPERATURE object
and values:
I want to compute the mean temperature value for a group of hours and get a new df.
For example, I want to create a new df with DAT_MESURE rounded to lowest hour and a mean for 4 values of the hour.
I want to get:
DAT_MESURE MES_TEMPERATURES
2020-08-01 00:00:00 21,xx
2020-08-01 01:00:00 22,xx
How to code it in python panda please?
Use:
df['MES_TEMPERATURE'] = df['MES_TEMPERATURE'].str.replace(',','.', regex=True).astype(float)
df1 = df.resample('H', on='DAT_MESURE')['MES_TEMPERATURE'].mean()
Or:
df2 = df.groupby(df['DAT_MESURE'].dt.floor('H'))['MES_TEMPERATURE'].mean()
If need round:
df3 = df.groupby(df['DAT_MESURE'].dt.round('H'))['MES_TEMPERATURE'].mean()
I have a pandas series as follows...
0 2039-03-16
1 2056-01-21
2 2051-11-18
3 2064-03-05
4 2048-06-05
Name: BIRTH, dtype: datetime64
It was created from string data as follows
s = data['BIRTH']
s = pd.to_datetime(s)
s
I want to convert all dates after year 2040 to 1940
I can do this for a single record as follows
s.iloc[0].replace(year=d.year-100)
but I really want to just run it over the whole series. I can't work it out. Help!??
PS - I know there's ways outside of pandas using Python's DT module but I'd like to learn how to do this within Pandas please
Using DateOffset is the obvious choice here:
df['date'] - pd.offsets.DateOffset(years=100)
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Assign it back:
df['date'] -= pd.offsets.DateOffset(years=100)
df
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
We have the offsets module to deal with non-fixed frequencies, it comes in handy in situations like these.
To fix your code, you'd have wanted to apply datetime.replace rowwise using apply (not recommended):
df['date'].apply(lambda x: x.replace(year=x.year-100))
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Or using a list comprehension,
df.assign(date=[x.replace(year=x.year-100) for x in df['date']])
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Neither of these handle NaT entries very well.
How do I pass values to months from April to September.
I would like the April value equals to 42000, May=41000, June=61200, July=71000,August=71000
df.index
RangeIndex(start=0, stop=60, step=1)
For a mapping like this, you would typically define a dictionary and map the values. Use .split to get the month part of the date and fillna to fill only the missing values.
Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2018-Jan', '2018-Feb', '2018-Mar', '2018-Apr', '2018-May',
'2018-Jun', '2018-Jul', '2018-Aug', '2018-Sep'],
'Value': [75267.169, 42258.868, 43793]+[np.NaN]*6})
Code:
d = {'Apr': 42000, 'May': 41000, 'Jun': 61200, 'Jul': 71000, 'Aug': 71000}
df['Value'] = df.Value.fillna(df.Date.str.split('-').str[1].map(d))
Output:
Date Value
0 2018-Jan 75267.169
1 2018-Feb 42258.868
2 2018-Mar 43793.000
3 2018-Apr 42000.000
4 2018-May 41000.000
5 2018-Jun 61200.000
6 2018-Jul 71000.000
7 2018-Aug 71000.000
8 2018-Sep NaN
super simple and ugly way to do it using pd.DataFrame.iloc
to_fill = [42000,41000,61200,71000,71000]
df.iloc[54:59,1] = to_fill
I have the following test dataframe:
date user answer
0 2018-08-19 19:08:19 pga yes
1 2018-08-19 19:09:27 pga no
2 2018-08-19 19:10:45 lry no
3 2018-09-07 19:12:31 lry yes
4 2018-09-19 19:13:07 pga yes
5 2018-10-22 19:13:20 lry no
I am using the following code to group by week:
test.groupby(pd.Grouper(freq='W'))
I'm getting an error that Grouper is only valid with DatetimeIndex, however I'm unfamiliar on how to structure this in order to group by week.
Probably you have date column as a string.
In order to use it in a Grouper with a frequency, start from converting this column to DateTime:
df['date'] = pd.to_datetime(df['date'])
Then, as date column is an "ordinary" data column (not the index), use key='date' parameter and a frequency.
To sum up, below you have a working example:
import pandas as pd
d = [['2018-08-19 19:08:19', 'pga', 'yes'],
['2018-08-19 19:09:27', 'pga', 'no'],
['2018-08-19 19:10:45', 'lry', 'no'],
['2018-09-07 19:12:31', 'lry', 'yes'],
['2018-09-19 19:13:07', 'pga', 'yes'],
['2018-10-22 19:13:20', 'lry', 'no']]
df = pd.DataFrame(data=d, columns=['date', 'user', 'answer'])
df['date'] = pd.to_datetime(df['date'])
gr = df.groupby(pd.Grouper(key='date',freq='W'))
for name, group in gr:
print(' ', name)
if len(group) > 0:
print(group)
Note that the group key (name) is the ending date of a week, so dates from group members are earlier or equal to the date printed above.
You can change it passing label='left' parameter to Grouper.
I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64