I have data that looks like this below and I'm trying to calculate the CRMSE (centered root mean squared error) by site_name and year. Maybe i need an agg function or a lambda function to do this at each groupby parameters (plant_name, year). The dataframe data for df3m1:
plant_name year month obsvals modelvals
0 ARIZONA I 2021 1 8.90 8.30
1 ARIZONA I 2021 2 7.98 7.41
2 CAETITE I 2021 1 9.10 7.78
3 CAETITE I 2021 2 6.05 6.02
The equation that I need to implement by plant_name and year looks like this:
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals - df3m1.obsvals.mean()) -
(df3m1.modelvals - df3m1.modelvals.mean()) ) ** 2).mean() ** .5
This is a bit advanced for me yet on how to integrate a groupby and a calculation at the same time. thank you. Final dataframe would look like:
plant_name year crmse
0 ARIZONA I 2021 ?
1 CAETITE I 2021 ?
I have tried things like this with groupby -
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals -
df3m1.obsvals.mean()) - (df3m1.modelvals - df3m1.modelvals.mean()) )
** 2).mean() ** .5
but get errors like this:
TypeError: 'DataFrameGroupBy' object is not callable
Using groupby is correct. After that, we would have used .agg normally, but computing csrme interacts with multiple columns (obsvals and modelvals). So we pass the entire dataframe then take columns as we want by using .apply.
Code:
def crmse(x, y):
return np.sqrt(np.mean(np.square( (x - x.mean()) - (y - y.mean()) )))
def f(df):
return pd.Series(crmse(df['obsvals'], df['modelvals']), index=['crmse'])
crmse_series = (
df3m1
.groupby(['plant_name', 'year'])
.apply(f)
)
crmse_series
crmse
plant_name year
ARIZONA I 2021 0.015
CAETITE I 2021 0.645
You can merge the series into the original dataframe with merge.
df = df.merge(crmse_series, on=['plant_name', 'year'])
df
plant_name year month obsvals modelvals crmse
0 ARIZONA I 2021 1 8.90 8.30 0.015
1 ARIZONA I 2021 2 7.98 7.41 0.015
2 CAETITE I 2021 1 9.10 7.78 0.645
3 CAETITE I 2021 2 6.05 6.02 0.645
See Also:
Apply multiple functions to multiple groupby columns
Related
I'm working on a fun side project and would like to compute a moving sum for number of wins for NBA teams over 2 year periods. Consider the sample pandas dataframe below,
pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
I would ideally like to compute the sum of the number of wins between 1970 and 1971, 1971 and 1972, 1972 and 1973, etc. An inefficient way would be to use a loop, is there a way to do this using the .groupby function?
This is a little bit of a hack, but you could group by df['Season'] // 2 * 2, which means dividing by two, taking a floor operation, then multiplying by two again. The effect is to round each year to a multiple of two.
df_sum = pd.DataFrame(df.groupby(['Team', df['Season'] // 2 * 2])['Wins'].sum()).reset_index()
Output:
Team Season Wins
0 Hawks 1970 74
1 Hawks 1972 76
2 Hawks 1974 42
If you have years ordered for each team you can just use rolling with groupby on command. For example:
import pandas as pd
df = pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
res = df.groupby('Team')['Wins'].rolling(2).sum()
print(res)
Out:
Team
Hawks 0 NaN
1 74.0
2 64.0
3 76.0
4 88.0
I have a pandas time series ser
ser
>>>
date x
2018-01-01 0.912
2018-01-02 0.704
...
2021-02-01 1.285
and I want to take a cumulative sum by year and make each year into a column as such, and the date index should now be just dates in year (e.g. Jan 01, Jan 02,... the formatting of Month and Day doesn't matter)
date 2018_x 2019_x 2020_x 2021_x 2022_x
Jan-01 0.912 ... ... ... ...
Jan-02 1.616 ... ... ... ...
...
I know how to groupby and take a cumulative sum, but then I want to do some sort of unstacking operation to get the years into columns
ser.groupby(ser.index.year).cumsum()
# what do I do next?
The standard pandas unstack() operation doesn't work here.
Can anyone please advise how to do this?
First you can aggregate sum per MM-DD with years and then reshape by Series.unstack:
df = ser.groupby([ser.index.strftime('%m-%d'), ser.index.year]).sum().unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285
Or if no duplicated datetimes create MultiIndex without groupby:
ser.index = [ser.index.strftime('%m-%d'), ser.index.year]
df = ser.unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285
I have opened a dataframe in julia where i have 3 columns like this:
day month year
1 1 2011
2 4 2015
3 12 2018
how can I make a new column called date that goes:
day month year date
1 1 2011 1/1/2011
2 4 2015 2/4/2015
3 12 2018 3/12/2018
I was trying with this:
df[!,:date]= df.day.*"/".*df.month.*"/".*df.year
but it didn't work.
in R i would do:
df$date=paste(df$day, df$month, df$year, sep="/")
is there anything similar?
thanks in advance!
Julia has an inbuilt Date type in its standard library:
julia> using Dates
julia> df[!, :date] = Date.(df.year, df.month, df.day)
3-element Vector{Date}:
2011-01-01
2015-04-02
2018-12-03
I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);
I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01