Pandas Calculate RMSE in Date Range Chunks by Year - pandas

I have data in a df and need to calculate the RMSE of a column consisting of rows of months and years data compared to the current month and year rows in a chunk period. I cannot figure out how to set up the sequencing by each year. For example, I need to calculate the RMSE by year from exactly month == 5 to month == 2 and print all the RMSE values in the "Variation" column by start year. My data looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
1992-04-01 00:00:00 4 12.265100 ... -1.074586 NaN
1992-05-01 00:00:00 5 12.533220 ... -1.017388 0.057198
1992-06-01 00:00:00 6 12.491247 ... -1.117406 -0.100018
1992-07-01 00:00:00 7 12.113165 ... -1.401221 -0.283815
1992-08-01 00:00:00 8 11.846904 ... -1.359026 0.042195
1992-09-01 00:00:00 9 11.526178 ... -0.299250 1.059776
1992-10-01 00:00:00 10 11.555834 ... -0.628162 -0.328911
1992-11-01 00:00:00 11 11.746104 ... -1.116374 -0.488213
1992-12-01 00:00:00 12 11.891824 ... -0.143343 0.973031
1993-01-01 00:00:00 1 11.997252 ... -0.486450 -0.343107
1993-02-01 00:00:00 2 12.028855 ... -0.862971 -0.376521
1993-03-01 00:00:00 3 12.063974 ... -0.596869 0.266102
1993-04-01 00:00:00 4 12.265100 ... -0.923695 -0.326826
1993-05-01 00:00:00 5 12.533220 ... 0.322987 1.246682
1993-06-01 00:00:00 6 12.491247 ... -0.478567 -0.801554
1993-07-01 00:00:00 7 12.113165 ... -0.274119 0.204448
1993-08-01 00:00:00 8 11.846904 ... -0.707968 -0.433849
1993-09-01 00:00:00 9 11.526178 ... 0.167246 0.875214
1993-10-01 00:00:00 10 11.555834 ... -0.089410 -0.256656
1993-11-01 00:00:00 11 11.746104 ... -1.046461 -0.957050
1993-12-01 00:00:00 12 11.891824 ... -1.293175 -0.246714
1994-01-01 00:00:00 1 11.997252 ... -1.505133 -0.211959
1994-02-01 00:00:00 2 12.028855 ... -0.610121 0.895012
1994-03-01 00:00:00 3 12.063974 ... -0.974184 -0.364063
1994-04-01 00:00:00 4 12.265100 ... -1.077609 -0.103424
The observed data from the current year looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
2021-05-01 00:00:00 5 12.533220 ... -0.935899 0.206586
2021-06-01 00:00:00 6 12.491247 ... -0.647261 0.288638
2021-07-01 00:00:00 7 12.113165 ... -0.711730 -0.064469
2021-08-01 00:00:00 8 11.846904 ... -0.482306 0.229424
2021-09-01 00:00:00 9 11.526178 ... -0.116989 0.365317
2021-10-01 00:00:00 10 11.555834 ... 0.319614 0.436603
2021-11-01 00:00:00 11 11.746104 ... 0.880379 0.560765
2021-12-01 00:00:00 12 11.891824 ... 0.630541 -0.249838
2022-01-01 00:00:00 1 11.997252 ... -0.151507 -0.782048
2022-02-01 00:00:00 2 12.028855 ... -0.237398 -0.085891
The result should be something like this below. I've tried using a groupby statement to calculate RMSE but not sure how to give groupby a range of dates.
year RMSE Variation
1992 number
1993 number
1994 number
.. ..
2020 number
thank you,

Some pre-processing of your dataframe for previous years. First, get the year label by taking the year component of your date with 4-month subtracted. Second, drop March and April.
from datetime import date
from dateutil.relativedelta import relativedelta
df_prev['year'] = pd.Series(df_prev['date'].dt.to_pydatetime() - relativedelta(months=4)).dt.year
df_prev = df_prev[~df_prev['month'].isin([3,4])]
Then convert df_prev into a matrix with years as column and month as index, convert the table for this year into a series with month as index.
df_prev_vari = df_prev.set_index(['month', 'year'])[['Variation']].unstack().droplevel(0, axis=1)
df_this_vari = df_this.set_index('month')['Variation']
Having month as the common index for both data enables us to subtract one another by matching the index, followed by squared, mean, and square-root operations.
(df_prev_vari.sub(df_this_vari, axis=0)**2).mean()**.5

Related

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0

Filtering and comparing dates with Pandas

I would like to know how to filter different dates at all the different time levels, i.e. find dates by year, month, day, hour, minute and/or day. For example, how do I find all dates that happened in 2014 or 2014 in the month of January or only 2nd January 2014 or ...down to the second?
So I have my date and time dataframe generated from pd.to_datetime
df
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
So if I filter by the year 2014 then I would have as output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
Or as a different example I want to know the dates that happened in 2014 and at the 2nd of each month. This would also result in:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
But if I asked for a date that happened on the 2nd of January 2014
timeStamp
0 2014-01-02 21:03:04
How can I achieve this at all the different levels?
Also how do you compare dates at these different levels to create an array of boolean indices?
You can filter your dataframe via boolean indexing like so:
df.loc[df['timeStamp'].dt.year == 2014]
df.loc[df['timeStamp'].dt.month == 5]
df.loc[df['timeStamp'].dt.second == 4]
df.loc[df['timeStamp'] == '2014-01-02']
df.loc[pd.to_datetime(df['timeStamp'].dt.date) == '2014-01-02']
... and so on and so forth.
If you set timestamp as index and dtype as datetime to get a DateTimeIndex, then you can use the following Partial String Indexing syntax:
df['2014'] # gets all 2014
df['2014-01'] # gets all Jan 2014
df['01-02-2014'] # gets all Jan 2, 2014
I would just create a string series, then use str.contains() with wildcards. That will give you whatever granularity you're looking for.
s = df['timeStamp'].map(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
print(df[s.str.contains('2014-..-.. ..:..:..')])
print(df[s.str.contains('2014-..-02 ..:..:..')])
print(df[s.str.contains('....-02-.. ..:..:..')])
print(df[s.str.contains('....-..-.. 18:03:10')])
Output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
timeStamp
2 2016-02-04 18:03:10
I think this also solves your question about boolean indices:
print(s.str.contains('....-..-.. 18:03:10'))
Output:
0 False
1 False
2 True
Name: timeStamp, dtype: bool