How to calculate monthly normals? - pandas

I have this df:
CODE TMAX TMIN PP
DATE
1991-01-01 000130 32.6 23.4 0.0
1991-01-02 000130 31.2 22.4 0.0
1991-01-03 000130 32.0 NaN 0.0
1991-01-04 000130 32.2 23.0 0.0
1991-01-05 000130 30.5 22.0 0.0
... ... ... ...
2020-12-27 158328 NaN NaN NaN
2020-12-28 158328 NaN NaN NaN
2020-12-29 158328 NaN NaN NaN
2020-12-30 158328 NaN NaN NaN
2020-12-31 158328 NaN NaN NaN
I have data of 30 years (1991-2020) for each CODE, and i want to calculate monthly normals of TMAX, TMIN and PP. So for TMAX and TMIN i should calculate the average for every month, so if January have 31 days i should get the mean of those 31 values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020), 30 Februarys, etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month). Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So i'm using this code but i don't know if it's ok.
from datetime import date
normalstemp=df[['CODE','TMAX','TMIN']].groupby([df.CODE, df.index.month]).mean().round(1)
For PP (precipitation) i should sum the values of every PP value of the month, so if January have 31 days i should sum all of their values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020) , 30 Februarys (February 1991, February 1992, ... ,February 2020), etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month, the same as TMAX and TMIN).
Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So im using this code but i know this code isn't correct because i'm not getting the mean of the januarys, februarys, etc.
normalspp=df[['CODE','PP']].groupby([df.CODE, df.index.month]).sum().round(1)
I only have basic knowledge of python so i will appreciate if you can help me.
Thanks in advance.

Ver 2: Average by Year-Month and by Month
import pandas as pd
import numpy as np
x = pd.date_range(start='1/1/1991', end='12/31/2020',freq='D')
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*10958 + ['158328']*10958,
'TMAX': np.random.randint(6,10, size=21916),
'TMIN': np.random.randint(1,5, size=21916)
})
# Create a Month column to get Average by Month for all years
df['Month'] = df.Date.dt.month
# Create a Year-Month column to get Average of each Month within the Year
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Print the Average of each Month within each Year for each code
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
# Print the Average of each Month irrespective of the year (for each code)
print (df.groupby(['Code','Month'])['TMAX'].mean())
print (df.groupby(['Code','Month'])['TMAX'].mean())
If you want to give a name for the TMAX Average value, you can add the reset_index and rename column. Here's code to do that.
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean().reset_index().rename(columns={'TMAX':'TMAX_Avg'}))
The output of this will be:
Average of TMAX for each Year-Month for each Code
Code Year_Mon
000130 1991-01 7.225806
1991-02 7.678571
1991-03 7.354839
1991-04 7.500000
1991-05 7.516129
...
158328 2020-08 7.387097
2020-09 7.300000
2020-10 7.516129
2020-11 7.500000
2020-12 7.451613
Name: TMAX, Length: 720, dtype: float64
Average of TMIN for each Year-Month for each Code
Code Year_Mon
000130 1991-01 2.419355
1991-02 2.571429
1991-03 2.193548
1991-04 2.366667
1991-05 2.451613
...
158328 2020-08 2.451613
2020-09 2.566667
2020-10 2.612903
2020-11 2.666667
2020-12 2.580645
Name: TMIN, Length: 720, dtype: float64
Average of TMAX for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Average of TMIN for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Ver 1: Average by Year and Month for each Code
Here is one way to do this.
You can create two columns - Year and Month. Then get the average of TMAX, TMIN, and PP for each month within the year by doing a groupby ('Code','Year_Mon')
See code for more details.
import pandas as pd
import numpy as np
# create a range of dates from 1/1/2018 thru 12/31/2020 for each day
x = pd.date_range(start='1/1/2018', end='12/31/2020',freq='D')
# create a dataframe with the date ranges x 2 for two codes
# TMIN is a random value from 1 thru 5 - you can put your actual data here
# TMAX is a random value from 6 thru 10 - you can put your actual data here
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*1096 + ['158328']*1096,
'TMAX': np.random.randint(6,10, size=2192),
'TMIN': np.random.randint(1,5, size=2192)
})
# Create a Year-Month column using df.Date.dt.strftime
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Calculate the Average of TMAX and TMIN using groupby Code and Year_Mon
df['TMAX_Avg'] = df.groupby(['Code','Year_Mon'])['TMAX'].transform('mean')
df['TMIN_Avg'] = df.groupby(['Code','Year_Mon'])['TMIN'].transform('mean')
The output of this will be:
Date Code TMAX TMIN Year_Mon TMAX_Avg TMIN_Avg
0 2018-01-01 000130 8 2 2018-01 7.451613 2.129032
1 2018-01-02 000130 7 4 2018-01 7.451613 2.129032
2 2018-01-03 000130 9 2 2018-01 7.451613 2.129032
3 2018-01-04 000130 6 1 2018-01 7.451613 2.129032
4 2018-01-05 000130 9 4 2018-01 7.451613 2.129032
5 2018-01-06 000130 6 1 2018-01 7.451613 2.129032
6 2018-01-07 000130 9 2 2018-01 7.451613 2.129032
7 2018-01-08 000130 9 2 2018-01 7.451613 2.129032
8 2018-01-09 000130 7 2 2018-01 7.451613 2.129032
9 2018-01-10 000130 8 2 2018-01 7.451613 2.129032
10 2018-01-11 000130 8 3 2018-01 7.451613 2.129032
11 2018-01-12 000130 7 2 2018-01 7.451613 2.129032
12 2018-01-13 000130 7 1 2018-01 7.451613 2.129032
13 2018-01-14 000130 8 1 2018-01 7.451613 2.129032
14 2018-01-15 000130 7 3 2018-01 7.451613 2.129032
15 2018-01-16 000130 6 1 2018-01 7.451613 2.129032
16 2018-01-17 000130 6 3 2018-01 7.451613 2.129032
17 2018-01-18 000130 9 3 2018-01 7.451613 2.129032
18 2018-01-19 000130 7 2 2018-01 7.451613 2.129032
19 2018-01-20 000130 8 1 2018-01 7.451613 2.129032
20 2018-01-21 000130 9 4 2018-01 7.451613 2.129032
21 2018-01-22 000130 6 2 2018-01 7.451613 2.129032
22 2018-01-23 000130 9 4 2018-01 7.451613 2.129032
23 2018-01-24 000130 6 2 2018-01 7.451613 2.129032
24 2018-01-25 000130 8 3 2018-01 7.451613 2.129032
25 2018-01-26 000130 6 2 2018-01 7.451613 2.129032
26 2018-01-27 000130 8 1 2018-01 7.451613 2.129032
27 2018-01-28 000130 8 3 2018-01 7.451613 2.129032
28 2018-01-29 000130 6 1 2018-01 7.451613 2.129032
29 2018-01-30 000130 6 1 2018-01 7.451613 2.129032
30 2018-01-31 000130 8 1 2018-01 7.451613 2.129032
31 2018-02-01 000130 7 1 2018-02 7.250000 2.428571
32 2018-02-02 000130 6 2 2018-02 7.250000 2.428571
33 2018-02-03 000130 6 4 2018-02 7.250000 2.428571
34 2018-02-04 000130 8 3 2018-02 7.250000 2.428571
35 2018-02-05 000130 8 2 2018-02 7.250000 2.428571
36 2018-02-06 000130 6 3 2018-02 7.250000 2.428571
37 2018-02-07 000130 6 3 2018-02 7.250000 2.428571
38 2018-02-08 000130 7 1 2018-02 7.250000 2.428571
39 2018-02-09 000130 9 4 2018-02 7.250000 2.428571
40 2018-02-10 000130 8 2 2018-02 7.250000 2.428571
41 2018-02-11 000130 7 4 2018-02 7.250000 2.428571
42 2018-02-12 000130 8 1 2018-02 7.250000 2.428571
43 2018-02-13 000130 6 4 2018-02 7.250000 2.428571
44 2018-02-14 000130 6 1 2018-02 7.250000 2.428571
45 2018-02-15 000130 6 4 2018-02 7.250000 2.428571
46 2018-02-16 000130 8 2 2018-02 7.250000 2.428571
47 2018-02-17 000130 7 3 2018-02 7.250000 2.428571
48 2018-02-18 000130 9 3 2018-02 7.250000 2.428571
49 2018-02-19 000130 8 2 2018-02 7.250000 2.428571
If you want only the Code, Year-Month, and TMIN and TMAX values, you can do:
TMAX average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
Output will be:
Code Year_Mon
000130 2018-01 7.451613
2018-02 7.250000
2018-03 7.774194
2018-04 7.366667
2018-05 7.451613
...
158328 2020-08 7.935484
2020-09 7.666667
2020-10 7.548387
2020-11 7.333333
2020-12 7.580645
TMIN average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
Output will be:
Code Year_Mon
000130 2018-01 2.129032
2018-02 2.428571
2018-03 2.451613
2018-04 2.500000
2018-05 2.677419
...
158328 2020-08 2.709677
2020-09 2.166667
2020-10 2.161290
2020-11 2.366667
2020-12 2.548387

Related

How to change monthly table into one column with date index?

I downloaded the Broad Dollar Index from FRED with the following format:
DATE RTWEXBGS
0 2006-01-01 100.0000
1 2006-02-01 100.2651
2 2006-03-01 100.5424
3 2006-04-01 100.0540
4 2006-05-01 97.8681
.. ... ...
194 2022-03-01 111.2659
195 2022-04-01 111.8324
196 2022-05-01 114.6075
197 2022-06-01 115.6957
198 2022-07-01 118.2674
I also got an Excel file of inflation rate with a different format:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN NaN NaN NaN NaN NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 0.05251 0.05390 0.06222 0.06809 0.07036 0.04698
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 0.01310 0.01371 0.01182 0.01175 0.01362 0.01234
3 2019 0.01551 0.01520 0.01863 0.01996 0.01790 0.01648 0.01811 0.01750 0.01711 0.01764 0.02051 0.02285 0.01812
4 2018 0.02071 0.02212 0.02360 0.02463 0.02801 0.02872 0.02950 0.02699 0.02277 0.02522 0.02177 0.01910 0.02443
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104 1918 0.19658 0.17500 0.16667 0.12698 0.13281 0.13077 0.17969 0.18462 0.18045 0.18519 0.20741 0.20438 0.17284
105 1917 0.12500 0.15385 0.14286 0.18868 0.19626 0.20370 0.18519 0.19266 0.19820 0.19469 0.17391 0.18103 0.17841
106 1916 0.02970 0.04000 0.06061 0.06000 0.05941 0.06931 0.06931 0.07921 0.09901 0.10784 0.11650 0.12621 0.07667
107 1915 0.01000 0.01010 0.00000 0.02041 0.02020 0.02020 0.01000 -0.00980 -0.00980 0.00990 0.00980 0.01980 0.00915
108 1914 0.02041 0.01020 0.01020 0.00000 0.02062 0.01020 0.01010 0.03030 0.02000 0.01000 0.00990 0.01000 0.01349
How do I change the inflation table into a format similar to the dollar index?
Something like this(didn't take column=Annual into account),
df
###
Year Jan Feb Mar Apr May Jun Jul Aug \
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 NaN
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 NaN
Sep Oct Nov Dec Annual
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df_melt = pd.melt(df, id_vars=['Year'], value_vars=month, var_name='Month', value_name='Sales')
df_melt['Date'] = pd.to_datetime(df_melt['Year'].astype(str) + '-' + df_melt['Month'].astype(str))
# convert Date column to datetime type
df_melt = df_melt[['Date', 'Sales']]
df_melt
###
Date Sales
0 2022-01-01 0.07480
1 2021-01-01 0.01400
2 2020-01-01 0.02487
3 2022-02-01 0.07871
4 2021-02-01 0.01676
5 2020-02-01 0.02335
6 2022-03-01 0.08542
7 2021-03-01 0.02620
8 2020-03-01 0.01539
9 2022-04-01 0.08259
10 2021-04-01 0.04160
11 2020-04-01 0.00329
12 2022-05-01 0.08582
13 2021-05-01 0.04993
14 2020-05-01 0.00118
15 2022-06-01 0.09060
16 2021-06-01 0.05391
17 2020-06-01 0.00646
18 2022-07-01 0.08525
19 2021-07-01 0.05365
20 2020-07-01 0.00986
21 2022-08-01 NaN
22 2021-08-01 NaN
23 2020-08-01 NaN
24 2022-09-01 NaN
25 2021-09-01 NaN
26 2020-09-01 NaN
27 2022-10-01 NaN
28 2021-10-01 NaN
29 2020-10-01 NaN
30 2022-11-01 NaN
31 2021-11-01 NaN
32 2020-11-01 NaN
33 2022-12-01 NaN
34 2021-12-01 NaN
35 2020-12-01 NaN

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

Unable to find date in pandas

I have a dataset in this form:
company_name date
0 global_infotech 2019-06-15
1 global_infotech 2020-03-22
2 global_infotech 2020-08-30
3 global_infotech 2018-06-19
4 global_infotech 2018-06-15
5 global_infotech 2018-02-15
6 global_infotech 2018-11-22
7 global_infotech 2019-01-15
8 global_infotech 2018-12-15
9 global_infotech 2019-06-15
10 global_infotech 2018-12-19
11 global_infotech 2019-12-31
12 global_infotech 2019-02-18
13 global_infotech 2018-06-16
14 global_infotech 2019-02-10
15 global_infotech 2019-03-15
16 Qualcom 2019-07-11
17 Qualcom 2018-01-11
18 Qualcom 2018-05-29
19 Qualcom 2018-10-06
20 Qualcom 2018-11-11
21 Qualcom 2019-08-17
22 Qualcom 2019-02-22
23 Qualcom 2019-10-16
24 Qualcom 2018-06-22
25 Qualcom 2018-06-14
26 Qualcom 2018-06-16
27 Syscin 2018-02-10
28 Syscin 2019-02-16
29 Syscin 2018-04-12
30 Syscin 2018-08-22
31 Syscin 2018-09-16
32 Syscin 2019-04-20
33 Syscin 2018-02-28
34 Syscin 2018-01-19
CONSIDERING TODAY'S DATE AS 1st JANUARY 2020, I WANT TO WRITE A CODE TO FIND THE NUMBER OF TIMES EACH COMPANY NAME IS OCCURING IN LAST 3 MONTHS. For example, suppose from 1st Oct 2019 to 1st Jan 2020, golbal_infotech's name is appearing 5 times, then 5 should appear infront of every global_infotech value like:
company_name date appearance_count_last_3_months
0 global_infotech 2019-06-15 5
1 global_infotech 2020-03-22 5
2 global_infotech 2020-08-30 5
3 global_infotech 2018-06-19 5
4 global_infotech 2018-06-15 5
5 global_infotech 2018-02-15 5
6 global_infotech 2018-11-22 5
7 global_infotech 2019-01-15 5
8 global_infotech 2018-12-15 5
9 global_infotech 2019-06-15 5
10 global_infotech 2018-12-19 5
11 global_infotech 2019-12-31 5
12 global_infotech 2019-02-18 5
13 global_infotech 2018-06-16 5
14 global_infotech 2019-02-10 5
15 global_infotech 2019-03-15 5
IIUC:
you can create a custom function:
def getcount(company,month=3,df=df):
df=df.copy()
df['date']=pd.to_datetime(df['date'],format='%Y-%m-%d',errors='coerce')
df=df[df['company_name'].eq(company)]
val=df.groupby(pd.Grouper(key='date',freq=str(month)+'m')).count().max().get(0)
df['appearance_count_last_3_months']=val
return df
getcount('global_infotech')
#OR
getcount('global_infotech',3)
Update:
since you have 92 different companies so you can use for loop:
lst=[]
for x in df['company_name'].unique():
lst.append(getcount(x))
out=pd.concat(lst)
If you print out then you will get your desired output
You can first filter the data for the last 3 months, and then groupby company name and merge back into the original dataframe.
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
# sample data
df = pd.DataFrame({
'company_name': ['global_infotech', 'global_infotech', 'Qualcom','another_company'],
'date': ['2019-02-18', '2021-07-02', '2021-07-01','2019-02-18']
})
df['date'] = pd.to_datetime(df['date'])
# filter for last 3 months
summary = df[df['date']>=datetime.now()-relativedelta(months=3)]
# groupby then aggregate with desired column name
summary = summary.rename(columns={'date':'appearance_count_last_3_months'})
summary = summary.groupby('company_name')
summary = summary.agg('count')
# merge summary back into original df, filling missing values with 0
df = df.merge(summary, left_on='company_name', right_index=True, how='left')
df['appearance_count_last_3_months'] = df['appearance_count_last_3_months'].fillna(0).astype('int')
# result:
df
company_name date appearance_count_last_3_months
0 global_infotech 2019-02-18 1
1 global_infotech 2021-07-02 1
2 Qualcom 2021-07-01 1
3 another_company 2019-02-18 0

pandas: rolling mean on time interval plus grouping on index

I am trying to find the 7-day rolling average for the hour of day for a category. The data frame is indexed on the category id and there is a time stamp plus other columns:
id name ds time x y z
6 red 2020-02-14 00:00:00 10 20 30
6 red 2020-02-14 01:00:00 20 40 50
6 red 2020-02-14 02:00:00 20 20 60
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30
7 green 2020-02-14 01:00:00 20 40 50
7 green 2020-02-14 02:00:00 20 20 60
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
what I would like as an output (obviously with the rolling columns filled by the rolling mean where not NaN):
id name ds time x y z rolling_x rolling_y rolling_z
6 red 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
6 red 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
6 red 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
7 green 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
7 green 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
My approach:
df = df.assign(day=df['ds time'].dt.normalize(),
hour=df['ds time'].dt.hour)
ret_df = df.merge(df.drop('ds time', axis=1)
.set_index('day')
.groupby(['id','hour']).rolling('7D').mean()
.drop(['hour','id'], axis=1),
on=['id','hour','day'],
how='left',
suffixes=['','_roll']
).drop(['day','hour'], axis=1)
Sample data:
dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')
np.random.seed(1)
df = pd.DataFrame({
'id': np.repeat([6,7], len(dates)),
'ds time': np.tile(dates,2),
'X': np.arange(len(dates)*2),
'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()
Output ret_df.head():
id ds time X Y X_roll Y_roll
0 6 2020-02-21 00:00:00 0 5 0.0 5.0
1 6 2020-02-21 01:00:00 1 8 1.0 8.0
2 6 2020-02-21 02:00:00 2 9 2.0 9.0
3 6 2020-02-21 03:00:00 3 5 3.0 5.0
4 6 2020-02-21 04:00:00 4 0 4.0 0.0

what is wrong with pandas date_time?

Following is the code for an example of using pandas datetime module. As shown in the output, it is not consitent, It is mixing date and month. Am i doing something wrong?
​
dates = ['20/11/17', '12/02/18', '02/05/18', '10/09/18',
'22/06/17', '12/02/15','19/11/17', '04/09/16',
'12/05/18', '11/04/15', '10/04/17', '13/06/16']
data = pd.DataFrame(data=dates, columns=['date'])
data['date_format'] = pd.to_datetime(dates)
data
Output:
date date_format
0 20/11/17 2017-11-20
1 12/02/18 2018-12-02
2 02/05/18 2018-02-05
3 10/09/18 2018-10-09
4 22/06/17 2017-06-22
5 12/02/15 2015-12-02
6 19/11/17 2017-11-19
7 04/09/16 2016-04-09
8 12/05/18 2018-12-05
9 11/04/15 2015-11-04
10 10/04/17 2017-10-04
11 13/06/16 2016-06-13