Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)
I have a dataframe of values that are mostly (but not always) quarterly values.
I need to fill in for any missing months so it is complete.
Here i need to put it into a complete df from 2015-12 to 2021-03.
Thank you.
id date amt rate
0 15856 2015-12-31 85.09 0.0175
1 15857 2016-03-31 135.60 0.0175
2 15858 2016-06-30 135.91 0.0175
3 15859 2016-09-30 167.27 0.0175
4 15860 2016-12-31 173.32 0.0175
....
19 15875 2020-09-30 305.03 0.0175
20 15876 2020-12-31 354.09 0.0175
21 15877 2021-03-31 391.19 0.0175
You can use pd.date_range() to generate a list of months end with freq='M' then reindex datetime index.
df_ = df.set_index('date').reindex(pd.date_range('2015-12', '2021-03', freq='M')).reset_index().rename(columns={'index': 'date'})
print(df_)
date id amt rate
0 2015-12-31 15856.0 85.09 0.0175
1 2016-01-31 NaN NaN NaN
2 2016-02-29 NaN NaN NaN
3 2016-03-31 15857.0 135.60 0.0175
4 2016-04-30 NaN NaN NaN
.. ... ... ... ...
58 2020-10-31 NaN NaN NaN
59 2020-11-30 NaN NaN NaN
60 2020-12-31 15876.0 354.09 0.0175
61 2021-01-31 NaN NaN NaN
62 2021-02-28 NaN NaN NaN
To fill the NaN value, you can use df_.fillna(0).
I have this df:
CODE TMAX TMIN PP
DATE
1991-01-01 000130 32.6 23.4 0.0
1991-01-02 000130 31.2 22.4 0.0
1991-01-03 000130 32.0 NaN 0.0
1991-01-04 000130 32.2 23.0 0.0
1991-01-05 000130 30.5 22.0 0.0
... ... ... ...
2020-12-27 158328 NaN NaN NaN
2020-12-28 158328 NaN NaN NaN
2020-12-29 158328 NaN NaN NaN
2020-12-30 158328 NaN NaN NaN
2020-12-31 158328 NaN NaN NaN
I have data of 30 years (1991-2020) for each CODE, and i want to calculate monthly normals of TMAX, TMIN and PP. So for TMAX and TMIN i should calculate the average for every month, so if January have 31 days i should get the mean of those 31 values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020), 30 Februarys, etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month). Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So i'm using this code but i don't know if it's ok.
from datetime import date
normalstemp=df[['CODE','TMAX','TMIN']].groupby([df.CODE, df.index.month]).mean().round(1)
For PP (precipitation) i should sum the values of every PP value of the month, so if January have 31 days i should sum all of their values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020) , 30 Februarys (February 1991, February 1992, ... ,February 2020), etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month, the same as TMAX and TMIN).
Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So im using this code but i know this code isn't correct because i'm not getting the mean of the januarys, februarys, etc.
normalspp=df[['CODE','PP']].groupby([df.CODE, df.index.month]).sum().round(1)
I only have basic knowledge of python so i will appreciate if you can help me.
Thanks in advance.
Ver 2: Average by Year-Month and by Month
import pandas as pd
import numpy as np
x = pd.date_range(start='1/1/1991', end='12/31/2020',freq='D')
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*10958 + ['158328']*10958,
'TMAX': np.random.randint(6,10, size=21916),
'TMIN': np.random.randint(1,5, size=21916)
})
# Create a Month column to get Average by Month for all years
df['Month'] = df.Date.dt.month
# Create a Year-Month column to get Average of each Month within the Year
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Print the Average of each Month within each Year for each code
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
# Print the Average of each Month irrespective of the year (for each code)
print (df.groupby(['Code','Month'])['TMAX'].mean())
print (df.groupby(['Code','Month'])['TMAX'].mean())
If you want to give a name for the TMAX Average value, you can add the reset_index and rename column. Here's code to do that.
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean().reset_index().rename(columns={'TMAX':'TMAX_Avg'}))
The output of this will be:
Average of TMAX for each Year-Month for each Code
Code Year_Mon
000130 1991-01 7.225806
1991-02 7.678571
1991-03 7.354839
1991-04 7.500000
1991-05 7.516129
...
158328 2020-08 7.387097
2020-09 7.300000
2020-10 7.516129
2020-11 7.500000
2020-12 7.451613
Name: TMAX, Length: 720, dtype: float64
Average of TMIN for each Year-Month for each Code
Code Year_Mon
000130 1991-01 2.419355
1991-02 2.571429
1991-03 2.193548
1991-04 2.366667
1991-05 2.451613
...
158328 2020-08 2.451613
2020-09 2.566667
2020-10 2.612903
2020-11 2.666667
2020-12 2.580645
Name: TMIN, Length: 720, dtype: float64
Average of TMAX for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Average of TMIN for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Ver 1: Average by Year and Month for each Code
Here is one way to do this.
You can create two columns - Year and Month. Then get the average of TMAX, TMIN, and PP for each month within the year by doing a groupby ('Code','Year_Mon')
See code for more details.
import pandas as pd
import numpy as np
# create a range of dates from 1/1/2018 thru 12/31/2020 for each day
x = pd.date_range(start='1/1/2018', end='12/31/2020',freq='D')
# create a dataframe with the date ranges x 2 for two codes
# TMIN is a random value from 1 thru 5 - you can put your actual data here
# TMAX is a random value from 6 thru 10 - you can put your actual data here
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*1096 + ['158328']*1096,
'TMAX': np.random.randint(6,10, size=2192),
'TMIN': np.random.randint(1,5, size=2192)
})
# Create a Year-Month column using df.Date.dt.strftime
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Calculate the Average of TMAX and TMIN using groupby Code and Year_Mon
df['TMAX_Avg'] = df.groupby(['Code','Year_Mon'])['TMAX'].transform('mean')
df['TMIN_Avg'] = df.groupby(['Code','Year_Mon'])['TMIN'].transform('mean')
The output of this will be:
Date Code TMAX TMIN Year_Mon TMAX_Avg TMIN_Avg
0 2018-01-01 000130 8 2 2018-01 7.451613 2.129032
1 2018-01-02 000130 7 4 2018-01 7.451613 2.129032
2 2018-01-03 000130 9 2 2018-01 7.451613 2.129032
3 2018-01-04 000130 6 1 2018-01 7.451613 2.129032
4 2018-01-05 000130 9 4 2018-01 7.451613 2.129032
5 2018-01-06 000130 6 1 2018-01 7.451613 2.129032
6 2018-01-07 000130 9 2 2018-01 7.451613 2.129032
7 2018-01-08 000130 9 2 2018-01 7.451613 2.129032
8 2018-01-09 000130 7 2 2018-01 7.451613 2.129032
9 2018-01-10 000130 8 2 2018-01 7.451613 2.129032
10 2018-01-11 000130 8 3 2018-01 7.451613 2.129032
11 2018-01-12 000130 7 2 2018-01 7.451613 2.129032
12 2018-01-13 000130 7 1 2018-01 7.451613 2.129032
13 2018-01-14 000130 8 1 2018-01 7.451613 2.129032
14 2018-01-15 000130 7 3 2018-01 7.451613 2.129032
15 2018-01-16 000130 6 1 2018-01 7.451613 2.129032
16 2018-01-17 000130 6 3 2018-01 7.451613 2.129032
17 2018-01-18 000130 9 3 2018-01 7.451613 2.129032
18 2018-01-19 000130 7 2 2018-01 7.451613 2.129032
19 2018-01-20 000130 8 1 2018-01 7.451613 2.129032
20 2018-01-21 000130 9 4 2018-01 7.451613 2.129032
21 2018-01-22 000130 6 2 2018-01 7.451613 2.129032
22 2018-01-23 000130 9 4 2018-01 7.451613 2.129032
23 2018-01-24 000130 6 2 2018-01 7.451613 2.129032
24 2018-01-25 000130 8 3 2018-01 7.451613 2.129032
25 2018-01-26 000130 6 2 2018-01 7.451613 2.129032
26 2018-01-27 000130 8 1 2018-01 7.451613 2.129032
27 2018-01-28 000130 8 3 2018-01 7.451613 2.129032
28 2018-01-29 000130 6 1 2018-01 7.451613 2.129032
29 2018-01-30 000130 6 1 2018-01 7.451613 2.129032
30 2018-01-31 000130 8 1 2018-01 7.451613 2.129032
31 2018-02-01 000130 7 1 2018-02 7.250000 2.428571
32 2018-02-02 000130 6 2 2018-02 7.250000 2.428571
33 2018-02-03 000130 6 4 2018-02 7.250000 2.428571
34 2018-02-04 000130 8 3 2018-02 7.250000 2.428571
35 2018-02-05 000130 8 2 2018-02 7.250000 2.428571
36 2018-02-06 000130 6 3 2018-02 7.250000 2.428571
37 2018-02-07 000130 6 3 2018-02 7.250000 2.428571
38 2018-02-08 000130 7 1 2018-02 7.250000 2.428571
39 2018-02-09 000130 9 4 2018-02 7.250000 2.428571
40 2018-02-10 000130 8 2 2018-02 7.250000 2.428571
41 2018-02-11 000130 7 4 2018-02 7.250000 2.428571
42 2018-02-12 000130 8 1 2018-02 7.250000 2.428571
43 2018-02-13 000130 6 4 2018-02 7.250000 2.428571
44 2018-02-14 000130 6 1 2018-02 7.250000 2.428571
45 2018-02-15 000130 6 4 2018-02 7.250000 2.428571
46 2018-02-16 000130 8 2 2018-02 7.250000 2.428571
47 2018-02-17 000130 7 3 2018-02 7.250000 2.428571
48 2018-02-18 000130 9 3 2018-02 7.250000 2.428571
49 2018-02-19 000130 8 2 2018-02 7.250000 2.428571
If you want only the Code, Year-Month, and TMIN and TMAX values, you can do:
TMAX average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
Output will be:
Code Year_Mon
000130 2018-01 7.451613
2018-02 7.250000
2018-03 7.774194
2018-04 7.366667
2018-05 7.451613
...
158328 2020-08 7.935484
2020-09 7.666667
2020-10 7.548387
2020-11 7.333333
2020-12 7.580645
TMIN average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
Output will be:
Code Year_Mon
000130 2018-01 2.129032
2018-02 2.428571
2018-03 2.451613
2018-04 2.500000
2018-05 2.677419
...
158328 2020-08 2.709677
2020-09 2.166667
2020-10 2.161290
2020-11 2.366667
2020-12 2.548387
I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.
When data using group by, how can I cumsum millisenconds in df?
Inputs is bellow here.
inputs:
time key isValue
2018-03-04 00:00:06.520 1 NaN
2018-03-04 00:00:07.230 1 NaN
2018-03-04 00:00:08.140 1 1
2018-03-04 00:00:08.720 1 1
2018-03-04 00:00:09.110 1 1
2018-03-04 00:00:09.650 1 NaN
2018-03-04 00:00:10.360 1 NaN
2018-03-04 00:00:11.150 1 NaN
2018-03-04 00:00:11.770 2 NaN
2018-03-04 00:00:12.320 2 NaN
2018-03-04 00:00:12.910 2 1
2018-03-04 00:00:13.250 2 1
2018-03-04 00:00:13.960 2 1
2018-03-04 00:00:14.550 2 NaN
2018-03-04 00:00:15.250 2 NaN
....
And I wanna Outputs is bellow here.
outputs
key : time
1 : 1.030
2 : 1.050
3 : X.xxx
4 : X.xxx
....
Well, I'm using this code
df.groupby(["key"])["time"].cumsum()
Is not correct code that I think.
I think need:
df['new'] = df["time"].dt.microsecond.groupby(df["key"]).cumsum() / 1000
print (df)
time key isValue new
0 2018-03-04 00:00:06.520 1 NaN 520.0
1 2018-03-04 00:00:07.230 1 NaN 750.0
2 2018-03-04 00:00:08.140 1 1.0 890.0
3 2018-03-04 00:00:08.720 1 1.0 1610.0
4 2018-03-04 00:00:09.110 1 1.0 1720.0
5 2018-03-04 00:00:09.650 1 NaN 2370.0
6 2018-03-04 00:00:10.360 1 NaN 2730.0
7 2018-03-04 00:00:11.150 1 NaN 2880.0
8 2018-03-04 00:00:11.770 2 NaN 770.0
9 2018-03-04 00:00:12.320 2 NaN 1090.0
10 2018-03-04 00:00:12.910 2 1.0 2000.0
11 2018-03-04 00:00:13.250 2 1.0 2250.0
12 2018-03-04 00:00:13.960 2 1.0 3210.0
13 2018-03-04 00:00:14.550 2 NaN 3760.0
14 2018-03-04 00:00:15.250 2 NaN 4010.0