Sample and group by, and pivot time series data - pandas

I'm struggling to handle a complex (imho) operation on time series data.
I have a time series data set and would like to break it into nonoverlapping pivoted grouped by chunks. It is organized by customer, year, and value. For the purposes of this toy example, I am trying to break it out into a simple forecast of the next 3 months.
df = pd.DataFrame({'Customer': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'b', 8: 'b', 9: 'b'},
'Year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2021, 6: 2021, 7: 2020, 8: 2020, 9: 2020},
'Month': {0: 8, 1: 9, 2: 10, 3: 11, 4: 12, 5: 1, 6: 2, 7: 1, 8: 2, 9: 3},
'Value': {0: 5.2, 1: 2.2, 2: 1.7, 3: 9.0, 4: 5.5, 5: 2.5, 6: 1.9, 7: 4.5, 8: 2.9, 9: 3.1}})
My goal is to create a dataframe where each row contains non overlapping data that is in 3 month pivoted increments. So each row has the 3 "value" data points upcoming from that point in time. I'd also like this data to include the most recent data, so if there is an odd amount of data, that data is dropped (see example a).
| Customer | Year | Month | Month1 | Month2 | Month3 |
| b | 2020 | 1 | 4.5 | 2.9 | 3.1 |
| a | 2020 | 9 | 2.2 | 1.7 | 9.0 |
| a | 2020 | 12 | 5.5 | 2.5 | 1.9 |
Much appreciated.

One way is to first sort_values on your df so latest month goes first, assign group numbers and drop those not in groups of 3:
df = df.sort_values(["Year", "Month", "Customer"], ascending=False)
df["group"] = (df.groupby("Customer").cumcount()%3).eq(0).cumsum()
df = df[(df.groupby(["Customer", "group"])["Year"].transform("size").eq(3))]
df["num"] = (df.groupby("Customer").cumcount()%3+1).replace({1:3, 3:1})
print (df)
Customer Year Month Value group num
6 a 2021 2 1.9 2 3
5 a 2021 1 2.5 2 2
4 a 2020 12 5.5 2 1
3 a 2020 11 9.0 3 3
2 a 2020 10 1.7 3 2
1 a 2020 9 2.2 3 1
9 b 2020 3 3.1 5 3
8 b 2020 2 2.9 5 2
7 b 2020 1 4.5 5 1
Now you can pivot your data:
print (df.assign(Month=df["Month"].where(df["num"].eq(1)).bfill(),
Year=df["Year"].where(df["num"].eq(1)).bfill(),
num="Month"+df["num"].astype(str))
.pivot(["Customer","Month","Year"], "num", "Value")
.reset_index())
num Customer Month Year Month1 Month2 Month3
0 a 9.0 2020.0 2.2 1.7 9.0
1 a 12.0 2020.0 5.5 2.5 1.9
2 b 1.0 2020.0 4.5 2.9 3.1

There might be a better way to do this, but this will give you the output you want :
First we add a Customer_counter column to add an ID to rows member of the same chunk, and we remove extra rows.
df["Customer_chunk"] = (df[::-1].groupby("Customer").cumcount()) // 3
df = df.groupby(["Customer", "Customer_chunk"]).filter(lambda group: len(group) == 3)
Then we group by Customer and Customer_chunk to generate each column of the desired output.
df_grouped = df.groupby(["Customer", "Customer_chunk"])
colYear = df_grouped["Year"].first()
colMonth = df_grouped["Month"].first()
colMonth1 = df_grouped["Value"].first()
colMonth2 = df_grouped["Value"].nth(2)
colMonth3 = df_grouped["Value"].last()
And finally we create the output by merging every columns.
df_output = pd.concat([colYear, colMonth, colMonth1, colMonth2, colMonth3], axis=1).reset_index().drop("Customer_chunk", axis=1)
Some steps feel a bit clunky, there's probably room for improvement in this code but it shouldn't impact performance too much.

Related

Sort only Certain Column Names based on Month and Year

I have more than 100 of columns and organize as below:
import pandas as pd
data = [[11, 1, 6, 8, 45, 67, '30-06-2021', 43578, 3.4, '30-04-2022', 6.7, 5000, 6744, 8.9, 8978, '31-03-2022', '31-01-2022',
'28-02-2022', 5.6]]
dat = pd.DataFrame(data, columns = ['a', 'b', 't', 'u', 'g', 'd', 'Start', 'Apr-22Total', 'Mar-22Rate', 'Apr-22', 'Feb-22Rate', 'Feb-22Total', 'Jan-22Total',
'Apr-22Rate', 'Mar-22Total', 'Mar-22', 'Jan-22', 'Feb-22', 'Jan-22Rate'])
a b t u g d Start Apr-22Total Mar-22Rate Apr-22 Feb-22Rate Feb-22Total Jan-22Total Apr-22Rate Mar-22Total Mar-22 Jan-22 Feb-22 Jan-22Rate
0 11 1 6 8 45 67 30-06-2021 43578 3.4 30-04-2022 6.7 5000 6744 8.9 8978 31-03-2022 31-01-2022 28-02-2022 5.6
How can I organize the order of column names that contain month and year only according to the month and year order?
My expectation is as follow:
Here's a solution:
cols = dat.filter(regex='(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d').columns
sorted_cols = cols[pd.Series(cols).str.split('-', expand=True).set_axis(['0','1'],axis=1).pipe(lambda x: x.assign(**{'0':pd.to_datetime(x['0'], format='%b').dt.month})).sort_values(['0', '1']).index]
dat = dat[dat.drop(cols, axis=1).columns.tolist() + sorted_cols.tolist()]
Output:
>>> dat
a b t u g d Start Jan-22 Jan-22Rate Jan-22Total Feb-22 Feb-22Rate Feb-22Total Mar-22 Mar-22Rate Mar-22Total Apr-22 Apr-22Rate Apr-22Total
0 11 1 6 8 45 67 30-06-2021 31-01-2022 5.6 6744 28-02-2022 6.7 5000 31-03-2022 3.4 8978 30-04-2022 8.9 43578

Finding the max values of each subset in pandas [duplicate]

This question already has answers here:
Pandas Group to Divide by Max
(1 answer)
How to use pandas transform function to divide each row with max value grouped by another column [duplicate]
(1 answer)
Closed 12 months ago.
I'm trying to divide each value by the max value for the given year.
df = pd.DataFrame({'Fiscal Year': {0: 2020, 1: 2019, 2: 2021, 3: 2020, 4: 2021},
'Product Num': {0: 'widget', 1: 'doodad', 2: "widget", 3: 'doodad', 4: 'widget'},
'Value': {0: 1000, 1: 1100, 2: 900, 3: 1300, 4: 800}})
So,
Product | Year | New Value
Widget | 2020 | .769
Doodad | 2019 | 1
Widget | 2021 | 1
Doodad | 2020 | 1
Widget | 2021 | .889
I know I can do groupby's and then go through each entry and figure it out that way, but that doesn't seem efficient. Is there a better way to do this?

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

Nested groupby in DataFrame and aggregate multiple columns

I am trying to do nested groupby as follows:
>>> df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'Quantity': {0: 60,1: 50, 2: 40, 3: 30, 4: 20, 5: 10}, 'UiD':{0:1,1:1,2:1,3:2,4:2,5:3}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'Sign':{0:1,1:1,2:0,3:-1,4:0,5:-1}, 'leg1':{0:2,1:2,2:4,3:5,4:7,5:8}})
>>> df1
Date Quantity Sign StartTime Stock UiD leg1
0 2016-10-11 60 1 08:00:00.241 ABC 1 2
1 2016-10-11 50 1 08:00:00.243 ABC 1 2
2 2016-10-11 40 0 12:34:23.563 ABC 1 4
3 2016-10-11 30 -1 08:14.05.908 ABC 2 5
4 2016-10-11 20 0 18:54:50.100 ABC 2 7
5 2016-10-12 10 -1 10:08:36.657 XYZ 3 8
>>> dfg1=df1.groupby(['Date','Stock'])
>>> dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Date Stock
2016-10-11 ABC 90
2016-10-12 XYZ 10
dtype: int64
>>>
>>> dfg1['leg1'].sum()
Date Stock
2016-10-11 ABC 20
2016-10-12 XYZ 8
Name: leg1, dtype: int64
So far so good. Now I am trying to concatenate the two results into a new DataFrame df2 as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
0 1
Date Stock
2016-10-11 ABC 20 90
2016-10-12 XYZ 8 10
>>>
I am wondering if there is a better way to re-write following line in order to avoid repetition of groupby(['Date','Stock'])
dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Also this fails if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'].
Please restate your question to be clearer. You want to groupby(['Date','Stock']), then:
take only the first record for each UiD and sum (aggregate) its
Quantity, but also
sum all leg1 values for that Date,Stock
combination (not just the first-for-each-UiD). Is that right?
Anyway you want to perform an aggregation (sum) on multiple columns, and yeah the way to avoid repetition of groupby(['Date','Stock']) is to keep one dataframe, not try to stitch together two dataframes from two individual aggregate operations. Something like the following (I'll fix it once you confirm this is what you want):
def filter_first_UiD(g):
#return g.groupby('UiD').first().agg(np.sum)
return g.groupby('UiD').first().agg({'Quantity':'sum', 'leg1':'sum'})
df1.groupby(['Date','Stock']).apply(filter_first_UiD)
The way I dealt with the last scenario of avoiding groupby to fail if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'] is as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1[].first() if 'UiD' in `['Date','Stock']` else dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
But more elegant solution is still an open question.