Related
I'm struggling to handle a complex (imho) operation on time series data.
I have a time series data set and would like to break it into nonoverlapping pivoted grouped by chunks. It is organized by customer, year, and value. For the purposes of this toy example, I am trying to break it out into a simple forecast of the next 3 months.
df = pd.DataFrame({'Customer': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'b', 8: 'b', 9: 'b'},
'Year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2021, 6: 2021, 7: 2020, 8: 2020, 9: 2020},
'Month': {0: 8, 1: 9, 2: 10, 3: 11, 4: 12, 5: 1, 6: 2, 7: 1, 8: 2, 9: 3},
'Value': {0: 5.2, 1: 2.2, 2: 1.7, 3: 9.0, 4: 5.5, 5: 2.5, 6: 1.9, 7: 4.5, 8: 2.9, 9: 3.1}})
My goal is to create a dataframe where each row contains non overlapping data that is in 3 month pivoted increments. So each row has the 3 "value" data points upcoming from that point in time. I'd also like this data to include the most recent data, so if there is an odd amount of data, that data is dropped (see example a).
| Customer | Year | Month | Month1 | Month2 | Month3 |
| b | 2020 | 1 | 4.5 | 2.9 | 3.1 |
| a | 2020 | 9 | 2.2 | 1.7 | 9.0 |
| a | 2020 | 12 | 5.5 | 2.5 | 1.9 |
Much appreciated.
One way is to first sort_values on your df so latest month goes first, assign group numbers and drop those not in groups of 3:
df = df.sort_values(["Year", "Month", "Customer"], ascending=False)
df["group"] = (df.groupby("Customer").cumcount()%3).eq(0).cumsum()
df = df[(df.groupby(["Customer", "group"])["Year"].transform("size").eq(3))]
df["num"] = (df.groupby("Customer").cumcount()%3+1).replace({1:3, 3:1})
print (df)
Customer Year Month Value group num
6 a 2021 2 1.9 2 3
5 a 2021 1 2.5 2 2
4 a 2020 12 5.5 2 1
3 a 2020 11 9.0 3 3
2 a 2020 10 1.7 3 2
1 a 2020 9 2.2 3 1
9 b 2020 3 3.1 5 3
8 b 2020 2 2.9 5 2
7 b 2020 1 4.5 5 1
Now you can pivot your data:
print (df.assign(Month=df["Month"].where(df["num"].eq(1)).bfill(),
Year=df["Year"].where(df["num"].eq(1)).bfill(),
num="Month"+df["num"].astype(str))
.pivot(["Customer","Month","Year"], "num", "Value")
.reset_index())
num Customer Month Year Month1 Month2 Month3
0 a 9.0 2020.0 2.2 1.7 9.0
1 a 12.0 2020.0 5.5 2.5 1.9
2 b 1.0 2020.0 4.5 2.9 3.1
There might be a better way to do this, but this will give you the output you want :
First we add a Customer_counter column to add an ID to rows member of the same chunk, and we remove extra rows.
df["Customer_chunk"] = (df[::-1].groupby("Customer").cumcount()) // 3
df = df.groupby(["Customer", "Customer_chunk"]).filter(lambda group: len(group) == 3)
Then we group by Customer and Customer_chunk to generate each column of the desired output.
df_grouped = df.groupby(["Customer", "Customer_chunk"])
colYear = df_grouped["Year"].first()
colMonth = df_grouped["Month"].first()
colMonth1 = df_grouped["Value"].first()
colMonth2 = df_grouped["Value"].nth(2)
colMonth3 = df_grouped["Value"].last()
And finally we create the output by merging every columns.
df_output = pd.concat([colYear, colMonth, colMonth1, colMonth2, colMonth3], axis=1).reset_index().drop("Customer_chunk", axis=1)
Some steps feel a bit clunky, there's probably room for improvement in this code but it shouldn't impact performance too much.
I have a dataframe that looks like this.
d = {'name': ["eric", "eric","eric", "sean","sean","sean"], 'values': [1, 5, 7, 4, 2, 5]}
df = pd.DataFrame(data=d)
And I am trying to add a new column called df['sum'] where it adds up the value of the previous row for each unique name in the df['name'] column, so that it would look like this:
name values sum
eric 1 1
eric 5 6
eric 4 10
sean 7 7
sean 2 9
sean 5 14
I tried using the below, but can't figure out how to get it to start over each time it gets to a new name.
for i in df['name']:
df['sum'] = df['values'].cumsum()
Use groupby().cumsum()
df.groupby('name').values.cumsum()
i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df
I am trying to do nested groupby as follows:
>>> df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'Quantity': {0: 60,1: 50, 2: 40, 3: 30, 4: 20, 5: 10}, 'UiD':{0:1,1:1,2:1,3:2,4:2,5:3}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'Sign':{0:1,1:1,2:0,3:-1,4:0,5:-1}, 'leg1':{0:2,1:2,2:4,3:5,4:7,5:8}})
>>> df1
Date Quantity Sign StartTime Stock UiD leg1
0 2016-10-11 60 1 08:00:00.241 ABC 1 2
1 2016-10-11 50 1 08:00:00.243 ABC 1 2
2 2016-10-11 40 0 12:34:23.563 ABC 1 4
3 2016-10-11 30 -1 08:14.05.908 ABC 2 5
4 2016-10-11 20 0 18:54:50.100 ABC 2 7
5 2016-10-12 10 -1 10:08:36.657 XYZ 3 8
>>> dfg1=df1.groupby(['Date','Stock'])
>>> dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Date Stock
2016-10-11 ABC 90
2016-10-12 XYZ 10
dtype: int64
>>>
>>> dfg1['leg1'].sum()
Date Stock
2016-10-11 ABC 20
2016-10-12 XYZ 8
Name: leg1, dtype: int64
So far so good. Now I am trying to concatenate the two results into a new DataFrame df2 as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
0 1
Date Stock
2016-10-11 ABC 20 90
2016-10-12 XYZ 8 10
>>>
I am wondering if there is a better way to re-write following line in order to avoid repetition of groupby(['Date','Stock'])
dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Also this fails if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'].
Please restate your question to be clearer. You want to groupby(['Date','Stock']), then:
take only the first record for each UiD and sum (aggregate) its
Quantity, but also
sum all leg1 values for that Date,Stock
combination (not just the first-for-each-UiD). Is that right?
Anyway you want to perform an aggregation (sum) on multiple columns, and yeah the way to avoid repetition of groupby(['Date','Stock']) is to keep one dataframe, not try to stitch together two dataframes from two individual aggregate operations. Something like the following (I'll fix it once you confirm this is what you want):
def filter_first_UiD(g):
#return g.groupby('UiD').first().agg(np.sum)
return g.groupby('UiD').first().agg({'Quantity':'sum', 'leg1':'sum'})
df1.groupby(['Date','Stock']).apply(filter_first_UiD)
The way I dealt with the last scenario of avoiding groupby to fail if ['Date','Stock'] contains 'UiD' as one of the keys or if ['Date','Stock'] is replaced by just ['UiD'] is as follows:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1[].first() if 'UiD' in `['Date','Stock']` else dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
But more elegant solution is still an open question.