Diff() function use with groupby for pandas - pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?

Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

Related

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

Merge old and new table and fill values by date

I have df1:
Date
Symbol
Time
Quantity
Price
2020-09-04
AAPL
09:54:48
11.0
115.97
2020-09-16
AAPL
09:30:02
-11.0
115.33
2020-02-24
AMBA
09:30:02
22.0
64.24
2020-02-25
AMBA
14:01:28
-22.0
62.64
2020-07-14
AMGN
09:30:01
5.0
243.90
...
...
...
...
...
2020-12-08
YUMC
09:30:00
-22.0
56.89
2020-11-18
Z
14:20:01
12.0
100.68
2020-11-20
Z
09:30:01
-12.0
109.25
2020-09-04
ZS
09:45:24
9.0
135.94
2020-09-14
ZS
09:38:23
-9.0
126.41
and df2:
Date
USD
2
2020-02-01
22.702
3
2020-03-01
22.753
4
2020-06-01
22.601
5
2020-07-01
22.626
6
2020-08-01
22.739
..
...
...
248
2020-12-23
21.681
249
2020-12-28
21.482
250
2020-12-29
21.462
251
2020-12-30
21.372
252
2020-12-31
21.387
I want to add a new column "USD" from df2 by date in df1.
Trying
new_df = (dane5.reset_index()
.merge(kurz2,how='outer')
.fillna(0)
.set_index('Date'))
new_df.sort_index(inplace=True)
new_df= new_df[new_df['Symbol'] != 0]
print(new_df.head(50))
But I return zero value some rows:
Date
Symbol
Time
Quantity
Price
USD
2020-01-02
GL
10:31:14
13.0
104.550000
0.000
2020-01-02
ATEC
13:35:04
211.0
6.860000
0.000
2020-01-03
IOVA
14:02:32
56.0
25.790000
0.000
2020-01-03
TGNA
09:30:00
90.0
16.080000
0.000
2020-01-03
SCS
09:30:01
-70.0
20.100000
0.000
2020-01-03
SKX
09:30:09
34.0
41.940000
0.000
2020-01-06
IOVA
09:45:19
-56.0
24.490000
24.163
2020-01-06
GL
09:30:02
-13.0
103.430000
24.163
2020-01-06
SKX
15:55:15
-34.0
43.900000
24.163
2020-01-07
TGNA
15:55:16
-90.0
16.945000
23.810
2020-01-07
MRTX
09:46:18
-13.0
101.290000
23.810
2020-01-07
MRTX
09:34:10
13.0
109.430000
23.810
2020-01-08
ITCI
09:30:01
49.0
27.640000
0.000
Could you some help me please?
Sorry my bad English language.

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

Pandas resample with percentage change

I am trying to resample my df to get an yearly data filling by percentage change.
Here is my dataframe.
data = {'year': ['2000', '2000', '2003', '2003', '2005', '2005'],
'country':['UK', 'US', 'UK','US','UK','US'],
'sales': [0, 10, 30, 25, 40, 45],
'cost': [0, 100, 300, 250, 400, 450]
}
df=pd.DataFrame(data)
dfL=df.copy()
dfL.year=dfL.year.astype('str') + '-01-01 00:00:00.00000'
dfL.year=pd.to_datetime(dfL.year)
dfL=dfL.set_index('year')
dfL
country sales cost
year
2000-01-01 UK 0 0
2000-01-01 US 10 100
2003-01-01 UK 30 300
2003-01-01 US 25 250
2005-01-01 UK 40 400
2005-01-01 US 55 550
I would like to get an output like the below..
country sales cost
year
2000-01-01 UK 0 0
2001-01-01 UK 10 100
2002-01-01 UK 20 200
2003-01-01 UK 30 300
2004-01-01 UK 35 350
2005-01-01 UK 40 400
2000-01-01 US 10 100
2001-01-01 US 15 150
2002-01-01 US 20 200
2003-01-01 US 25 250
2004-01-01 US 35 350
2005-01-01 US 45 450
I hope I would need to do a resample yearwise. but not very sure about the apply function to use.
Can any one help ?
Using resample + interpolate and reshape method stack and unstack
dfL=dfL.set_index('country',append=True).unstack().resample('YS').interpolate().stack().reset_index(level=1)
dfL
Out[309]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0
I'd use a pivot_table to do this and then resample:
In [11]: res = dfL.pivot_table(index="year", columns="country", values=["sales", "cost"])
In [12]: res
Out[12]:
cost sales
country UK US UK US
year
2000-01-01 0 100 0 10
2003-01-01 300 250 30 25
2005-01-01 400 450 40 45
In [13]: res.resample("YS").interpolate()
Out[13]:
cost sales
country UK US UK US
year
2000-01-01 0.0 100.0 0.0 10.0
2001-01-01 100.0 150.0 10.0 15.0
2002-01-01 200.0 200.0 20.0 20.0
2003-01-01 300.0 250.0 30.0 25.0
2004-01-01 350.0 350.0 35.0 35.0
2005-01-01 400.0 450.0 40.0 45.0
Personally I'd keep it in this format, but if you want to stack it back, you can stack and reset_index:
In [14]: res.resample("YS").interpolate().stack(level=1).reset_index(level=1)
Out[14]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0