Group by Index of Row in Pandas - pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0

Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...

Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

make a mean of several year dataframes, hour by hour

I have several dataframes of some value taken very hour, on several year, like this :
df1
Out[6]:
time P G(i) H_sun T2m WS10m Int
0 2005-01-01 00:10:00 0.0 0.0 0.0 0.68 2.11 0.0
1 2005-01-01 01:10:00 0.0 0.0 0.0 0.38 2.11 0.0
2 2005-01-01 02:10:00 0.0 0.0 0.0 0.08 2.11 0.0
3 2005-01-01 03:10:00 0.0 0.0 0.0 -0.22 2.11 0.0
4 2005-01-01 04:10:00 0.0 0.0 0.0 0.06 2.21 0.0
... ... ... ... ... ... ...
8755 2005-12-31 19:10:00 0.0 0.0 0.0 1.75 1.71 0.0
8756 2005-12-31 20:10:00 0.0 0.0 0.0 1.49 1.71 0.0
8757 2005-12-31 21:10:00 0.0 0.0 0.0 1.23 1.70 0.0
8758 2005-12-31 22:10:00 0.0 0.0 0.0 0.95 1.65 0.0
8759 2005-12-31 23:10:00 0.0 0.0 0.0 0.67 1.60 0.0
[8760 rows x 7 columns]
df2
Out[7]:
time P G(i) H_sun T2m WS10m Int
8760 2006-01-01 00:10:00 0.0 0.0 0.0 0.39 1.56 0.0
8761 2006-01-01 01:10:00 0.0 0.0 0.0 0.26 1.52 0.0
8762 2006-01-01 02:10:00 0.0 0.0 0.0 0.13 1.49 0.0
8763 2006-01-01 03:10:00 0.0 0.0 0.0 0.01 1.45 0.0
8764 2006-01-01 04:10:00 0.0 0.0 0.0 -0.45 1.65 0.0
... ... ... ... ... ... ...
17515 2006-12-31 19:10:00 0.0 0.0 0.0 4.24 1.32 0.0
17516 2006-12-31 20:10:00 0.0 0.0 0.0 4.00 1.32 0.0
17517 2006-12-31 21:10:00 0.0 0.0 0.0 3.75 1.32 0.0
17518 2006-12-31 22:10:00 0.0 0.0 0.0 4.34 1.54 0.0
17519 2006-12-31 23:10:00 0.0 0.0 0.0 4.92 1.76 0.0
[8760 rows x 7 columns]
and this for 10 years.
I'm trying to make a mean of the value for the "20XX-01-01 00:10:00" of each year to obtain something like "mean all the value of the 01 January at 00:10". Ideally with a time column merge to obtain just "01-01 00:10:00".
Is it possible ?
For now I just know the df.mean() function to take all the value of a column to have just one result, and that's not what I want.
Join all DataFrames together in concat:
df = pd.concat([df1, df2, df3, ..., df10])
And then aggregate mean with same year - e.g. 2005
df['time'] = pd.to_datetime(df['time'])
#for remove 29 Feb
#df = df[((df['time'].dt.month != 2) | (df['time'].dt.day != 29))]
df1 = df.groupby(pd.to_datetime(df['time'].dt.strftime('2005-%m-%d %H:%M:%S'))).mean()

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

Pandas dataframe column math when row conditions is met

I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe.
For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1.
In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution.
Thanks :)
1 2 3 4 \
age subsidence age subsidence age subsidence age
0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0
1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0
2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0
3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0
4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0
5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0
6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0
7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0
8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0
9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0
10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0
11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0
12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0
13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0
14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0
15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0
16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0
17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0
18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0
19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0
20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0
21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0
22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0
23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0
24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0
25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0
26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0
27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0
28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0
29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0
30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0
31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0
32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0
33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0
34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0
35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0
36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0
37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0
38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0
39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0
40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0
41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0
42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0
43 NaN NaN NaN NaN 400.0 -61.404510 400.0
44 NaN NaN NaN NaN 410.0 13.058900 410.0
For the first Dataframe:
df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value
You need to update each dataframes accordingly.

Week difference from current week to last day previous week

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame.
Add a column called week difference that computes the difference between the total sales for this week, and the latest value (by date) for the previous week. Assumption: I always have data for some days a week but it's not fixed days.
The week difference column will be different as new data comes in, but for the latest data would look like:
>>> df_sales
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 WeekDifference
region
NE 50.0 50.0-20.0
NW 38.0 38.0-39.0
SW 141.0 141.0-137.0
All 229.0 229-196.0
Because it's the difference between the latest date and the latest day of the previous week. In this specific example, we are on week 2017-04-20, and the last day of data from previous week is 2017-04-13.
I'd want to do this in a general way as data gets updated.
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
Input:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0
Calculate Last week and one week offset:
last_column = pd.to_datetime(df_sales.iloc[:,-1].name[2])
last_week_column = last_column + pd.DateOffset(weeks=-1)
col_mask = (pd.to_datetime(df_sales.columns.get_level_values(2)).weekofyear == (last_column.weekofyear-1))
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',last_week_column.strftime('%Y-%m-%d'))].astype(str)
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',list(col_mask))].iloc[:,-1].astype(str)
print(df_sales)
Output:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0