Week difference from current week to last day previous week - pandas

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame.
Add a column called week difference that computes the difference between the total sales for this week, and the latest value (by date) for the previous week. Assumption: I always have data for some days a week but it's not fixed days.
The week difference column will be different as new data comes in, but for the latest data would look like:
>>> df_sales
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 WeekDifference
region
NE 50.0 50.0-20.0
NW 38.0 38.0-39.0
SW 141.0 141.0-137.0
All 229.0 229-196.0
Because it's the difference between the latest date and the latest day of the previous week. In this specific example, we are on week 2017-04-20, and the last day of data from previous week is 2017-04-13.
I'd want to do this in a general way as data gets updated.

df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
Input:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0
Calculate Last week and one week offset:
last_column = pd.to_datetime(df_sales.iloc[:,-1].name[2])
last_week_column = last_column + pd.DateOffset(weeks=-1)
col_mask = (pd.to_datetime(df_sales.columns.get_level_values(2)).weekofyear == (last_column.weekofyear-1))
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',last_week_column.strftime('%Y-%m-%d'))].astype(str)
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',list(col_mask))].iloc[:,-1].astype(str)
print(df_sales)
Output:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

Changing index type from a value_counts()

I am trying to change the index type from int to string after a value_counts()
df
['value']
.value_counts()
.sort_index()
output:
40.0 1448
45.0 28558
50.0 83675
55.0 96377
60.0 47351
65.0 13226
70.0 2602
75.0 568
80.0 72
100.0 52
105.0 53
Name: value, dtype: int64
expected output:
40.0 1448
45.0 28558
50.0 83675
55.0 96377
60.0 47351
65.0 13226
70.0 2602
75.0 568
80.0 72
100.0 52
105.0 53
Name: value, dtype: string
If need convert sorted index values like 40.0 use rename:
df['value'].value_counts().sort_index().rename(index=str)
If need convert count values like 1448 use Series.astype:
df['value'].value_counts().sort_index().astype(str)

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

Pandas resample with percentage change

I am trying to resample my df to get an yearly data filling by percentage change.
Here is my dataframe.
data = {'year': ['2000', '2000', '2003', '2003', '2005', '2005'],
'country':['UK', 'US', 'UK','US','UK','US'],
'sales': [0, 10, 30, 25, 40, 45],
'cost': [0, 100, 300, 250, 400, 450]
}
df=pd.DataFrame(data)
dfL=df.copy()
dfL.year=dfL.year.astype('str') + '-01-01 00:00:00.00000'
dfL.year=pd.to_datetime(dfL.year)
dfL=dfL.set_index('year')
dfL
country sales cost
year
2000-01-01 UK 0 0
2000-01-01 US 10 100
2003-01-01 UK 30 300
2003-01-01 US 25 250
2005-01-01 UK 40 400
2005-01-01 US 55 550
I would like to get an output like the below..
country sales cost
year
2000-01-01 UK 0 0
2001-01-01 UK 10 100
2002-01-01 UK 20 200
2003-01-01 UK 30 300
2004-01-01 UK 35 350
2005-01-01 UK 40 400
2000-01-01 US 10 100
2001-01-01 US 15 150
2002-01-01 US 20 200
2003-01-01 US 25 250
2004-01-01 US 35 350
2005-01-01 US 45 450
I hope I would need to do a resample yearwise. but not very sure about the apply function to use.
Can any one help ?
Using resample + interpolate and reshape method stack and unstack
dfL=dfL.set_index('country',append=True).unstack().resample('YS').interpolate().stack().reset_index(level=1)
dfL
Out[309]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0
I'd use a pivot_table to do this and then resample:
In [11]: res = dfL.pivot_table(index="year", columns="country", values=["sales", "cost"])
In [12]: res
Out[12]:
cost sales
country UK US UK US
year
2000-01-01 0 100 0 10
2003-01-01 300 250 30 25
2005-01-01 400 450 40 45
In [13]: res.resample("YS").interpolate()
Out[13]:
cost sales
country UK US UK US
year
2000-01-01 0.0 100.0 0.0 10.0
2001-01-01 100.0 150.0 10.0 15.0
2002-01-01 200.0 200.0 20.0 20.0
2003-01-01 300.0 250.0 30.0 25.0
2004-01-01 350.0 350.0 35.0 35.0
2005-01-01 400.0 450.0 40.0 45.0
Personally I'd keep it in this format, but if you want to stack it back, you can stack and reset_index:
In [14]: res.resample("YS").interpolate().stack(level=1).reset_index(level=1)
Out[14]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0

Pandas dataframe column math when row conditions is met

I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe.
For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1.
In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution.
Thanks :)
1 2 3 4 \
age subsidence age subsidence age subsidence age
0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0
1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0
2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0
3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0
4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0
5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0
6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0
7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0
8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0
9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0
10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0
11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0
12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0
13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0
14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0
15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0
16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0
17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0
18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0
19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0
20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0
21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0
22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0
23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0
24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0
25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0
26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0
27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0
28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0
29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0
30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0
31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0
32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0
33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0
34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0
35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0
36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0
37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0
38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0
39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0
40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0
41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0
42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0
43 NaN NaN NaN NaN 400.0 -61.404510 400.0
44 NaN NaN NaN NaN 410.0 13.058900 410.0
For the first Dataframe:
df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value
You need to update each dataframes accordingly.