Actual and Percentage Difference on consecutive columns in a Pandas or Pyspark Dataframe - pandas

I would like to perform two different calculations across consecutive columns in a pandas or pyspark dataframe.
Columns are weeks and the metrics are displayed as rows.
I want to calculate the actual and percentage differences across the columns.
The input/output tables incl. the calculations used in Excel are displayed in the following image.
I want to replicate these calculations on a pandas or pyspark dataframe.
Raw Data Attached:
Metrics Week20 Week21 Week22 Week23 Week24 Week25 Week26 Week27
Sales 20301 21132 20059 23062 19610 22734 22140 20699
TRXs 739 729 690 779 701 736 762 655
Attachment Rate 4.47 4.44 4.28 4.56 4.41 4.58 4.55 4.96
AOV 27.47 28.99 29.07 29.6 27.97 30.89 29.06 31.6
Profit 5177 5389 5115 5881 5001 5797 5646 5278
Profit per TRX 7.01 7.39 7.41 7.55 7.13 7.88 7.41 8.06

in pandas you could use pct_change(axis=1) and diff(axis=1) methods:
df = df.set_index('Metrics')
# list of metrics with "actual diff"
actual = ['AOV', 'Attachment Rate']
rep = (df[~df.index.isin(actual)].pct_change(axis=1).round(2)*100).fillna(0).astype(str).add('%')
rep = pd.concat([rep,
df[df.index.isin(actual)].diff(axis=1).fillna(0)
])
In [131]: rep
Out[131]:
Week20 Week21 Week22 Week23 Week24 Week25 Week26 Week27
Metrics
Sales 0.0% 4.0% -5.0% 15.0% -15.0% 16.0% -3.0% -7.0%
TRXs 0.0% -1.0% -5.0% 13.0% -10.0% 5.0% 4.0% -14.0%
Profit 0.0% 4.0% -5.0% 15.0% -15.0% 16.0% -3.0% -7.0%
Profit per TRX 0.0% 5.0% 0.0% 2.0% -6.0% 11.0% -6.0% 9.0%
Attachment Rate 0 -0.03 -0.16 0.28 -0.15 0.17 -0.03 0.41
AOV 0 1.52 0.08 0.53 -1.63 2.92 -1.83 2.54

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Add column for percentages

I have a df who looks like this:
Total Initial Follow Sched Supp Any
0 5525 3663 968 296 65 533
I transpose the df 'cause I have to add a column with the percentages based on column 'Total'
Now my df looks like this:
0
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
So, How can I add this percentage column?
The expected output looks like this
0 Percentage
Total 5525 100
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6
Do you know how can I add this new column?
I'm working in jupyterlab with pandas and numpy
Multiple column 0 by scalar from Total with Series.div, then multiple by 100 by Series.mul and last round by Series.round:
df['Percentage'] = df[0].div(df.loc['Total', 0]).mul(100).round(1)
print (df)
0 Percentage
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6
Consider below df:
In [1328]: df
Out[1328]:
b
a
Total 5525
Initial 3663
Follow 968
Sched 296
Supp 65
Any 533
In [1327]: df['Perc'] = round(df.b.div(df.loc['Total', 'b']) * 100, 1)
In [1330]: df
Out[1330]:
b Perc
a
Total 5525 100.0
Initial 3663 66.3
Follow 968 17.5
Sched 296 5.4
Supp 65 1.2
Any 533 9.6

Pandas: Delete Rows or Interpolate

I'm trying to learn IoT data using time series. The data comes from two different sources. In some measurements, the difference between the sources is very small: one source has 11 rows and the second source has 15 rows. In other measurements, one source has 30 rows and the second source has 240 rows.
Thought to interpolate using:
df.resample('20ms').interpolate()
but sow that it delete some rows.
Is there any method to interpolate without deleting or should I delete rows?
EDIT - data and code:
#!/usr/bin/env python3.6
import pandas as pd
import sklearn.preprocessing
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
first_df_file_name='interpolate_test.in'
df = read_csv(first_df_file_name, header=0, squeeze=True, delimiter=' ')
print(df.head(5))
idx=0
new_col = pd.date_range('1/1/2011 00:00:00.000000', periods=len(df.index), freq='100ms')
df.insert(loc=idx, column='date', value=new_col)
df.set_index('date', inplace=True)
upsampled = df.resample('20ms').interpolate()
print('20 ms, num rows', len(upsampled.index))
print(upsampled.head(5))
upsampled.to_csv('test_20ms.out')
upsampled = df.resample('60ms').interpolate()
print('60 ms, num rows', len(upsampled.index))
print(upsampled.head(5))
upsampled.to_csv('test_60ms.out')
This is the test input file name:
a b
100 200
200 400
300 600
400 800
500 1000
600 1100
700 1200
800 1300
900 1400
1000 2000
Here is the output (parts of it)
//output of interpolating by 20 milis - this is fine
a b
date
2011-01-01 00:00:00.000 100.0 200.0
2011-01-01 00:00:00.020 120.0 240.0
2011-01-01 00:00:00.040 140.0 280.0
2011-01-01 00:00:00.060 160.0 320.0
2011-01-01 00:00:00.080 180.0 360.0
60 ms, num rows 16
//output when interpolating by 60 milis - data is lost
a b
date
2011-01-01 00:00:00.000 100.0 200.0
2011-01-01 00:00:00.060 160.0 320.0
2011-01-01 00:00:00.120 220.0 440.0
2011-01-01 00:00:00.180 280.0 560.0
2011-01-01 00:00:00.240 340.0 680.0
So, should I delete rows from the largest source instead of interpolating? If I'm interpolating, how can I avoid loosing data?

How can I filter dataframe rows based on a quantile value of a column using groupby?

(There is probably a better way of asking the question, but hopefully this description will make it more clear)
A simplified view of my dataframe, showing 10 random rows, is:
Duration starting_station_id ending_station_id
5163 420 3077 3018
113379 240 3019 3056
9730 240 3047 3074
104058 900 3034 3042
93110 240 3055 3029
93144 240 3016 3014
48999 780 3005 3024
30905 360 3019 3025
88132 300 3022 3048
12673 240 3075 3031
What I want to do is groupby starting_station_id and ending_station_id and the filter out the rows where the value in the Duration column for a group falls above the .99 quantile.
To do the groupby and quantile computation, I do:
df.groupby( ['starting_station_id', 'ending_station_id'] )[ 'Duration' ].quantile([.99])
and some partial output is:
3005 3006 0.99 3825.6
3007 0.99 1134.0
3008 0.99 5968.8
3009 0.99 9420.0
3010 0.99 1740.0
3011 0.99 41856.0
3014 0.99 22629.6
3016 0.99 1793.4
3018 0.99 37466.4
What I believe this is telling me is that for the group ( 3005, 3006 ), the values >= 3825.6 fall into the .99 quantile. So, I want to filter out the rows where the duration value for that group is >= 3825.6. (And then do the same for all of the other groups)
What is the best way to do this?
Try this
thresholds = df.groupby(['start', 'end'])['x'].quantile(.99)
mask = (df.Duration.values > thresholds[[(x, y) for x, y in zip(df.start, df.end)]]).values
out = df[mask]

Pandas: aggregating by different columns with MultiIndex columns

I would like to take a dataframe with MultiIndex columns (where the index is a DatetimeIndex), and the aggregate by different functions depending on the column.
For example, consider the following table where index includes dates, first level of columns are Price and Volume, and second level of columns are tickers (e.g. AAPL and AMZN).
df1 = pd.DataFrame({"ticker":["AAPL"]*365,
'date': pd.date_range(start='20170101', end='20171231'),
'volume' : [np.random.randint(50,100) for i in range(365)],
'price': [np.random.randint(100,200) for i in range(365)]})
df2 = pd.DataFrame({"ticker":["AMZN"]*365,
'date': pd.date_range(start='20170101', end='20171231'),
'volume' : [np.random.randint(50,100) for i in range(365)],
'price': [np.random.randint(100,200) for i in range(365)]})
df = pd.concat([df1,df2])
grp = df.groupby(['date', 'ticker']).mean().unstack()
grp.head()
What I would like to do is to aggregate the data by month, but taking the mean of price and sum of volume.
I would have thought that something along the lines of grp.resample("MS").agg({"price":"mean", "volume":"sum"}) should work, but it does not because of the multi-index column. What's the best way to accomplish this?
You can
df.groupby([pd.to_datetime(df.date).dt.strftime('%Y-%m'),df.ticker]).\
agg({"price":"mean", "volume":"sum"}).unstack()
Out[529]:
price volume
ticker AAPL AMZN AAPL AMZN
date
2017-01 155.548387 141.580645 2334 2418
2017-02 154.035714 156.821429 2112 2058
2017-03 154.709677 148.806452 2258 2188
2017-04 154.366667 149.366667 2271 2254
2017-05 154.774194 155.096774 2331 2264
2017-06 147.333333 145.133333 2220 2302
2017-07 149.709677 150.645161 2188 2412
2017-08 150.806452 154.645161 2265 2341
2017-09 157.033333 151.466667 2199 2232
2017-10 149.387097 145.580645 2303 2203
2017-11 154.100000 150.266667 2212 2275
2017-12 156.064516 149.290323 2265 2224