How to calculate cumulative percent change by each group? - pandas

I'd like to create a new column to calculate the cumulative percent change by each group
Sample dataset:
import pandas as pd
df = pd.DataFrame({'Group':['A', 'A', 'A', 'B', 'B'],
'Col_1':[100, 200, 300, 400, 500],
'Col_2':[55, 66, 77, 88, 99]})
Methodology: See example below
| Group |Col_1 | Col_2 | Cumulative Percent Change |
|-------|------|--------|---------------------------------------|
| A | 100 | 55 | 1 |
| A | 200 | 66 |(66-55)/55 + 1 |
| A | 300 | 77 |((77-66)/66) + ((66-55)/55 + 1) |
| B | 400 | 88 | 1 |
| B | 500 | 99 |((99-88)/88) + 1 |

You need to groupby twice, once to compute the percent change (with pct_change) and once for the cumulative sum+1 (cumsum and add(1)):
df['CPC'] = (df.groupby('Group')['Col_2']
.pct_change()
.fillna(0)
.groupby(df['Group'])
.cumsum().add(1)
)
output:
Group Col_1 Col_2 CPC
0 A 100 55 1.000000
1 A 200 66 1.200000
2 A 300 77 1.366667
3 B 400 88 1.000000
4 B 500 99 1.125000

Related

Python Data Frame - How can I evaluate/use a column being created on the fly

Suppose that I have a dataframe as follows:
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | NaN |
| C | 25 | NaN |
| D | 30 | NaN |
+---------+-------+------------+
The above can be created using below code:
data = {'Product':['A', 'B', 'C', 'D'],
'Price':[10, 20, 25, 30],
'Calculated':[10, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I want to update column calculated on the fly. For 2nd row the calculated = Prv. calculated / Previous Price i.e. calculated at row 2 is 10/10=1
Now that we have value for row 2 calculated row 3 calculated would be 1/20 and so on and so forth.
Expected Output
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | 1 |
| C | 25 | 0.05 |
| D | 30 | 0.002 |
+---------+-------+------------+
The above can be achieved using loops but I don't want to use loops instead I need a vectorized approach to update column Calculated. How can I achieve that?
You are looking at cumprod with a shift:
# also `df['Calculated'].iloc[0]` instead of `.ffill()`
df['Calculated'] = df['Calculated'].ffill()/df.Price.cumprod().shift(fill_value=1)
Output:
Product Price Calculated
0 A 10 10.000
1 B 20 1.000
2 C 25 0.050
3 D 30 0.002

Pandas Remove Duplicate Rows Based on Condition

I have a pandas data frame as follows
+---------+------------+------------+-------+--------+
| Product | Date | Adj_Date | Price | Factor |
+---------+------------+------------+-------+--------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 |
| A | 01-06-2020 | 01-08-2020 | 200 | 20 |
| B | 15-07-2020 | 01-07-2020 | 400 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 |
| C | 01-09-2020 | 01-08-2020 | 1200 | 10 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 |
+---------+------------+------------+-------+--------+
Code to generate the data frame above:
data = {'Product':['A', 'A', 'B', 'B', 'C', 'C', 'D', 'E'],
'Date':['01-06-2020', '01-06-2020', '15-07-2020', '15-07-2020', '01-09-2020', '01-09-2020', '01-10-2020', '01-10-2020'],
'Adj_Date':['01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-11-2020', '01-09-2020'],
'Price':[100, 200, 400, 800, 1000, 1200, 1400, 1600],
'Factor':[10,20, 10, 10, 10, 10, 10, 10]}
df =pd.DataFrame(data)
Desired Output:
+---------+------------+------------+-------+--------+--------------+
| Product | Date | Adj_Date | Price | Factor | Actual Price |
+---------+------------+------------+-------+--------+--------------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 | 80 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 | 1000 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 | 140 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 | 1600 |
+---------+------------+------------+-------+--------+--------------+
The above result is based on comparing 2 columns Date and Adj_Date. If for any product there are 2 rows we choose row in which Date is less than Adj_Date and differnce between Adj_Date is minimum. As we can see for product A we have date = 01-06-2020 this date was less than Adj_Date in both the rows but we choose Adj_Date = 01-07-2020 as the difference with this date is minimum. Using the same logic we opt for row 2 in case of product B.
If Date is greater than Adj_Date in all cases than we keep the first row.
Next part is to create column Actual Price. Once we have single row for each product we divide Price with Factor to create Actual Price only if Date is less than Adj_Date. Else Actual Price is equal to Price.
How can this result be achieved?
First, convert Date and Adj_Date to Timestamp. This will make your life a lot easier:
for col in ['Date', 'Adj_Date']:
df[col] = pd.to_datetime(df[col], dayfirst=True)
Then:
# Pick one row for each product
def pick_one(group):
if len(group) == 1:
return group
diff = (group['Date'] - group['Adj_Date']).dt.days
if (diff < 0).any():
cond = group.index == diff[diff < 0].idxmax()
else:
cond = group.index == group.index[0]
return group.loc[cond]
result = df.groupby('Product', as_index=False).apply(pick_one).droplevel(0)
# Calculate the Actual Price
result['Actual Price'] = np.where(result['Date'] < result['Adj_Date'], result['Price'] / result['Factor'], result['Price'])
you can use duplicated and groupby.idxmin to filter your dataframe, then apply a boolean to get your ActualPrice column.
import numpy as np
# ensure you have valid datetime objects.
#df[['Date','Adj_Date']] = df[['Date','Adj_Date']].apply(pd.to_datetime)
df1 = df.loc[
df.assign(
delta=np.where(
df.duplicated(subset=["Product"], keep=False),
(df["Date"] - df["Adj_Date"]).abs(),
0,
)
)
.groupby("Product")["delta"]
.idxmin()
]
df1['ActualPrice'] = np.where(
df1['Date'] <= df1['Adj_Date'],
df1['Price'].div(df1['Factor']),
df1['Price']
)
print(df1)
Product Date Adj_Date Price Factor ActualPrice
0 A 2020-01-06 2020-01-07 100 10 10.0
3 B 2020-07-15 2020-01-08 800 10 800.0
5 C 2020-01-09 2020-01-08 1200 10 1200.0
6 D 2020-01-10 2020-01-11 1400 10 140.0
7 E 2020-01-10 2020-01-09 1600 10 1600.0

Create new column in pandas depending on multiple conditions

I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse
It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |

PostgreSQL, Cumulative amount with interval

Hello there i have this example dataset:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
6 | 220 | 320
7 | 45 | 365
8 | 50 | 415
9 | 110 | 525
16 | 300 | 825
17 | 250 | 1075
18 | 200 | 1275
And interval, let's say 300:
I'd like to pick only rows, that match the interval, with condition:
Pick value if it's >= previous value+interval
(e.g if start Val = 100, next matching row is where cumulative amount >= 400, and so on)
:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100 <-- $Start
6 | 220 | 320 - 400
7 | 45 | 365 - 400
8 | 50 | 415 <-- 1
9 | 110 | 525 - 715 (prev value (415)+300)
16 | 300 | 825 <-- 2
17 | 250 | 1075 - 1125 (825+300)
18 | 200 | 1275 <-- 3
so final result would be :
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
18 | 200 | 1275
How to achieve this in PostgreSQL in most efficient way ?
Column cumulative_amount is progressive sum of column amount
and it's calculated in another query, which result is dataset above, table is ordered by employee_id.
Regards.
not saying it is the most effective way, but probably the easiest:
s=# create table s1(a int, b int, c int);
CREATE TABLE
Time: 10.262 ms
s=# copy s1 from stdin delimiter '|';
...
s=# with g as (select generate_series(100,1300,300) s)
, o as (select *,sum(b) over (order by a) from s1)
, c as (select *, min(sum) over (partition by g.s)
from o
join g on sum >= g.s and sum < g.s + 300
)
select a,b,sum from c
where sum = min
;
a | b | sum
----+-----+------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
17 | 250 | 1075
(4 rows)
here I used order by a as you sad your cumulative sum is by first column (which reconciled with third row)

Need SQL select query to find duplicates and return min and max rows

I have the sql query as below.
(SELECT
height
,width
,ROUND(height / 0.0254, 0) AS "H1"
,FLOOR((width * 2) / 0.0254) AS "W1"
FROM iclr_max_dim_results mdim
,iclr_request req
WHERE mdim.request_oid = req.oid
AND req.request_number = 102017
AND req.version_number = 52731
GROUP BY height
ORDER BY height DESC
) A
)
Below is the result of the query.
height | width | H1 | W1
-----------------------------------------
<hr>
6.0223 | 0.1003 | 237 | 7
6.0198 | 0.2435 | 237 | 19
6.0185 | 0.3151 | 237 | 24
5.9944 | 1.6759 | 236 | 131
5.9931 | 1.6779 | 236 | 132
5.9576 | 1.7016 | 235 | 133
5.9563 | 1.7024 | 235 | 134
If we see the last two columns H1 and W1 in the first three rows, the value 237 repeats with 7, 19, 24 respectively. I will need to return only the rows min and max W1 value for H1.
Here, in this case the result shall be as below. We eliminated 237 | 19 since 7 and 24 are min and max for 237.
6.0223 | 0.1003 | 237 | 7
6.0185 | 0.3151 | 237 | 24
5.9944 | 1.6759 | 236 | 131
5.9931 | 1.6779 | 236 | 132
5.9576 | 1.7016 | 235 | 133
5.9563 | 1.7024 | 235 | 134
How should I edit the SQL qyery to archieve this.
Thank you very much.
Query can be this:
SELECT a.*
FROM (...) a
JOIN (
SELECT H1, MIN(W1) as w1_min, MAX(W1) as w1_max
FROM (...) c
GROUP BY H1
) b ON b.H1 = a.H1 AND (b.w1_min = a.W1 OR b.w1_max = a.W1)
Replace ... with your original query or cretate VIEW from your original query and replace (...) with views name.