Pandas Remove Duplicate Rows Based on Condition

Pandas Remove Duplicate Rows Based on Condition - pandas

I have a pandas data frame as follows
+---------+------------+------------+-------+--------+
| Product | Date | Adj_Date | Price | Factor |
+---------+------------+------------+-------+--------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 |
| A | 01-06-2020 | 01-08-2020 | 200 | 20 |
| B | 15-07-2020 | 01-07-2020 | 400 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 |
| C | 01-09-2020 | 01-08-2020 | 1200 | 10 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 |
+---------+------------+------------+-------+--------+
Code to generate the data frame above:
data = {'Product':['A', 'A', 'B', 'B', 'C', 'C', 'D', 'E'],
'Date':['01-06-2020', '01-06-2020', '15-07-2020', '15-07-2020', '01-09-2020', '01-09-2020', '01-10-2020', '01-10-2020'],
'Adj_Date':['01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-07-2020', '01-08-2020', '01-11-2020', '01-09-2020'],
'Price':[100, 200, 400, 800, 1000, 1200, 1400, 1600],
'Factor':[10,20, 10, 10, 10, 10, 10, 10]}
df =pd.DataFrame(data)
Desired Output:
+---------+------------+------------+-------+--------+--------------+
| Product | Date | Adj_Date | Price | Factor | Actual Price |
+---------+------------+------------+-------+--------+--------------+
| A | 01-06-2020 | 01-07-2020 | 100 | 10 | 10 |
| B | 15-07-2020 | 01-08-2020 | 800 | 10 | 80 |
| C | 01-09-2020 | 01-07-2020 | 1000 | 10 | 1000 |
| D | 01-10-2020 | 01-11-2020 | 1400 | 10 | 140 |
| E | 01-10-2020 | 01-09-2020 | 1600 | 10 | 1600 |
+---------+------------+------------+-------+--------+--------------+
The above result is based on comparing 2 columns Date and Adj_Date. If for any product there are 2 rows we choose row in which Date is less than Adj_Date and differnce between Adj_Date is minimum. As we can see for product A we have date = 01-06-2020 this date was less than Adj_Date in both the rows but we choose Adj_Date = 01-07-2020 as the difference with this date is minimum. Using the same logic we opt for row 2 in case of product B.
If Date is greater than Adj_Date in all cases than we keep the first row.
Next part is to create column Actual Price. Once we have single row for each product we divide Price with Factor to create Actual Price only if Date is less than Adj_Date. Else Actual Price is equal to Price.
How can this result be achieved?

First, convert Date and Adj_Date to Timestamp. This will make your life a lot easier:
for col in ['Date', 'Adj_Date']:
df[col] = pd.to_datetime(df[col], dayfirst=True)
Then:
# Pick one row for each product
def pick_one(group):
if len(group) == 1:
return group
diff = (group['Date'] - group['Adj_Date']).dt.days
if (diff < 0).any():
cond = group.index == diff[diff < 0].idxmax()
else:
cond = group.index == group.index[0]
return group.loc[cond]
result = df.groupby('Product', as_index=False).apply(pick_one).droplevel(0)
# Calculate the Actual Price
result['Actual Price'] = np.where(result['Date'] < result['Adj_Date'], result['Price'] / result['Factor'], result['Price'])

you can use duplicated and groupby.idxmin to filter your dataframe, then apply a boolean to get your ActualPrice column.
import numpy as np
# ensure you have valid datetime objects.
#df[['Date','Adj_Date']] = df[['Date','Adj_Date']].apply(pd.to_datetime)
df1 = df.loc[
df.assign(
delta=np.where(
df.duplicated(subset=["Product"], keep=False),
(df["Date"] - df["Adj_Date"]).abs(),
0,
)
)
.groupby("Product")["delta"]
.idxmin()
]
df1['ActualPrice'] = np.where(
df1['Date'] <= df1['Adj_Date'],
df1['Price'].div(df1['Factor']),
df1['Price']
)
print(df1)
Product Date Adj_Date Price Factor ActualPrice
0 A 2020-01-06 2020-01-07 100 10 10.0
3 B 2020-07-15 2020-01-08 800 10 800.0
5 C 2020-01-09 2020-01-08 1200 10 1200.0
6 D 2020-01-10 2020-01-11 1400 10 140.0
7 E 2020-01-10 2020-01-09 1600 10 1600.0

Related

Python Data Frame - How can I evaluate/use a column being created on the fly

Suppose that I have a dataframe as follows:
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | NaN |
| C | 25 | NaN |
| D | 30 | NaN |
+---------+-------+------------+
The above can be created using below code:
data = {'Product':['A', 'B', 'C', 'D'],
'Price':[10, 20, 25, 30],
'Calculated':[10, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I want to update column calculated on the fly. For 2nd row the calculated = Prv. calculated / Previous Price i.e. calculated at row 2 is 10/10=1
Now that we have value for row 2 calculated row 3 calculated would be 1/20 and so on and so forth.
Expected Output
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | 1 |
| C | 25 | 0.05 |
| D | 30 | 0.002 |
+---------+-------+------------+
The above can be achieved using loops but I don't want to use loops instead I need a vectorized approach to update column Calculated. How can I achieve that?

You are looking at cumprod with a shift:
# also `df['Calculated'].iloc[0]` instead of `.ffill()`
df['Calculated'] = df['Calculated'].ffill()/df.Price.cumprod().shift(fill_value=1)
Output:
Product Price Calculated
0 A 10 10.000
1 B 20 1.000
2 C 25 0.050
3 D 30 0.002

How to calculate cumulative percent change by each group?

I'd like to create a new column to calculate the cumulative percent change by each group
Sample dataset:
import pandas as pd
df = pd.DataFrame({'Group':['A', 'A', 'A', 'B', 'B'],
'Col_1':[100, 200, 300, 400, 500],
'Col_2':[55, 66, 77, 88, 99]})
Methodology: See example below
| Group |Col_1 | Col_2 | Cumulative Percent Change |
|-------|------|--------|---------------------------------------|
| A | 100 | 55 | 1 |
| A | 200 | 66 |(66-55)/55 + 1 |
| A | 300 | 77 |((77-66)/66) + ((66-55)/55 + 1) |
| B | 400 | 88 | 1 |
| B | 500 | 99 |((99-88)/88) + 1 |

You need to groupby twice, once to compute the percent change (with pct_change) and once for the cumulative sum+1 (cumsum and add(1)):
df['CPC'] = (df.groupby('Group')['Col_2']
.pct_change()
.fillna(0)
.groupby(df['Group'])
.cumsum().add(1)
)
output:
Group Col_1 Col_2 CPC
0 A 100 55 1.000000
1 A 200 66 1.200000
2 A 300 77 1.366667
3 B 400 88 1.000000
4 B 500 99 1.125000

Create new column in pandas depending on multiple conditions

I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse

It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |

PostgreSQL, Cumulative amount with interval

Hello there i have this example dataset:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
6 | 220 | 320
7 | 45 | 365
8 | 50 | 415
9 | 110 | 525
16 | 300 | 825
17 | 250 | 1075
18 | 200 | 1275
And interval, let's say 300:
I'd like to pick only rows, that match the interval, with condition:
Pick value if it's >= previous value+interval
(e.g if start Val = 100, next matching row is where cumulative amount >= 400, and so on)
:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100 <-- $Start
6 | 220 | 320 - 400
7 | 45 | 365 - 400
8 | 50 | 415 <-- 1
9 | 110 | 525 - 715 (prev value (415)+300)
16 | 300 | 825 <-- 2
17 | 250 | 1075 - 1125 (825+300)
18 | 200 | 1275 <-- 3
so final result would be :
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
18 | 200 | 1275
How to achieve this in PostgreSQL in most efficient way ?
Column cumulative_amount is progressive sum of column amount
and it's calculated in another query, which result is dataset above, table is ordered by employee_id.
Regards.

not saying it is the most effective way, but probably the easiest:
s=# create table s1(a int, b int, c int);
CREATE TABLE
Time: 10.262 ms
s=# copy s1 from stdin delimiter '|';
...
s=# with g as (select generate_series(100,1300,300) s)
, o as (select *,sum(b) over (order by a) from s1)
, c as (select *, min(sum) over (partition by g.s)
from o
join g on sum >= g.s and sum < g.s + 300
)
select a,b,sum from c
where sum = min
;
a | b | sum
----+-----+------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
17 | 250 | 1075
(4 rows)
here I used order by a as you sad your cumulative sum is by first column (which reconciled with third row)

Hive, ordering lines using a variable lag

I have the following hive table:
product | price
A | 100
B | 102
C | 220
D | 240
E | 242
F | 410
For every line I would like to divide the lower price by the current price, if the result is greater than 0.9 I would like to increments a row number. If the result is lower than 0.9 then row number should be 1 for this line, and current price become lower price, then iterate.
Result should look like:
product | price | row_number
A | 100 | 1
B | 102 | 2
C | 220 | 1
D | 240 | 2
E | 242 | 3
F | 410 | 1
Because:
lower price = 100: product A get 1 as row_number
100/102 >= 0.9: product B get 2 as row_number
100/220 < 0.9: product C get 1 as row_number, lower price = 220
220/240 >= 0.9: product D get 2 as row_number
220/242 >= 0.9: product E get 3 as row_number
220/410 < 0.9: product F get 1 as row_number, lower price = 410
I was thinking about creating a temporary_row_number just ordered by price:
product | price | temp_row_number
A | 100 | 1
B | 102 | 2
C | 220 | 3
D | 240 | 4
E | 242 | 5
F | 410 | 6
And then:
Select
product,
price,
case
when lag(price,temp_row_number-1,0)/price over() >= 0.9 then lag(price,temp_row_number-1,0)
else price
end as test
from my_table
This will retrieve:
product | price | test
A | 100 | 100
B | 102 | 100
C | 220 | 220
D | 240 | 240
E | 242 | 242
F | 410 | 410
But ideally I would like to retrieve
product | price | test
A | 100 | 100
B | 102 | 100
C | 220 | 220
D | 240 | 220
E | 242 | 220
F | 410 | 410
So I could compute row_number row using the row_number() function order by product and price and get the expected result.

WITH CTE
AS
(select product,price,(case when price between 100 and 200 then 1
when price between 200 and 300 then 2
when price between 300 and 400 then 3 END ) AS RN
FROM #test)
SELECT Product,Price, ROW_NUMBER() OVER (PARTITION BY RN ORDER BY RN) FROM CTE
ORDER BY Product

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas Remove Duplicate Rows Based on Condition - pandas

Related

Python Data Frame - How can I evaluate/use a column being created on the fly

How to calculate cumulative percent change by each group?

Create new column in pandas depending on multiple conditions

PostgreSQL, Cumulative amount with interval

Hive, ordering lines using a variable lag

Categories

Resources