Drop Duplicates based on Nearest Datetime condition - pandas

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.

IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64

Related

For each unique item in a data frame column, get the last rows based on a looking/scanning window

patient_id
alert_id
alert_timestamp
3
xyz
2022-10-10
1
anp
2022-10-12
1
gfe
2022-10-10
2
fgy
2022-10-02
2
gpl
2022-10-03
1
gdf
2022-10-13
2
mkd
2022-10-23
1
liu
2022-10-01
I have a data frame (see simplified version above) where for each patient_id, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7.
Note, the window size needs to look at consecutive days i.e. between day 1 -> day 1 + window_size. The ranges of alert_timestamp for each patient_id varies and is usually well beyond the window_size range.
Note, that the data frame example given above, is a very simple example and will have many more patient_id's and will be in a mixed order in terms alert_timestamp and alert_id.
The approach is to start from the last alert_timstamp for a given patient_id and work back using the window_size to select the alert that was the last one in that window time frame.
Please note the idea is to have a scanning/looking window, example window_size = 7 days to move across the timestamps of each patient
The end result I want, is a data frame with the filtered out alerts
Expected output for (this example) window_size = 7:
patient_id
alert_id
alert_timestamp
1
liu
2022-10-01
1
gdf
2022-10-13
2
gpl
2022-10-03
2
mkd
2022-10-23
3
xyz
2022-10-10
What's the most efficient way to solve for this?
You can use:
# ensure datetime
df['alert_timestamp'] = pd.to_datetime(df['alert_timestamp'])
# get last date per patient
# and compute bins of 7 days
group = (df['alert_timestamp']
.rsub(df.groupby('patient_id')['alert_timestamp']
.transform('max'))
.floordiv(pd.Timedelta('7d')) # or .dt.days.floordiv(7)
)
# keep only most recent of each alert per group
out = (df
.sort_values(by=['patient_id', 'alert_timestamp'])
.groupby(['patient_id', group], as_index=False, sort=False)
.last()
)
Output:
patient_id alert_id alert_timestamp
0 1 liu 2022-10-01
1 1 gdf 2022-10-13
2 2 gpl 2022-10-02
3 2 mkd 2022-10-23
4 3 xyz 2022-10-10

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

Is there a way to group columns with multiple conditions using pandas?

My dataframe is similar to:
transaction date cash
0 1 2020-01-01 72
1 2 2020-01-03 100
2 2 2020-01-05 -75
3 3 2020-01-05 82
I want the output to group by transaction and to sum the cash for each transaction (if there is two amounts) BUT ALSO to return the later date. So for transaction 2 the end result would show transaction, date, cash as: 2,1/5/2020, 25...
Not sure how to make tables to help the visuals in my question yet so sorry, please let me know if there's any questions.
Use groupby + agg. Check the docs examples.
output = df.groupby('transaction').agg({'date': 'max', 'cash': 'sum'})
This solution assumes that the date column is encoded as proper datetime instances. If this is not currently the case, try df['date'] = pd.to_datetime(df['date']) before doing the following.
>>> df
transaction date cash
0 1 2020-01-01 72
1 2 2020-01-03 100
2 2 2020-01-05 -75
3 3 2020-01-05 82
>>> transactions = df.groupby('transaction')
>>> pd.concat((transactions['cash'].sum(), transactions['date'].max()), axis=1)
cash date
transaction
1 72 2020-01-01
2 25 2020-01-05
3 82 2020-01-05
transactions['date'].max() picks the date furthest into the future of those with the same transaction ID.

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

How to remove duplicate entires using the latest time in Pandas

Here is the snippet:
test = pd.DataFrame({'uid':[1,1,2,2,3,3],
'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
'amount': [10,11,12,13,14,15]})
Output:
amount start_time uid
0 10 2017-07-20 1
1 11 2017-06-20 1
2 12 2017-05-20 2
3 13 2017-04-20 2
4 14 2017-03-20 3
5 15 2017-02-20 3
Desired Output:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
I want to group by uid and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time.
I tried test.groupby(['uid'])['start_time'].max() but it doesn't work as it only returns back the uid and start_time column. I need the amount column as well.
Update: Thanks to #jezrael & #EdChum, you guys always help me out on this forum, thank you so much!
I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:
Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms
Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms
I guess groupby requires more time to compute.
Use idxmax to return the index of the latest time and use this to index the original df:
In[35]:
test.loc[test.groupby('uid')['start_time'].idxmax()]
Out[35]:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
Use sort_values by column start_time with drop_duplicates by uid:
df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
If need output with ordered uid:
print (test.sort_values('start_time', ascending=False)
.drop_duplicates('uid')
.sort_values('uid'))