How could I remove duplicates if duplicates mean less than 30days? - sql

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8

To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.

You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

Related

How to produce monthly count when given a date range in pandas?

I have a dataframe that records users, a label, and the start and end date of them being labelled as such
e.g.
user
label
start_date
end_date
1
x
2018-01-01
2018-10-01
2
x
2019-05-10
2020-01-01
3
y
2019-04-01
2022-04-20
1
b
2018-10-01
2020-05-08
etc
where each row is for a given user and a label; a user appears multiple times for different labels
I want to get a count of users for every month for each label, such as this:
date
count_label_x
count_label_y
count_label_b
count_label_
2018-01
10
0
20
5
2018-02
2
5
15
3
2018-03
20
6
8
3
etc
where for instance for the first entry of the previous table, that user should be counted once for every month between his start and end date. The problem boils down to this and since I only have a few labels I can filter labels one by one and produce one output for each label. But how do I check and count users given an interval?
Thanks
You can use date_range combined with to_period to generate the active months, then pivot_table with aggfunc='nunique' to aggregate the unique user (if you want to count the duplicated users use aggfunc='count'):
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)
output:
label b x y
period
2018-01 0 1 0
2018-02 0 1 0
2018-03 0 1 0
2018-04 0 1 0
2018-05 0 1 0
...
2021-12 0 0 1
2022-01 0 0 1
2022-02 0 0 1
2022-03 0 0 1
handling NaT
if you have the same start/end and want to count the value:
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.assign(period=lambda d: d['period'].fillna(d['start_date'].dt.to_period('M')))
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

Difference in weeks between first occurance and current occurance

I have a large dataset including items and dates. The simplified version looks as follows:
Item
Date
1
2018-10-01
2
2018-04-03
1
2018-10-16
2
2018-04-15
1
2018-10-20
2
2018-04-30
I want to add the column df['ItemAge'], displaying the number of weeks between the date of the occurrence of the item and the date of the first date of occurrence of the item. Here, the number of weeks is rounded to whole integers. The value of this variable is 0 on the date that the item first occurred. Hence I want to obtain:
Item
Date
ItemAge
1
2018-10-01
0
2
2018-04-03
0
1
2018-10-16
2
2
2018-04-15
2
1
2018-10-20
3
2
2018-04-30
4
I am thinking about creating the variable StartDate for each item as the date of the first occurrence of each item. Subsequently looping over the items and taking the difference in days between the new occurrence of the item and its StartDate. Then dividing this number of days by 7 and rounding to whole integers.
However, I don't know how to write this code in python. Does anyone have any suggestions?
You can convert to datetime, groupby Item and transform to take the first date per group. Then get the number of days by dividing the resulting TimeDelta by 7:
d = pd.to_datetime(df['Date'])
df['ItemAge'] = ((d-d.groupby(df['Item']).transform('first')).dt.days/7).round().astype(int)
output:
Item Date ItemAge
0 1 2018-10-01 0
1 2 2018-04-03 0
2 1 2018-10-16 2
3 2 2018-04-15 2
4 1 2018-10-20 3
5 2 2018-04-30 4

Selecting all the previous 6 months data records from occurrence of a particular value in a column in pandas

I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved

How to remove duplicate entires using the latest time in Pandas

Here is the snippet:
test = pd.DataFrame({'uid':[1,1,2,2,3,3],
'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
'amount': [10,11,12,13,14,15]})
Output:
amount start_time uid
0 10 2017-07-20 1
1 11 2017-06-20 1
2 12 2017-05-20 2
3 13 2017-04-20 2
4 14 2017-03-20 3
5 15 2017-02-20 3
Desired Output:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
I want to group by uid and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time.
I tried test.groupby(['uid'])['start_time'].max() but it doesn't work as it only returns back the uid and start_time column. I need the amount column as well.
Update: Thanks to #jezrael & #EdChum, you guys always help me out on this forum, thank you so much!
I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:
Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms
Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms
I guess groupby requires more time to compute.
Use idxmax to return the index of the latest time and use this to index the original df:
In[35]:
test.loc[test.groupby('uid')['start_time'].idxmax()]
Out[35]:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
Use sort_values by column start_time with drop_duplicates by uid:
df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
If need output with ordered uid:
print (test.sort_values('start_time', ascending=False)
.drop_duplicates('uid')
.sort_values('uid'))