Create New Column If Statement Based on Duplicate Rows in R - sql

I want to create a new column based on whether or not it is a duplicate row. I have my data ordered by user # then date. I want the new column to check to see if the value in the first column is equal to the row before, then do the same for the date.
For example I have the first two columns of data and want to create a boolean array in the 3rd column whether or not it was a new user on a new day:
User# Date Unique
1 1/1/17 1
1 1/1/17 0
1 1/2/17 1
2 1/1/17 1
3 1/1/17 1
3 1/2/17 1

This may give you what you are looking for
library(dplyr)
User <- c(1,1,1,2,3,3)
Date <- c("1/1/17","1/1/17","1/2/17","1/1/17","1/1/17","1/2/17")
df <- data.frame(User,Date,stringsAsFactors = FALSE)
df <- df %>%
group_by(User, Date) %>%
mutate(Unique = if_else(duplicated(Date) == FALSE, 1, 0))

There might be a typo in the sample data set as the last row is unique per the given criteria
df1$Unique <- c(1, diff(df1$User) != 0 | diff(df1$Date) != 0)
User Date Unique
1 1 2017-01-01 1
2 1 2017-01-01 0
3 1 2017-01-02 1
4 2 2017-01-01 1
5 3 2017-01-01 1
6 3 2017-01-02 1
update
If the users are stored as factors then the following will work
User <- c(1, 1, 1, 2, 3, 3)
User <- letters[User]
Date <- c("1/1/17", "1/1/17", "1/4/17", "1/1/17", "1/1/17", "1/2/17")
df1 <- data.frame(User, Date)
df1$Date <- as.Date(df1$Date, "%m/%d/%y")
df1$Unique <- c(1, diff(as.numeric(df1$User)) != 0 | diff(df1$Date) > 1)
User Date Unique
1 a 2017-01-01 1
2 a 2017-01-01 0
3 a 2017-01-04 1
4 b 2017-01-01 1
5 c 2017-01-01 1
6 c 2017-01-02 0

Related

How to achieve this in pandas dataframe

I have two dataframes df1 and df2' :- df1` :-
Date
ID
total calls
24-02-2021
1
15
22-02-2021
1
25
20-02-2021
3
100
21-02-2021
4
30
df2:
Date
ID
total calls
match_flag
24-02-2021
1
16
1
22-02-2021
1
25
1
20-02-2021
3
99
1
24-02-2021
2
80
not_found
21-02-2021
4
25
0
I want to first match based on Id and Date if both matches I want to check for an addional condition of total calls and if the difference between total calls in df1 and df2 is +-1 then I want to consider that row as match and update the flag and if it does not satisfy the +-1 condition want to update the flag to 0 and if that date for the ID is not found in df1 then update to not_found
Updating the df1 and df2 matched on ID and DateId
df1:
ID
Call_Date
TId
StartTime
EndTime
total calls
Type
Indicator
DateId
562124
18-10-2021
480271
18-10-2021
18-10-2021
1
Regular Call
SA
20211018
df2 :
ID
total calls
DateId
Start_Time
End_Time
Indicator
Type
match_flag
562124
0
20211018
2021-10-18T13:06:00.000+0000
2021-10-18T13:07:00.000+0000
AD
R
not_found
You can use a merge:
s = df2.merge(df1, on=['Date', 'ID'], how='left')['total calls_y']
df2['match_flag'] = (df2['total calls']
.sub(s).abs().le(1) # is absolute diff ≤ 1?
.astype(int) # convert to int
.mask(s.isna(), 'not_found') # mask missing
)
output:
Date ID total calls match_flag
0 24-02-2021 1 16 1
1 22-02-2021 1 25 1
2 20-02-2021 3 99 1
3 24-02-2021 2 80 not_found
4 21-02-2021 4 25 0

Leave the first TWO dates for each id

I have a dataframe of id number and dates:
import pandas as pd
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
df
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 1 04/05/2003
3 2 01/05/2010
4 2 08/08/2009
5 3 12/11/2008
I am looking for a way to leave for each id the first TWO dates (i.e. the two earliest dates).
for the example above the output would be:
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 2 08/08/2009
3 2 01/05/2010
4 3 12/11/2008
Thanks!
ensure timestamp
df['start_date']=pd.to_datetime(df['start_date'])
sort values
df=df.sort_values(by=['id','start_date'])
group and select first 2 only
df_=df.groupby('id')['id','start_date'].head(2)
Just group by id and then you can call head. Be sure to sort your values first.
df = df.sort_values(['id', 'start_date'])
df.groupby('id').head(2)
full code:
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
# 1. convert 'start_time' column to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
# 2. sort the dataframe ascending by 'start_time'
df.sort_values(by='start_date', ascending=True, inplace=True)
# 3. select only the first two occurances of each id
df.groupby('id').head(2)
output:
id start_date
0 1 2000-01-01
1 1 2002-01-07
5 3 2008-12-11
4 2 2009-08-08
3 2 2010-01-05

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

How to assign a new colum which value starts from 1 to N based on time sequence and duplicated values in dataframe?

Example:
id date seq
a 2019/11/01 1
a 2019/12/01 2
b 2019/10/01 1
c 2019/12/01 2
c 2019/11/01 1
I want to assign column seq base on columns date and id , the latter is duplicated. The details as belows:
For values which are not duplicate in column id like b, it will get 1 in column seq.
For values which are duplicated in column id like a and c, it will start 1 from N(N is repeated frequency) based on time sequence (column date).
You can also use the rank method
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,3,3],
'date':['2019/11/01',
'2019/12/01',
'2019/10/01',
'2019/12/01',
'2019/11/01']})
df['date'] = pd.to_datetime(df['date']) # first convert to datetime
df['seq'] = df.groupby('id')['date'].rank(method='dense').astype(int)
id date seq
0 1 2019-11-01 1
1 1 2019-12-01 2
2 2 2019-10-01 1
3 3 2019-12-01 2
4 3 2019-11-01 1
Use GroupBy.cumcount with converting values to datetimes and sorting before:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'])
df['seq'] = df.groupby('id').cumcount() + 1
print (df)
id date seq
0 a 2019-11-01 1
1 a 2019-12-01 2
2 b 2019-10-01 1
4 c 2019-11-01 1
3 c 2019-12-01 2
If need same order like in original add DataFrame.sort_index:
df = df.sort_index()
print (df)
id date seq
0 a 2019-11-01 1
1 a 2019-12-01 2
2 b 2019-10-01 1
3 c 2019-12-01 2
4 c 2019-11-01 1

Selecting all the previous 6 months data records from occurrence of a particular value in a column in pandas

I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved