Finding duplicate entries - pandas

I am working with the 515k Hotel Reviews dataset from Kaggle. There are 1492 unique hotel names and 1493 unique addresses. So at first it would appear that one (or possibly more) hotel has more than one address. But, if I do a groupby.count on the data, I get 1494 whether I groupby HotelName followed by Address or if I reverse the order.
In order to make this reproducible, hopefully this simplification will suffice:
data = {
'HotelName': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'B', 'C', 'C'],
'Address': [1, 2, 3, 4, 1, 2, 3, 4, 2, 2, 2, 3, 5]
}
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df['HotelName'].unique().shape[0] # Returns 4
df['Address'].unique().shape[0] # Returns 5
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
I would like to find the hotel names that have different addresses. So in my example, I would like to find the A and C along with their addresses (1,2 and 3,5 respectively). That code should be enough for me to also find the addresses that have duplicate hotel names.

Use the nunique groupby aggregator:
>>> n_uniq = df.groupby('HotelName')['Address'].nunique()
>>> n_uniq
HotelName
A 2
B 1
C 2
D 1
Name: Address, dtype: int64
If you want to look at the distinct hotels with more than one address in the original dataframe,
>>> hotels_with_mult_addr = n_uniq.index[n_uniq > 1]
>>> df[df['HotelName'].isin(hotels_with_mult_addr)].drop_duplicates()
HotelName Address
0 A 1
2 C 3
8 A 2
12 C 5

If I understand your correctly, we can check which hotel has more then 1 unique adress with groupby.transform(nunqiue):
m = df.groupby('HotelName')['Address'].transform('nunique').ne(1)
print(df.loc[m])
HotelName Address
0 A 1
2 C 3
4 A 1
6 C 3
8 A 2
11 C 3
12 C 5
If you want to get a more concise view on which the duplicates are, use groupby.agg(set):
df.loc[m].groupby('HotelName')['Address'].agg(set).reset_index(name='addresses')
HotelName addresses
0 A {1, 2}
1 C {3, 5}
Step by step:
transform(nunique) gives us the amount of unique adresses next to each row
df.groupby('HotelName')['Address'].transform('nunique')
0 2
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 1
11 2
12 2
Name: Address, dtype: int64
Then we check which rows are not equal (ne) to 1 and filter those:
df.groupby('HotelName')['Address'].transform('nunique').ne(1)
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 False
10 False
11 True
12 True
Name: Address, dtype: bool

Groupby didn't do what you were expected. After you did the groupby here is what you got
HotelName Address
0 A 1
4 A 1
HotelName Address
8 A 2
HotelName Address
1 B 2
5 B 2
9 B 2
10 B 2
HotelName Address
2 C 3
6 C 3
11 C 3
HotelName Address
3 D 4
7 D 4
HotelName Address
12 C 5
There are indeed 6 combinations!
If you want to know the duplication in each group, you should check the row index.

Here is the long way to do it, where in newdf['count'] == 1 is unique
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df = df.sort_values(by = ['HotelName','Address']).reset_index(drop = True)
count = df.groupby(['HotelName','Address'])['Address'].count().reset_index(drop = True)
df['rownum'] = df.groupby(['HotelName','Address']).cumcount()+1
dfnew = df[df['rownum']==1].reset_index(drop = True).drop(columns = 'rownum')
dfnew['count'] = count
dfnew

Related

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Pandas add column with values from two df on partly matching column

I have an easy question most likely but still am stuck on how to solve what I want.
I have a two dataframes which match one column "giftID" and want to create a new column in df1 adding the values from df2 matching the giftID. I tried it with np.where and all different kinds but can't get it working.
df = pd.read_csv('../data/gifts.csv')
trip1 =df[:20].copy()
trip1['TripId']=0
subtours = [list(trip1['GiftId'])] * len(trip1)
trip1['Subtour'] = subtours
trip2 = df[20:41].copy()
#trip2['Subtour'] = [s]*len(trip2)
trip2['TripId']=1
trip2['Subtour'] = subtours = [list(trip2['GiftId'])] * len(trip2)
mini_tour = trip1.append(trip2)
grouped = mini_tour.groupby('TripId')
SA = Simulated_Anealing()
wrw = 0
for name, trip in grouped:
tourId = trip['TripId'].unique()[0]
optimized_trip,wrw_c = SA.simulated_annealing(trip)
wrw += wrw_c
subtours = [optimized_trip]*len(trip)
mask = mini_tour['TripId'] == tourId
mini_tour.loc[mask,'Subtour'] = 0
Input:
df giftID weight
1 A 4
2 B 5
3 C 6
4 D 7
5 E 12
df1 giftID subtour
1 A 1, 3, 4
2 B 1, 3, 4
3 C 1, 3, 4
df2 giftID subtour
1 D 2, 5, 8
2 E 2, 5, 8
Output:
df giftID weight subtour
1 A 4 1, 3, 4
2 B 5 1, 3, 4
3 C 6 1, 3, 4
4 D 7 2, 5, 8
5 E 12 2, 5, 8
Firstly, you can pd.concat, df1 and df2
import pandas pd
df12 = pd.concat([df1,df2],axis=0) # axis = 0 means row wise
Then merge the df12 with your main one:
df_merge = pd.merge(df,df12,how='left',left_on='giftID',right_on='gift')

Combine two dataframes in Pandas to generate many to many relationship

I have two lists, say
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I want to generate a Pandas dataframe so:
All customers and accounts are used
There is a many to many relationship between customers and accounts (one customer 'may' have multiple accounts and an account 'may' be owned by multiple customers
I want the many to many relationship to be random. That is, some customers will have one account and some will have more than one. Similarly, some accounts will be owned by just one customers and others by more than one.
Something like,
Customer
Account
a
1
a
2
b
2
c
3
a
4
b
4
c
4
b
5
b
6
b
7
b
8
a
9
Since I am generating random data, in the worst case scenario, I can generate way too many accounts and discard the unused ones if the code is easier (essentially relaxing the requirement 1 above).
I am using sample (n=20, replace=True) to generate 20 records in both dataframes and then merging them into one based on the index. Is there an out of the box API or library to do this or is my code the recommended way?
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
customers_df = pd.DataFrame(data=customers)
customers_df = customers_df.sample (n=20, replace=True)
customers_df['new_index'] = range (20)
customers_df.set_index ('new_index', inplace=True)
accounts_df = pd.DataFrame (data=accounts)
accounts_df = accounts_df.sample (n=20, replace=True)
accounts_df['new_index'] = range (20)
accounts_df.set_index ('new_index', inplace=True)
combined_df = pd.merge (customers_df, accounts_df, on='new_index')
print (combined_df)
Edit: Modified the question and added sample code I have tried.
One way to accomplish this is to collect the set of all possible relationships with a cartesian product, then select from that list before building your dataframe:
import itertools
import random
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_associations = [ca for ca in itertools.product(customers, accounts)]
df = pd.DataFrame.from_records(random.choices(possible_associations, k=20), columns=['customers', 'accounts']).sort_values(['customers','accounts'])
print(df)
Output
customers accounts
0 a 2
3 a 2
15 a 2
18 a 4
16 a 5
14 a 7
7 a 8
12 a 8
1 a 9
2 b 5
9 b 5
8 b 8
11 b 8
19 c 2
17 c 3
5 c 4
4 c 5
6 c 5
13 c 5
10 c 7
To have a repeatable test result, start with np.random.seed(1) (in the target
version drop it).
Then proceed as follows:
Create a list of probabilities - how many accounts can have a customer, e.g.:
prob = [0.5, 0.25, 0.15, 0.09, 0.01]
Generate a Series stating how many owners shall have each account:
cnt = pd.Series(np.random.choice(range(1, len(prob) + 1), size=len(accounts),
p=prob), name='Customer')
Its name is Customer, because it will be the source to create just
Customer column.
For my sample probablities and generator seeding the result is:
0 1
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: Customer, dtype: int32
(the left column is the index, the right - actual values).
Because your data sample contains only 9 accounts, the result does
not contain "greater" number of owners. But in your target version,
with more accounts, there will be accounts with greater numbers of
owners.
Generate the result - cust_acct DataFrame, defining the assignment of customers
to accounts:
cust_acct = cnt.apply(lambda x: np.random.choice(customers, x, replace=False))\
.explode().to_frame().join(pd.Series(accounts, name='Account')).reset_index(drop=True)
The result, for your sample data and my seeding and probabilities, is:
Customer Account
0 b 1
1 a 2
2 b 2
3 b 3
4 b 4
5 c 5
6 b 6
7 c 7
8 a 8
9 b 9
Of course, you can assume different proabilities in prob.
You can also choose other "top" number of owners (the number of
entries in prob).
In this case no change in code is needed, because the range of values in
the first np.random.choice is set to accomodete to the lenght of prob.
Note: Because your sample data contains only 3 customers,
under different generator seeding there can occur ValueError: Cannot
take a larger sample than population when 'replace=False'.
The reason is that if the number of owners for some account was > 3 then just
this error occurs.
But with your target data, with greater number of customers, this error
will not occur.

Pandas: How to obtain top 2, middle 2 and bottom 2 rows in a each group

Let's say I have a dataframe df as below. To obtain 1st 2 and last 2 in each group I have used groupby.nth
df = pd.DataFrame({'A': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'B': [1, 2, 3, 4, 5,6,7,8,1, 2, 3, 4, 5,6,7]}, columns=['A', 'B'])
df.groupby('A').nth([0,1,-2,-1])
Result:
B
A
a 1
a 2
a 7
a 8
b 1
b 2
b 6
b 7
I'm not sure how to obtain the middle 2 rows. For example, in group 'A' there are 8 instances so my middle would be 4, 5 (n/2, n/2+1) and group 'B' my middle rows would be 3, 4 (n/2-0.5, n/2+0.5). Any guidance is appreciated.
sacul's answer is nice , Here I just follow your own idea def a customize function
def middle(x):
if len(x) % 2 == 0:
return x.iloc[int(len(x) / 2) - 1:int(len(x) / 2) + 1]
else:
return x.iloc[int((len(x) / 2 - 0.5)) - 1:int(len(x) / 2 + 0.5)]
pd.concat([middle(y) for _ , y in df.groupby('A')])
Out[25]:
A B
3 a 4
4 a 5
10 b 3
11 b 4
You can use iloc to find the n//2 -1 and n//2 indices for each group (// is floor division):
g = df.groupby('A')
g.apply(lambda x: x['B'].iloc[[len(x)//2-1, len(x)//2]])
A
a 3 4
4 5
b 10 3
11 4
Name: B, dtype: int64

Comparing and replacing column items pandas dataframe

I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?