pandas - get count of duplicate rows (matching across multiple columns) - pandas

I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).
Id First Last
1 Dave Davis
2 Dave Smith
3 Bob Smith
4 Dave Smith
I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.
import pandas as pd
dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)
print(df2.groupby(df2.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
Output:
First Last records
0 Bob Smith 1
1 Dave Davis 1
2 Dave Smith 2
I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.
import pandas as pd
dict1 = {'Id': pd.Series([1, 2, 3, 4]),
'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)
print(df1.groupby(df1.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
gives:
Id First Last records
0 1 Dave Davis 1
1 2 Dave Smith 1
2 3 Bob Smith 1
3 4 Dave Smith 1
I want (ideally):
First Last records Ids
0 Dave Smith 2 2, 4

first filter only duplicated rows by DataFrame.duplicated by columns for check and keep=False for return all dupes, filter by boolean indexing. Then aggregate by GroupBy.agg counts with GroupBy.size and join id with converting to strings:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
.groupby(['First','Last'])['Id'].agg(tup)
.reset_index())
print (df2)
First Last records Ids
0 Dave Smith 2 2,4
Another idea is aggregate all values and then filter with DataFrame.query:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
.reset_index()
.query('records != 1'))
print (df2)
First Last records Ids
2 Dave Smith 2 2,4

Related

How to return difference between two column values in pandas

I have 1 dataframe and want to check and then return the difference in values between two columns of the same dataframe only if there is a value in the 2nd column. The 2nd column in my example below is AppliancesO and first column is AppliancesH
Item Name AppliancesH AppliancesO
1 Joe TV TV
2 Mary [TV; Fridge] TV
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge]
4 Pete [Fridge;Oven]
and 1000 more rows as such
The output am looking for is
Item Name AppliancesH AppliancesO Diff
1 Joe TV TV
2 Mary [TV; Fridge] TV Fridge
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge] [Microwave;Computer]
4 Pete [Fridge;Oven]
I know how to compare the columns to determine if they are different, but I dont know how to return the difference
df.loc[(df['AppliancesH']!=df['AppliancesO'])& ~df.AppliancesO.isna()][['Name','AppliancesH', 'AppliancesO','Diff']]
Assuming the following data
>>> dict_ = {'AppliancesH': {1: ['TV'], 2: ['TV', 'Fridge'], 3: ['Microwave', 'TV', 'Fridge'], 4: ['Fridge', 'Oven']}, 'AppliancesO': {1: ['TV'], 2: ['TV'], 3: ['Computer', 'TV', 'Fridge'], 4: []}, 'Name': {1: 'Joe', 2: 'Mary', 3: 'Jack', 4: 'Pete'}}
>>> df = pd.DataFrame(dict_)
>>> df
AppliancesH AppliancesO Name
1 [TV] [TV] Joe
2 [TV, Fridge] [TV] Mary
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack
4 [Fridge, Oven] [] Pete
You can use set's ~.symmetric_difference to perform such operation. Let(s first define the callable we need:
def symdif(s: pd.Series) -> list:
h = s.AppliancesH
o = s.AppliancesO
return h and o and sorted(set(h).symmetric_difference(o))
and use it via pandas.DataFrame.apply
>>> df['Diff'] = df.apply(axis=1, func=symdif)
>>> df
AppliancesH AppliancesO Name Diff
1 [TV] [TV] Joe []
2 [TV, Fridge] [TV] Mary [Fridge]
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack [Computer, Microwave]
4 [Fridge, Oven] [] Pete []
Here is another way:
df['Differences'] = (df.set_index('Name')
.applymap(set)
.apply(lambda x: set.symmetric_difference(*x),axis=1).map(list)
.reset_index(drop=True))
This can also be done with XOR operator
def find_diff(row):
if row.isna().any():
return []
diff = set(row['AppliancesH']) ^ set(row['AppliancesO'])
return list(diff)
df.apply(find_diff, axis=1)
You might also need to write a function that converts those strings to a list

Subtract 1 from column based on another DataFrame. Pandas

Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!
a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4
This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']
Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1

How to append/concat 2 Pandas Dataframe with different columns

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview
import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

How to sort 2 columns in a Dataframe, one sorted in descending order and the other in alphabetical order corresponding to the 1st column

Dataframes looks something like
Names Rank
Michael 8
David 6
Christopher 6
Brian 5
Amanda 3
Heather 8
Sarah 2
Rebecca 4
Expected O/P
Names Rank
Heather 8
Michael 8
Christopher 6
David 6
Brian 5
Rebecca 4
Amanda 3
Sarah 2
Here, I need to first sort the rank column in descending order and then the Name column in alphabetical order.
My code :
df = df.sort_values(['Name'],ascending = True)
df = df.'Name'.sort_values(['Rank'],ascending = False)
df
This code gives me on the sorted rank but the Name column doesn't get sorted.
df = df.sort_values(['Rank', 'Name'],ascending = [False, True])

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)