Aggregate/Remove duplicate rows in DataFrame based on swapped index levels - pandas

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?

Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)

Related

Pandas Dataframe GroupBy Agg - LAMBDA - single values go to preexisting or new lists and preexisting lists fusion

I have this DataFrame to groupby key:
df = pd.DataFrame({
'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df
I want to make the groupby key and sum data2, it's ok for this part.
But concerning data1, I want to :
If a list doesn't exist yet:
Single values don't change when key was not duplicated
Single values assigned to a key are combined into a new list
If a list already exist:
Other single values are append to it
Other lists values are append to it
The resulting DataFrame should then be :
dfgood = pd.DataFrame({
'key': ['1', '2', '3', '4', '5'],
'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
'data2': [6, 9, 13, 17, 10]
})
dfgood
In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.
I thought about two solutions :
Going that way :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
'data2' : sum,
})
dfgood
It doesn't work because of index out of range in x.iloc[1].
I also tried, because data1 was organized like this in another groupby from the question on this link:
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
But it's creating new lists from preexisting lists or values and not appending data to already existing lists.
Another way to do it, but I think it's more complicated and there should be a better or faster solution :
Turning data1 lists and single values into individual series with apply,
use wide_to_long to keep single values for each key,
Then groupby applying :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.
There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:
dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
.groupby('key', as_index=False).agg({
'data1' : lambda x: [z for y in x for z in y],
'data2' : sum,
})
)
print (dfgood)
key data1 data2
0 1 [A, B, C, D, P] 6
1 2 [E, F, G, H] 9
2 3 [I, J, K, L] 13
3 4 [M, N] 17
4 5 [O] 10
Another idea is use flatten function for flatten only lists, not strings:
#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
for x in foo:
if hasattr(x, '__iter__') and not isinstance(x, str):
for y in flatten(x):
yield y
else:
yield x
dfgood = (df.groupby('key', as_index=False).agg({
'data1' : lambda x: list(flatten(x)),
'data2' : sum}))
You could explode to get individual rows, then aggregate again with groupby+agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):
(df.explode('data1')
.assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
.groupby('key')
.agg({'data1': list, 'data2': 'sum'})
)
output:
data1 data2
key
1 [A, B, C, D, P] 6
2 [E, F, G, H] 9
3 [I, J, K, L] 13
4 [M, N] 17
5 [O] 10

Dictionary Unique Keys Rename and Replace

I have a dictionary format structure like this
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"1407273790":5,"1801032636":20,"1174813554":1,"1215470448":2,"1053754655":4,"1891751228":1},
{"1497066526":19,"1801032636":16,"1215470448":11,"1891751228":18},
{"1215470448":8,"1407273790":4},]})
Now I want to create a unique list of keys and create names for them like this -
np_code np_rename
1407273790 np_1
1801032636 np_2
1174813554 np_3
1215470448 np_4
1053754655 np_5
1891751228 np_6
1497066526 np_7
And finally replace the new names in main dataframe df -
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"np_1":5,"np_2":20,"np_3":1,"np_4":2,"np_5":4,"np_6":1},
{"np_7":19,"1801032636":16,"np_4":11,"np_6":18},
{"np_4":8,"np_1":4},]})
You can use apply here:
Assuming the unique list dataframe is unique_list_df:
u = df['CODES'].map(lambda x: [*x.keys()]).explode().unique()
d = dict(zip(u,'np_'+pd.Index((pd.factorize(u)[0]+1).astype(str))))
f = lambda x: {d.get(k,k): v for k,v in x.items()}
df['CODES'] = df['CODES'].apply(f)
print(df)
ID CODES
0 A {'np_1': 5, 'np_2': 20, 'np_3': 1, 'np_4': 2, ...
1 B {'np_7': 19, 'np_2': 16, 'np_4': 11, 'np_6': 18}
2 C {'np_4': 8, 'np_1': 4}

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

update values in dataframe

I have a dataframe in which the second column is an array. I have an another dataframe which has 2 columns, from which the value has to be updated in the first dataframe.
I already tried using update, explode, map, assign method.
df = pd.DataFrame({'Account': ['A1','A2','A3']})
groups = np.array([['g1','g2'],['g3','g4'],['g1','g2','g3']])
df["Group"] = groups.tolist()
key_values = pd.DataFrame({'ID': ['1','2','3','4','5'],'Group': ['g1','g2','g3','g4','g5']})
keys = key_values.set_index('Key')['ID']
ag = Accounts_Group.explode('Group')
Setup
m = key_values.set_index('Group')['ID']
Option 1
explode + map
f = df.explode('Group')
res = f['Group'].map(m).groupby(level=0).agg(list)
0 [1, 2]
1 [3, 4]
2 [1, 2, 3]
Name: Group, dtype: object
Option 2
List comprehension + map
res = [[*map(m.get, el)] for el in df['Group']]
[['1', '2'], ['3', '4'], ['1', '2', '3']]
To assign it back:
df.assign(Group=res)
Account Group
0 A1 [1, 2]
1 A2 [3, 4]
2 A3 [1, 2, 3]
Firstly convert them to strings and replace them. Then you can convert them to list again from string using ast
import ast
df['keys']=df.astype(str).replace(to_replace=list(key_values['Group']),value=list(key_values['ID']),regex=True)['Group']
df['keys']=df['keys'].apply(lambda x: ast.literal_eval(x))
print(df)
Account Group keys
0 A1 [g1, g2] [1, 2]
1 A2 [g3, g4] [3, 4]
2 A3 [g1, g2, g3] [1, 2, 3]

Processing each N rows in for loop, print list of processed items

Following reproducible script is intended to process each two rows in a dataframe with length 5. For each two processed rows, I like to print a list of items that have been processed.
import pandas as pd
import itertools
my_dict = {
'name' : ['a', 'b', 'c', 'd', 'e'],
'age' : [10, 20, 30, 40, 50]
}
df = pd.DataFrame(my_dict)
for index, row in itertools.islice(df.iterrows(), 2):
rowlist = (row.name)
print('Processed two rows {}'.format(rowlist))
Output:
a
b
I'm looking for a way to get following desired output:
[a,b]
[c,d]
[e]
Tried:
print(df.groupby(df.index//2)['name'].agg(list))
Out:
0 [a, b]
1 [c, d]
2 [e]
Name: name, dtype: object
0 [a, b]
1 [c, d]
2 [e]
Name: name, dtype: object
Thanks for your help!