Extension of Comparing columns of dataframes and returning the difference - pandas

This is an extension of my previous question at: Comparing columns of dataframes and returning the difference.
After comparing the columns of all the dataframes in my collection of 37 dataframes, i found that some of the dataframes have similar columns while some have different. So there is now a need to compare these different dataframes and return the difference. This step should continue until all the dataframes have been sorted into two groups, i.e., dataframes with similar columns into one group and different columns dataframes into second group.
for example:
df = [None] * 6
df[0] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'c':[7,8,3], 'd':[1,5,3]})
df[1] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'c':[7,8,3], 'd':[1,5,3]})
df[2] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'x':[7,8,3], 'y':[1,5,3]})
df[3] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'c':[7,8,3], 'd':[1,5,3]})
df[4] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'x':[7,8,3], 'z':[1,5,3]})
df[5] = pd.DataFrame({'a':[1,2,3],'b':[3,4,5], 'x':[7,8,3], 'y':[1,5,3]})
# code to group the dataframes into similar and different cols groups
nsame = []
same = []
for i in range(0, len(df)):
for j in range(i+1, len(df)):
if not (df[i].columns.equals(df[j].columns)):
nsame.append(j)
else:
same.append(i)
When I print the above code for same group (same), the output is as:
print(same)
[0, 0, 1, 2]
Desired output:
print(same)
[0, 1, 3]
Perhaps I need a recursive function to group all similar columns into one group and all different columns dataframes into a different group. However, the tricky part is that there can more than two groups. For example, in the above code, there are 3 groups:
Group1: df[0], df[1], df[3]
Group2: df[2], df[5]
Group3: df[4]
Can someone help here?

Here is one way
s=pd.Series([','.join(x) for x in df])
s.groupby(s).groups # the out put here already make the dfs into groups
Out[695]:
{'a,b,c,d': Int64Index([0, 1, 3], dtype='int64'),
'a,b,x,y': Int64Index([2, 5], dtype='int64'),
'a,b,x,z': Int64Index([4], dtype='int64')}
[y.index.tolist() for x , y in s.groupby(s)]
Out[699]: [[0, 1, 3], [2, 5], [4]]

Isn't it easier to pass all column names as a different pandas dataframe i.e.:
a - b - c - d
a - b - c - d
a - b - x - y
...
and just do a simple groupby over the columns
the count() series over groupby res will be the desired result

Related

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Why does pandas.DataFrame.apply produces Series instead of DataFrame

I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:
In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Select rows where number can be found in list

Given the following data
I hope to select the rows where num appears in list. In this case, it will select row 1 and row2, row 3 is not selected since 3 can't be found in [4,5].
Following is the dataframe, how should we write the filter query?
cat1=pd.DataFrame({"num":[1,2,3],
"list":[[1,2,3],[3,2],[4,5]]})
One possible solution with list comprehension, zip and in passed to boolean indexing:
df = cat1[[a in b for a, b in zip(cat1.num, cat1.list)]]
Or solution with DataFrame.apply with axis=1 for processing per rows:
df = cat1[cat1.apply(lambda x: x.num in x.list, axis=1)]
Or create DataFrame and test membership:
df = cat1[pd.DataFrame(cat1.list.tolist()).isin(cat1.num).any(axis=1)]
print (df)
num list
0 1 [1, 2, 3]
1 2 [3, 2]
A different solution if you are using pandas .25 is using explode():
cat1[cat1['num'].isin(cat1.explode('list1').query("num==list1").loc[:,'num'])]
num list1
0 1 [1, 2, 3]
1 2 [3, 2]

How to remove rows from a dataframe based on their column values existence in another df?

Given two dataframes A and B, which both have columns 'x', 'y' how can I efficiently remove all rows in A that their pairs of (x, y) appear in B.
I thought about implementing it using a row iterator on A and then per pair checking if it exists in B but I am guessing this is the least efficient way...
I tried using the .isin function as suggested in Filter dataframe rows if value in column is in a set list of values but couldn't make use of it for multiple columns.
Example dataframes:
A = pd.DataFrame([[1, 2], [1, 4], [3, 4], [2, 4]], columns=['x', 'y'])
B = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
C should contain [1,4] and [2,4] after the operation.
In pandas master (or in future 0.13) isin will also accept DataFrames, but the problem is that it just looks at the values in each column, and not at an exact row combination of the columns.
Taken from #AndyHayden comment here (https://github.com/pydata/pandas/issues/4421#issuecomment-23052472), a similar approach with set:
In [3]: mask = pd.Series(map(set(B.itertuples(index=False)).__contains__, A.itertuples(index=False)))
In [4]: A[~mask]
Out[4]:
x y
1 1 4
3 2 4
Or a more readable version:
set_B = set(B.itertuples(index=False))
mask = [x not in set_B for x in A.itertuples(index=False)]
The possible advantage of this compared to #Acorbe's answer is that this preserves the index of A and does not remove duplicate rows in A (but that depends on what you want of course).
As I said, 0.13 will have accept DataFrames to isin. However, I don't think this will solve this issue because also the index has to be the same:
In [27]: A.isin(B)
Out[27]:
x y
0 True True
1 False True
2 False False
3 False False
You can solve this by converting it to a dict, but now it does not look at the combinatio of both columns, but just for each column seperately:
In [28]: A.isin(B.to_dict(outtype='list'))
Out[28]:
x y
0 True True
1 True True
2 True True
3 False True
For those looking for a single-column solution:
new_df = df1[~df1["column_name"].isin(df2["column_name"])]
The ~ is a logical operator for NOT.
So this will create a new dataframe when the values of df1["column_name"] are not found in df2["column_name"]
One option would be to generate two sets, say A_set, B_set, whose elements are the rows of the DataFrames. Hence, the fast set difference operation A_set - B_set can be used.
A_set = set(map(tuple,A.values)) #we need to have an hashable object before generating a set
B_set = set(map(tuple,B.values))
C_set = A_set - B_set
C_set
{(1, 4), (2, 4)}
C = pd.DataFrame([c for c in C_set], columns=['x','y'])
x y
0 2 4
1 1 4
This procedure involves some preliminary conversion operations, though.