The union of the intersection of rows of n dataframes - pandas

Say I have n dataframes, in this example n = 3.
**df1**
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
5 True 0 26.0
**df2**
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 True 2 19.0
**df3**
A B C
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5
5 True 0 27.50
**dfn** ...
I want one dataframe that includes all the rows where the value in Column C appears in every dataframe dfn. So this is a kind of the union of the intersection of dataframes on a Column, in this case Column C. So for the above dataframes, the rows with 19.0, 26.0 and 27.50 don't make it to the final dataframe which is:
**Expected df**
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5
So a row lives on to the final dataframe, if and only if, the value in Column C is seen in all dataframes.
Reproducible code:
import pandas as pd
df1 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,1,2,4,1,0],
'C': [21.0,23.0,25.0,25.5,25.0,26.0]})
df2 = pd.DataFrame({'A': [True,True,False,False,False], 'B': [3,1,2,4,2],
'C': [21.0,23.0,25.0,25.5,19.0]})
df3 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,2,2,1,4,0],
'C': [21.0,23.0,25.0,25.0,25.5,27.5]})
dfn = ...

The straightforward approach seems to be to compute the (n-way intersection) common C values (as a set/list), then filter with .isin:
common_C_values = set.intersection(set(df1['C']), set(df2['C']), set(df3['C']))
df_all = pd.concat([df1,df2,df3])
df_all = df_all[ df_all['C'].isin(common_C_values) ]

You can use pd.concat:
# merge column C from all DataFrames
df_C = pd.concat([df1,df2,df3],1)['C']
# concat all DataFrames
df_all = pd.concat([df1,df2,df3])
# only extract rows with its C value appears in all DataFrames C columns.
df_all.loc[df_all.apply(lambda x: df_C.eq(x.C).sum().all(), axis=1)]
Out[105]:
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5

For simplicity, store your dataframes in a list. We'll make use of set operations to speed this up as much as possible.
df_list = [df1, df2, df3, ...]
common_idx = set.intersection(*[set(df['C']) for df in df_list])
print(common_idx)
{21.0, 23.0, 25.0, 25.5}
Thanks to #smci for the improvement! set.intersection will find the intersection of all the indices. Finally, call pd.concat, join the dataframes vertically, and then use query to filter on common indices obtained from the previous step.
pd.concat(df_list, ignore_index=True).query('C in #common_idx')
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
5 True 3 21.0
6 True 1 23.0
7 False 2 25.0
8 False 4 25.5
9 True 3 21.0
10 True 2 23.0
11 False 2 25.0
12 False 1 25.0
13 False 4 25.5

Related

pandas change all rows with Type X if 1 Type X Result = 1

Here is a simple pandas df:
>>> df
Type Var1 Result
0 A 1 NaN
1 A 2 NaN
2 A 3 NaN
3 B 4 NaN
4 B 5 NaN
5 B 6 NaN
6 C 1 NaN
7 C 2 NaN
8 C 3 NaN
9 D 4 NaN
10 D 5 NaN
11 D 6 NaN
The object of the exercise is: if column Var1 = 3, set Result = 1 for all that Type.
This finds the rows with 3 in Var1 and sets Result to 1,
df['Result'] = df['Var1'].apply(lambda x: 1 if x == 3 else 0)
but I can't figure out how to then catch all the same Type and make them 1. In this case it should be all the As and all the Cs. Doesn't have to be a one-liner.
Any tips please?
Create boolean mask and for True/False to 1/0 mapp convert values to integers:
df['Result'] = df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']).astype(int)
#alternative
df['Result'] = np.where(df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']), 1, 0)
print (df)
Type Var1 Result
0 A 1 1
1 A 2 1
2 A 3 1
3 B 4 0
4 B 5 0
5 B 6 0
6 C 1 1
7 C 2 1
8 C 3 1
9 D 4 0
10 D 5 0
11 D 6 0
Details:
Get all Type values if match condition:
print (df.loc[df['Var1'].eq(3), 'Type'])
2 A
8 C
Name: Type, dtype: object
Test original column Type by filtered types:
print (df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
Name: Type, dtype: bool
Or use GroupBy.transform with any for test if match at least one value, thi solution is slowier if larger df:
df['Result'] = df['Var1'].eq(3).groupby(df['Type']).transform('any').astype(int)

Slicing on entire panda dataframe instead of series results in change of data type and values assignment of first field to NaN, what is happening?

Was trying to do some cleaning on a dataset ,where instead providing a condition on a panda series
head_only[head_only.BasePay > 70000]
I applied the condition on the data frame
head_only[head_only > 70000]
attached images of my observation, could anyone help me understand what is it that's happening ?
Your second solution raise error if numeric with strings columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2.0,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df[df > 5])
TypeError: '>' not supported between instances of 'str' and 'int'
If compare only numeric columns it get values higher like 4 and all another numbers convert to misisng values:
df1 = df.select_dtypes(np.number)
print (df1[df1 > 4])
B C D E
0 NaN 7.0 NaN 5.0
1 5.0 8.0 NaN NaN
2 NaN 9.0 5.0 6.0
3 5.0 NaN 7.0 9.0
4 5.0 NaN NaN NaN
5 NaN NaN NaN NaN
Here are replaced at least one value in each column, so integers columns are converted to floats (because NaN is float):
print (df1[df1 > 4].dtypes)
B float64
C float64
D float64
E float64
dtype: object
If need compare all numeric columns if at least one of them match condition use DataFrame.any for test if at least one value is True:
#returned boolean DataFrame
print ((df1 > 7))
B C D E
0 False False False False
1 False True False False
2 False True False False
3 False False False True
4 False False False False
5 False False False False
print ((df1 > 7).any(axis=1))
0 False
1 True
2 True
3 True
4 False
5 False
dtype: bool
print (df1[(df1 > 7).any(axis=1)])
B C D E
1 5 8.0 3 3
2 4 9.0 5 6
3 5 4.0 7 9
Or if need filter original all columns is possible filter only numeric columns by DataFrame.select_dtypes:
print (df[(df.select_dtypes(np.number) > 7).any(axis=1)])
A B C D E F
1 b 5 8.0 3 3 a
2 c 4 9.0 5 6 a
3 d 5 4.0 7 9 b

Python Pandas Where Condition Is Not Working

I have created a where condition with python.
filter = data['Ueber'] > 2.3
data[filter]
Here you can see my dataset.
Saison Spieltag Heimteam ... Ueber Unter UeberUnter
0 1819 3 Bayern München ... 1.30 3.48 Ueber
1 1819 3 Werder Bremen ... 1.75 2.12 Unter
2 1819 3 SC Freiburg ... 2.20 1.69 Ueber
3 1819 3 VfL Wolfsburg ... 2.17 1.71 Ueber
4 1819 3 Fortuna Düsseldorf ... 1.46 2.71 Ueber
Unfortunately, my greater than condition is not working. What's the problem?
Thanks
Just for the sake of clarity, if you have really floats into your column, which you want into conditional check then it should work.
Example DataFrame:
>>> df = pd.DataFrame({'num': [-12.5, 60.0, 50.0, -25.10, 50.0, 51.0, 71.0]} , dtype=float)
>>> df
num
0 -12.5
1 60.0
2 50.0
3 -25.1
4 50.0
5 51.0
6 71.0
Conditional check to compare..
>>> df['num'] > 50.0
0 False
1 True
2 False
3 False
4 False
5 True
6 True
Name: num, dtype: bool
Result:
>>> df [ df['num'] > 50.0 ]
num
1 60.0
5 51.0
6 71.0

Reading Pandas Data frame if condition occurs in any row/column

Attempting to read a data frame that has values in a random rows/columns order and I would like to get a new column where the all the values containing 'that' are summarized.
Input:
0 1 2 3 4
0 this=1 that=2 who=2 was=3 where=5
1 that=4 who=5 this=1 was=3 where=2
2 was=2 who=7 this=7 that=3 where=7
3 was=3 who=4 this=7 that=1 where=8
4 that=1 who=3 this=4 was=1 where=3
Output:
0
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
I have been successfully able to get the correct result with the following code. But with larger data frames it takes a long time to complete
df1=pd.DataFrame([['this=1', 'that=2', 'who=2', 'was=3', 'where=5'],
['that=4', 'who=5', 'this=1', 'was=3', 'where=2'],
['was=2', 'who=7', 'this=7', 'that=3','where=7'],
['was=3', 'who=4', 'this=7', 'that=1', 'where=8'],
['that=1', 'who=3', 'this=4', 'was=1', 'where=3']],
columns=[0,1,2,3,4])
df2=pd.DataFrame()
for i in df1.index:
data=[name for name in df1[i] if name[0:4]=='that']
df2=df2.append(pd.DataFrame(data))
df1[df1.apply(lambda x: x.str.contains('that'))].stack()
Let's break this down:
df1.apply(lambda x: x.str.contains('that')) Applies our lambda function to the entire dataframe. in english it reads: if that is in our value, True
0 1 2 3 4
0 False True False False False
1 True False False False False
2 False False False True False
3 False False False True False
4 True False False False False
df1[] around that will return the values, instead of True/False:
0 1 2 3 4
0 NaN that=2 NaN NaN NaN
1 that=4 NaN NaN NaN NaN
2 NaN NaN NaN that=3 NaN
3 NaN NaN NaN that=1 NaN
4 that=1 NaN NaN NaN NaN
stack() will stack all the values into one Series. stack() gets rid of NA by default, which is what you needed.
if the extra index is tripping you up, you can also reset the index for a single series:
df1[df1.apply(lambda x: x.str.contains('that'))].stack().reset_index(drop=True)
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
dtype: object

pandas compare two column with different size

Suppose we have two columns in two dataframes , the columns are the same but the size is different. How do we compare two columns and have the indices of the same values in both? df1 and df2, age is common in two but the df1 has 1000 rows and the df2 has 200 rows -- I want to have the indices of rows that have the same age value?
You can use .loc for index labelling:
df1.age < df2.loc[df1.index].age
Example:
df1 = pd.DataFrame({'age':np.random.randint(1,10,10)})
df2 = pd.DataFrame({'age':np.random.randint(1,10,20)})
Output:
0 True
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 False
9 False
Name: age, dtype: bool
Get everything all in one dataframe:
df1.assign(age_2=df2.loc[df1.index],cond=df1.age < df2.loc[df1.index].age)
Output:
age age_2 cond
0 3 5 True
1 3 8 True
2 6 6 False
3 4 7 True
4 4 7 True
5 5 2 False
6 2 2 False
7 3 7 True
8 6 3 False
9 5 4 False