How can I select rows from one DataFrame, where a part of the row's index is in another DataFrame's index and meets certain criteria? - pandas

I have two DataFrames. df provides a lot of data. test_df describes whether certain tests have passed or not. I need to select from df only the rows where the tests have not failed by looking up this info in test_df. So far, I'm able to reduce my test_df to passed_tests. So, what's left is to select only the rows from df where the relevant part of the row index is in passed_tests. How can I do that?
Updates:
test_db doesn't haven't unique rows. Where there are duplicate rows (and there may be more than 1 duplicate), the test that was the most positive takes priority. i.e True > Ok > False.
My code:
import pandas as pd
import numpy as np
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Ok', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
print test_df
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux', 'qux']),
np.array(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b']),
np.array(['1', '2', '1', '2', '1', '2', '1', '2'])]
data = np.random.randn(8, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
print df
passed_tests = test_df.loc[test_df['Passed?'].isin(['True', 'Ok'])]
print passed_tests
df
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
b 1 -0.092252 0.143451
2 0.939052 -0.239052
qux a 1 0.757239 2.836032
2 -0.445335 1.352374
b 1 2.175553 -0.700816
2 1.082709 -0.923095
test_df
Passed?
foo a False
a True
b False
b False
qux a False
b Ok
b False
passed_tests
Passed?
foo a True
qux b Ok
required solution
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
qux b 1 2.175553 -0.700816
2 1.082709 -0.923095

You need reindex with method='ffill', then check values by isin and last use boolean indexing:
print (test_df.reindex(df.index, method='ffill'))
Passed?
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 Ok
2 Ok
mask = test_df.reindex(df.index, method='ffill').isin(['True', 'Ok'])['Passed?']
print (mask)
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 True
2 True
Name: Passed?, dtype: bool
print (df[mask])
X Y
foo a 1 -0.580448 -0.168951
2 -0.875165 1.304745
qux b 1 -0.147014 -0.787483
2 0.188989 -1.159533
EDIT:
For remove duplicates here is the easier use:
get columns from MultiIndex by reset_index
sort_values - Passed? column descending, first and second ascending
drop_duplicates - keep only first value
set_index for MultiIndex back
rename_axis for remove index names
test_df = test_df.reset_index()
.sort_values(['level_0','level_1', 'Passed?'], ascending=[1,1,0])
.drop_duplicates(['level_0','level_1'])
.set_index(['level_0','level_1'])
.rename_axis([None, None])
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
Another solution is simplier - sorting first and then groupby with first:
test_df = test_df.sort_values('Passed?', ascending=False)
.groupby(level=[0,1])
.first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
EDIT1:
Convert values to ordered Categorical.
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Acceptable', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
#print (test_df)
cat = ['False', 'Acceptable','True']
test_df["Passed?"] = test_df["Passed?"].astype('category', categories=cat, ordered=True)
print (test_df["Passed?"])
foo a False
a True
b False
b False
qux a False
b Acceptable
b False
Name: Passed?, dtype: category
Categories (3, object): [False < Acceptable < True]
test_df = test_df.sort_values('Passed?', ascending=False).groupby(level=[0,1]).first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Acceptable

Related

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Update categories in two Series / Columns for comparison

If I try to compare two Series with different categories I get an error:
a = pd.Categorical([1, 2, 3])
b = pd.Categorical([4, 5, 3])
df = pd.DataFrame([a, b], columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 3
df.a == df.b
# TypeError: Categoricals can only be compared if 'categories' are the same.
What is the best way to update categories in both Series? Thank you!
My solution:
df['b'] = df.b.cat.add_categories(df.a.cat.categories.difference(df.b.cat.categories))
df['a'] = df.a.cat.add_categories(df.b.cat.categories.difference(df.a.cat.categories))
df.a == df.b
Output:
0 False
1 False
2 True
dtype: bool
One idea with union_categoricals:
from pandas.api.types import union_categoricals
union = union_categoricals([df.a, df.b]).categories
df['a'] = df.a.cat.set_categories(union)
df['b'] = df.b.cat.set_categories(union)
print (df.a == df.b)
0 False
1 False
2 True
dtype: bool

Pandas: select multiple rows or default with new API

I need to retrieve multiples rows (which could be duplicated) and if the index does not exist get a default value. An example with Series:
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['a', 'd', 'f']
result = s.loc[labels]
result = result.fillna(my_default_value)
Now, I'm using DataFrame, an equivalent with names is:
df = pd.DataFrame({
"Person": {
"name_1": "Genarito",
"name_2": "Donald Trump",
"name_3": "Joe Biden",
"name_4": "Pablo Escobar",
"name_5": "Dalai Lama"
}
})
default_value = 'No name'
names_to_retrieve = ['name_1', 'name_2', 'name_8', 'name_3']
result = df.loc[names_to_retrieve]
result = result.fillna(default_value)
With both examples it's throwing a warning saying:
FutureWarning: Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.
In the documentation of the issue it says that you should use reindex but they say that It won't work with duplicates...
Is there any way to work without warnings and duplicated indexes?
Thanks in advance
Let's try merge:
result = (pd.DataFrame({'label':labels})
.merge(s.to_frame(name='x'), left_on='label',
right_index=True, how='left')
.set_index('label')['x']
)
Output:
label
a 0.0
a 1.0
d NaN
f NaN
Name: x, dtype: float64
How about :
on_values = s.loc[s.index.intersection(labels).unique()]
off_values = pd.Series(default_value,index=s.index.difference(labels))
result = pd.concat([on_values,off_values])
Check isin with append
out = s[s.index.isin(labels)]
out = out.append(pd.Series(index=set(labels)-set(s.index),dtype='float').fillna(0))
out
Out[341]:
a 0.0
a 1.0
d 0.0
f 0.0
dtype: float64
You can write a simple function to handle the rows in labels and missing from labels separately, then join. When True the in_order argument will ensure that if you specify labels = ['d', 'a', 'f'], the output is ordered ['d', 'a', 'f'].
def reindex_with_dup(s: pd.Series or pd.DataFrame, labels, fill_value=np.NaN, in_order=True):
labels = pd.Series(labels)
s1 = s.loc[labels[labels.isin(s.index)]]
if isinstance(s, pd.Series):
s2 = pd.Series(fill_value, index=labels[~labels.isin(s.index)])
if isinstance(s, pd.DataFrame):
s2 = pd.DataFrame(fill_value, index=labels[~labels.isin(s.index)],
columns=s.columns)
s = pd.concat([s1, s2])
if in_order:
s = s.loc[labels.drop_duplicates()]
return s
reindex_with_dup(s, ['d', 'a', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#f foo
#dtype: object
This retains the .loc behavior that if your index is duplicated and your labels are duplicated it duplicates the selection:
reindex_with_dup(s, ['d', 'a', 'a', 'f', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#a 0
#a 1
#f foo
#f foo
#dtype: object

How to find the value by checking the flag

data frame is below
uid,col1,col2,flag
1001,a,b,{'a':True,'b':False}
1002,a,b,{'a':False,'b':True}
out
a
b
by checking the flag, if a is true then print a on the out column, if b flag is true then print b on the out column
IIUC, you can use dot after DataFrame constructor:
m=pd.DataFrame(df['flag'].tolist()).fillna(False)
final=df.assign(New=m.dot(m.columns))
print(final)
uid col1 col2 flag New
0 1001 a b {'a': True} a
1 1002 a b {'b': True} b
If you just want to evaluate the flags column (and col1 and col2 won't be used in any way as per your question), then you can simply get the first key from the flags dict where the value is True:
df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
(instead of '' you can, of course, supply any other value for the case that none of the values in the dict is True)
Example:
import pandas as pd
import io
import ast
s = '''uid,col1,col2,flag
1001,a,b,"{'a':True,'b':False}"
1002,a,b,"{'a':False,'b':True}"
1003,a,b,"{'a':True,'b':True}"
1004,a,b,"{'a':False,'b':False}"'''
df = pd.read_csv(io.StringIO(s))
df.flag = df.flag.map(ast.literal_eval)
df['out'] = df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
Result
uid col1 col2 flag out
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b
2 1003 a b {'a': True, 'b': True} a
3 1004 a b {'a': False, 'b': False}
Method 1
We can also use Series.apply
to convert the dictionary to series, then remove the fake ones with boolean indexing + DataFrame.stack and select a or b from the index with Index.get_level_values:
s = df['flag'].apply(pd.Series)
df['new']=s[s].stack().index.get_level_values(1)
#df['new']=np.dot(s,s.columns) #or this
print(df)
Method 2:
We can also check the items with Series.apply and save the key in a list if the value is True.
Finally we use Series.explode if we want to get rid of the list.
df['new']=df['flag'].apply(lambda x: [k for k,v in x.items() if v])
df = df.explode('new')
print(df)
or without apply:
df=df.assign(new=[[k for k,v in d.items() if v] for d in df['flag']]).explode('new')
print(df)
Output
uid col1 col2 flag new
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b

Selecting data from a dataframe based on a tuple

Suppose I have the following dataframe
df = DataFrame({'vals': [1, 2, 3, 4],
'ids': ['a', 'b', 'a', 'n']})
I want to select all the rows which are in the list
[ (1,a), (3,f) ]
I have tried using boolean indexing like so
to_search = { 'vals' : [1,3],
'ids' : ['a', 'f']
}
df.isin(to_search)
I expect only the first row to match but I get the first and the third row
ids vals
0 True True
1 True False
2 True True
3 False False
Is there any way to match exactly the values at a particular index instead of matching any value?
You might create a DataFrame for what you want to match, and then merge it:
In [32]: df2 = DataFrame([[1,'a'],[3,'f']], columns=['vals', 'ids'])
In [33]: df.merge(df2)
Out[33]:
ids vals
0 a 1