I want those customers present in the data frame which has more false value than true value .Any suggestion on how to achieve that - pandas

The data frame :
df = pd.DataFrame({'A': ['cust1', 'cust1', 'cust2', 'cust1',
'cust2', 'cust1', 'cust2', 'cust2','cust2','cust1'],
'B': ['true', 'true', 'true', 'false',
'false', 'false', 'false', 'true','false','true']})
Ouput : ['cust2']

First get counts by crosstab and then filter index values by columns with boolean indexing, for greater is used Series.gt:
df1 = pd.crosstab(df['A'], df['B'])
print (df1)
B false true
A
cust1 2 3
cust2 3 2
c = df1.index[df1['false'].gt(df1['true'])].tolist()
#if True, False are boolean
#c = df1.index[df1[False].gt(df1[True])].tolist()
print (c)
['cust2']]

df[df['B']=='false'].groupby(['A']).count().sort_values(by['A'],ascending=False).index[0]
Explanation: Take all values with only 'False', groupby 'A' and count. Now sort the value in descending order and get the first index('A') value.

It seems like the case of multi -indexing so you can use index to isolate the greater value :
list = list(dataframe.index[dataframe['false'].gt(dataframe['true'])])

Related

Generate combinations with specified order with itertools.combinations

I used itertools.combinations to generate combinations for a dataframe's index. I'd like the combinations in specified order --> (High - Mid - Low)
Example
from itertools import combinations
d = {'levels':['High', 'High', 'Mid', 'Low', 'Low', 'Low', 'Mid'], 'converted':[True, True, True, False, False, True, False]}
df = pd.DataFrame(data=d)
df_ = pd.crosstab(df['levels'], df['converted'])
df_
converted False True
levels
High 0 2
Low 2 1
Mid 1 1
list(combinations(df_.index, 2)) returns [('High', 'Low'), ('High', 'Mid'), ('Low', 'Mid')]
I'd like the third group to be ('Mid', 'Low'), how can I achieve this ?
Use DataFrame.reindex first, but first and second values in list are swapped:
order = ['High','Mid','Low']
a = list(combinations(df_.reindex(order).index, 2))
print (a)
[('High', 'Mid'), ('High', 'Low'), ('Mid', 'Low')]

Pandas: select multiple rows or default with new API

I need to retrieve multiples rows (which could be duplicated) and if the index does not exist get a default value. An example with Series:
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['a', 'd', 'f']
result = s.loc[labels]
result = result.fillna(my_default_value)
Now, I'm using DataFrame, an equivalent with names is:
df = pd.DataFrame({
"Person": {
"name_1": "Genarito",
"name_2": "Donald Trump",
"name_3": "Joe Biden",
"name_4": "Pablo Escobar",
"name_5": "Dalai Lama"
}
})
default_value = 'No name'
names_to_retrieve = ['name_1', 'name_2', 'name_8', 'name_3']
result = df.loc[names_to_retrieve]
result = result.fillna(default_value)
With both examples it's throwing a warning saying:
FutureWarning: Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.
In the documentation of the issue it says that you should use reindex but they say that It won't work with duplicates...
Is there any way to work without warnings and duplicated indexes?
Thanks in advance
Let's try merge:
result = (pd.DataFrame({'label':labels})
.merge(s.to_frame(name='x'), left_on='label',
right_index=True, how='left')
.set_index('label')['x']
)
Output:
label
a 0.0
a 1.0
d NaN
f NaN
Name: x, dtype: float64
How about :
on_values = s.loc[s.index.intersection(labels).unique()]
off_values = pd.Series(default_value,index=s.index.difference(labels))
result = pd.concat([on_values,off_values])
Check isin with append
out = s[s.index.isin(labels)]
out = out.append(pd.Series(index=set(labels)-set(s.index),dtype='float').fillna(0))
out
Out[341]:
a 0.0
a 1.0
d 0.0
f 0.0
dtype: float64
You can write a simple function to handle the rows in labels and missing from labels separately, then join. When True the in_order argument will ensure that if you specify labels = ['d', 'a', 'f'], the output is ordered ['d', 'a', 'f'].
def reindex_with_dup(s: pd.Series or pd.DataFrame, labels, fill_value=np.NaN, in_order=True):
labels = pd.Series(labels)
s1 = s.loc[labels[labels.isin(s.index)]]
if isinstance(s, pd.Series):
s2 = pd.Series(fill_value, index=labels[~labels.isin(s.index)])
if isinstance(s, pd.DataFrame):
s2 = pd.DataFrame(fill_value, index=labels[~labels.isin(s.index)],
columns=s.columns)
s = pd.concat([s1, s2])
if in_order:
s = s.loc[labels.drop_duplicates()]
return s
reindex_with_dup(s, ['d', 'a', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#f foo
#dtype: object
This retains the .loc behavior that if your index is duplicated and your labels are duplicated it duplicates the selection:
reindex_with_dup(s, ['d', 'a', 'a', 'f', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#a 0
#a 1
#f foo
#f foo
#dtype: object

How to find the value by checking the flag

data frame is below
uid,col1,col2,flag
1001,a,b,{'a':True,'b':False}
1002,a,b,{'a':False,'b':True}
out
a
b
by checking the flag, if a is true then print a on the out column, if b flag is true then print b on the out column
IIUC, you can use dot after DataFrame constructor:
m=pd.DataFrame(df['flag'].tolist()).fillna(False)
final=df.assign(New=m.dot(m.columns))
print(final)
uid col1 col2 flag New
0 1001 a b {'a': True} a
1 1002 a b {'b': True} b
If you just want to evaluate the flags column (and col1 and col2 won't be used in any way as per your question), then you can simply get the first key from the flags dict where the value is True:
df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
(instead of '' you can, of course, supply any other value for the case that none of the values in the dict is True)
Example:
import pandas as pd
import io
import ast
s = '''uid,col1,col2,flag
1001,a,b,"{'a':True,'b':False}"
1002,a,b,"{'a':False,'b':True}"
1003,a,b,"{'a':True,'b':True}"
1004,a,b,"{'a':False,'b':False}"'''
df = pd.read_csv(io.StringIO(s))
df.flag = df.flag.map(ast.literal_eval)
df['out'] = df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
Result
uid col1 col2 flag out
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b
2 1003 a b {'a': True, 'b': True} a
3 1004 a b {'a': False, 'b': False}
Method 1
We can also use Series.apply
to convert the dictionary to series, then remove the fake ones with boolean indexing + DataFrame.stack and select a or b from the index with Index.get_level_values:
s = df['flag'].apply(pd.Series)
df['new']=s[s].stack().index.get_level_values(1)
#df['new']=np.dot(s,s.columns) #or this
print(df)
Method 2:
We can also check the items with Series.apply and save the key in a list if the value is True.
Finally we use Series.explode if we want to get rid of the list.
df['new']=df['flag'].apply(lambda x: [k for k,v in x.items() if v])
df = df.explode('new')
print(df)
or without apply:
df=df.assign(new=[[k for k,v in d.items() if v] for d in df['flag']]).explode('new')
print(df)
Output
uid col1 col2 flag new
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b

How can I select rows from one DataFrame, where a part of the row's index is in another DataFrame's index and meets certain criteria?

I have two DataFrames. df provides a lot of data. test_df describes whether certain tests have passed or not. I need to select from df only the rows where the tests have not failed by looking up this info in test_df. So far, I'm able to reduce my test_df to passed_tests. So, what's left is to select only the rows from df where the relevant part of the row index is in passed_tests. How can I do that?
Updates:
test_db doesn't haven't unique rows. Where there are duplicate rows (and there may be more than 1 duplicate), the test that was the most positive takes priority. i.e True > Ok > False.
My code:
import pandas as pd
import numpy as np
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Ok', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
print test_df
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux', 'qux']),
np.array(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b']),
np.array(['1', '2', '1', '2', '1', '2', '1', '2'])]
data = np.random.randn(8, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
print df
passed_tests = test_df.loc[test_df['Passed?'].isin(['True', 'Ok'])]
print passed_tests
df
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
b 1 -0.092252 0.143451
2 0.939052 -0.239052
qux a 1 0.757239 2.836032
2 -0.445335 1.352374
b 1 2.175553 -0.700816
2 1.082709 -0.923095
test_df
Passed?
foo a False
a True
b False
b False
qux a False
b Ok
b False
passed_tests
Passed?
foo a True
qux b Ok
required solution
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
qux b 1 2.175553 -0.700816
2 1.082709 -0.923095
You need reindex with method='ffill', then check values by isin and last use boolean indexing:
print (test_df.reindex(df.index, method='ffill'))
Passed?
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 Ok
2 Ok
mask = test_df.reindex(df.index, method='ffill').isin(['True', 'Ok'])['Passed?']
print (mask)
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 True
2 True
Name: Passed?, dtype: bool
print (df[mask])
X Y
foo a 1 -0.580448 -0.168951
2 -0.875165 1.304745
qux b 1 -0.147014 -0.787483
2 0.188989 -1.159533
EDIT:
For remove duplicates here is the easier use:
get columns from MultiIndex by reset_index
sort_values - Passed? column descending, first and second ascending
drop_duplicates - keep only first value
set_index for MultiIndex back
rename_axis for remove index names
test_df = test_df.reset_index()
.sort_values(['level_0','level_1', 'Passed?'], ascending=[1,1,0])
.drop_duplicates(['level_0','level_1'])
.set_index(['level_0','level_1'])
.rename_axis([None, None])
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
Another solution is simplier - sorting first and then groupby with first:
test_df = test_df.sort_values('Passed?', ascending=False)
.groupby(level=[0,1])
.first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
EDIT1:
Convert values to ordered Categorical.
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Acceptable', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
#print (test_df)
cat = ['False', 'Acceptable','True']
test_df["Passed?"] = test_df["Passed?"].astype('category', categories=cat, ordered=True)
print (test_df["Passed?"])
foo a False
a True
b False
b False
qux a False
b Acceptable
b False
Name: Passed?, dtype: category
Categories (3, object): [False < Acceptable < True]
test_df = test_df.sort_values('Passed?', ascending=False).groupby(level=[0,1]).first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Acceptable

Selecting data from a dataframe based on a tuple

Suppose I have the following dataframe
df = DataFrame({'vals': [1, 2, 3, 4],
'ids': ['a', 'b', 'a', 'n']})
I want to select all the rows which are in the list
[ (1,a), (3,f) ]
I have tried using boolean indexing like so
to_search = { 'vals' : [1,3],
'ids' : ['a', 'f']
}
df.isin(to_search)
I expect only the first row to match but I get the first and the third row
ids vals
0 True True
1 True False
2 True True
3 False False
Is there any way to match exactly the values at a particular index instead of matching any value?
You might create a DataFrame for what you want to match, and then merge it:
In [32]: df2 = DataFrame([[1,'a'],[3,'f']], columns=['vals', 'ids'])
In [33]: df.merge(df2)
Out[33]:
ids vals
0 a 1