How to find the value by checking the flag - pandas

data frame is below
uid,col1,col2,flag
1001,a,b,{'a':True,'b':False}
1002,a,b,{'a':False,'b':True}
out
a
b
by checking the flag, if a is true then print a on the out column, if b flag is true then print b on the out column

IIUC, you can use dot after DataFrame constructor:
m=pd.DataFrame(df['flag'].tolist()).fillna(False)
final=df.assign(New=m.dot(m.columns))
print(final)
uid col1 col2 flag New
0 1001 a b {'a': True} a
1 1002 a b {'b': True} b

If you just want to evaluate the flags column (and col1 and col2 won't be used in any way as per your question), then you can simply get the first key from the flags dict where the value is True:
df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
(instead of '' you can, of course, supply any other value for the case that none of the values in the dict is True)
Example:
import pandas as pd
import io
import ast
s = '''uid,col1,col2,flag
1001,a,b,"{'a':True,'b':False}"
1002,a,b,"{'a':False,'b':True}"
1003,a,b,"{'a':True,'b':True}"
1004,a,b,"{'a':False,'b':False}"'''
df = pd.read_csv(io.StringIO(s))
df.flag = df.flag.map(ast.literal_eval)
df['out'] = df.flag.apply(lambda x: next((k for k,v in x.items() if v), ''))
Result
uid col1 col2 flag out
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b
2 1003 a b {'a': True, 'b': True} a
3 1004 a b {'a': False, 'b': False}

Method 1
We can also use Series.apply
to convert the dictionary to series, then remove the fake ones with boolean indexing + DataFrame.stack and select a or b from the index with Index.get_level_values:
s = df['flag'].apply(pd.Series)
df['new']=s[s].stack().index.get_level_values(1)
#df['new']=np.dot(s,s.columns) #or this
print(df)
Method 2:
We can also check the items with Series.apply and save the key in a list if the value is True.
Finally we use Series.explode if we want to get rid of the list.
df['new']=df['flag'].apply(lambda x: [k for k,v in x.items() if v])
df = df.explode('new')
print(df)
or without apply:
df=df.assign(new=[[k for k,v in d.items() if v] for d in df['flag']]).explode('new')
print(df)
Output
uid col1 col2 flag new
0 1001 a b {'a': True, 'b': False} a
1 1002 a b {'a': False, 'b': True} b

Related

Join 2 data frame with special columns matching new

I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)
Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))
You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3
df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

Pandas merge two columns into Json

I have a pandas dataframe like below
Col1 Col2
0 a apple
1 a anar
2 b ball
3 b banana
I am looking to output json which outputs like
{ 'a' : ['apple', 'anar'], 'b' : ['ball', 'banana'] }
Use groupby with apply and last convert Series to json by Series.to_json:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
If want write json to file:
s = df.groupby('Col1')['Col2'].apply(list)
s.to_json('file.json')
Check difference:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
d = df.groupby('Col1')['Col2'].apply(list).to_dict()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
print (d)
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}
print (type(j))
<class 'str'>
print (type(d))
<class 'dict'>
You can groupby() 'Col1' and apply() list to 'Col2' and convert to_dict(), Use:
df.groupby('Col1')['Col2'].apply(list).to_dict()
Output:
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}

pandas: filter rows with list elements beginning with string?

Blockquote
I have the following dataframe.
d = pd.DataFrame({'a': [['foo', 'bar'], ['bar'], ['fah', 'baz']})
I'd like to return just the rows with values of a beginning f in them - i.e. the first and third rows.
This is what I've tried:
d[d.a.is_in('f')]
Use any in list comprehension with generator:
d = d[[any(y.startswith('f') for y in x) for x in d['a']]]
print (d)
a
0 [foo, bar]
2 [fah, baz]
Detail: (convert to list only for sample)
print ([list(y.startswith('f') for y in x) for x in d['a']])
[[True, False], [False], [True, False]]
Solution using .apply(), iterating over the individual list elements, checking with .startswith() and evaluating the length of the resultant list:
import pandas as pd
df = pd.DataFrame({'a': [['foo', 'bar'], ['bar'], ['fah', 'baz']]})
df = df[df.a.apply(lambda x: len([el for el in x if el.startswith('f')]) > 0)]
print(df)
which results in:
a
0 [foo, bar]
2 [fah, baz]