Left Join and Anti-Join on same data frames Pandas - pandas

I have 2 dataframes like these:
df1 = pd.DataFrame(data = {'col1' : ['finance', 'accounting'], 'col2' : ['f1', 'a1']})
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'accounting','IT','IT'], 'col2' : ['f1','f2','f3','a1,'a2','I1','I2']})
df1
col1 col2
0 finance f1
1 accounting a1
df2
col1 col2
0 finance f1
1 finance f2
2 finance f3
3 accounting a1
4 accounting a2
5 IT I1
6 IT I2
I would like to do LEFT JOIN on col1 and ANTI-JOIN on col2. The output should look like this:
col1 col2
finance f2
finance f3
accounting a2
Could someone please help me how to do it properly in pandas. I tried both join and merge in pandas but it hasn't worked for me. Thanks in advance.

You can merge and filter:
(df1.merge(df2, on='col1', suffixes=('_', None))
.loc[lambda d: d['col2'] != d.pop('col2_')]
)
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2

Just for fun, here's another way (other than the really elegant solution by #mozway):
df2 = ( df2
.reset_index() # save index as column 'index'
.set_index('col1') # make 'col1' the index
.loc[df1.col1,:] # filter for 'col1' values in df1
.set_index('col2', append=True) # add 'col2' to the index
.drop(index=df1.
set_index(list(df1.columns))
.index) # create a multi-index from df1 and drop all matches from df2
.reset_index() # make 'col1' and col2' columns again
.set_index('index') # make 'index' the index again
.rename_axis(index=None) ) # make the index anonymous
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2

Related

Pandas pivot table or groupby absolute maximum of column

I have a dataframe df as:
Col1 Col2
A -5
A 3
B -2
B 15
I need to get the following:
Col1 Col2
A -5
B 15
Where the decision was made for each group in Col1 by selecting the absolute maximum from Col2. I am not sure how to proceed with this.
Use DataFrameGroupBy.idxmax with pass absolute values for indices and then select by DataFrame.loc:
df = df.loc[df['Col2'].abs().groupby(df['Col1']).idxmax()]
#alternative with reassign column
df = df.loc[df.assign(Col2 = df['Col2'].abs()).groupby('Col1')['Col2'].idxmax()]
print (df)
Col1 Col2
0 A -5
3 B 15

Substring column in Pandas based another column

I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'
I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a
using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a
Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a

Filter dataframe based on the quantile per group of values

Let's suppose that I have a dataframe like that:
import pandas as pd
df = pd.DataFrame({'col1':['A','A', 'A', 'B','B'], 'col2':[2, 4, 6, 3, 4]})
I want to keep from it only the rows which have values at col2 which are less than the x-th quantile of the values for each of the groups of values of col1 separately.
For example for the 60-th percentile then the dataframe should look like that:
col1 col2
0 A 2
1 A 4
2 B 3
How can I do this efficiently in pandas?
We have transform with quantile
df[df.col2.lt(df.groupby('col1').col2.transform(lambda x : x.quantile(0.6)))]
col1 col2
0 A 2
1 A 4
3 B 3

conditional operation on pandas column

df1 =
name col1
a 1
b 2
c 3
d 4
df2 =
name col2
b 3
c 9
a 2
d 3
I want to compare names in both data-frames and multpily other two columns respectively. so the output would be like..
df3 =
name col_new
a 2
b 6
c 27
d 12
Use Series.map for correct ordering with multiple by Series.mul and for extract original col is used DataFrame.pop:
df1['col_new'] = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col']))
For new DataFrame is uses DataFrame.assign:
df3 = df1.assign('col_new' = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col'])))
Or another solution with DataFrame.merge and left join:
df3 = df1.merge(df2, on='name', how='left')
df3['col_new'] = df3.pop('col_x').mul(df3.pop('col_y'))

Group values together in Pandas column, then filter values in another column

Pandas Dataframe looks like so:
Col1 Col2
A 1
A 1
A 1
B 0
B 0
B 1
B 1
B 1
C 1
C 1
C 1
C 1
I wanted to group all together in Col1, then check Col2 to see whether all values for that group i.e. A are 1. In this example the desired output would be:
[A, C]
(because only A and C have all values set to 1). How do I do this?
In your case groupby with all
df.groupby('Col1').Col2.all().loc[lambda x : x ].index.tolist()
Out[350]: ['A', 'C']
Or without groupby
df.loc[~df.Col1.isin(df.Col1[df.Col2.eq(0)]),'Col1'].unique()
Out[352]: array(['A', 'C'], dtype=object)
From the comment
cs95 :df.loc[df['Col2'].astype(bool).groupby(df['Col1']).transform('all'), 'Col1'].unique()
We can use all with groupby:
out = df.Col2.groupby(df.Col1).all()
out.index[out].tolist()
# ['A', 'C']
Comprehension
[k for k, d in df.Col2.eq(1).groupby(df.Col1) if d.all()]
['A', 'C']