Left Join and Anti-Join on same data frames Pandas

Left Join and Anti-Join on same data frames Pandas - pandas

I have 2 dataframes like these:
df1 = pd.DataFrame(data = {'col1' : ['finance', 'accounting'], 'col2' : ['f1', 'a1']})
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'accounting','IT','IT'], 'col2' : ['f1','f2','f3','a1,'a2','I1','I2']})
df1
col1 col2
0 finance f1
1 accounting a1
df2
col1 col2
0 finance f1
1 finance f2
2 finance f3
3 accounting a1
4 accounting a2
5 IT I1
6 IT I2
I would like to do LEFT JOIN on col1 and ANTI-JOIN on col2. The output should look like this:
col1 col2
finance f2
finance f3
accounting a2
Could someone please help me how to do it properly in pandas. I tried both join and merge in pandas but it hasn't worked for me. Thanks in advance.

You can merge and filter:
(df1.merge(df2, on='col1', suffixes=('_', None))
.loc[lambda d: d['col2'] != d.pop('col2_')]
)
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2

Just for fun, here's another way (other than the really elegant solution by #mozway):
df2 = ( df2
.reset_index() # save index as column 'index'
.set_index('col1') # make 'col1' the index
.loc[df1.col1,:] # filter for 'col1' values in df1
.set_index('col2', append=True) # add 'col2' to the index
.drop(index=df1.
set_index(list(df1.columns))
.index) # create a multi-index from df1 and drop all matches from df2
.reset_index() # make 'col1' and col2' columns again
.set_index('index') # make 'index' the index again
.rename_axis(index=None) ) # make the index anonymous
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2

Related

Pandas pivot table or groupby absolute maximum of column

I have a dataframe df as:
Col1 Col2
A -5
A 3
B -2
B 15
I need to get the following:
Col1 Col2
A -5
B 15
Where the decision was made for each group in Col1 by selecting the absolute maximum from Col2. I am not sure how to proceed with this.

Use DataFrameGroupBy.idxmax with pass absolute values for indices and then select by DataFrame.loc:
df = df.loc[df['Col2'].abs().groupby(df['Col1']).idxmax()]
#alternative with reassign column
df = df.loc[df.assign(Col2 = df['Col2'].abs()).groupby('Col1')['Col2'].idxmax()]
print (df)
Col1 Col2
0 A -5
3 B 15

Substring column in Pandas based another column

I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'

I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a

using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a

Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a

Filter dataframe based on the quantile per group of values

Let's suppose that I have a dataframe like that:
import pandas as pd
df = pd.DataFrame({'col1':['A','A', 'A', 'B','B'], 'col2':[2, 4, 6, 3, 4]})
I want to keep from it only the rows which have values at col2 which are less than the x-th quantile of the values for each of the groups of values of col1 separately.
For example for the 60-th percentile then the dataframe should look like that:
col1 col2
0 A 2
1 A 4
2 B 3
How can I do this efficiently in pandas?

We have transform with quantile
df[df.col2.lt(df.groupby('col1').col2.transform(lambda x : x.quantile(0.6)))]
col1 col2
0 A 2
1 A 4
3 B 3

conditional operation on pandas column

df1 =
name col1
a 1
b 2
c 3
d 4
df2 =
name col2
b 3
c 9
a 2
d 3
I want to compare names in both data-frames and multpily other two columns respectively. so the output would be like..
df3 =
name col_new
a 2
b 6
c 27
d 12

Use Series.map for correct ordering with multiple by Series.mul and for extract original col is used DataFrame.pop:
df1['col_new'] = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col']))
For new DataFrame is uses DataFrame.assign:
df3 = df1.assign('col_new' = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col'])))
Or another solution with DataFrame.merge and left join:
df3 = df1.merge(df2, on='name', how='left')
df3['col_new'] = df3.pop('col_x').mul(df3.pop('col_y'))

Group values together in Pandas column, then filter values in another column

Pandas Dataframe looks like so:
Col1 Col2
A 1
A 1
A 1
B 0
B 0
B 1
B 1
B 1
C 1
C 1
C 1
C 1
I wanted to group all together in Col1, then check Col2 to see whether all values for that group i.e. A are 1. In this example the desired output would be:
[A, C]
(because only A and C have all values set to 1). How do I do this?

In your case groupby with all
df.groupby('Col1').Col2.all().loc[lambda x : x ].index.tolist()
Out[350]: ['A', 'C']
Or without groupby
df.loc[~df.Col1.isin(df.Col1[df.Col2.eq(0)]),'Col1'].unique()
Out[352]: array(['A', 'C'], dtype=object)
From the comment
cs95 :df.loc[df['Col2'].astype(bool).groupby(df['Col1']).transform('all'), 'Col1'].unique()

We can use all with groupby:
out = df.Col2.groupby(df.Col1).all()
out.index[out].tolist()
# ['A', 'C']

Comprehension
[k for k, d in df.Col2.eq(1).groupby(df.Col1) if d.all()]
['A', 'C']

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Left Join and Anti-Join on same data frames Pandas - pandas

You can merge and filter: (df1.merge(df2, on='col1', suffixes=('_', None)) .loc[lambda d: d['col2'] != d.pop('col2_')] ) Output: col1 col2 1 finance f2 2 finance f3 4 accounting a2

Related

Pandas pivot table or groupby absolute maximum of column

Substring column in Pandas based another column

Filter dataframe based on the quantile per group of values

conditional operation on pandas column

Group values together in Pandas column, then filter values in another column

Categories

Resources