How to do intersection match between 2 DataFrames in Pandas?

How to do intersection match between 2 DataFrames in Pandas? - pandas

Assume exists 2 DataFrames A and B like following
A:
a A
b B
c C
B:
1 2
3 4
How to produce C DataFrame like
a A 1 2
a A 3 4
b B 1 2
b B 3 4
c C 1 2
c C 3 4
Is there some function in Pandas can do this operation?

First all values has to be unique in each DataFrame.
I think you need product:
from itertools import product
A = pd.DataFrame({'a':list('abc')})
B = pd.DataFrame({'a':[1,2]})
C = pd.DataFrame(list(product(A['a'], B['a'])))
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Pandas pure solutions with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([A['a'], B['a']])
C = pd.DataFrame(mux.values.tolist())
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
C = mux.to_frame().reset_index(drop=True)
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Solution with cross join with merge and column filled by same scalars by assign:
df = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
df.columns = ['a','b']
print (df)
a b
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
EDIT:
A = pd.DataFrame({'a':list('abc'), 'b':list('ABC')})
B = pd.DataFrame({'a':[1,3], 'c':[2,4]})
print (A)
a b
0 a A
1 b B
2 c C
print (B)
a c
0 1 2
1 3 4
C = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
C.columns = list('abcd')
print (C)
a b c d
0 a A 1 2
1 a A 3 4
2 b B 1 2
3 b B 3 4
4 c C 1 2
5 c C 3 4

Related

How to create a rolling unique count by group using pandas

I have a dataframe like the following:
group value
1 a
1 a
1 b
1 b
1 b
1 b
1 c
2 d
2 d
2 d
2 d
2 e
I want to create a column with how many unique values there have been so far for the group. Like below:
group value group_value_id
1 a 1
1 a 1
1 b 2
1 b 2
1 b 2
1 b 2
1 c 3
2 d 1
2 d 1
2 d 1
2 d 1
2 e 2

Use custom lambda function with GroupBy.transform and factorize:
df['group_value_id']=df.groupby('group')['value'].transform(lambda x:pd.factorize(x)[0]) + 1
print (df)
group value group_value_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2
because:
df['group_value_id'] = df.groupby('group')['value'].rank('dense')
print (df)
DataError: No numeric types to aggregate

Also cab be solved as :
df['group_val_id'] = (df.groupby('group')['value'].
apply(lambda x:x.astype('category').cat.codes + 1))
df
group value group_val_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2

Reorder pandas DataFrame based on repetitive set of integer in index

I have a pandas dataframe contains some columns, I didn't find a way to order rows as follows:
I need to order the dataframe by the field label but in sequential order (like groups)
Input
I category tags
1 A #25-74
1 B #26-170
0 C #29-106
2 A #18-109
3 B #26-86
2 A #26-108
2 C #30-125
1 B #28-145
0 B #29-93
0 D #21-102
1 F #26-108
2 F #30-125
3 A #28-145
3 D #29-93
0 B #21-102
Needed Order:
I category tags
0 C #29-106
1 B #25-74
2 F #18-109
3 C #26-86
0 B #29-93
1 D #26-170
2 B #26-108
3 B #28-145
0 C #21-102
1 D #28-145
2 A #30-125
3 A #29-93
0 B #21-102
1 A #26-108
2 C #30-125
I have searched for different ways to sort but couldn't find a way to sort using only pandas.
I appreciate every help!

One idea with helper column by GroupBy.cumcount and DataFrame.sort_values:
df['a'] = df.groupby('I').cumcount()
df = df.sort_values(['a','I'])
print (df)
I category tags a
2 0 C #29-106 0
0 1 A #25-74 0
3 2 A #18-109 0
4 3 B #26-86 0
8 0 B #29-93 1
1 1 B #26-170 1
5 2 A #26-108 1
12 3 A #28-145 1
9 0 D #21-102 2
7 1 B #28-145 2
6 2 C #30-125 2
13 3 D #29-93 2
14 0 B #21-102 3
10 1 F #26-108 3
11 2 F #30-125 3
Or first sorting by column | and then change order with Series.argsort and DataFrame.iloc:
df = df.sort_values('I')
df = df.iloc[df.groupby('I').cumcount().argsort()]
print (df)
I category tags
2 0 C #29-106
0 1 A #25-74
3 2 A #18-109
4 3 B #26-86
8 0 B #29-93
1 1 B #26-170
5 2 A #26-108
12 3 A #28-145
9 0 D #21-102
7 1 B #28-145
6 2 C #30-125
13 3 D #29-93
14 0 B #21-102
10 1 F #26-108
11 2 F #30-125

Replace values of duplicated rows with first record in pandas?

Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.

Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2

Concatenate alternate scalar column to pandas based on condition

Have a master dataframe and a tag list, as follows:
import pandas as pd
i = ['A'] * 2 + ['B'] * 3 + ['A'] * 4 + ['B'] * 5
master = pd.DataFrame(i, columns={'cat'})
tag = [0, 1]
How to insert a column of tags that is normal for cat: A, but reversed for cat: B? Expected output is:
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
...

EDIT: Because is necessary processing each concsecutive group separately I try create general solution:
tag = ['a','b','c']
r = range(len(tag))
r1 = range(len(tag)-1, -1, -1)
print (dict(zip(r1, tag)))
{2: 'a', 1: 'b', 0: 'c'}
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['tags'] = master.groupby(s).cumcount() % len(tag)
master.loc[m1, 'tags'] = master.loc[m1, 'tags'].map(dict(zip(r, tag)))
master.loc[m2, 'tags'] = master.loc[m2, 'tags'].map(dict(zip(r1, tag)))
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Another approach is create DataFrame from tags and merge with left join:
tag = ['a','b','c']
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['g'] = master.groupby(s).cumcount() % len(tag)
d = {'A': tag, 'B':tag[::-1]}
df = pd.DataFrame([(k,i,x)
for k, v in d.items()
for i, x in enumerate(v)], columns=['cat','g','tags'])
print (df)
cat g tags
0 A 0 a
1 A 1 b
2 A 2 c
3 B 0 c
4 B 1 b
5 B 2 a
master = master.merge(df, on=['cat','g'], how='left').drop('g', axis=1)
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Idea is use numpy.tile for repeat tag values by number of matched values with integer division and then filtering by indexing and assign by both masks:
le = len(tag)
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s1 = m1.sum()
s2 = m2.sum()
master.loc[m1, 'tags'] = np.tile(tag, s1 // le + le)[:s1]
#swapped order for m2 mask
master.loc[m2, 'tags'] = np.tile(tag[::-1], s2// le + le)[:s2]
print (master)
cat tags
0 A 0.0
1 A 1.0
2 B 1.0
3 B 0.0
4 B 1.0
5 A 0.0
6 A 1.0
7 A 0.0
8 A 1.0

IIUC, GroupBy.cumcount + Series.mod.
Then we invert the sequence where cat is B with Series.mask
s = df.groupby('cat').cumcount().mod(2)
df['tags'] = s.mask(df['cat'].eq('B'), ~s.astype(bool)).astype(int)
print(df)
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1

numpy place might help here :
#create temp column :
mapp={'A':0,'B':1}
res = (master.assign(temp=master.cat.map(mapp),
tag = master.cat
)
)
#locate point where B changes to A
split_point = res.loc[res.temp.diff().eq(-1)].index
split_point
Int64Index([5], dtype='int64')
#split into sections :
spl = np.split(res.cat,split_point)
def replace(entry):
np.place(entry.to_numpy(), entry=="A",[0,1])
np.place(entry.to_numpy(),entry=="B",[1,0])
return entry
res.tag = pd.concat(map(replace,spl))
res.drop('temp',axis=1)
cat tag
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
11 B 1
12 B 0
13 B 1

column names to column, pandas

What is an apposite function of pivot in Pandas?
For example I have
a b c
1 1 2
2 2 3
3 1 2
What I want
a newcol newcol2
1 b 1
1 c 2
2 b 2
2 c 3
3 b 1
3 c 2

use pd.melt http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,1],'c':[2,3,2]})
pd.melt(df,id_vars=['a'])
Out[8]:
a variable value
0 1 b 1
1 2 b 2
2 3 b 1
3 1 c 2
4 2 c 3
5 3 c 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to do intersection match between 2 DataFrames in Pandas? - pandas

Assume exists 2 DataFrames A and B like following A: a A b B c C B: 1 2 3 4 How to produce C DataFrame like a A 1 2 a A 3 4 b B 1 2 b B 3 4 c C 1 2 c C 3 4 Is there some function in Pandas can do this operation?

Related

How to create a rolling unique count by group using pandas

Reorder pandas DataFrame based on repetitive set of integer in index

Replace values of duplicated rows with first record in pandas?

Concatenate alternate scalar column to pandas based on condition

column names to column, pandas

Categories

Resources