How to do intersection match between 2 DataFrames in Pandas? - pandas

Assume exists 2 DataFrames A and B like following
A:
a A
b B
c C
B:
1 2
3 4
How to produce C DataFrame like
a A 1 2
a A 3 4
b B 1 2
b B 3 4
c C 1 2
c C 3 4
Is there some function in Pandas can do this operation?

First all values has to be unique in each DataFrame.
I think you need product:
from itertools import product
A = pd.DataFrame({'a':list('abc')})
B = pd.DataFrame({'a':[1,2]})
C = pd.DataFrame(list(product(A['a'], B['a'])))
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Pandas pure solutions with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([A['a'], B['a']])
C = pd.DataFrame(mux.values.tolist())
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
C = mux.to_frame().reset_index(drop=True)
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Solution with cross join with merge and column filled by same scalars by assign:
df = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
df.columns = ['a','b']
print (df)
a b
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
EDIT:
A = pd.DataFrame({'a':list('abc'), 'b':list('ABC')})
B = pd.DataFrame({'a':[1,3], 'c':[2,4]})
print (A)
a b
0 a A
1 b B
2 c C
print (B)
a c
0 1 2
1 3 4
C = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
C.columns = list('abcd')
print (C)
a b c d
0 a A 1 2
1 a A 3 4
2 b B 1 2
3 b B 3 4
4 c C 1 2
5 c C 3 4

Related

How to create a rolling unique count by group using pandas

I have a dataframe like the following:
group value
1 a
1 a
1 b
1 b
1 b
1 b
1 c
2 d
2 d
2 d
2 d
2 e
I want to create a column with how many unique values there have been so far for the group. Like below:
group value group_value_id
1 a 1
1 a 1
1 b 2
1 b 2
1 b 2
1 b 2
1 c 3
2 d 1
2 d 1
2 d 1
2 d 1
2 e 2
Use custom lambda function with GroupBy.transform and factorize:
df['group_value_id']=df.groupby('group')['value'].transform(lambda x:pd.factorize(x)[0]) + 1
print (df)
group value group_value_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2
because:
df['group_value_id'] = df.groupby('group')['value'].rank('dense')
print (df)
DataError: No numeric types to aggregate
Also cab be solved as :
df['group_val_id'] = (df.groupby('group')['value'].
apply(lambda x:x.astype('category').cat.codes + 1))
df
group value group_val_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2

Reorder pandas DataFrame based on repetitive set of integer in index

I have a pandas dataframe contains some columns, I didn't find a way to order rows as follows:
I need to order the dataframe by the field label but in sequential order (like groups)
Input
I category tags
1 A #25-74
1 B #26-170
0 C #29-106
2 A #18-109
3 B #26-86
2 A #26-108
2 C #30-125
1 B #28-145
0 B #29-93
0 D #21-102
1 F #26-108
2 F #30-125
3 A #28-145
3 D #29-93
0 B #21-102
Needed Order:
I category tags
0 C #29-106
1 B #25-74
2 F #18-109
3 C #26-86
0 B #29-93
1 D #26-170
2 B #26-108
3 B #28-145
0 C #21-102
1 D #28-145
2 A #30-125
3 A #29-93
0 B #21-102
1 A #26-108
2 C #30-125
I have searched for different ways to sort but couldn't find a way to sort using only pandas.
I appreciate every help!
One idea with helper column by GroupBy.cumcount and DataFrame.sort_values:
df['a'] = df.groupby('I').cumcount()
df = df.sort_values(['a','I'])
print (df)
I category tags a
2 0 C #29-106 0
0 1 A #25-74 0
3 2 A #18-109 0
4 3 B #26-86 0
8 0 B #29-93 1
1 1 B #26-170 1
5 2 A #26-108 1
12 3 A #28-145 1
9 0 D #21-102 2
7 1 B #28-145 2
6 2 C #30-125 2
13 3 D #29-93 2
14 0 B #21-102 3
10 1 F #26-108 3
11 2 F #30-125 3
Or first sorting by column | and then change order with Series.argsort and DataFrame.iloc:
df = df.sort_values('I')
df = df.iloc[df.groupby('I').cumcount().argsort()]
print (df)
I category tags
2 0 C #29-106
0 1 A #25-74
3 2 A #18-109
4 3 B #26-86
8 0 B #29-93
1 1 B #26-170
5 2 A #26-108
12 3 A #28-145
9 0 D #21-102
7 1 B #28-145
6 2 C #30-125
13 3 D #29-93
14 0 B #21-102
10 1 F #26-108
11 2 F #30-125

Replace values of duplicated rows with first record in pandas?

Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.
Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2

Concatenate alternate scalar column to pandas based on condition

Have a master dataframe and a tag list, as follows:
import pandas as pd
i = ['A'] * 2 + ['B'] * 3 + ['A'] * 4 + ['B'] * 5
master = pd.DataFrame(i, columns={'cat'})
tag = [0, 1]
How to insert a column of tags that is normal for cat: A, but reversed for cat: B? Expected output is:
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
...
EDIT: Because is necessary processing each concsecutive group separately I try create general solution:
tag = ['a','b','c']
r = range(len(tag))
r1 = range(len(tag)-1, -1, -1)
print (dict(zip(r1, tag)))
{2: 'a', 1: 'b', 0: 'c'}
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['tags'] = master.groupby(s).cumcount() % len(tag)
master.loc[m1, 'tags'] = master.loc[m1, 'tags'].map(dict(zip(r, tag)))
master.loc[m2, 'tags'] = master.loc[m2, 'tags'].map(dict(zip(r1, tag)))
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Another approach is create DataFrame from tags and merge with left join:
tag = ['a','b','c']
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['g'] = master.groupby(s).cumcount() % len(tag)
d = {'A': tag, 'B':tag[::-1]}
df = pd.DataFrame([(k,i,x)
for k, v in d.items()
for i, x in enumerate(v)], columns=['cat','g','tags'])
print (df)
cat g tags
0 A 0 a
1 A 1 b
2 A 2 c
3 B 0 c
4 B 1 b
5 B 2 a
master = master.merge(df, on=['cat','g'], how='left').drop('g', axis=1)
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Idea is use numpy.tile for repeat tag values by number of matched values with integer division and then filtering by indexing and assign by both masks:
le = len(tag)
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s1 = m1.sum()
s2 = m2.sum()
master.loc[m1, 'tags'] = np.tile(tag, s1 // le + le)[:s1]
#swapped order for m2 mask
master.loc[m2, 'tags'] = np.tile(tag[::-1], s2// le + le)[:s2]
print (master)
cat tags
0 A 0.0
1 A 1.0
2 B 1.0
3 B 0.0
4 B 1.0
5 A 0.0
6 A 1.0
7 A 0.0
8 A 1.0
IIUC, GroupBy.cumcount + Series.mod.
Then we invert the sequence where cat is B with Series.mask
s = df.groupby('cat').cumcount().mod(2)
df['tags'] = s.mask(df['cat'].eq('B'), ~s.astype(bool)).astype(int)
print(df)
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
numpy place might help here :
#create temp column :
mapp={'A':0,'B':1}
res = (master.assign(temp=master.cat.map(mapp),
tag = master.cat
)
)
#locate point where B changes to A
split_point = res.loc[res.temp.diff().eq(-1)].index
split_point
Int64Index([5], dtype='int64')
#split into sections :
spl = np.split(res.cat,split_point)
def replace(entry):
np.place(entry.to_numpy(), entry=="A",[0,1])
np.place(entry.to_numpy(),entry=="B",[1,0])
return entry
res.tag = pd.concat(map(replace,spl))
res.drop('temp',axis=1)
cat tag
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
11 B 1
12 B 0
13 B 1

column names to column, pandas

What is an apposite function of pivot in Pandas?
For example I have
a b c
1 1 2
2 2 3
3 1 2
What I want
a newcol newcol2
1 b 1
1 c 2
2 b 2
2 c 3
3 b 1
3 c 2
use pd.melt http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,1],'c':[2,3,2]})
pd.melt(df,id_vars=['a'])
Out[8]:
a variable value
0 1 b 1
1 2 b 2
2 3 b 1
3 1 c 2
4 2 c 3
5 3 c 2