Concatenate alternate scalar column to pandas based on condition - pandas

Have a master dataframe and a tag list, as follows:
import pandas as pd
i = ['A'] * 2 + ['B'] * 3 + ['A'] * 4 + ['B'] * 5
master = pd.DataFrame(i, columns={'cat'})
tag = [0, 1]
How to insert a column of tags that is normal for cat: A, but reversed for cat: B? Expected output is:
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
...

EDIT: Because is necessary processing each concsecutive group separately I try create general solution:
tag = ['a','b','c']
r = range(len(tag))
r1 = range(len(tag)-1, -1, -1)
print (dict(zip(r1, tag)))
{2: 'a', 1: 'b', 0: 'c'}
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['tags'] = master.groupby(s).cumcount() % len(tag)
master.loc[m1, 'tags'] = master.loc[m1, 'tags'].map(dict(zip(r, tag)))
master.loc[m2, 'tags'] = master.loc[m2, 'tags'].map(dict(zip(r1, tag)))
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Another approach is create DataFrame from tags and merge with left join:
tag = ['a','b','c']
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['g'] = master.groupby(s).cumcount() % len(tag)
d = {'A': tag, 'B':tag[::-1]}
df = pd.DataFrame([(k,i,x)
for k, v in d.items()
for i, x in enumerate(v)], columns=['cat','g','tags'])
print (df)
cat g tags
0 A 0 a
1 A 1 b
2 A 2 c
3 B 0 c
4 B 1 b
5 B 2 a
master = master.merge(df, on=['cat','g'], how='left').drop('g', axis=1)
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Idea is use numpy.tile for repeat tag values by number of matched values with integer division and then filtering by indexing and assign by both masks:
le = len(tag)
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s1 = m1.sum()
s2 = m2.sum()
master.loc[m1, 'tags'] = np.tile(tag, s1 // le + le)[:s1]
#swapped order for m2 mask
master.loc[m2, 'tags'] = np.tile(tag[::-1], s2// le + le)[:s2]
print (master)
cat tags
0 A 0.0
1 A 1.0
2 B 1.0
3 B 0.0
4 B 1.0
5 A 0.0
6 A 1.0
7 A 0.0
8 A 1.0

IIUC, GroupBy.cumcount + Series.mod.
Then we invert the sequence where cat is B with Series.mask
s = df.groupby('cat').cumcount().mod(2)
df['tags'] = s.mask(df['cat'].eq('B'), ~s.astype(bool)).astype(int)
print(df)
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1

numpy place might help here :
#create temp column :
mapp={'A':0,'B':1}
res = (master.assign(temp=master.cat.map(mapp),
tag = master.cat
)
)
#locate point where B changes to A
split_point = res.loc[res.temp.diff().eq(-1)].index
split_point
Int64Index([5], dtype='int64')
#split into sections :
spl = np.split(res.cat,split_point)
def replace(entry):
np.place(entry.to_numpy(), entry=="A",[0,1])
np.place(entry.to_numpy(),entry=="B",[1,0])
return entry
res.tag = pd.concat(map(replace,spl))
res.drop('temp',axis=1)
cat tag
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
11 B 1
12 B 0
13 B 1

Related

Setting value_counts that lower than a threshold as others

I want to set item with count<=1 as others, code for input table:
import pandas as pd
df=pd.DataFrame({"item":['a','a','a','b','b','c','d']})
input table:
item
0 a
1 a
2 a
3 b
4 b
5 c
6 d
expected output:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
How could I achieve that?
Use Series.where with check if all values are duplciates by Series.duplicated with keep=False:
df['result'] = df.item.where(df.item.duplicated(keep=False), 'other')
Or use GroupBy.transform with greater by 1 by Series.gt:
df['result'] = df.item.where(df.groupby('item')['item'].transform('size').gt(1), 'other')
Or use Series.map with Series.value_counts:
df['result'] = df.item.where(df['item'].map(df['item'].value_counts()).gt(1), 'other')
print (df)
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
Use numpy.where with Groupby.transform and Series.le:
In [926]: import numpy as np
In [927]: df['result'] = np.where(df.groupby('item')['item'].transform('count').le(1), 'other', df.item)
In [928]: df
Out[928]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
OR use Groupby.size with merge:
In [917]: x = df.groupby('item').size().reset_index()
In [919]: ans = df.merge(x)
In [921]: ans['result'] = np.where(ans[0].le(1), 'other', ans.item)
In [923]: ans = ans.drop(0, 1)
In [924]: ans
Out[924]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other

Reorder pandas DataFrame based on repetitive set of integer in index

I have a pandas dataframe contains some columns, I didn't find a way to order rows as follows:
I need to order the dataframe by the field label but in sequential order (like groups)
Input
I category tags
1 A #25-74
1 B #26-170
0 C #29-106
2 A #18-109
3 B #26-86
2 A #26-108
2 C #30-125
1 B #28-145
0 B #29-93
0 D #21-102
1 F #26-108
2 F #30-125
3 A #28-145
3 D #29-93
0 B #21-102
Needed Order:
I category tags
0 C #29-106
1 B #25-74
2 F #18-109
3 C #26-86
0 B #29-93
1 D #26-170
2 B #26-108
3 B #28-145
0 C #21-102
1 D #28-145
2 A #30-125
3 A #29-93
0 B #21-102
1 A #26-108
2 C #30-125
I have searched for different ways to sort but couldn't find a way to sort using only pandas.
I appreciate every help!
One idea with helper column by GroupBy.cumcount and DataFrame.sort_values:
df['a'] = df.groupby('I').cumcount()
df = df.sort_values(['a','I'])
print (df)
I category tags a
2 0 C #29-106 0
0 1 A #25-74 0
3 2 A #18-109 0
4 3 B #26-86 0
8 0 B #29-93 1
1 1 B #26-170 1
5 2 A #26-108 1
12 3 A #28-145 1
9 0 D #21-102 2
7 1 B #28-145 2
6 2 C #30-125 2
13 3 D #29-93 2
14 0 B #21-102 3
10 1 F #26-108 3
11 2 F #30-125 3
Or first sorting by column | and then change order with Series.argsort and DataFrame.iloc:
df = df.sort_values('I')
df = df.iloc[df.groupby('I').cumcount().argsort()]
print (df)
I category tags
2 0 C #29-106
0 1 A #25-74
3 2 A #18-109
4 3 B #26-86
8 0 B #29-93
1 1 B #26-170
5 2 A #26-108
12 3 A #28-145
9 0 D #21-102
7 1 B #28-145
6 2 C #30-125
13 3 D #29-93
14 0 B #21-102
10 1 F #26-108
11 2 F #30-125

If a column value does not have a certain number of occurances in a dataframe, how to duplicate all rows with that column value?

Say this my dataframe
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
If the number of a particular value in column B does not have a particular occurrence count, I want to duplicate all rows which have that particular value for B.
For the df above, say this particular value is 3. If a value for Column B is less than three, than all rows with that column value are duplicated. So rows with column value 0, 1, and 2 are duplicated, but rows with column b value of 5 are not.
Desired result:
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
10 b 2
11 m 2
12 g 3
13 p 3
14 c 0
15 c 0
Here is my approach
n=3 #threshold
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = 'A',
aggfunc = 'first')
)
r = max(n,len(df2.columns))
df2 = df2.reindex(columns = range(r))
notNaN_count = df2.count(axis=1)
m_ffill = notNaN_count.mul(2).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).add(1)
new_df = (df2.ffill(axis = 1)
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.rename('A')
.reset_index()
.loc[:,df.columns]
)
print(new_df)
Output
A B
0 c 0
1 c 0
2 c 0
3 q 1
4 z 1
5 q 1
6 z 1
7 b 2
8 m 2
9 b 2
10 m 2
11 g 3
12 p 3
13 g 3
14 p 3
15 a 5
16 d 5
17 u 5
if instead of duplicating we want to multiply by a factor d,
we must make the following modifications:
n = 3
d = 2
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
EDIT
n=3 #threshold
d = 2
values = df.columns.difference(['B'])
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = values,
aggfunc = 'first'))
r = max(n,len(df2.columns.get_level_values('columns').unique()))
df2 = df2.reindex(columns = range(r),level = 'columns')
notNaN_count = df2.count(axis=1).div(len(values))
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
new_df = (df2.T
.groupby(level=0)
.ffill()
.T
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.reset_index()
.loc[:,df.columns]
)

How to impute column values on Dask Dataframe?

I would like to impute negative values of Dask Dataframe, with pandas i use this code:
df.loc[(df.column_name < 0),'column_name'] = 0
I think need dask.dataframe.Series.clip_lower:
ddf['B'] = ddf['B'].clip_lower(0)
Sample:
import pandas as pd
df = pd.DataFrame({'F':list('abcdef'),
'B':[-4,5,4,-5,5,4],
'A':list('aaabbb')})
print (df)
A B F
0 a -4 a
1 a 5 b
2 a 4 c
3 b -5 d
4 b 5 e
5 b 4 f
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
#print (ddf)
ddf['B'] = ddf['B'].clip_lower(0)
print (ddf.compute())
A B F
0 a 0 a
1 a 5 b
2 a 4 c
3 b 0 d
4 b 5 e
5 b 4 f
For more general solution use dask.dataframe.Series.mask`:
ddf['B'] = ddf['B'].mask(ddf['B'] > 0, 3)
print (ddf.compute())
A B F
0 a -4 a
1 a 3 b
2 a 3 c
3 b -5 d
4 b 3 e
5 b 3 f

How to do intersection match between 2 DataFrames in Pandas?

Assume exists 2 DataFrames A and B like following
A:
a A
b B
c C
B:
1 2
3 4
How to produce C DataFrame like
a A 1 2
a A 3 4
b B 1 2
b B 3 4
c C 1 2
c C 3 4
Is there some function in Pandas can do this operation?
First all values has to be unique in each DataFrame.
I think you need product:
from itertools import product
A = pd.DataFrame({'a':list('abc')})
B = pd.DataFrame({'a':[1,2]})
C = pd.DataFrame(list(product(A['a'], B['a'])))
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Pandas pure solutions with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([A['a'], B['a']])
C = pd.DataFrame(mux.values.tolist())
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
C = mux.to_frame().reset_index(drop=True)
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Solution with cross join with merge and column filled by same scalars by assign:
df = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
df.columns = ['a','b']
print (df)
a b
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
EDIT:
A = pd.DataFrame({'a':list('abc'), 'b':list('ABC')})
B = pd.DataFrame({'a':[1,3], 'c':[2,4]})
print (A)
a b
0 a A
1 b B
2 c C
print (B)
a c
0 1 2
1 3 4
C = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
C.columns = list('abcd')
print (C)
a b c d
0 a A 1 2
1 a A 3 4
2 b B 1 2
3 b B 3 4
4 c C 1 2
5 c C 3 4