How to impute column values on Dask Dataframe? - pandas

I would like to impute negative values of Dask Dataframe, with pandas i use this code:
df.loc[(df.column_name < 0),'column_name'] = 0

I think need dask.dataframe.Series.clip_lower:
ddf['B'] = ddf['B'].clip_lower(0)
Sample:
import pandas as pd
df = pd.DataFrame({'F':list('abcdef'),
'B':[-4,5,4,-5,5,4],
'A':list('aaabbb')})
print (df)
A B F
0 a -4 a
1 a 5 b
2 a 4 c
3 b -5 d
4 b 5 e
5 b 4 f
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
#print (ddf)
ddf['B'] = ddf['B'].clip_lower(0)
print (ddf.compute())
A B F
0 a 0 a
1 a 5 b
2 a 4 c
3 b 0 d
4 b 5 e
5 b 4 f
For more general solution use dask.dataframe.Series.mask`:
ddf['B'] = ddf['B'].mask(ddf['B'] > 0, 3)
print (ddf.compute())
A B F
0 a -4 a
1 a 3 b
2 a 3 c
3 b -5 d
4 b 3 e
5 b 3 f

Related

Setting value_counts that lower than a threshold as others

I want to set item with count<=1 as others, code for input table:
import pandas as pd
df=pd.DataFrame({"item":['a','a','a','b','b','c','d']})
input table:
item
0 a
1 a
2 a
3 b
4 b
5 c
6 d
expected output:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
How could I achieve that?
Use Series.where with check if all values are duplciates by Series.duplicated with keep=False:
df['result'] = df.item.where(df.item.duplicated(keep=False), 'other')
Or use GroupBy.transform with greater by 1 by Series.gt:
df['result'] = df.item.where(df.groupby('item')['item'].transform('size').gt(1), 'other')
Or use Series.map with Series.value_counts:
df['result'] = df.item.where(df['item'].map(df['item'].value_counts()).gt(1), 'other')
print (df)
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
Use numpy.where with Groupby.transform and Series.le:
In [926]: import numpy as np
In [927]: df['result'] = np.where(df.groupby('item')['item'].transform('count').le(1), 'other', df.item)
In [928]: df
Out[928]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
OR use Groupby.size with merge:
In [917]: x = df.groupby('item').size().reset_index()
In [919]: ans = df.merge(x)
In [921]: ans['result'] = np.where(ans[0].le(1), 'other', ans.item)
In [923]: ans = ans.drop(0, 1)
In [924]: ans
Out[924]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Concatenate alternate scalar column to pandas based on condition

Have a master dataframe and a tag list, as follows:
import pandas as pd
i = ['A'] * 2 + ['B'] * 3 + ['A'] * 4 + ['B'] * 5
master = pd.DataFrame(i, columns={'cat'})
tag = [0, 1]
How to insert a column of tags that is normal for cat: A, but reversed for cat: B? Expected output is:
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
...
EDIT: Because is necessary processing each concsecutive group separately I try create general solution:
tag = ['a','b','c']
r = range(len(tag))
r1 = range(len(tag)-1, -1, -1)
print (dict(zip(r1, tag)))
{2: 'a', 1: 'b', 0: 'c'}
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['tags'] = master.groupby(s).cumcount() % len(tag)
master.loc[m1, 'tags'] = master.loc[m1, 'tags'].map(dict(zip(r, tag)))
master.loc[m2, 'tags'] = master.loc[m2, 'tags'].map(dict(zip(r1, tag)))
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Another approach is create DataFrame from tags and merge with left join:
tag = ['a','b','c']
s = master['cat'].ne(master['cat'].shift()).cumsum()
master['g'] = master.groupby(s).cumcount() % len(tag)
d = {'A': tag, 'B':tag[::-1]}
df = pd.DataFrame([(k,i,x)
for k, v in d.items()
for i, x in enumerate(v)], columns=['cat','g','tags'])
print (df)
cat g tags
0 A 0 a
1 A 1 b
2 A 2 c
3 B 0 c
4 B 1 b
5 B 2 a
master = master.merge(df, on=['cat','g'], how='left').drop('g', axis=1)
print (master)
cat tags
0 A a
1 A b
2 B c
3 B b
4 B a
5 A a
6 A b
7 A c
8 A a
9 B c
10 B b
11 B a
12 B c
13 B b
Idea is use numpy.tile for repeat tag values by number of matched values with integer division and then filtering by indexing and assign by both masks:
le = len(tag)
m1 = master['cat'].eq('A')
m2 = master['cat'].eq('B')
s1 = m1.sum()
s2 = m2.sum()
master.loc[m1, 'tags'] = np.tile(tag, s1 // le + le)[:s1]
#swapped order for m2 mask
master.loc[m2, 'tags'] = np.tile(tag[::-1], s2// le + le)[:s2]
print (master)
cat tags
0 A 0.0
1 A 1.0
2 B 1.0
3 B 0.0
4 B 1.0
5 A 0.0
6 A 1.0
7 A 0.0
8 A 1.0
IIUC, GroupBy.cumcount + Series.mod.
Then we invert the sequence where cat is B with Series.mask
s = df.groupby('cat').cumcount().mod(2)
df['tags'] = s.mask(df['cat'].eq('B'), ~s.astype(bool)).astype(int)
print(df)
cat tags
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
numpy place might help here :
#create temp column :
mapp={'A':0,'B':1}
res = (master.assign(temp=master.cat.map(mapp),
tag = master.cat
)
)
#locate point where B changes to A
split_point = res.loc[res.temp.diff().eq(-1)].index
split_point
Int64Index([5], dtype='int64')
#split into sections :
spl = np.split(res.cat,split_point)
def replace(entry):
np.place(entry.to_numpy(), entry=="A",[0,1])
np.place(entry.to_numpy(),entry=="B",[1,0])
return entry
res.tag = pd.concat(map(replace,spl))
res.drop('temp',axis=1)
cat tag
0 A 0
1 A 1
2 B 1
3 B 0
4 B 1
5 A 0
6 A 1
7 A 0
8 A 1
9 B 1
10 B 0
11 B 1
12 B 0
13 B 1

How to append a DataFrame to a multiindex DataFrame?

Suppose that I have the DataFrames
In [1]: a=pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],
...: index=pd.MultiIndex.from_product([('A','B'),('d','e')]))
In [2]: a
Out[2]:
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
In [3]: b=pd.DataFrame([[9,10],[11,12]],index=('d','e'))
In [4]: b
Out[4]:
0 1
d 9 10
e 11 12
and I want to append b to a, with the subindex C, producing the
DataFrame
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
C d 9 10
e 11 12
I tried
In [5]: a.loc['C'] = b
but got
TypeError: 'int' object is not iterable
How do I do it?
Assign a new value to b , then set_index and swaplevel before append to a
a.append(b.assign(k='C').set_index('k',append=True).swaplevel(0,1))
Out[33]:
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
C d 9 10
e 11 12
First update b's index to match the same levels as a, then concat:
b.index = pd.MultiIndex.from_arrays([('C','C'), ('d','e')])
pd.concat([a,b]))])
If wanna step-by-step;
df2 = pd.concat([a,b], ignore_index=True)
df2['i0'] = a.index.get_level_values(0).tolist() + ['C']*len(b)
df2['i1'] = a.index.get_level_values(0).union(b.index)
df2.set_index(['i0', 'i1'])
Outputs
0 1
i0 i1
A A 1 2
A 3 4
B B 5 6
B 7 8
C d 9 10
e 11 12

Group by with a pandas dataframe using different aggregation for different columns

I have a pandas dataframe df with columns [a, b, c, d, e, f]. I want to perform a group by on df. I can best describe what it's supposed to do in SQL:
SELECT a, b, min(c), min(d), max(e), sum(f)
FROM df
GROUP BY a, b
How do I do this group by using pandas on my dataframe df?
consider df:
a b c d e f
1 1 2 5 9 3
1 1 3 3 4 5
2 2 4 7 4 4
2 2 5 3 8 8
I expect the result to be:
a b c d e f
1 1 2 3 9 8
2 2 4 3 8 12
use agg
df = pd.DataFrame(
dict(
a=list('aaaabbbb'),
b=list('ccddccdd'),
c=np.arange(8),
d=np.arange(8),
e=np.arange(8),
f=np.arange(8),
)
)
funcs = dict(c='min', d='min', e='max', f='sum')
df.groupby(['a', 'b']).agg(funcs).reset_index()
a b c e f d
0 a c 0 1 1 0
1 a d 2 3 5 2
2 b c 4 5 9 4
3 b d 6 7 13 6
with your data
a b c e f d
0 1 1 2 9 8 3
1 2 2 4 8 12 3