How to optimize merging only on lines matching a condition? - pandas

I want to left merge df_1 and df_2 on column a
I can archive it easily with:
df_3 = df_1.merge(df_2, on="a", how="left")
However, I know I will never find a in df_2 when df_1.b == 0
So to optimze my code, I would like to merge df_1 with df_2 only when df_1.b != 0
How can I get df_3 more efficiently knowing this information ?
input
d = {'a': list('ABCDEF'),
'b': list('111000')}
df_1 = pd.DataFrame(data=d)
# a b
# 0 A 1
# 1 B 1
# 2 C 1
# 3 D 0
# 4 E 0
# 5 F 0
d = {'a': list('ABC'),
'c': list('xyz')}
df_2 = pd.DataFrame(data=d)
# a c
# 0 A x
# 1 B y
# 2 C z
expected output
df_3
# a b c
# 0 A 1 x
# 1 B 1 y
# 2 C 1 z
# 3 D 0 NaN
# 4 E 0 NaN
# 5 F 0 NaN

just merge it
df_1.merge(df_2, how = 'outer', on = 'a')

IIUC use:
m = df_1.b != '0'
df_3 = (df_1[m].reset_index().merge(df_2, on="a", how="left",
suffixes=('','_'))
.drop(df_1.columns, axis=1)
.set_index('index'))
print (df_3)
c
index
0 x
1 y
2 z
out = pd.concat([df_1, df_3], axis=1)
print (out)
a b c
0 A 1 x
1 B 1 y
2 C 1 z
3 D 0 NaN
4 E 0 NaN
5 F 0 NaN

Related

Split list column in a dataframe into separate 1/0 entry columns [duplicate]

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').
You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1

pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry

I have 2 dataframes, df1
A B C
0 a 1 x
1 b 2 y
2 c 3 z
3 d 4 g
4 e 5 h
and df2:
0 A B C
0 1 a 6 i
1 2 a 7 j
2 3 b 8 k
3 3 d 10 k
What I want to do is the following:
Whenever an entry in column A of df1 matches an entry in column A of df2, replace the matching row in df1 with parts of the row in df2
In my approach, in the below code, I tried to replace the first row
(a,1,x) by (a,6,i) and consecutively with (a,7,j).
Also all other matching rows should be replaced:
So: (b,2,y) with (b,8,k) and (d,4,g) with (d,10,k)
Meaning that every row in df1 should be replaced by the latest match of column A in df2.
import numpy as np
import pandas as pd
columns = ["0","A", "B", "C"]
s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = columns[1::])
s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)], columns = columns)
cols = ["A", "B", "C"]
print(df1[columns[1::]])
print("---")
print(df2[columns])
print("---")
df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)
The expected result would therefor be:
A B C
0 a 7 j
1 b 2 y
2 c 3 z
3 d 10 k
4 e 5 h
But the above approach results in:
A B C
0 a 6 i
1 a 7 j
2 c 3 z
3 d 10 k
4 e 5 h
I know i could do what I want with iterrows() but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)
Or is there there any other easy approach to achieve this?
Use:
df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
A B C
1 a 7 j
2 b 8 k
2 c 3 z
3 d 10 k
4 e 5 h
You can try merge then update
df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)
A B C
0 a 7.0 j
1 b 8.0 k
2 c 3.0 z
3 d 10.0 k
4 e 5.0 h

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

How to impute column values on Dask Dataframe?

I would like to impute negative values of Dask Dataframe, with pandas i use this code:
df.loc[(df.column_name < 0),'column_name'] = 0
I think need dask.dataframe.Series.clip_lower:
ddf['B'] = ddf['B'].clip_lower(0)
Sample:
import pandas as pd
df = pd.DataFrame({'F':list('abcdef'),
'B':[-4,5,4,-5,5,4],
'A':list('aaabbb')})
print (df)
A B F
0 a -4 a
1 a 5 b
2 a 4 c
3 b -5 d
4 b 5 e
5 b 4 f
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
#print (ddf)
ddf['B'] = ddf['B'].clip_lower(0)
print (ddf.compute())
A B F
0 a 0 a
1 a 5 b
2 a 4 c
3 b 0 d
4 b 5 e
5 b 4 f
For more general solution use dask.dataframe.Series.mask`:
ddf['B'] = ddf['B'].mask(ddf['B'] > 0, 3)
print (ddf.compute())
A B F
0 a -4 a
1 a 3 b
2 a 3 c
3 b -5 d
4 b 3 e
5 b 3 f

populate a dense dataframe given a key-value dataframe

I have a key-value dataframe:
pd.DataFrame(columns=['X','Y','val'],data= [['a','z',5],['b','g',3],['b','y',6],['e','r',9]])
> X Y val
0 a z 5
1 b g 3
2 b y 6
3 e r 9
Which I'd like to convert into a denser dataframe:
X z g y r
0 a 5 0 0 0
1 b 0 3 6 0
2 e 0 0 0 9
Before I resort to a pure-python I was wondering if there was a simple way to do this with pandas.
You can use get_dummies:
In [11]: dummies = pd.get_dummies(df['Y'])
In [12]: dummies
Out[12]:
g r y z
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
and then multiply by the val column:
In [13]: res = dummies.mul(df['val'], axis=0)
In [14]: res
Out[14]:
g r y z
0 0 0 0 5
1 3 0 0 0
2 0 0 6 0
3 0 9 0 0
To fix the index, you could just add the X as this index, you could first apply set_index:
In [21]: df1 = df.set_index('X', append=True)
In [22]: df1
Out[22]:
Y val
X
0 a z 5
1 b g 3
2 b y 6
3 e r 9
In [23]: dummies = pd.get_dummies(df['Y'])
In [24]: dummies.mul(df['val'], axis=0)
Out[24]:
g r y z
X
0 a 0 0 0 5
1 b 3 0 0 0
2 b 0 0 6 0
3 e 0 9 0 0
If you wanted to do this pivot (you can also use pivot_table):
In [31]: df.pivot('X', 'Y').fillna(0)
Out[31]:
val
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
Perhaps you want to reset_index, to make X a column (I'm not sure whether than makes sense):
In [32]: df.pivot('X', 'Y').fillna(0).reset_index()
Out[32]:
X val
Y g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
For completeness, the pivot_table:
In [33]: df.pivot_table('val', 'X', 'Y', fill_value=0)
Out[33]:
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
In [34]: df.pivot_table('val', 'X', 'Y', fill_value=0).reset_index()
Out[34]:
Y X g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
Note: the column name are named Y, after reseting the index, not sure if this makes sense (and easy to rectify via res.columns.name = None).
If you want something that feels more direct. Something akin to DataFrame.lookup but for np.put might make sense.
def lookup_index(self, row_labels, col_labels):
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise ValueError('One or more row labels was not found')
if (cidx == -1).any():
raise ValueError('One or more column labels was not found')
flat_index = ridx * len(self.columns) + cidx
return flat_index
flat_index = lookup_index(df, vals.X, vals.Y)
np.put(df.values, flat_index, vals.val.values)
This assumes that df has the appropriate columns and index to hold the X/Y values. Here's an ipython notebook http://nbviewer.ipython.org/6454120