pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry - pandas

I have 2 dataframes, df1
A B C
0 a 1 x
1 b 2 y
2 c 3 z
3 d 4 g
4 e 5 h
and df2:
0 A B C
0 1 a 6 i
1 2 a 7 j
2 3 b 8 k
3 3 d 10 k
What I want to do is the following:
Whenever an entry in column A of df1 matches an entry in column A of df2, replace the matching row in df1 with parts of the row in df2
In my approach, in the below code, I tried to replace the first row
(a,1,x) by (a,6,i) and consecutively with (a,7,j).
Also all other matching rows should be replaced:
So: (b,2,y) with (b,8,k) and (d,4,g) with (d,10,k)
Meaning that every row in df1 should be replaced by the latest match of column A in df2.
import numpy as np
import pandas as pd
columns = ["0","A", "B", "C"]
s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = columns[1::])
s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)], columns = columns)
cols = ["A", "B", "C"]
print(df1[columns[1::]])
print("---")
print(df2[columns])
print("---")
df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)
The expected result would therefor be:
A B C
0 a 7 j
1 b 2 y
2 c 3 z
3 d 10 k
4 e 5 h
But the above approach results in:
A B C
0 a 6 i
1 a 7 j
2 c 3 z
3 d 10 k
4 e 5 h
I know i could do what I want with iterrows() but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)
Or is there there any other easy approach to achieve this?

Use:
df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
A B C
1 a 7 j
2 b 8 k
2 c 3 z
3 d 10 k
4 e 5 h

You can try merge then update
df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)
A B C
0 a 7.0 j
1 b 8.0 k
2 c 3.0 z
3 d 10.0 k
4 e 5.0 h

Related

Setting value_counts that lower than a threshold as others

I want to set item with count<=1 as others, code for input table:
import pandas as pd
df=pd.DataFrame({"item":['a','a','a','b','b','c','d']})
input table:
item
0 a
1 a
2 a
3 b
4 b
5 c
6 d
expected output:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
How could I achieve that?
Use Series.where with check if all values are duplciates by Series.duplicated with keep=False:
df['result'] = df.item.where(df.item.duplicated(keep=False), 'other')
Or use GroupBy.transform with greater by 1 by Series.gt:
df['result'] = df.item.where(df.groupby('item')['item'].transform('size').gt(1), 'other')
Or use Series.map with Series.value_counts:
df['result'] = df.item.where(df['item'].map(df['item'].value_counts()).gt(1), 'other')
print (df)
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
Use numpy.where with Groupby.transform and Series.le:
In [926]: import numpy as np
In [927]: df['result'] = np.where(df.groupby('item')['item'].transform('count').le(1), 'other', df.item)
In [928]: df
Out[928]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
OR use Groupby.size with merge:
In [917]: x = df.groupby('item').size().reset_index()
In [919]: ans = df.merge(x)
In [921]: ans['result'] = np.where(ans[0].le(1), 'other', ans.item)
In [923]: ans = ans.drop(0, 1)
In [924]: ans
Out[924]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

If a column value does not have a certain number of occurances in a dataframe, how to duplicate all rows with that column value?

Say this my dataframe
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
If the number of a particular value in column B does not have a particular occurrence count, I want to duplicate all rows which have that particular value for B.
For the df above, say this particular value is 3. If a value for Column B is less than three, than all rows with that column value are duplicated. So rows with column value 0, 1, and 2 are duplicated, but rows with column b value of 5 are not.
Desired result:
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
10 b 2
11 m 2
12 g 3
13 p 3
14 c 0
15 c 0
Here is my approach
n=3 #threshold
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = 'A',
aggfunc = 'first')
)
r = max(n,len(df2.columns))
df2 = df2.reindex(columns = range(r))
notNaN_count = df2.count(axis=1)
m_ffill = notNaN_count.mul(2).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).add(1)
new_df = (df2.ffill(axis = 1)
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.rename('A')
.reset_index()
.loc[:,df.columns]
)
print(new_df)
Output
A B
0 c 0
1 c 0
2 c 0
3 q 1
4 z 1
5 q 1
6 z 1
7 b 2
8 m 2
9 b 2
10 m 2
11 g 3
12 p 3
13 g 3
14 p 3
15 a 5
16 d 5
17 u 5
if instead of duplicating we want to multiply by a factor d,
we must make the following modifications:
n = 3
d = 2
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
EDIT
n=3 #threshold
d = 2
values = df.columns.difference(['B'])
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = values,
aggfunc = 'first'))
r = max(n,len(df2.columns.get_level_values('columns').unique()))
df2 = df2.reindex(columns = range(r),level = 'columns')
notNaN_count = df2.count(axis=1).div(len(values))
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
new_df = (df2.T
.groupby(level=0)
.ffill()
.T
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.reset_index()
.loc[:,df.columns]
)

Unstack a single column dataframe

I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns
What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1

how to selectively filter elements in pandas group

I want to selectively remove elements of a pandas group based on their properties within the group.
Here's an example: remove all elements except the row with the highest value in the 'A' column
>>> dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc'), 'C': list('lmnopqrt')})
>>> dff
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
>>> grped = dff.groupby('B')
>>> grped.groups
{'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}
apply custom function/method to the groups (sort within group on col 'A', filter elements).
>>> yourGenius(grped,'A').reset_index()
returns dataframe:
A B C
0 2 a m
1 9 b p
2 10 c t
maybe there is a compact way to do this with a lambda function or .filter()? thanks
If you want to select one row per group, you could use groupby/agg
to return index values and select the rows using loc.
For example, to group by B and then select the row with the highest A value:
In [171]: dff
Out[171]:
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
[8 rows x 3 columns]
In [172]: dff.loc[dff.groupby('B')['A'].idxmax()]
Out[172]:
A B C
1 2 a m
4 9 b p
7 10 c t
another option (suggested by jezrael) which in practice is faster for a wide range of DataFrames is
dff.sort_values(by=['A'], ascending=False).drop_duplicates('B')
If you wish to select many rows per group, you could use groupby/apply with a function that returns sub-DataFrames for
each group. apply will then try to merge these sub-DataFrames for you.
For example, to select every row except the last from each group:
In [216]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=list('ABC'), index=list('vwxyz')); df['A'] %= 2; df
Out[216]:
A B C
v 0 1 2
w 1 4 5
x 0 7 8
y 1 10 11
z 0 13 14
In [217]: df.groupby(['A']).apply(lambda grp: grp.iloc[:-1]).reset_index(drop=True, level=0)
Out[217]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
Another way is to use groupby/apply to return a Series of index values. Again apply will try to join the Series into one Series. You could then use df.loc to select rows by index value:
In [218]: df.loc[df.groupby(['A']).apply(lambda grp: pd.Series(grp.index[:-1]))]
Out[218]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
I don't think groupby/filter will do what you wish, since
groupby/filter filters whole groups. It doesn't allow you to select particular rows from each group.