for my graduation project, I would like to remove duplicate rows and keep only a row where column b and c are equal for the value in column a. I tried a lot of things, groupby, Merge combinations and duplicates, but nothing worked out till now. Can you please help me? Many thanks!
input:
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
result:
a b c
1 1 A A
4 2 B B
I believe you need:
print (df)
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
5 3 C C
6 4 C NaN
7 4 C E
7 5 NaN E
Replace NaNs by forward and back filling:
df1 = df[['b','c']].bfill(axis=1).ffill(axis=1)
print (df1)
b c
0 A B
1 A A
2 A C
3 B A
4 B B
5 C C
6 C C
7 C E
7 E E
Check condition in df1 and because same index is possible filter df:
df = df[df1['b'] == df1['c']]
print (df)
a b c
1 1 A A
4 2 B B
5 3 C C
6 4 C NaN
7 5 NaN E
Related
Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.
Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2
For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.
In the line below, I am renaming the columns of pnlsummary dataframe from the column names of three series (totalheldmw, totalcost and totalsellprofit) and one dataframe (totalheldprofit).
The difficulty I have is to iterate over the column names of the dataframe. I have manually assigned the names as you can see below. I would suppose there is an efficient way of iterating over the column names of the dataframe. Please advice.
pnlsummary.columns =
[totalheldmw.name[0],totalcost.name[0],totalsellprofit.name[0],
totalheldprofit.columns[0],totalheldprofit.columns[1],
totalheldprofit.columns[2],totalheldprofit.columns[3]]
I think you need create list by constants and then add columns names converted to list:
pnlsummary.columns = [totalheldmw.name[0],totalcost.name[0],totalsellprofit.name[0]] +
totalheldprofit.columns[0:3].astype(str).tolist()
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df.columns = ['a','s','d'] + df.columns[0:3].tolist()
print (df)
a s d A B C
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
Assume exists 2 DataFrames A and B like following
A:
a A
b B
c C
B:
1 2
3 4
How to produce C DataFrame like
a A 1 2
a A 3 4
b B 1 2
b B 3 4
c C 1 2
c C 3 4
Is there some function in Pandas can do this operation?
First all values has to be unique in each DataFrame.
I think you need product:
from itertools import product
A = pd.DataFrame({'a':list('abc')})
B = pd.DataFrame({'a':[1,2]})
C = pd.DataFrame(list(product(A['a'], B['a'])))
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Pandas pure solutions with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([A['a'], B['a']])
C = pd.DataFrame(mux.values.tolist())
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
C = mux.to_frame().reset_index(drop=True)
print (C)
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
Solution with cross join with merge and column filled by same scalars by assign:
df = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
df.columns = ['a','b']
print (df)
a b
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2
EDIT:
A = pd.DataFrame({'a':list('abc'), 'b':list('ABC')})
B = pd.DataFrame({'a':[1,3], 'c':[2,4]})
print (A)
a b
0 a A
1 b B
2 c C
print (B)
a c
0 1 2
1 3 4
C = pd.merge(A.assign(tmp=1), B.assign(tmp=1), on='tmp').drop('tmp', 1)
C.columns = list('abcd')
print (C)
a b c d
0 a A 1 2
1 a A 3 4
2 b B 1 2
3 b B 3 4
4 c C 1 2
5 c C 3 4
I have a pandas dataframe df with columns [a, b, c, d, e, f]. I want to perform a group by on df. I can best describe what it's supposed to do in SQL:
SELECT a, b, min(c), min(d), max(e), sum(f)
FROM df
GROUP BY a, b
How do I do this group by using pandas on my dataframe df?
consider df:
a b c d e f
1 1 2 5 9 3
1 1 3 3 4 5
2 2 4 7 4 4
2 2 5 3 8 8
I expect the result to be:
a b c d e f
1 1 2 3 9 8
2 2 4 3 8 12
use agg
df = pd.DataFrame(
dict(
a=list('aaaabbbb'),
b=list('ccddccdd'),
c=np.arange(8),
d=np.arange(8),
e=np.arange(8),
f=np.arange(8),
)
)
funcs = dict(c='min', d='min', e='max', f='sum')
df.groupby(['a', 'b']).agg(funcs).reset_index()
a b c e f d
0 a c 0 1 1 0
1 a d 2 3 5 2
2 b c 4 5 9 4
3 b d 6 7 13 6
with your data
a b c e f d
0 1 1 2 9 8 3
1 2 2 4 8 12 3