Appending column in Pandas dataframe - pandas

For the 3 columns below, I would like to create a 4th column based on unique values from the 3 columns.
Col 1
Col 2
Col 3
A
: X :
Y :
X
: B :
Y :
C
: X :
X :
4th Column should have values of only A, B or C, as shown below. Please let me know how this can be done.
Col 1
Col 2
Col 3
Col 4
A
X
Y
A
X
B
Y
B
C
X
X
C

If unique means unique values in all columns with join multiple unique values per rows use DataFrame.stack with Series.drop_duplicates with aggregate join:
c = ['Col 1','Col 2','Col 3']
df['Col 4'] = df[c].stack().drop_duplicates(keep=False).groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X X C
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X D C
c = ['Col 1','Col 2','Col 3']
df['Col 4'] = df[c].stack().drop_duplicates(keep=False).groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X D C,D
EDIT: If need extract only A,B,C values defined in list use:
L = ['A','B','C']
c = ['Col 1','Col 2','Col 3']
s = df[c].stack()
df['Col 4'] = s[s.isin(L)].groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X X C

Related

pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry

I have 2 dataframes, df1
A B C
0 a 1 x
1 b 2 y
2 c 3 z
3 d 4 g
4 e 5 h
and df2:
0 A B C
0 1 a 6 i
1 2 a 7 j
2 3 b 8 k
3 3 d 10 k
What I want to do is the following:
Whenever an entry in column A of df1 matches an entry in column A of df2, replace the matching row in df1 with parts of the row in df2
In my approach, in the below code, I tried to replace the first row
(a,1,x) by (a,6,i) and consecutively with (a,7,j).
Also all other matching rows should be replaced:
So: (b,2,y) with (b,8,k) and (d,4,g) with (d,10,k)
Meaning that every row in df1 should be replaced by the latest match of column A in df2.
import numpy as np
import pandas as pd
columns = ["0","A", "B", "C"]
s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = columns[1::])
s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)], columns = columns)
cols = ["A", "B", "C"]
print(df1[columns[1::]])
print("---")
print(df2[columns])
print("---")
df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)
The expected result would therefor be:
A B C
0 a 7 j
1 b 2 y
2 c 3 z
3 d 10 k
4 e 5 h
But the above approach results in:
A B C
0 a 6 i
1 a 7 j
2 c 3 z
3 d 10 k
4 e 5 h
I know i could do what I want with iterrows() but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)
Or is there there any other easy approach to achieve this?
Use:
df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
A B C
1 a 7 j
2 b 8 k
2 c 3 z
3 d 10 k
4 e 5 h
You can try merge then update
df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)
A B C
0 a 7.0 j
1 b 8.0 k
2 c 3.0 z
3 d 10.0 k
4 e 5.0 h

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Count number of Columns with specific value in pandas

I am searching a way to countif rows in pandas. An example would be:
df = pd.DataFrame(data = {'A': [x,y, z], 'B':[z,y,x], 'C': [y,x,z] })
I want to count the number of repetitions on each row and add it to new columns based on specific criteria:
Criteria
C1 = x
C2 = y
C3 = z
In the example above, C3 will be [1,0,2] As there are one 'z' in row 0, no 'z' in row 1 and two 'z' in row 2.
The end table would look like:
A B C | C1 C2 C3
x z y | 1 1 1
y y x | 1 2 0
z x z | 1 0 2
How can I do this in Pandas?
Thanks a lot!
do you mean:
df.join(df.apply(pd.Series.value_counts, axis=1).fillna(0))
Output:
A B C x y z
0 x z y 1.0 1.0 1.0
1 y y x 1.0 2.0 0.0
2 z x z 1.0 0.0 2.0
Can iterate through the values and sum across axis 1
df = pd.concat([df.eq(val).sum(1) for val in ['x', 'y', 'z']], axis=1)
0 1 2
0 1 1 1
1 1 2 0
2 1 0 2
Then rename your column names accordingly.
For a more general solution, consider np.unique and using the pd.Series.name attr.
pd.concat([df.eq(val).sum(1).rename(val) for val in np.unique(df)], axis=1)
x y z
0 1 1 1
1 1 2 0
2 1 0 2
And with some trivial tweaks, you can have your end table
map_ = {'x':'C1', 'y':'C2', 'z':'C3'}
df.join(pd.concat([df.eq(i).sum(1).rename(map_[i]) for i in np.unique(df)], 1))
A B C C1 C2 C3
0 x z y 1 1 1
1 y y x 1 2 0
2 z x z 1 0 2

Python Select N number of rows dataframe

I have a dataframe with 2 columns and I want to select N number of row from column B per column A
A B
0 A
0 B
0 I
0 D
1 A
1 F
1 K
1 L
2 R
For each unique number in Column A give me N random rows from Column B: if N == 2 then the resulting dataframe would look like. If Column A doesn't have up to N rows then return all of column A
A B
0 A
0 D
1 F
1 K
2 R
Use DataFrame.sample per groups in GroupBy.apply with test length of groups with if-else:
N = 2
df1 = df.groupby('A').apply(lambda x: x.sample(N) if len(x) >=N else x).reset_index(drop=True)
print (df1)
A B
0 0 I
1 0 D
2 1 A
3 1 K
4 2 R
Or:
N = 2
df1 = df.groupby('A', group_keys=False).apply(lambda x: x.sample(N) if len(x) >=N else x)
print (df1)
A B
0 0 A
3 0 D
5 1 F
6 1 K
8 2 R

Pandas duplicates when grouped

x = df.groupby(["Customer ID", "Category"]).sum().sort_values(by="VALUE", ascending=False)
I want to group by Customer ID but when I use above code, it duplicates customers...
Here is the result:
Source DF:
Customer ID Category Value
0 A x 5
1 B y 5
2 B z 6
3 C x 7
4 A z 2
5 B x 5
6 A x 1
new: https://ufile.io/dpruz
I think you are looking for something like this:
df_out = df.groupby(['Customer ID','Category']).sum()
df_out.reindex(df_out.sum(level=0).sort_values('Value', ascending=False).index,level=0)
Output:
Value
Customer ID Category
B x 5
y 5
z 6
A x 6
z 2
C x 7