conditional operation on pandas column

conditional operation on pandas column - pandas

df1 =
name col1
a 1
b 2
c 3
d 4
df2 =
name col2
b 3
c 9
a 2
d 3
I want to compare names in both data-frames and multpily other two columns respectively. so the output would be like..
df3 =
name col_new
a 2
b 6
c 27
d 12

Use Series.map for correct ordering with multiple by Series.mul and for extract original col is used DataFrame.pop:
df1['col_new'] = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col']))
For new DataFrame is uses DataFrame.assign:
df3 = df1.assign('col_new' = df1.pop('col').mul(df1['name'].map(df2.set_index('name')['col'])))
Or another solution with DataFrame.merge and left join:
df3 = df1.merge(df2, on='name', how='left')
df3['col_new'] = df3.pop('col_x').mul(df3.pop('col_y'))

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.

NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

How do I subset the columns of a dataframe based on the index of another dataframe?

The rows of clin.index (row length = 81) is a subset of the columns of common_mrna (col length = 151). I want to keep the columns of common_mrna only if the column names match to the row values of clin dataframe.
My code failed to reduce the number of columns in common_mrna to 81.
import pandas as pd
common_mrna = common_mrna.set_index("Hugo_Symbol")
mrna_val = {}
for colnames, val in common_mrna.iteritems():
for i, rows in clin.iterrows():
if [[common_mrna.columns == i] == "TRUE"]:
mrna_val = np.append(mrna_val, val)
mrna = np.concatenate(mrna_val, axis=0)
common_mrna
Hugo_Symbol
A
B
C
D
First
1
2
3
4
Second
5
row
6
7
clin
Another header
A
20
D
30
desired output
Hugo_Symbol
A
D
First
1
4
Second
5
7

Try this using reindex:
common_mrna.reindex(clin.index, axis=1)
Output:
A D
First 1 4
Second 5 7
Update, IIUC:
common_mrna.set_index('Hugo_Symbol').reindex(clin.index, axis=1).reset_index()

IUUC, you can select the rows of A header in clin found in common_mrna columns and add the first column of common_mrna
cols = clin.loc[clin.index.isin(common_mrna.columns)].index.tolist()
# or with set
cols = list(sorted(set(clin.index.tolist()) & set(common_mrna.columns), key=common_mrna.columns.tolist().index))
out = common_mrna[['Hugo_Symbol'] + cols]
print(out)
Hugo_Symbol A D
0 First 1 4
1 Second 5 7

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1

Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

Pandas find columns with wildcard names

I have a pandas dataframe with column names like this:
id ColNameOrig_x ColNameOrig_y
There are many such columns, the 'x' and 'y' came about because 2 datasets with similar column names were merged.
What I need to do:
df.ColName = df.ColNameOrig_x + df.ColNameOrig_y
I am now manually repeating this line for many cols(close to 50), is there a wildcard way of doing this?

You can use DataFrame.filter with DataFrame.groupby by lambda function and axis=1 for grouping per columns names with aggregate sum or use text functions like Series.str.split with indexing:
df1 = df.filter(like='_').groupby(lambda x: x.split('_')[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str.split('_').str[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str[:12], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15

You can use the subscripting syntax to access column names dynamically:
col_groups = ['ColName1', 'ColName2']
for grp in col_groups:
df[grp] = df[f'{grp}Orig_x'] + df[f'{grp}Orig_y']
Or you can aggregate by column group. For example
df = pd.DataFrame([
[1,2,3,4],
[5,6,7,8]
], columns=['ColName1Orig_x', 'ColName1Orig_y', 'ColName2Orig_x', 'ColName2Orig_y'])
# Here's your opportunity to define the wildcard
col_groups = df.columns.str.extract('(.+)Orig_[x|y]')[0]
df.columns = [col_groups, df.columns]
df.groupby(level=0, axis=1).sum()
Input:
ColName1Orig_x ColName1Orig_y ColName2Orig_x ColName2Orig_y
1 2 3 4
5 6 7 8
Output:
ColName1 ColName2
3 7
11 15

Map column names if data is same in two dataframes

I have two pandas dataframes
df1 = A B C
1 2 3
2 3 4
3 4 5
df2 = X Y Z
1 2 3
2 3 4
3 4 5
I need to map based on data If data is same then map column namesenter code here
Output = col1 col2
A X
B Y
C Z

I cannot find any built-in function to support this, hence simply loop over all columns:
pairs = []
for col1 in df1.columns:
for col2 in df2.columns:
if df1[col1].equals(df2[col2]):
pairs.append((col1, col2))
output = pandas.DataFrame(pairs, columns=['col1', 'col2'])

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

conditional operation on pandas column - pandas

df1 = name col1 a 1 b 2 c 3 d 4 df2 = name col2 b 3 c 9 a 2 d 3 I want to compare names in both data-frames and multpily other two columns respectively. so the output would be like.. df3 = name col_new a 2 b 6 c 27 d 12

Related

Multimatch join in pandas

How do I subset the columns of a dataframe based on the index of another dataframe?

How to make pandas work for cross multiplication

Pandas find columns with wildcard names

Map column names if data is same in two dataframes

Categories

Resources