How to make pandas work for cross multiplication - pandas

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1

Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

How do I subset the columns of a dataframe based on the index of another dataframe?

The rows of clin.index (row length = 81) is a subset of the columns of common_mrna (col length = 151). I want to keep the columns of common_mrna only if the column names match to the row values of clin dataframe.
My code failed to reduce the number of columns in common_mrna to 81.
import pandas as pd
common_mrna = common_mrna.set_index("Hugo_Symbol")
mrna_val = {}
for colnames, val in common_mrna.iteritems():
for i, rows in clin.iterrows():
if [[common_mrna.columns == i] == "TRUE"]:
mrna_val = np.append(mrna_val, val)
mrna = np.concatenate(mrna_val, axis=0)
common_mrna
Hugo_Symbol
A
B
C
D
First
1
2
3
4
Second
5
row
6
7
clin
Another header
A
20
D
30
desired output
Hugo_Symbol
A
D
First
1
4
Second
5
7
Try this using reindex:
common_mrna.reindex(clin.index, axis=1)
Output:
A D
First 1 4
Second 5 7
Update, IIUC:
common_mrna.set_index('Hugo_Symbol').reindex(clin.index, axis=1).reset_index()
IUUC, you can select the rows of A header in clin found in common_mrna columns and add the first column of common_mrna
cols = clin.loc[clin.index.isin(common_mrna.columns)].index.tolist()
# or with set
cols = list(sorted(set(clin.index.tolist()) & set(common_mrna.columns), key=common_mrna.columns.tolist().index))
out = common_mrna[['Hugo_Symbol'] + cols]
print(out)
Hugo_Symbol A D
0 First 1 4
1 Second 5 7

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

How to change in a dataframe columns to rows based on same name?

As you can see I have a dataframe with several columns with the same name but split into 0., 1. until 27.
How can I take all the values of 1.name and have it under 0.name?
Thank you very much!
Assuming that for all 0<=n<=27 the column names' suffixes are the same, one solution can be:
import pandas as pd
import re
# pattern to extract colum name suffix
pattern = re.compile('^\d\.([\w\.]+)')
# getting all the distinct column names/fields
fields = set([pattern.match(colname).group(1) for colname in df.columns])
# max prefix number, for you 27
n = 27
partitions = []
for i in range(0,n+1):
# creating column selector for partitions
columns_for_partition = list(map(lambda field: str(i) + f'.{field}', fields))
# get partition from dataframe and renaming column to field name (removing n. prefix)
partition = df[columns_for_partition].rename(lambda x: x.split('.',1)[1], axis=1)
partitions.append(partition)
new_df = pd.concat(partitions)
print(new_df)
With an initial dataframe df
0.name 0.something 1.name 1.something
0 a 1 d 4
1 b 2 e 5
2 c 3 f 6
The resulting dataframe new_df will look like:
name something
0 a 1
1 b 2
2 c 3
0 d 4
1 e 5
2 f 6

pandas dataframe filter by sequence of values in a specific column

I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4