How to make pandas work for cross multiplication - pandas
I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
Related
Creating a dataframe using roll-forward window on multivariate time series
Based on the simplifed sample dataframe import pandas as pd import numpy as np timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left') values = np.arange(0,len(timestamps)) df = pd.DataFrame({'A': values ,'B' : values*2}, index = timestamps ) print(df) A B 2017-01-01 0 0 2017-01-02 1 2 2017-01-03 2 4 2017-01-04 3 6 I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like timestep_1 timestep_2 target 0 A 0 1 2 B 0 2 4 1 A 1 2 3 B 2 4 6 I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values. My first idea was to use pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html But that seems to only work in combination with aggregate functions such as sum, which is a different use case. Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it: window_size = 3 new_df = pd.concat( [ df.iloc[i : i + window_size, :] .T.reset_index() .assign(other_index=i) .set_index(["other_index", "index"]) .set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1) for i in range(df.shape[0] - window_size + 1) ] ) new_df.index.names = ["", ""] print(df) # Output timestep_1 timestep_2 target 0 A 0 1 2 B 0 2 4 1 A 1 2 3 B 2 4 6
How do I subset the columns of a dataframe based on the index of another dataframe?
The rows of clin.index (row length = 81) is a subset of the columns of common_mrna (col length = 151). I want to keep the columns of common_mrna only if the column names match to the row values of clin dataframe. My code failed to reduce the number of columns in common_mrna to 81. import pandas as pd common_mrna = common_mrna.set_index("Hugo_Symbol") mrna_val = {} for colnames, val in common_mrna.iteritems(): for i, rows in clin.iterrows(): if [[common_mrna.columns == i] == "TRUE"]: mrna_val = np.append(mrna_val, val) mrna = np.concatenate(mrna_val, axis=0) common_mrna Hugo_Symbol A B C D First 1 2 3 4 Second 5 row 6 7 clin Another header A 20 D 30 desired output Hugo_Symbol A D First 1 4 Second 5 7
Try this using reindex: common_mrna.reindex(clin.index, axis=1) Output: A D First 1 4 Second 5 7 Update, IIUC: common_mrna.set_index('Hugo_Symbol').reindex(clin.index, axis=1).reset_index()
IUUC, you can select the rows of A header in clin found in common_mrna columns and add the first column of common_mrna cols = clin.loc[clin.index.isin(common_mrna.columns)].index.tolist() # or with set cols = list(sorted(set(clin.index.tolist()) & set(common_mrna.columns), key=common_mrna.columns.tolist().index)) out = common_mrna[['Hugo_Symbol'] + cols] print(out) Hugo_Symbol A D 0 First 1 4 1 Second 5 7
pandas finding duplicate rows with different label
I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe. This isn't hard but I am wondering what the most elegant solution for this is. Here an example: import pandas as pd df = pd.DataFrame({ "feature_1" : [0,0,0,4,4,2], "feature_2" : [0,5,5,1,1,3], "label" : ["A","A","B","B","D","A"] }) result_df = pd.DataFrame({ "cluster_index" : [0,0,1,1], "feature_1" : [0,0,4,4], "feature_2" : [5,5,1,1], "label" : ["A","B","B","D"] })
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach: g = df.groupby(['feature_1', 'feature_2'])['label'] (df.assign(cluster_index=g.ngroup()) # get group name .loc[g.transform('size').gt(1)] # filter the non-duplicates # line below only to have a nice cluster_index range (0,1…) .assign(cluster_index= lambda d: d['cluster_index'].factorize()[0]) ) output: feature_1 feature_2 label cluster_index 1 0 5 A 0 2 0 5 B 0 3 4 1 B 1 4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices: df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates() df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup() print (df) feature_1 feature_2 label cluster_index 1 0 5 A 0 2 0 5 B 0 3 4 1 B 1 4 4 1 D 1
How to change in a dataframe columns to rows based on same name?
As you can see I have a dataframe with several columns with the same name but split into 0., 1. until 27. How can I take all the values of 1.name and have it under 0.name? Thank you very much!
Assuming that for all 0<=n<=27 the column names' suffixes are the same, one solution can be: import pandas as pd import re # pattern to extract colum name suffix pattern = re.compile('^\d\.([\w\.]+)') # getting all the distinct column names/fields fields = set([pattern.match(colname).group(1) for colname in df.columns]) # max prefix number, for you 27 n = 27 partitions = [] for i in range(0,n+1): # creating column selector for partitions columns_for_partition = list(map(lambda field: str(i) + f'.{field}', fields)) # get partition from dataframe and renaming column to field name (removing n. prefix) partition = df[columns_for_partition].rename(lambda x: x.split('.',1)[1], axis=1) partitions.append(partition) new_df = pd.concat(partitions) print(new_df) With an initial dataframe df 0.name 0.something 1.name 1.something 0 a 1 d 4 1 b 2 e 5 2 c 3 f 6 The resulting dataframe new_df will look like: name something 0 a 1 1 b 2 2 c 3 0 d 4 1 e 5 2 f 6
pandas dataframe filter by sequence of values in a specific column
I have a dataframe A B C 1 2 3 2 3 4 3 8 7 I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows) What will be the best way to do so?
You can use rolling for general solution working with any pattern: pat = np.asarray([3,4]) N = len(pat) mask= (df['C'].rolling(window=N , min_periods=N) .apply(lambda x: (x==pat).all(), raw=True) .mask(lambda x: x == 0) .bfill(limit=N-1) .fillna(0) .astype(bool)) df = df[mask] print (df) A B C 0 1 2 3 1 2 3 4 Explanation: use rolling.apply and test pattern replace 0s to NaNs by mask use bfill with limit for filling first NANs values by last previous one fillna NaNs to 0 last cast to bool by astype
Use shift In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1) In [1086]: df[s | s.shift()] Out[1086]: A B C 0 1 2 3 1 2 3 4