How to make pandas work for cross multiplication - pandas
I have 3 data frame:
from the multiplication, using pandas and numpy, I want to the output in df1:
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
Creating a dataframe using roll-forward window on multivariate time series
Based on the simplifed sample dataframe import pandas as pd import numpy as np timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left') values = np.arange(0,len(timestamps)) df = pd.DataFrame({'A': values ,'B' : values*2}, index = timestamps ) print(df) A B 2017-01-01 0 0 2017-01-02 1 2 2017-01-03 2 4 2017-01-04 3 6 I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like timestep_1 timestep_2 target 0 A 0 1 2 B 0 2 4 1 A 1 2 3 B 2 4 6 I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values. My first idea was to use pandas But that seems to only work in combination with aggregate functions such as sum, which is a different use case. Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it: window_size = 3 new_df = pd.concat( [ df.iloc[i : i + window_size, :] .T.reset_index() .assign(other_index=i) .set_index(["other_index", "index"]) .set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1) for i in range(df.shape[0] - window_size + 1) ] ) new_df.index.names = ["", ""] print(df) # Output timestep_1 timestep_2 target 0 A 0 1 2 B 0 2 4 1 A 1 2 3 B 2 4 6
How do I subset the columns of a dataframe based on the index of another dataframe?
The rows of clin.index (row length = 81) is a subset of the columns of common_mrna (col length = 151). I want to keep the columns of common_mrna only if the column names match to the row values of clin dataframe. My code failed to reduce the number of columns in common_mrna to 81. import pandas as pd common_mrna = common_mrna.set_index("Hugo_Symbol") mrna_val = {} for colnames, val in common_mrna.iteritems(): for i, rows in clin.iterrows(): if [[common_mrna.columns == i] == "TRUE"]: mrna_val = np.append(mrna_val, val) mrna = np.concatenate(mrna_val, axis=0) common_mrna Hugo_Symbol A B C D First 1 2 3 4 Second 5 row 6 7 clin Another header A 20 D 30 desired output Hugo_Symbol A D First 1 4 Second 5 7
Try this using reindex: common_mrna.reindex(clin.index, axis=1) Output: A D First 1 4 Second 5 7 Update, IIUC: common_mrna.set_index('Hugo_Symbol').reindex(clin.index, axis=1).reset_index()
IUUC, you can select the rows of A header in clin found in common_mrna columns and add the first column of common_mrna cols = clin.loc[clin.index.isin(common_mrna.columns)].index.tolist() # or with set cols = list(sorted(set(clin.index.tolist()) & set(common_mrna.columns), key=common_mrna.columns.tolist().index)) out = common_mrna[['Hugo_Symbol'] + cols] print(out) Hugo_Symbol A D 0 First 1 4 1 Second 5 7
pandas finding duplicate rows with different label
I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe. This isn't hard but I am wondering what the most elegant solution for this is. Here an example: import pandas as pd df = pd.DataFrame({ "feature_1" : [0,0,0,4,4,2], "feature_2" : [0,5,5,1,1,3], "label" : ["A","A","B","B","D","A"] }) result_df = pd.DataFrame({ "cluster_index" : [0,0,1,1], "feature_1" : [0,0,4,4], "feature_2" : [5,5,1,1], "label" : ["A","B","B","D"] })
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach: g = df.groupby(['feature_1', 'feature_2'])['label'] (df.assign(cluster_index=g.ngroup()) # get group name .loc[g.transform('size').gt(1)] # filter the non-duplicates # line below only to have a nice cluster_index range (0,1…) .assign(cluster_index= lambda d: d['cluster_index'].factorize()[0]) ) output: feature_1 feature_2 label cluster_index 1 0 5 A 0 2 0 5 B 0 3 4 1 B 1 4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices: df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates() df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup() print (df) feature_1 feature_2 label cluster_index 1 0 5 A 0 2 0 5 B 0 3 4 1 B 1 4 4 1 D 1
How to change in a dataframe columns to rows based on same name?
As you can see I have a dataframe with several columns with the same name but split into 0., 1. until 27. How can I take all the values of and have it under Thank you very much!
Assuming that for all 0<=n<=27 the column names' suffixes are the same, one solution can be: import pandas as pd import re # pattern to extract colum name suffix pattern = re.compile('^\d\.([\w\.]+)') # getting all the distinct column names/fields fields = set([pattern.match(colname).group(1) for colname in df.columns]) # max prefix number, for you 27 n = 27 partitions = [] for i in range(0,n+1): # creating column selector for partitions columns_for_partition = list(map(lambda field: str(i) + f'.{field}', fields)) # get partition from dataframe and renaming column to field name (removing n. prefix) partition = df[columns_for_partition].rename(lambda x: x.split('.',1)[1], axis=1) partitions.append(partition) new_df = pd.concat(partitions) print(new_df) With an initial dataframe df 0.something 1.something 0 a 1 d 4 1 b 2 e 5 2 c 3 f 6 The resulting dataframe new_df will look like: name something 0 a 1 1 b 2 2 c 3 0 d 4 1 e 5 2 f 6
pandas dataframe filter by sequence of values in a specific column
I have a dataframe A B C 1 2 3 2 3 4 3 8 7 I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows) What will be the best way to do so?
You can use rolling for general solution working with any pattern: pat = np.asarray([3,4]) N = len(pat) mask= (df['C'].rolling(window=N , min_periods=N) .apply(lambda x: (x==pat).all(), raw=True) .mask(lambda x: x == 0) .bfill(limit=N-1) .fillna(0) .astype(bool)) df = df[mask] print (df) A B C 0 1 2 3 1 2 3 4 Explanation: use rolling.apply and test pattern replace 0s to NaNs by mask use bfill with limit for filling first NANs values by last previous one fillna NaNs to 0 last cast to bool by astype
Use shift In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1) In [1086]: df[s | s.shift()] Out[1086]: A B C 0 1 2 3 1 2 3 4