Take product of columns in dataframe with lags - pandas

Have following dataframe.
A = pd.Series([2, 3, 4, 5], index=[1, 2, 3, 4])
B = pd.Series([6, 7, 8, 9], index=[1, 2, 3, 4])
Aw = pd.Series([0.25, 0.3, 0.33, 0.36], index=[1, 2, 3, 4])
Bw = pd.Series([0.75, 0.7, 0.67, 0.65], index=[1, 2, 3, 4])
df = pd.DataFrame({'A': A, 'B': B, 'Aw': Aw, 'Bw', Bw})
df
Index A B Aw Bw
1 2 6 0.25 0.75
2 3 7 0.30 0.70
3 4 8 0.33 0.67
4 5 9 0.36 0.64
What I would like to do is multiply 'A' and lag of 'Aw' and likewise 'B' with 'Bw'. The resulting dataframe will look like the following:
Index A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.3 0.7 1.2 5.6
4 5 9 0.33 0.64 1.65 5.76
Thank you in advance

To get your desired output, first shift Aw and Bw, then multiply them by A and B:
df[['Aw','Bw']] = df[['Aw','Bw']].shift()
df[['A_ctr','B_ctr']] = df[['A','B']].values*df[['Aw','Bw']]
A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.30 0.70 1.20 5.60
4 5 9 0.33 0.67 1.65 6.03

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Python - count and Difference data frames

I have two data frames about occupation in industry in 2005 and 2006. I would like to create a df using the column with the result of the changed of these years, if it growth or decreased. Here is a sample:
import pandas as pd
d = {'OCC2005': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4321,4321, 3333], 'IND2005': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5], 'Result': [7, 8, 12, 1, 11,15,20,1,5,12,8,4,3]}
df = pd.DataFrame(data=d)
print(df)
d2 = {'OCC2006': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4361,4321, 3333,4444], 'IND2006': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5,8], 'Result': [17, 18, 12, 1, 1,5,20,1,5,2,18,4,0,15]}
df2 = pd.DataFrame(data=d2)
print(df2)
Final_Result = df2['Result'] - df['Result']
print(Final_Result)
I would like to create a df with occ- ind- final_result
Rename columns of df to match column names of df2:
MAP = dict(zip(df.columns, df2.columns))
out = (df2.set_index(['OCC2006', 'IND2006'])
.sub(df.rename(columns=MAP).set_index(['OCC2006', 'IND2006']))
.reset_index())
print(out)
# Output
OCC2006 IND2006 Result
0 1234 4 10.0
1 1234 5 10.0
2 1234 6 0.0
3 1234 7 0.0
4 2357 4 0.0
5 2357 5 -10.0
6 2357 6 -10.0
7 2357 7 0.0
8 3333 5 -3.0
9 4321 4 0.0
10 4321 5 NaN
11 4321 6 0.0
12 4321 7 -10.0
13 4361 5 NaN
14 4444 8 NaN

Pandas concatenate dataframe with multiindex retaining index names

I have a list of DataFrames as follows where each DataFrame in the list is as follows:
dfList[0]
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
dfList[1]
monthNum 1 2
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
dfList[0].index
Float64Index([2.0, 3.0, 4.0], dtype='float64', name='G1')
dfList[0].columns
Int64Index([1, 2], dtype='int64', name='monthNum')
I am trying to achieve the following in a dataframe Final_Combined_DF:
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
I tried doing different combinations of:
pd.concat(dfList, axis=0)
but it has not given me desired output. I am not sure how to go about this.
We can try pd.concat with keys using the Index.name from each DataFrame to add a new level index in the final frame:
final_combined_df = pd.concat(
df_list, keys=map(lambda d: d.index.name, df_list)
)
final_combined_df:
monthNum 0 1
G1 2.0 4 7
3.0 7 1
4.0 9 5
G2 21.0 8 1
31.0 1 8
41.0 2 6
Setup Used:
import numpy as np
import pandas as pd
np.random.seed(5)
df_list = [
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([2.0, 3.0, 4.0], name='G1')),
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([21.0, 31.0, 41.0], name='G2'))
]
df_list:
[monthNum 0 1
G1
2.0 4 7
3.0 7 1
4.0 9 5,
monthNum 0 1
G2
21.0 8 1
31.0 1 8
41.0 2 6]

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

Concat and append in pandas datafarme

I have three data frame with the same dimension, and I need to concatenate them as a single data frame.
df1 = pd.DataFrame({'AD': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007'],
'FC': [0.5, 0.7, 0.7, 2.6, 2.9],
'EX':['12', '13', '14', '15', '16'],
't' : [2, 2, 3, 3, 3],
'P' :[3,7,8,9,1]})
df2 = df1.copy()
df3 = df1.copy()
df = df1.append([df2, df3])
I tried append and concate, both returns me with a data frame without the first column.
This is what I tried,
pd.concat([df1,df2,df3]) and df1.append([df2,df3])
Concat works if I set the first column of all data frames as index using df1.set_index('col1') and so for df2 and df3. Then with pd.concat it works, not otherwise. Would be great if there is a direct solution
Thank you
Is this what you are looking for?
pd.concat([df1,df2,df3], ignore_index=True)
AD EX FC P t
0 CTA15 12 0.5 3 2
1 CTA15 13 0.7 7 2
2 AC007 14 0.7 8 3
3 AC007 15 2.6 9 3
4 AC007 16 2.9 1 3
5 CTA15 12 0.5 3 2
6 CTA15 13 0.7 7 2
7 AC007 14 0.7 8 3
8 AC007 15 2.6 9 3
9 AC007 16 2.9 1 3
10 CTA15 12 0.5 3 2
11 CTA15 13 0.7 7 2
12 AC007 14 0.7 8 3
13 AC007 15 2.6 9 3
14 AC007 16 2.9 1 3