Python pandas: Faster way than numpy.select? - pandas

I have a dataframe with two columns looking like this:
GebTyp BAK
0 RH C
1 MFH A
2 RH J
3 RH F
4 RH K
... ... ..
25046 MFH C
25047 MFH G
25048 MFH I
25049 MFH A
25050 MFH B
And another one with values for each pair of these two columns.
BAK EFH/DHH RH MFH GMH HH
0 A 231.0 222.0 265.0 186.0 156.0
1 B 271.0 222.0 204.0 186.0 156.0
2 C 214.0 186.0 222.0 197.0 167.0
3 D 242.0 183.0 236.0 201.0 171.0
4 E 184.0 155.0 188.0 196.0 143.0
5 F 198.0 179.0 162.0 158.0 121.0
6 G 134.0 145.0 138.0 134.0 104.0
7 H 159.0 118.0 143.0 103.0 73.0
8 I 120.0 110.0 119.0 97.0 87.0
9 J 91.0 89.0 86.0 75.0 69.0
10 K NaN NaN NaN NaN NaN
11 L NaN NaN NaN NaN NaN
I can assign each individual value correctly with numpy.select like this:
def GWB()
conditions = [
(mc["BAK"] == "A" & mc["GebTyp"] == "EFH/DHH"),
(mc["BAK"] == "A" & mc["GebTyp"] == "RH"),
(mc["BAK"] == "A" & mc["GebTyp"] == "MFH"),
(mc["BAK"] == "A" & mc["GebTyp"] == "GMH"),
(mc["BAK"] == "A" & mc["GebTyp"] == "HH"),
]
values = [
(231),
(222),
(265).
(186),
(156)
]
df["result"] = np.select(conditions,values)
GWB()
But this would result in roughly 80 lines of code, also in this case I'm working only with the first dataframe, assigning the values manually. I was wondering if there would be a faster/shorter way to do this task?

Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt('BAK', value_name='result', var_name='GebTyp'),
how='left',
on=['BAK','GebTyp'])
print (df)
GebTyp BAK result
0 RH C 186.0
1 MFH A 265.0
2 RH J 89.0
3 RH F 179.0
4 RH K NaN
5 ... .. NaN
6 MFH C 222.0
7 MFH G 138.0
8 MFH I 119.0
9 MFH A 265.0
10 MFH B 204.0

Related

Group/merge rows when defined columns match and sum up values

How Do I group/merge rows, where multiple defined columns have the same value and display the sums in other columns not relevant for grouping/merging?
In the below example: If rows have the same values in columns "OrgA" to "OrgF" (text – this refers to an org. structure with departments and sub-departments), group/merge rows and add up the numbers in columns "numA" and "numB".
import pandas as pd
import numpy as np
data = {'orgA': ['A','C','A','C','A','C','A','A','A','L'],
'orgB': ['B',np.nan,'E',np.nan,'B',np.nan,'E','E','E','C'],
'orgC': ['C',np.nan,'D',np.nan,'C',np.nan,'H','D','H','B'],
'orgD': ['D',np.nan,np.nan,np.nan,'D',np.nan,'F',np.nan,'F','S'],
'orgE': ['E',np.nan,np.nan,np.nan,'E',np.nan,np.nan,np.nan,np.nan,'F'],
'orgF': ['F',np.nan,np.nan,np.nan,'F',np.nan,np.nan,np.nan,np.nan,np.nan],
'numA': [1,1,1,1,1,1,1,1,1,1],
'numB': [2,2,2,2,2,2,2,2,2,2]}
df = pd.DataFrame(data)
print(df)
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 1 2
1 C NaN NaN NaN NaN NaN 1 2
2 A E D NaN NaN NaN 1 2
3 C NaN NaN NaN NaN NaN 1 2
4 A B C D E F 1 2
5 C NaN NaN NaN NaN NaN 1 2
6 A E H F NaN NaN 1 2
7 A E D NaN NaN NaN 1 2
8 A E H F NaN NaN 1 2
9 L C B S F NaN 1 2
The output is supposed to look as follows:
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 2 4
1 C NaN NaN NaN NaN NaN 3 6
2 A E D NaN NaN NaN 2 4
3 A E H F NaN NaN 3 6
4 L C B S F NaN 1 2
Many thanks for your ideas in advance!
You can pass a list of column names to groupby, and set dropna to False so that rows containing nans are not dropped. You can also specify sort=False if it is not important to sort the group keys. Applying this to your example, as in
df.groupby(
['orgA', 'orgB', 'orgC', 'orgD', 'orgE', 'orgF'],
dropna=False,
sort=False
).sum()
we get
numA numB
orgA orgB orgC orgD orgE orgF
A B C D E F 2 4
C NaN NaN NaN NaN NaN 3 6
A E D NaN NaN NaN 2 4
H F NaN NaN 2 4
L C B S F NaN 1 2

Create columns in python data frame based on existing column-name and column-values

I have a dataframe in pandas:
import pandas as pd
# assign data of lists.
data = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Employment': ['R','U', 'E','R','U', 'E','R','U', 'E','R','U', 'E'],
'Age': ['Y','M', 'O','Y','M', 'O','Y','M', 'O','Y','M', 'O']
}
# Create DataFrame
df = pd.DataFrame(data)
df
What I want is to create for each category of each existing column a new column with the following format:
Gender_M -> for when the gender equals M
Gender_F -> for when the gender equal F
Employment_R -> for when employment equals R
Employment_U -> for when employment equals U
and so on...
So far, I have created the below code:
for i in range(len(df.columns)):
curent_column=list(df.columns)[i]
col_df_array = df[curent_column].unique()
for j in range(col_df_array.size):
new_col_name = str(list(df.columns)[i])+"_"+col_df_array[j]
for index,row in df.iterrows():
if(row[curent_column] == col_df_array[j]):
df[new_col_name] = row[curent_column]
The problem is that even though I have managed to create successfully the column names, I am not able to get the correct column values.
For example the column Gender should be as below:
data2 = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Gender_M': ['M', 'na', 'M', 'na','M', 'na','M', 'na','M', 'na','M', 'na'],
'Gender_F': ['na', 'F', 'na', 'F','na', 'F','na', 'F','na', 'F','na', 'F']
}
df2 = pd.DataFrame(data2)
Just to say, the na can be anything such as blanks or dots or NAN.
You're looking for pd.get_dummies.
>>> pd.get_dummies(df)
Gender_F Gender_M Employment_E Employment_R Employment_U Age_M Age_O Age_Y
0 0 1 0 1 0 0 0 1
1 1 0 0 0 1 1 0 0
2 0 1 1 0 0 0 1 0
3 1 0 0 1 0 0 0 1
4 0 1 0 0 1 1 0 0
5 1 0 1 0 0 0 1 0
6 0 1 0 1 0 0 0 1
7 1 0 0 0 1 1 0 0
8 0 1 1 0 0 0 1 0
9 1 0 0 1 0 0 0 1
10 0 1 0 0 1 1 0 0
11 1 0 1 0 0 0 1 0
If you are trying to get the data in a format like your df2 example, I believe this is what you are looking for.
ndf = pd.get_dummies(df)
df.join(ndf.mul(ndf.columns.str.split('_').str[-1]))
Output:
Old Answer
df[['Gender']].join(pd.get_dummies(df[['Gender']]).mul(df['Gender'],axis=0).replace('',np.NaN))
Output:
Gender Gender_F Gender_M
0 M NaN M
1 F F NaN
2 M NaN M
3 F F NaN
4 M NaN M
5 F F NaN
6 M NaN M
7 F F NaN
8 M NaN M
9 F F NaN
10 M NaN M
11 F F NaN
If you are okay with 0s and 1s in your new columns, then using get_dummies (as suggested by #richardec) should be the most straightforward.
However, if want a specific letter in each of your new columns, then another method is to loop through the current columns and the specific categories within each column, and create a new column from this information using apply.
for col in data.keys():
categories = list(df[col].unique())
for category in categories:
df[f"{col}_{category}"] = df[col].apply(lambda x: category if x==category else float("nan"))
Result:
>>> df
Gender Employment Age Gender_M Gender_F Employment_R Employment_U Employment_E Age_Y Age_M Age_O
0 M R Y M NaN R NaN NaN Y NaN NaN
1 F U M NaN F NaN U NaN NaN M NaN
2 M E O M NaN NaN NaN E NaN NaN O
3 F R Y NaN F R NaN NaN Y NaN NaN
4 M U M M NaN NaN U NaN NaN M NaN
5 F E O NaN F NaN NaN E NaN NaN O
6 M R Y M NaN R NaN NaN Y NaN NaN
7 F U M NaN F NaN U NaN NaN M NaN
8 M E O M NaN NaN NaN E NaN NaN O
9 F R Y NaN F R NaN NaN Y NaN NaN
10 M U M M NaN NaN U NaN NaN M NaN
11 F E O NaN F NaN NaN E NaN NaN O

apply a function to each secuence of rows in a column

I have a df like this:
xx
A 3
B 4
C 1
D 5
E 7
F 6
G 3
H 5
I 8
J 5
I would like to apply the pct_change function to column XX to every 5 rows:
to generate the following output:
xx
A NaN
B 0.333333
C -0.750000
D 4.000000
E 0.400000
F NaN
G -0.500000
H 0.666667
I 0.600000
J -0.375000
How could I achieve this?
Create np.arange by length of df and use integer divison by 5 and pass it to groupby function:
df = df.groupby(np.arange(len(df)) // 5).pct_change()
print (df)
xx
A NaN
B 0.333333
C -0.750000
D 4.000000
E 0.400000
F NaN
G -0.500000
H 0.666667
I 0.600000
J -0.375000

Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequence

I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0

Mul() Broadcast levels from Multi Index

Attempting to use a multiply operation with a multi index.
import pandas as pd
import numpy as np
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),np.repeat('C',9)))
ser.index.names = ['Alpha','Gamma']
print df
print ser
foo = df.mul(ser,axis=0,level = ['Alpha','Gamma'])
So my dataframe which became a series looks like
Alpha Beta Gamma
1 A C 7
D 188
2 B C 7
D 110
3 C C 2
D 124
4 D C 4
D 153
5 E C 9
D 178
6 F C 6
D 196
7 G C 1
D 156
8 H C 1
D 184
9 I C 3
D 169
And my series looks like
Alpha Gamma
1 C 0.8731
2 C 0.6347
3 C 0.4688
4 C 0.5623
5 C 0.4944
6 C 0.5234
7 C 0.9946
8 C 0.7815
9 C 0.1219
In my multiply operation I want to broadcast on index levels 'Alpha' and 'Gamma'
but i get this error message:
TypeError: Join on level between two MultiIndex objects is ambiguous
How about this? Perhaps it's the extra 'Beta' column in df but not ser that causes the problem?
(Note: this is using df as updated in #Dickster's answer, not as in the original question)
df2 = df.reset_index().set_index(['Alpha','Gamma'])
df2[0].mul(ser)
Alpha Gamma
1 C 2.503829
D NaN
2 C 5.028208
D NaN
3 C 0.842322
D NaN
4 C 0.198101
D NaN
5 C 0.800745
D NaN
6 C 1.936523
D NaN
7 C 2.507393
D NaN
8 C 4.846258
D NaN
9 C NaN
D 147.233378
So imagine I have this, where I now have a 'D' in Gamma in the series "ser":
import pandas as pd
import numpy as np
np.random.seed(1)
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
idx = list(np.repeat('C',8))
idx.append('D')
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),idx))
ser.index.names = ['Alpha','Gamma']
print df
print ser
df_A = df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
print df_A
df_dickster77 = df.unstack('Alpha').mul(ser.unstack('Alpha')).stack('Alpha').reorder_levels(df.index.names)
print df_dickster77
Output is this:
Alpha Beta Gamma
1 A C 6
D 120
2 B C 9
D 118
3 C C 6
D 184
4 D C 1
D 111
5 E C 1
D 128
6 F C 2
D 129
7 G C 8
D 114
8 H C 7
D 150
9 I C 3
D 168
dtype: int32
Alpha Gamma
1 C 0.417305
2 C 0.558690
3 C 0.140387
4 C 0.198101
5 C 0.800745
6 C 0.968262
7 C 0.313424
8 C 0.692323
9 D 0.876389
dtype: float64
output A: inadvertent multiplication
Gamma C D
Alpha Beta Gamma
1 A C 2.503829 NaN
D 50.076576 NaN
2 B C 5.028208 NaN
D 65.925400 NaN
3 C C 0.842322 NaN
D 25.831197 NaN
4 D C 0.198101 NaN
D 21.989265 NaN
5 E C 0.800745 NaN
D 102.495305 NaN
6 F C 1.936523 NaN
D 124.905743 NaN
7 G C 2.507393 NaN
D 35.730356 NaN
8 H C 4.846258 NaN
D 103.848392 NaN
9 I C NaN 2.629167
D NaN 147.233378
output df_dickster77: Its correct multiplication lining up on C's and D.
However 8 x D NaNs lost and 1 x C NaN lost
Alpha Beta Gamma
1 A C 2.503829
2 B C 5.028208
3 C C 0.842322
4 D C 0.198101
5 E C 0.800745
6 F C 1.936523
7 G C 2.507393
8 H C 4.846258
9 I D 147.233378
dtype: float64
This is the way to do it ATM. At some point a more concise may be implemented.
In [21]: df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
Out[21]:
Gamma C
Alpha Beta Gamma
1 A C 6.761867
D 171.944612
2 B C 0.154139
D 6.371062
3 C C 2.311870
D 42.898041
4 D C 0.390920
D 9.479801
5 E C 3.484439
D 72.011743
6 F C 0.740913
D 50.382061
7 G C 3.459497
D 60.541203
8 H C 0.467012
D 19.030741
9 I C 0.071290
D 11.620286