Related
How Do I group/merge rows, where multiple defined columns have the same value and display the sums in other columns not relevant for grouping/merging?
In the below example: If rows have the same values in columns "OrgA" to "OrgF" (text – this refers to an org. structure with departments and sub-departments), group/merge rows and add up the numbers in columns "numA" and "numB".
import pandas as pd
import numpy as np
data = {'orgA': ['A','C','A','C','A','C','A','A','A','L'],
'orgB': ['B',np.nan,'E',np.nan,'B',np.nan,'E','E','E','C'],
'orgC': ['C',np.nan,'D',np.nan,'C',np.nan,'H','D','H','B'],
'orgD': ['D',np.nan,np.nan,np.nan,'D',np.nan,'F',np.nan,'F','S'],
'orgE': ['E',np.nan,np.nan,np.nan,'E',np.nan,np.nan,np.nan,np.nan,'F'],
'orgF': ['F',np.nan,np.nan,np.nan,'F',np.nan,np.nan,np.nan,np.nan,np.nan],
'numA': [1,1,1,1,1,1,1,1,1,1],
'numB': [2,2,2,2,2,2,2,2,2,2]}
df = pd.DataFrame(data)
print(df)
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 1 2
1 C NaN NaN NaN NaN NaN 1 2
2 A E D NaN NaN NaN 1 2
3 C NaN NaN NaN NaN NaN 1 2
4 A B C D E F 1 2
5 C NaN NaN NaN NaN NaN 1 2
6 A E H F NaN NaN 1 2
7 A E D NaN NaN NaN 1 2
8 A E H F NaN NaN 1 2
9 L C B S F NaN 1 2
The output is supposed to look as follows:
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 2 4
1 C NaN NaN NaN NaN NaN 3 6
2 A E D NaN NaN NaN 2 4
3 A E H F NaN NaN 3 6
4 L C B S F NaN 1 2
Many thanks for your ideas in advance!
You can pass a list of column names to groupby, and set dropna to False so that rows containing nans are not dropped. You can also specify sort=False if it is not important to sort the group keys. Applying this to your example, as in
df.groupby(
['orgA', 'orgB', 'orgC', 'orgD', 'orgE', 'orgF'],
dropna=False,
sort=False
).sum()
we get
numA numB
orgA orgB orgC orgD orgE orgF
A B C D E F 2 4
C NaN NaN NaN NaN NaN 3 6
A E D NaN NaN NaN 2 4
H F NaN NaN 2 4
L C B S F NaN 1 2
I have a dataframe in pandas:
import pandas as pd
# assign data of lists.
data = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Employment': ['R','U', 'E','R','U', 'E','R','U', 'E','R','U', 'E'],
'Age': ['Y','M', 'O','Y','M', 'O','Y','M', 'O','Y','M', 'O']
}
# Create DataFrame
df = pd.DataFrame(data)
df
What I want is to create for each category of each existing column a new column with the following format:
Gender_M -> for when the gender equals M
Gender_F -> for when the gender equal F
Employment_R -> for when employment equals R
Employment_U -> for when employment equals U
and so on...
So far, I have created the below code:
for i in range(len(df.columns)):
curent_column=list(df.columns)[i]
col_df_array = df[curent_column].unique()
for j in range(col_df_array.size):
new_col_name = str(list(df.columns)[i])+"_"+col_df_array[j]
for index,row in df.iterrows():
if(row[curent_column] == col_df_array[j]):
df[new_col_name] = row[curent_column]
The problem is that even though I have managed to create successfully the column names, I am not able to get the correct column values.
For example the column Gender should be as below:
data2 = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Gender_M': ['M', 'na', 'M', 'na','M', 'na','M', 'na','M', 'na','M', 'na'],
'Gender_F': ['na', 'F', 'na', 'F','na', 'F','na', 'F','na', 'F','na', 'F']
}
df2 = pd.DataFrame(data2)
Just to say, the na can be anything such as blanks or dots or NAN.
You're looking for pd.get_dummies.
>>> pd.get_dummies(df)
Gender_F Gender_M Employment_E Employment_R Employment_U Age_M Age_O Age_Y
0 0 1 0 1 0 0 0 1
1 1 0 0 0 1 1 0 0
2 0 1 1 0 0 0 1 0
3 1 0 0 1 0 0 0 1
4 0 1 0 0 1 1 0 0
5 1 0 1 0 0 0 1 0
6 0 1 0 1 0 0 0 1
7 1 0 0 0 1 1 0 0
8 0 1 1 0 0 0 1 0
9 1 0 0 1 0 0 0 1
10 0 1 0 0 1 1 0 0
11 1 0 1 0 0 0 1 0
If you are trying to get the data in a format like your df2 example, I believe this is what you are looking for.
ndf = pd.get_dummies(df)
df.join(ndf.mul(ndf.columns.str.split('_').str[-1]))
Output:
Old Answer
df[['Gender']].join(pd.get_dummies(df[['Gender']]).mul(df['Gender'],axis=0).replace('',np.NaN))
Output:
Gender Gender_F Gender_M
0 M NaN M
1 F F NaN
2 M NaN M
3 F F NaN
4 M NaN M
5 F F NaN
6 M NaN M
7 F F NaN
8 M NaN M
9 F F NaN
10 M NaN M
11 F F NaN
If you are okay with 0s and 1s in your new columns, then using get_dummies (as suggested by #richardec) should be the most straightforward.
However, if want a specific letter in each of your new columns, then another method is to loop through the current columns and the specific categories within each column, and create a new column from this information using apply.
for col in data.keys():
categories = list(df[col].unique())
for category in categories:
df[f"{col}_{category}"] = df[col].apply(lambda x: category if x==category else float("nan"))
Result:
>>> df
Gender Employment Age Gender_M Gender_F Employment_R Employment_U Employment_E Age_Y Age_M Age_O
0 M R Y M NaN R NaN NaN Y NaN NaN
1 F U M NaN F NaN U NaN NaN M NaN
2 M E O M NaN NaN NaN E NaN NaN O
3 F R Y NaN F R NaN NaN Y NaN NaN
4 M U M M NaN NaN U NaN NaN M NaN
5 F E O NaN F NaN NaN E NaN NaN O
6 M R Y M NaN R NaN NaN Y NaN NaN
7 F U M NaN F NaN U NaN NaN M NaN
8 M E O M NaN NaN NaN E NaN NaN O
9 F R Y NaN F R NaN NaN Y NaN NaN
10 M U M M NaN NaN U NaN NaN M NaN
11 F E O NaN F NaN NaN E NaN NaN O
I want to take a series and append it to an existing dataframe row. For example:
df
A B C
0 2 3 4
1 5 6 7
2 7 8 9
series
0 x
1 y
2 z
-->
A B C D E F
0 2 3 4 x y z
1 5 6 7 ...
2 7 8 9 ...
I want to do this using a for loop, appending a different series to each row of the dataframe. The series may have different lengths. Is there an easy way to accomplish this?
Use loc and the series's index as the column name
lst = [
[2,3,4],
[5,6,7],
[7,8,9]
]
df = pd.DataFrame(lst, columns=list("ABC"))
print(df)
###
A B C
0 2 3 4
1 5 6 7
2 7 8 9
s1 = pd.Series(list("xyz"))
s1.index = list("DEF")
print(s1)
###
D x
E y
F z
dtype: object
s2 = pd.Series(list("abcd"))
s2.index = list("GHIJ")
print(s2)
###
G a
H b
I c
J d
dtype: object
for idx, s in enumerate([s1, s2]):
df.loc[idx, s.index] = s.values
print(df)
###
A B C D E F G H I J
0 2 3 4 x y z NaN NaN NaN NaN
1 5 6 7 NaN NaN NaN a b c d
2 7 8 9 NaN NaN NaN NaN NaN NaN NaN
Try this:
df['D'], df['E'], df['F'] = s.tolist()
And now:
print(df)
Gives:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
Edit:
If you are not sure how many extra values there are, try:
from string import ascii_uppercase as letters
df = df.assign(**dict(zip([letters[i + len(df.columns)] for i, v in enumerate(series)], series.tolist())))
print(df)
Output:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
a = pd.DataFrame(df.groupby('actor_1_name')['gross'].sum())
b = pd.DataFrame(df.groupby('actor_2_name')['gross'].sum())
c = pd.DataFrame(df.groupby('actor_3_name')['gross'].sum())
x = [a,b,c]
y = pd.concat(x)
p =['actor_1_name','actor_2_name','actor_3_name','gross']
df.loc[y.nlargest(3).index,p]
I want to find the sum of each column then combine them together to find the top 3 highest values, but I'm getting an error and not sure what to do to fix it. I need some assistance.
I believe you need:
df = pd.DataFrame({'actor_1_name':['a','a','a','b','b','c','c','d','d','e'],
'actor_2_name':['d','d','a','c','b','c','c','d','e','e'],
'actor_3_name':['c','c','a','b','b','b','c','e','e','e'],
'gross':[1,2,3,4,5,6,7,8,9,10]})
print (df)
actor_1_name actor_2_name actor_3_name gross
0 a d c 1
1 a d c 2
2 a a a 3
3 b c b 4
4 b b b 5
5 c c b 6
6 c c c 7
7 d d e 8
8 d e e 9
9 e e e 10
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3)
x = [a,b,c]
print (x)
[actor_1_name
d 17
c 13
e 10
Name: gross, dtype: int64, actor_2_name
e 19
c 17
d 11
Name: gross, dtype: int64, actor_3_name
e 27
b 15
c 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
b NaN NaN 15.0
c 13.0 17.0 10.0
d 17.0 11.0 NaN
e 10.0 19.0 27.0
EDIT1:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index()
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index()
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index()
x = [a,b,c]
print (x)
[ actor_1_name gross
0 d 17
1 c 13
2 e 10, actor_2_name gross
0 e 19
1 c 17
2 d 11, actor_3_name gross
0 e 27
1 b 15
2 c 10]
df1 = pd.concat(x, axis=1, keys=['a','b','c'])
df1.columns = df1.columns.map('-'.join)
print (df1)
a-actor_1_name a-gross b-actor_2_name b-gross c-actor_3_name c-gross
0 d 17 e 19 e 27
1 c 13 c 17 b 15
2 e 10 d 11 c 10
EDIT2:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index(drop=True)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index(drop=True)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index(drop=True)
x = [a,b,c]
print (x)
[0 17
1 13
2 10
Name: gross, dtype: int64, 0 19
1 17
2 11
Name: gross, dtype: int64, 0 27
1 15
2 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
0 17 19 27
1 13 17 15
2 10 11 10
Attempting to use a multiply operation with a multi index.
import pandas as pd
import numpy as np
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),np.repeat('C',9)))
ser.index.names = ['Alpha','Gamma']
print df
print ser
foo = df.mul(ser,axis=0,level = ['Alpha','Gamma'])
So my dataframe which became a series looks like
Alpha Beta Gamma
1 A C 7
D 188
2 B C 7
D 110
3 C C 2
D 124
4 D C 4
D 153
5 E C 9
D 178
6 F C 6
D 196
7 G C 1
D 156
8 H C 1
D 184
9 I C 3
D 169
And my series looks like
Alpha Gamma
1 C 0.8731
2 C 0.6347
3 C 0.4688
4 C 0.5623
5 C 0.4944
6 C 0.5234
7 C 0.9946
8 C 0.7815
9 C 0.1219
In my multiply operation I want to broadcast on index levels 'Alpha' and 'Gamma'
but i get this error message:
TypeError: Join on level between two MultiIndex objects is ambiguous
How about this? Perhaps it's the extra 'Beta' column in df but not ser that causes the problem?
(Note: this is using df as updated in #Dickster's answer, not as in the original question)
df2 = df.reset_index().set_index(['Alpha','Gamma'])
df2[0].mul(ser)
Alpha Gamma
1 C 2.503829
D NaN
2 C 5.028208
D NaN
3 C 0.842322
D NaN
4 C 0.198101
D NaN
5 C 0.800745
D NaN
6 C 1.936523
D NaN
7 C 2.507393
D NaN
8 C 4.846258
D NaN
9 C NaN
D 147.233378
So imagine I have this, where I now have a 'D' in Gamma in the series "ser":
import pandas as pd
import numpy as np
np.random.seed(1)
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
idx = list(np.repeat('C',8))
idx.append('D')
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),idx))
ser.index.names = ['Alpha','Gamma']
print df
print ser
df_A = df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
print df_A
df_dickster77 = df.unstack('Alpha').mul(ser.unstack('Alpha')).stack('Alpha').reorder_levels(df.index.names)
print df_dickster77
Output is this:
Alpha Beta Gamma
1 A C 6
D 120
2 B C 9
D 118
3 C C 6
D 184
4 D C 1
D 111
5 E C 1
D 128
6 F C 2
D 129
7 G C 8
D 114
8 H C 7
D 150
9 I C 3
D 168
dtype: int32
Alpha Gamma
1 C 0.417305
2 C 0.558690
3 C 0.140387
4 C 0.198101
5 C 0.800745
6 C 0.968262
7 C 0.313424
8 C 0.692323
9 D 0.876389
dtype: float64
output A: inadvertent multiplication
Gamma C D
Alpha Beta Gamma
1 A C 2.503829 NaN
D 50.076576 NaN
2 B C 5.028208 NaN
D 65.925400 NaN
3 C C 0.842322 NaN
D 25.831197 NaN
4 D C 0.198101 NaN
D 21.989265 NaN
5 E C 0.800745 NaN
D 102.495305 NaN
6 F C 1.936523 NaN
D 124.905743 NaN
7 G C 2.507393 NaN
D 35.730356 NaN
8 H C 4.846258 NaN
D 103.848392 NaN
9 I C NaN 2.629167
D NaN 147.233378
output df_dickster77: Its correct multiplication lining up on C's and D.
However 8 x D NaNs lost and 1 x C NaN lost
Alpha Beta Gamma
1 A C 2.503829
2 B C 5.028208
3 C C 0.842322
4 D C 0.198101
5 E C 0.800745
6 F C 1.936523
7 G C 2.507393
8 H C 4.846258
9 I D 147.233378
dtype: float64
This is the way to do it ATM. At some point a more concise may be implemented.
In [21]: df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
Out[21]:
Gamma C
Alpha Beta Gamma
1 A C 6.761867
D 171.944612
2 B C 0.154139
D 6.371062
3 C C 2.311870
D 42.898041
4 D C 0.390920
D 9.479801
5 E C 3.484439
D 72.011743
6 F C 0.740913
D 50.382061
7 G C 3.459497
D 60.541203
8 H C 0.467012
D 19.030741
9 I C 0.071290
D 11.620286