Group/merge rows when defined columns match and sum up values

Group/merge rows when defined columns match and sum up values - pandas

How Do I group/merge rows, where multiple defined columns have the same value and display the sums in other columns not relevant for grouping/merging?
In the below example: If rows have the same values in columns "OrgA" to "OrgF" (text – this refers to an org. structure with departments and sub-departments), group/merge rows and add up the numbers in columns "numA" and "numB".
import pandas as pd
import numpy as np
data = {'orgA': ['A','C','A','C','A','C','A','A','A','L'],
'orgB': ['B',np.nan,'E',np.nan,'B',np.nan,'E','E','E','C'],
'orgC': ['C',np.nan,'D',np.nan,'C',np.nan,'H','D','H','B'],
'orgD': ['D',np.nan,np.nan,np.nan,'D',np.nan,'F',np.nan,'F','S'],
'orgE': ['E',np.nan,np.nan,np.nan,'E',np.nan,np.nan,np.nan,np.nan,'F'],
'orgF': ['F',np.nan,np.nan,np.nan,'F',np.nan,np.nan,np.nan,np.nan,np.nan],
'numA': [1,1,1,1,1,1,1,1,1,1],
'numB': [2,2,2,2,2,2,2,2,2,2]}
df = pd.DataFrame(data)
print(df)
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 1 2
1 C NaN NaN NaN NaN NaN 1 2
2 A E D NaN NaN NaN 1 2
3 C NaN NaN NaN NaN NaN 1 2
4 A B C D E F 1 2
5 C NaN NaN NaN NaN NaN 1 2
6 A E H F NaN NaN 1 2
7 A E D NaN NaN NaN 1 2
8 A E H F NaN NaN 1 2
9 L C B S F NaN 1 2
The output is supposed to look as follows:
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 2 4
1 C NaN NaN NaN NaN NaN 3 6
2 A E D NaN NaN NaN 2 4
3 A E H F NaN NaN 3 6
4 L C B S F NaN 1 2
Many thanks for your ideas in advance!

You can pass a list of column names to groupby, and set dropna to False so that rows containing nans are not dropped. You can also specify sort=False if it is not important to sort the group keys. Applying this to your example, as in
df.groupby(
['orgA', 'orgB', 'orgC', 'orgD', 'orgE', 'orgF'],
dropna=False,
sort=False
).sum()
we get
numA numB
orgA orgB orgC orgD orgE orgF
A B C D E F 2 4
C NaN NaN NaN NaN NaN 3 6
A E D NaN NaN NaN 2 4
H F NaN NaN 2 4
L C B S F NaN 1 2

Related

Create columns in python data frame based on existing column-name and column-values

I have a dataframe in pandas:
import pandas as pd
# assign data of lists.
data = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Employment': ['R','U', 'E','R','U', 'E','R','U', 'E','R','U', 'E'],
'Age': ['Y','M', 'O','Y','M', 'O','Y','M', 'O','Y','M', 'O']
}
# Create DataFrame
df = pd.DataFrame(data)
df
What I want is to create for each category of each existing column a new column with the following format:
Gender_M -> for when the gender equals M
Gender_F -> for when the gender equal F
Employment_R -> for when employment equals R
Employment_U -> for when employment equals U
and so on...
So far, I have created the below code:
for i in range(len(df.columns)):
curent_column=list(df.columns)[i]
col_df_array = df[curent_column].unique()
for j in range(col_df_array.size):
new_col_name = str(list(df.columns)[i])+"_"+col_df_array[j]
for index,row in df.iterrows():
if(row[curent_column] == col_df_array[j]):
df[new_col_name] = row[curent_column]
The problem is that even though I have managed to create successfully the column names, I am not able to get the correct column values.
For example the column Gender should be as below:
data2 = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Gender_M': ['M', 'na', 'M', 'na','M', 'na','M', 'na','M', 'na','M', 'na'],
'Gender_F': ['na', 'F', 'na', 'F','na', 'F','na', 'F','na', 'F','na', 'F']
}
df2 = pd.DataFrame(data2)
Just to say, the na can be anything such as blanks or dots or NAN.

You're looking for pd.get_dummies.
>>> pd.get_dummies(df)
Gender_F Gender_M Employment_E Employment_R Employment_U Age_M Age_O Age_Y
0 0 1 0 1 0 0 0 1
1 1 0 0 0 1 1 0 0
2 0 1 1 0 0 0 1 0
3 1 0 0 1 0 0 0 1
4 0 1 0 0 1 1 0 0
5 1 0 1 0 0 0 1 0
6 0 1 0 1 0 0 0 1
7 1 0 0 0 1 1 0 0
8 0 1 1 0 0 0 1 0
9 1 0 0 1 0 0 0 1
10 0 1 0 0 1 1 0 0
11 1 0 1 0 0 0 1 0

If you are trying to get the data in a format like your df2 example, I believe this is what you are looking for.
ndf = pd.get_dummies(df)
df.join(ndf.mul(ndf.columns.str.split('_').str[-1]))
Output:
Old Answer
df[['Gender']].join(pd.get_dummies(df[['Gender']]).mul(df['Gender'],axis=0).replace('',np.NaN))
Output:
Gender Gender_F Gender_M
0 M NaN M
1 F F NaN
2 M NaN M
3 F F NaN
4 M NaN M
5 F F NaN
6 M NaN M
7 F F NaN
8 M NaN M
9 F F NaN
10 M NaN M
11 F F NaN

If you are okay with 0s and 1s in your new columns, then using get_dummies (as suggested by #richardec) should be the most straightforward.
However, if want a specific letter in each of your new columns, then another method is to loop through the current columns and the specific categories within each column, and create a new column from this information using apply.
for col in data.keys():
categories = list(df[col].unique())
for category in categories:
df[f"{col}_{category}"] = df[col].apply(lambda x: category if x==category else float("nan"))
Result:
>>> df
Gender Employment Age Gender_M Gender_F Employment_R Employment_U Employment_E Age_Y Age_M Age_O
0 M R Y M NaN R NaN NaN Y NaN NaN
1 F U M NaN F NaN U NaN NaN M NaN
2 M E O M NaN NaN NaN E NaN NaN O
3 F R Y NaN F R NaN NaN Y NaN NaN
4 M U M M NaN NaN U NaN NaN M NaN
5 F E O NaN F NaN NaN E NaN NaN O
6 M R Y M NaN R NaN NaN Y NaN NaN
7 F U M NaN F NaN U NaN NaN M NaN
8 M E O M NaN NaN NaN E NaN NaN O
9 F R Y NaN F R NaN NaN Y NaN NaN
10 M U M M NaN NaN U NaN NaN M NaN
11 F E O NaN F NaN NaN E NaN NaN O

How to delete a row based on number of cells in this row

I have a long data frame which contains some data for my project
I want to delete the row that contains more than 2 cells
here is my sample code
A B C D E F
9012_1 :2683_1_0
9044_0 :2680_1_0
9007_1 9007_2 :8487_3_0 :8487_4_0 :2675_1_0
8814_2 :8374_1_2
77114_0 77114_1 :53453_1_0 :53453_1_1
I want my output to be like this
A B C D E F
9012_1 :2683_1_0
9044_0 :2680_1_0
8814_2 :8374_1_2
how could it be done, as I have searched for it many times and could not find any answer.
Thanks

Based on your comment, I'm supposing you have this dataframe:
A B C D E F
0 9012_1 :2683_1_0 NaN NaN NaN NaN
1 9044_0 :2680_1_0 NaN NaN NaN NaN
2 9007_1 9007_2 :8487_3_0 :8487_4_0 :2675_1_0 NaN
3 8814_2 :8374_1_2 NaN NaN NaN NaN
4 77114_0 77114_1 :53453_1_0 :53453_1_1 NaN NaN
Then:
print(df[df.notna().sum(1) <= 2])
Prints:
A B C D E F
0 9012_1 :2683_1_0 NaN NaN NaN NaN
1 9044_0 :2680_1_0 NaN NaN NaN NaN
3 8814_2 :8374_1_2 NaN NaN NaN NaN

I am assuming the same df as #Andrej.
You could use count(axis=1) which will count the non nan across columns, and use ~ operator with gt(3). This will delete the rows that have more than 2 non nan.
print(df[~df.count(1).gt(3)])
A B C D E F
0 9012_1 :2683_1_0 NaN NaN NaN NaN
1 9044_0 :2680_1_0 NaN NaN NaN NaN
3 8814_2 :8374_1_2 NaN NaN NaN NaN

How do I make the pandas index of a pivot table part of the column names?

I'm trying to pivot two columns out by another flag column with out multi-indexing. I would like to have the column names be a part of the indicator itself. Take for example:
import pandas as pd
df_dict = {'fire_indicator':[0,0,1,0,1],
'cost':[200, 300, 354, 456, 444],
'value':[1,1,2,1,1],
'id':['a','b','c','d','e']}
df = pd.DataFrame(df_dict)
If I do the following:
df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
I get the following:
cost value
fire_indicator 0 1 0 1
id
a 200.0 NaN 1.0 NaN
b 300.0 NaN 1.0 NaN
c NaN 354.0 NaN 2.0
d 456.0 NaN 1.0 NaN
e NaN 444.0 NaN 1.0
What I'm trying to do is the following:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_0_value
a 200 0 1 0
b 300 0 1 0
c 0 354 0 2
d 456 0 1 0
e 0 444 0 1
I know there is a way in SAS. Is there a way in python pandas?

Just rename and re_index:
out = df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
out.columns = [f'fire_indicator_{y}_{x}' for x,y in out.columns]
# not necessary if you want `id` be the index
out = out.reset_index()
Output:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_1_value
-- ---- ----------------------- ----------------------- ------------------------ ------------------------
0 a 200 nan 1 nan
1 b 300 nan 1 nan
2 c nan 354 nan 2
3 d 456 nan 1 nan
4 e nan 444 nan 1

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.

Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

Pandas assign value in one column based on top 10 values in another column

I have a table:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
I would like to make a new column called 'flag' for the top 2 values in column D.
I've tried:
for i in df.D.nlargest(2):
df.['flag']= 1
But that gets me:
A B C D flag
0 NaN 2.0 NaN 0 1
1 3.0 4.0 NaN 1 1
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
What I want is:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

IIUC:
df['flag'] = 0
df.loc[df.D.nlargest(2).index, 'flag'] = 1
Or:
df['flag'] = df.index.isin(df.D.nlargest(2).index).astype(int)
Output:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

IIUC
df['flag']=df.D.sort_values().tail(2).eq(df.D).astype(int)
df
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group/merge rows when defined columns match and sum up values - pandas

Related

Create columns in python data frame based on existing column-name and column-values

How to delete a row based on number of cells in this row

How do I make the pandas index of a pivot table part of the column names?

In pandas replace consecutive 0s with NaN

Pandas assign value in one column based on top 10 values in another column

Categories

Resources