pandas groupby + multiple aggregate/apply with multiple columns - pandas

I have this minimal sample data:
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:
These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:
Code example:
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
The apply function doesn't work for columns with missing values:
Code example:
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
The aggregate and apply functions don't allow me to put multiple columns for one transformation:
Code example:
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.
The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.

For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})

Go step by step, and prepare three different data frames to merge them later.
First dataframe is for simple functions like count,sum,mean
df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()
Next you deal with Type and Loc join, we use fill na to deal with nan values
df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
';'.join).reset_index()
And finally you merge the results together:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
data_new output:

Related

How can I combine 2 levels index into one in pandas?

I am trying to extract only the relevant information from a dataframe. My data looks like
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': {0: 'id1', 1: 'id1', 2: 'id1'},
'EM': {0: 'met1', 1: 'met2', 2: 'met3'},
'met1_AVG': {0: 0.38, 1: np.nan, 2: np.nan},
'met2_AVG': {0: np.nan, 1: 0.2, 2: np.nan},
'met3_AVG': {0: np.nan, 1: np.nan, 2: 0.58},
'score': {0: 89, 1: 89, 2: 89}})
My desired output is
Please, find my code below. I really would appreciate if someone could help me out. Thank you in advance for your time and helpful assistance
df_melted = df.melt(id_vars=['ID','EM','score']).dropna(subset=['value'])
df_pivoted = pd.pivot_table(data=df_melted,index=['ID','score'],columns=['variable'])
df_ready = df_pivoted.reset_index()
df_ready
Assuming the score is always same, you can use pandas.DataFrame.groupby.first:
df.drop("EM",axis=1).groupby("ID", as_index=False).first()
Output:
ID met1_AVG met2_AVG met3_AVG score
0 id1 0.38 0.2 0.58 89

multiple nested groupby in pandas

Here is my pandas dataframe:
df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})
Here is how it looks:
I want to add the following columns:
'Date_Range_Avg': average of 'Range' grouped by Date
'Date_Sector_Range_Avg': average of 'Range' grouped by Date and Sector
'Date_Segment_Range_Avg': average of 'Range' grouped by Date and Segment
This would be the output:
res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})
This is how it looks:
Note I have rounded some of the values - but this rounding is not essential for the question I have (please feel free to not round)
I'm aware that I can do each of these groupings separately but it strikes me as inefficient (my dataset contains millions of rows)
Essentially, I would like to first do a grouping by Date and then re-use it to do the two more fine-grained groupings by Date and Segment and by Date and Sector.
How to do this?
My initial hunch is to go like this:
day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')
and then to re-use day_groups to do the 2 more fine-grained groupbys like this:
df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')
Which doesn't work as you get:
'AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby''
groupby runs really fast when the aggregate function is vectorized. If you are worried about performance, try it out first to see if it's the real bottleneck in your program.
You can create temporary data frames holding the result of each groupby, then successively merge them with df:
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = [
df.groupby(columns)["Range"].mean().to_frame(key)
for key, columns in group_bys.items()
]
result = df
for t in tmp:
result = result.merge(t, left_on=t.index.names, right_index=True)
Result:
Date Stock Sector Segment Range Date_Range_Avg Date_Sector_Range_Avg Date_Segment_Range_Avg
0 2016-10-11 A 0 0 5 1.6 2.500000 5.00
1 2016-10-11 B 0 1 0 1.6 2.500000 0.75
2 2016-10-11 C 1 1 1 1.6 1.000000 0.75
3 2016-10-11 D 1 1 0 1.6 1.000000 0.75
4 2016-10-11 E 1 1 2 1.6 1.000000 0.75
5 2016-10-12 F 0 1 6 7.8 9.666667 6.00
6 2016-10-12 G 0 2 0 7.8 9.666667 11.50
7 2016-10-12 H 0 2 23 7.8 9.666667 11.50
8 2016-10-12 I 1 3 5 7.8 5.000000 5.00
9 2016-10-12 J 1 3 5 7.8 5.000000 5.00
Another option is to use transform, and avoid the multiple merges:
# reusing your code
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = {key : df.groupby(columns)["Range"].transform('mean')
for key, columns in group_bys.items()
}
df.assign(**tmp)

How do I remove duplicate rows based on all columns?

I merged 3 dataframes mrna, meth, and cna. I want to remove any duplicate rows that either have the same Hugo_Symbol column value or have the same values across all the remaining columns (i.e., columns starting with "TCGA-").
import re
import pandas as pd
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
common["Hugo_Symbol"] = [re.sub(r'\|.+', "", str(i)) for i in common["Hugo_Symbol"]] # In Hugo_Symbol column, remove everything after the pipe except newline
common = common.drop_duplicates(subset="Hugo_Symbol") # Remove row if Hugo_Symbol is the same
common
A snippet of the dataframe:
common_dict = common.iloc[1:10,1:10].to_dict()
common_dict
{'TCGA-02-0001-01': {1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0007-01': {1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816},
'TCGA-02-0009-01': {1: 0.9795,
2: -0.5464,
3: 1.1115,
4: -0.2128,
5: -3.3461,
6: 1.3576,
7: -1.0782,
8: -3.4734,
9: -0.8985},
'TCGA-02-0010-01': {1: -0.7122,
2: 0.7651,
3: 2.4691,
4: 0.7222,
5: -1.7822,
6: -3.3403,
7: -1.6397,
8: 0.3424,
9: 1.7337},
'TCGA-02-0011-01': {1: -6.8649,
2: -0.4178,
3: 0.1858,
4: -0.0863,
5: -2.9486,
6: -3.843,
7: -0.9275,
8: -5.0462,
9: 0.9702},
'TCGA-02-0014-01': {1: -1.9439,
2: 0.3727,
3: -0.5368,
4: -0.1501,
5: 0.8977,
6: 0.5138,
7: -1.688,
8: 0.1778,
9: 1.7975},
'TCGA-02-0021-01': {1: -0.8761,
2: -0.2532,
3: 2.0574,
4: -0.9708,
5: -1.0883,
6: -1.0698,
7: -0.8684,
8: -5.3854,
9: 1.2353},
'TCGA-02-0024-01': {1: 1.6237,
2: -0.717,
3: -0.4517,
4: -0.5276,
5: -2.3993,
6: -4.3485,
7: 0.0811,
8: -2.5217,
9: 0.1883}}
Now, I want to drop any duplicate rows by subsetting all the columns beginning with "TCGA-" (i.e., all except the Hugo_Symbol column). How do I do it?
common = common.drop_duplicates(subset=[1:,], keep="first", inplace=False, ignore_index=False)
Here is the example data to reproduce the problem. It needed some changes to the data from dict of OP to have duplicates.
df = pd.DataFrame({
'Hugo_Symbol': ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'ABC', 'GHI', 'XYZ', 'DEF', 'BBB', 'CCC'],
'TCGA-02-0001-01': [-0.9099, -2.3351, 0.2216, 0.6798, -2.48, 0.7912, -1.4578, -3.8009, 3.4868, -2.48, 3.4868],
'TCGA-02-0003-01': [0.0896, -1.17, 0.1255, 0.2374, -3.2629, 1.2846, -1.474, -2.9891, -0.1511, -3.2629, -0.1511],
'TCGA-02-0007-01': [-5.6511, -2.8365, 2.0026, -0.6326, -1.3741, -3.437, -1.047, -4.185, 2.1816, -1.3741, 2.1816],
'TCGA-02-0009-01': [0.9795, -0.5464, 1.1115, -0.2128, -3.3461, 1.3576, -1.0782, -3.4734, -0.8985, -3.3461, -0.8985],
'TCGA-02-0010-01': [-0.7122, 0.7651, 2.4691, 0.7222, -1.7822, -3.3403, -1.6397, 0.3424, 1.7337, -1.7822, 1.7337],
'TCGA-02-0011-01': [-6.8649, -0.4178, 0.1858, -0.0863, -2.9486, -3.843, -0.9275, -5.0462, 0.9702, -2.9486, 0.9702],
'TCGA-02-0014-01': [-1.9439, 0.3727, -0.5368, -0.1501, 0.8977, 0.5138, -1.688, 0.1778, 1.7975, 0.8977, 1.7975],
'TCGA-02-0021-01': [-0.8761, -0.2532, 2.0574, -0.9708, -1.0883, -1.0698, -0.8684, -5.3854, 1.2353, -1.0883, 1.2353],
'TCGA-02-0024-01': [1.6237, -0.717, -0.4517, -0.5276, -2.3993, -4.3485, 0.0811, -2.5217, 0.1883, -2.3993, 0.1883]})
We have some duplicates in the "Hugo_Symbol" column and the last two rows (different hugo symbol) have exactly same data as the rows at position 5 and 9.
With the ideas of #Code Different I created a mask and used it on the DataFrame.
tcga_cols = df.columns[df.columns.str.startswith("TCGA-")].to_list()
mask = df.duplicated("Hugo_Symbol") | df.duplicated(tcga_cols)
print(mask)
False False False False False True True False True True True
result = df[~mask]
print(result)
Hugo_Symbol TCGA-02-0001-01 TCGA-02-0003-01 TCGA-02-0007-01 TCGA-02-0009-01 TCGA-02-0010-01 TCGA-02-0011-01 TCGA-02-0014-01 TCGA-02-0021-01 TCGA-02-0024-01
0 ABC -0.9099 0.0896 -5.6511 0.9795 -0.7122 -6.8649 -1.9439 -0.8761 1.6237
1 DEF -2.3351 -1.1700 -2.8365 -0.5464 0.7651 -0.4178 0.3727 -0.2532 -0.7170
2 GHI 0.2216 0.1255 2.0026 1.1115 2.4691 0.1858 -0.5368 2.0574 -0.4517
3 JKL 0.6798 0.2374 -0.6326 -0.2128 0.7222 -0.0863 -0.1501 -0.9708 -0.5276
4 MNO -2.4800 -3.2629 -1.3741 -3.3461 -1.7822 -2.9486 0.8977 -1.0883 -2.3993
7 XYZ -3.8009 -2.9891 -4.1850 -3.4734 0.3424 -5.0462 0.1778 -5.3854 -2.5217
As you can see result only contains rows where the mask was False
EDIT:
I tested the logic on several cases and it seems to work just fine (for this example data) so I guess your real data has some format which causes problems.
For example if your columns have leading whitespaces str.startswith won't work properly.
As a workaround, do ALL your columns start with TCGA except the "hugo" column? Then you could just replace the tcga_cols line with:
tcga_cols = df.columns[1:]

Match coloring of slices for series of pandas pie charts

I have a pandas dataframe that looks like this :
df = pd.DataFrame( {'Judge': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}, 'Category': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'A', 7: 'B', 8: 'C'}, 'Rating': {0: 'Excellent', 1: 'Very Good', 2: 'Good', 3: 'Very Good', 4: 'Very Good', 5: 'Very Good', 6: 'Excellent', 7: 'Very Good', 8: 'Excellent'}} )
I'm plotting a pie chart to show the ratings of each judge like this:
grouped = df.groupby('Judge')
for group in grouped:
group[1].Rating.value_counts().plot(kind='pie', autopct="%1.1f%%")
plt.legend(group[1].Rating.value_counts().index.values, loc="upper right")
plt.title('Judge ' + str(group[0]))
plt.axis('equal')
plt.ylabel('')
plt.tight_layout()
plt.show()
Unfortunately, the colors of the slices are different for each judge. For example, Judge 1's "Excellent" slice is blue where Judge 2's "Very Good" slice is blue.
How can enforce slice color consistency from plot to plot?
I think you can unstack and plot:
axes = (df.groupby('Judge').Rating.value_counts()
.unstack('Judge')
.plot.pie(subplots=True, figsize=(6,6), layout=(2,2))
)
# do some thing with the axes
for ax in axes.ravel():
pass
Output:

pivoting pandas df - turn column values into column names

I have a df:
pd.DataFrame({'time_period': {0: pd.Timestamp('2017-04-01 00:00:00'),
1: pd.Timestamp('2017-04-01 00:00:00'),
2: pd.Timestamp('2017-03-01 00:00:00'),
3: pd.Timestamp('2017-03-01 00:00:00')},
'cost1': {0: 142.62999999999994,
1: 131.97000000000003,
2: 142.62999999999994,
3: 131.97000000000003},
'revenue1': {0: 56,
1: 113.14999999999998,
2: 177,
3: 99},
'cost2': {0: 309.85000000000002,
1: 258.25,
2: 309.85000000000002,
3: 258.25},
'revenue2': {0: 4.5,
1: 299.63,2: 309.85,
3: 258.25},
'City': {0: 'Boston',
1: 'New York',2: 'Boston',
3: 'New York'}})
I want to re-structure this df such that for revenue and cost separately:
pd.DataFrame({'City': {0: 'Boston', 1: 'New York'},
'Apr-17 revenue1': {0: 56.0, 1: 113.15000000000001},
'Apr-17 revenue2': {0: 4.5, 1: 299.63},
'Mar-17 revenue1': {0: 177, 1: 99},
'Mar-17 revenue2': {0: 309.85000000000002, 1: 258.25}})
And a similar df for costs.
Basically, turn the time_period column values into column names like Apr-17, Mar-17 with revenue/cost string as appropriate and values of revenue1/revenue2 and cost1/cost2 respectively.
I've been playing around with pd.pivot_table with some success but I can't get exactly what I want.
Use set_index and unstack
import datetime as dt
df['time_period'] = df['time_period'].apply(lambda x: dt.datetime.strftime(x,'%b-%Y'))
df = df.set_index(['A', 'B', 'time_period'])[['revenue1', 'revenue2']].unstack().reset_index()
df.columns = df.columns.map(' '.join)
A B revenue1 Apr-2017 revenue1 Mar-2017 revenue2 Apr-2017 revenue2 Mar-2017
0 Boston Orlando 56.00 177.0 4.50 309.85
1 New York Dallas 113.15 99.0 299.63 258.25