How can I combine 2 levels index into one in pandas? - pandas

I am trying to extract only the relevant information from a dataframe. My data looks like
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': {0: 'id1', 1: 'id1', 2: 'id1'},
'EM': {0: 'met1', 1: 'met2', 2: 'met3'},
'met1_AVG': {0: 0.38, 1: np.nan, 2: np.nan},
'met2_AVG': {0: np.nan, 1: 0.2, 2: np.nan},
'met3_AVG': {0: np.nan, 1: np.nan, 2: 0.58},
'score': {0: 89, 1: 89, 2: 89}})
My desired output is
Please, find my code below. I really would appreciate if someone could help me out. Thank you in advance for your time and helpful assistance
df_melted = df.melt(id_vars=['ID','EM','score']).dropna(subset=['value'])
df_pivoted = pd.pivot_table(data=df_melted,index=['ID','score'],columns=['variable'])
df_ready = df_pivoted.reset_index()
df_ready

Assuming the score is always same, you can use pandas.DataFrame.groupby.first:
df.drop("EM",axis=1).groupby("ID", as_index=False).first()
Output:
ID met1_AVG met2_AVG met3_AVG score
0 id1 0.38 0.2 0.58 89

Related

How to do a Pandas comparison with keep shape=False, but maintain relationship with the username column

I'm trying to run a Pandas dataframe comparison df.compare(df2) that returns only differences between two dataframes, but keep the relationship between the first column (with user's names) and the output when using the argument keep_shape=False which will only display rows with differences and the indexes, but the relationship with the username column is not displayed.
How do I keep the name column (which is the first column) and use the argument keep_shape=False at the same time so I can identify the username and the changes at the same time.
Example:
import pandas as pd
df_1 = pd.read_excel('../output/spreadsheet_Jan_1.xlsx')
df_2 = pd.read_excel('../output/spreadsheet_Feb_1.xlsx')
df_compare = df_1.compare(df_2, keep_equal=True, keep_shape=False)
I guess the image isn't showing...it's a spreadsheet with the df.compare() result showing the averages columns and the 'self' and 'other' columns split below the averages. The index is on the left hand side in the order of the 'keep_shape-False' format (e.g. 1, 6, 7, 8, 9 11, etc).
How do I match the usernames which is the first column along the left side with the associated indexes?
Thanks in advance.
Here is an example of one simple way to it:
import pandas as pd
df_1 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10},
}
)
df_2 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10},
}
)
In df_compare, we want to show the fruit names for which values are different in df_1 and df_2 (that is to say 'banana' and 'apple'):
df_compare = (
df_1
.compare(df_2, keep_equal=True, keep_shape=False)
.pipe(lambda df_: df_.set_index(df_1.loc[df_.index, "fruit"]))
.reset_index()
)
print(df_compare)
# Output
fruit quantity
self other
0 banana 22 27
1 apple 7 8
thanks Laurent for the given dataset example:
df_1 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10}})
df_2 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10}})
df_compare = pd.concat([df_1['fruit'],
df_1.compare(df_2, keep_equal=True, keep_shape=False)],1).dropna()
print(df_compare)
fruit (quantity, self) (quantity, other)
0 banana 22.0 27.0
2 apple 7.0 8.0

Two Seaborn plots on one twinx figure become distorted

I saw variations of this question asked several times, but I don't think any of the variations I saw fixes it (other than "use matplotlib for combo-plots", but I'd appreciate help understanding why should I do that).
df1 = pd.DataFrame({'height': {0: 161, 1: 173, 2: 168, 3: 185, 4: 163},
'year': {0: 2015, 1: 2016, 2: 2017, 3: 2018, 4: 2019}})
df2 = pd.DataFrame({'year': {0: 2015, 1: 2015, 2: 2016, 3: 2016, 4: 2017,
5: 2017, 6: 2018, 7: 2018, 8: 2019, 9: 2019},
'weight': {0: 64, 1: 81, 2: 82, 3: 83, 4: 66,
5: 71, 6: 84, 7: 91, 8: 99, 9: 94},
'sex': {0: 'M', 1: 'F', 2: 'M', 3: 'F', 4: 'M',
5: 'F', 6: 'M', 7: 'F', 8: 'M', 9: 'F'}})
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
ax2 = ax.twinx()
sns.lineplot(x='year', y='height', data=df1, ax=ax2)
I expected this to be a textbook example of a comboplot, but the result is:
Why is that? Shouldn't the X axes simply converge and make a nice plot? Of course, each plot renders fine individually.
If you plot them separately and check their xlim, you can see seaborn shifts the bar plot's x values down to 0 (the years are displayed separately via xticklabels):
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
print(ax.get_xlim())
print(ax.get_xticklabels())
# (-0.5, 4.5)
# [Text(0, 0, '2015'), Text(1, 0, '2016'), Text(2, 0, '2017'), Text(3, 0, '2018'), Text(4, 0, '2019')]
The line plot does not shift the x values and plots the years in the 2000s range:
ax = sns.lineplot(x='year', y='height', data=df1)
print(ax.get_xlim())
# (2014.8, 2019.2)
One workaround is to use reset_index() on the line plot's data and use x='index' to manually shift its x values to 0 to align with the bar plot:
g = sns.lineplot(x='index', y='height', data=df1.reset_index(), ax=ax2)

pandas groupby + multiple aggregate/apply with multiple columns

I have this minimal sample data:
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:
These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:
Code example:
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
The apply function doesn't work for columns with missing values:
Code example:
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
The aggregate and apply functions don't allow me to put multiple columns for one transformation:
Code example:
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.
The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.
For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
Go step by step, and prepare three different data frames to merge them later.
First dataframe is for simple functions like count,sum,mean
df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()
Next you deal with Type and Loc join, we use fill na to deal with nan values
df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
';'.join).reset_index()
And finally you merge the results together:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
data_new output:

Match coloring of slices for series of pandas pie charts

I have a pandas dataframe that looks like this :
df = pd.DataFrame( {'Judge': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}, 'Category': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'A', 7: 'B', 8: 'C'}, 'Rating': {0: 'Excellent', 1: 'Very Good', 2: 'Good', 3: 'Very Good', 4: 'Very Good', 5: 'Very Good', 6: 'Excellent', 7: 'Very Good', 8: 'Excellent'}} )
I'm plotting a pie chart to show the ratings of each judge like this:
grouped = df.groupby('Judge')
for group in grouped:
group[1].Rating.value_counts().plot(kind='pie', autopct="%1.1f%%")
plt.legend(group[1].Rating.value_counts().index.values, loc="upper right")
plt.title('Judge ' + str(group[0]))
plt.axis('equal')
plt.ylabel('')
plt.tight_layout()
plt.show()
Unfortunately, the colors of the slices are different for each judge. For example, Judge 1's "Excellent" slice is blue where Judge 2's "Very Good" slice is blue.
How can enforce slice color consistency from plot to plot?
I think you can unstack and plot:
axes = (df.groupby('Judge').Rating.value_counts()
.unstack('Judge')
.plot.pie(subplots=True, figsize=(6,6), layout=(2,2))
)
# do some thing with the axes
for ax in axes.ravel():
pass
Output:

pivoting pandas df - turn column values into column names

I have a df:
pd.DataFrame({'time_period': {0: pd.Timestamp('2017-04-01 00:00:00'),
1: pd.Timestamp('2017-04-01 00:00:00'),
2: pd.Timestamp('2017-03-01 00:00:00'),
3: pd.Timestamp('2017-03-01 00:00:00')},
'cost1': {0: 142.62999999999994,
1: 131.97000000000003,
2: 142.62999999999994,
3: 131.97000000000003},
'revenue1': {0: 56,
1: 113.14999999999998,
2: 177,
3: 99},
'cost2': {0: 309.85000000000002,
1: 258.25,
2: 309.85000000000002,
3: 258.25},
'revenue2': {0: 4.5,
1: 299.63,2: 309.85,
3: 258.25},
'City': {0: 'Boston',
1: 'New York',2: 'Boston',
3: 'New York'}})
I want to re-structure this df such that for revenue and cost separately:
pd.DataFrame({'City': {0: 'Boston', 1: 'New York'},
'Apr-17 revenue1': {0: 56.0, 1: 113.15000000000001},
'Apr-17 revenue2': {0: 4.5, 1: 299.63},
'Mar-17 revenue1': {0: 177, 1: 99},
'Mar-17 revenue2': {0: 309.85000000000002, 1: 258.25}})
And a similar df for costs.
Basically, turn the time_period column values into column names like Apr-17, Mar-17 with revenue/cost string as appropriate and values of revenue1/revenue2 and cost1/cost2 respectively.
I've been playing around with pd.pivot_table with some success but I can't get exactly what I want.
Use set_index and unstack
import datetime as dt
df['time_period'] = df['time_period'].apply(lambda x: dt.datetime.strftime(x,'%b-%Y'))
df = df.set_index(['A', 'B', 'time_period'])[['revenue1', 'revenue2']].unstack().reset_index()
df.columns = df.columns.map(' '.join)
A B revenue1 Apr-2017 revenue1 Mar-2017 revenue2 Apr-2017 revenue2 Mar-2017
0 Boston Orlando 56.00 177.0 4.50 309.85
1 New York Dallas 113.15 99.0 299.63 258.25