Python pandas data frame issue. How can I conditionally insert rows and get the subtotal of the numeric column? Already solved it but any improvement? - pandas

With the change of Function, I want empty rows inserted below and the subtotal of Amount in the empty row below.
How can I imporve the code? A better or shorter way?
insertROW_df = pd.read_clipboard()
insertROW_df
insertROW_df['match'] = insertROW_df['Function'].eq(insertROW_df['Function'].shift(-1))
insertROW_df['insert_row_below?'] = insertROW_df['match'].apply(lambda x: 'Yes' if x == False else "No")
index_changed_row = insertROW_df.index[insertROW_df['match']==False]
type(insertROW_df)
line = pd.DataFrame({"Function": np.nan, "Budget": np.nan, "Amount":np.nan,"match":np.nan,"insert_row_below?":np.nan }, index=index_changed_row)
df = insertROW_df.append(line, ignore_index=False)
df = df.sort_index().reset_index(drop=True)
df['Total'] = df.groupby(['Function'])['Amount'].transform('sum')
df['Total'] = df['Total'].fillna(method="ffill")
df['Amount'] = df['Amount'].fillna(df['Total'])
print(df)
print(df[['Function', 'Budget', 'Amount']])
df.to_excel(r'Z:\Claiming FY17, 18, 19, 20, 21 & 22\November 2021\Pythonic approach\InsertBlankRows.xlsx', index=False, header=True)
DATAFRAME = pd.DataFrame({'Function':['AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB', 'BBB', 'CCC', 'CCC'],
'Budget': [550,550,550,680,550,550,550,860,860,],
'Amount': [14850,8640,2150,3210,5540,6660,2210,5555,5595,]
})

Related

Inputting first and last name to output a value in Pandas Dataframe

I am trying to create an input function that returns a value for the corresponding first and last name.
For this example i'd like to be able to enter "Emily" and "Bell" and return "attempts: 3"
Heres my code so far:
import pandas as pd
import numpy as np
data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'lastname': ['Thompson','Wu', 'Downs','Hunter','Bell','Cisneros', 'Becker', 'Sims', 'Gallegos', 'Horne'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no',
'yes', 'yes', 'no', 'no', 'yes']
}
data
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
fname = input()
lname = input()
print(f"{fname} {lname}'s number of attempts: {???}")
I thought there would be specific documentation for this but I cant find any on the pandas dataframe documentation. I am assuming its pretty simple but can't find it.
fname = input()
lname = input()
# use loc to filter the row and then capture the value from attempts columns
print(f"{fname} {lname}'s number of attempts:{df.loc[df['name'].eq(fname) & df['lastname'].eq(lname)]['attempts'].squeeze()}")
Emily
Bell
Emily Bell's number of attempts:2
alternately, to avoid mismatch due to case
fname = input().lower()
lname = input().lower()
print(f"{fname} {lname}'s number of attempts:{df.loc[(df['name'].str.lower() == fname) & (df['lastname'].str.lower() == lname)]['attempts'].squeeze()}")
emily
BELL
emily bell's number of attempts:2
Try this:
df[(df['name'] == fname) & (df['lastname'] == lname)]['attempts'].squeeze()

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

pandas sort values in plot with groupby

I'm working on this dataset and I've this code for the plot below
x = df2.groupby(by = ['LearnCode', 'Age']).size()
chart = x.unstack()
axs = chart.plot.barh(subplots=True,figsize=(20,50), layout=(9,1), legend=False, title=chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
How can I sort the values of each category for every plot alone?
You can do it, but probably, you'll have to plot each subplot separately.
df2 = pd.DataFrame({'LearnCode': ['A', 'B', 'B', 'B', 'B', 'A', 'C', 'C', 'B', 'A', 'C', 'C', 'B'],
'Age': [18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24]})
x = df2.groupby(by = ['LearnCode', 'Age']).size()
chart = x.unstack()
f, axs = plt.subplots(nrows=len(chart.columns), ncols=1, figsize=(20,10), sharex='col')
#to each subplot to have different color
colors = plt.rcParams["axes.prop_cycle"]()
for i, age in enumerate(chart):
chart[age].sort_values().plot.barh(title = age,
ax = axs[i],
color = next(colors)["color"],
xlabel = '')
PS. For me, it's better to have original graph than graph like this (it's much easier to track differences between groups).

Sort DataFrame asc/desc based on column value

I have this DataFrame
df = pd.DataFrame({'A': [100, 100, 300, 200, 200, 200], 'B': [60, 55, 12, 32, 15, 44], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
and I want to sort it by columns "A" and "B". "A" is always ascending. I also want ascending for "B" if "C == x", else descending for "B" if "C == y". So it would end up like this
df_sorted = pd.DataFrame({'A': [100, 100, 200, 200, 200, 300], 'B': [55, 60, 44, 32, 15, 12], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
I would filter each DataFrame into two Dataframe based on the value of C:
df_x = df.loc[df['C'] == 'x']
df_y = df.loc[df['C'] == 'y']
and then use "sort_values" like so:
df_x.sort_values(by=['A', 'B'], inplace=True)
sorting df_y will be different since you want one column ascending and the other descending, since "sort_values" is stable we can do it like so
df_y.sort_values(by=['A'], inplace=True)
df_y.sort_values(by=['b'], inplace=True, ascending=False)
You can then merge the DataFrames back and sort again by A and the order will remain.
You can set up a temporary column to invert the values of "B" when "C" equals "x", sort, and drop the column:
(df.assign(B2=df['B']*df['C'].eq('x').mul(2).sub(1))
.sort_values(by=['A', 'B2'])
.drop('B2', axis=1)
)
def function1(dd:pd.DataFrame):
return dd.sort_values(['A','B']) if dd.name=='x' else dd.sort_values(['A','B'],ascending=[True,False])
df.groupby('C').apply(function1).reset_index(drop=True)
A B C
0 100 55 x
1 100 60 x
2 200 44 y
3 200 32 y
4 200 15 y
5 300 12 y

How to subtract one row to other rows in a grouped by dataframe?

I've got this data frame with some 'init' values ('value', 'value2') that I want to subtract to the mid term value 'mid' and final value 'final' once I've grouped by ID.
import pandas as pd
df = pd.DataFrame({
'value': [100, 120, 130, 200, 190,210],
'value2': [2100, 2120, 2130, 2200, 2190,2210],
'ID': [1, 1, 1, 2, 2, 2],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
My attempt was tho extract the index where I found 'init', 'mid' and 'final' and subtract from 'mid' and 'final' the value of 'init' once I've grouped the value by 'ID':
group = df.groupby('ID')
group['diff_1_f'] = group['value'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]]
group['diff_2_f'] = group['value2'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_1_m'] = group['value'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_2_m'] = group['value2'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
But of course it doesn't work. How can I obtain the following result:
df = pd.DataFrame({
'diff_value': [20, 30, -10,10],
'diff_value2': [20, 30, -10,10],
'ID': [ 1, 1, 2, 2],
'state': ['mid', 'final', 'mid', 'final'],
})
Also in it's grouped form.
Use:
#columns names in list for subtract
cols = ['value', 'value2']
#new columns names created by join
new = [c + '_diff' for c in cols]
#filter rows with init
m = df['state'].ne('init')
#add init rows to new columns by join and filter no init rows
df1 = df.join(df[~m].set_index('ID')[cols], lsuffix='_diff', on='ID')[m]
#subtract with numpy array by .values for prevent index alignment
df1[new] = df1[new].sub(df1[cols].values)
#remove helper columns
df1 = df1.drop(cols, axis=1)
print (df1)
value_diff value2_diff ID state
1 20 20 1 mid
2 30 30 1 final
4 -10 -10 2 mid
5 10 10 2 final