Is there a way to set the order in pandas group boxplots? - pandas

Is there a way to sort the x-axis for a grouped box plot in pandas? It seems like it is sorted by an ascending order and I would like it to be ordered based on some other column value.

If you're grouping by a category, set it as an ordered categorical in the desired order.
See example below:
Here a dataset is created with three categories A, B and C where the mean value of each category is of the order C, B, A. The goal is to plot the categories in order of their mean value.
The key is converting the category to an ordered categorical data type with the desired order.
# create some data
n = 50
a = pd.concat([pd.Series(['A']*n, name='cat'),
pd.Series(np.random.normal(1, 1, n), name='val')],
axis=1)
b = pd.concat([pd.Series(['B']*n, name='cat'),
pd.Series(np.random.normal(.5, 1, n), name='val')],
axis=1)
c = pd.concat([pd.Series(['C']*n, name='cat'),
pd.Series(np.random.normal(0, 1, n), name='val')],
axis=1)
df = pd.concat([a, b, c]).reset_index(drop=True)
# unordered boxplot
df.boxplot(column='val', by='cat')
# get order by mean
means = df.groupby(['cat'])['val'].agg(np.mean).sort_values()
ordered_cats = means.index.values
# create categorical data type and set categorical column as new data type
cat_dtype = pd.CategoricalDtype(ordered_cats, ordered=True)
df['cat'] = df['cat'].astype(cat_dtype)
# ordered boxplot
df.boxplot(column='val', by='cat')

Using the solution posted by krieger, the short answer is to convert the category column to a CategoricalDtype like so:
ordered_list = ['dog', 'cat', 'mouse']
df['category'] = df['category'].astype(pd.CategoricalDtype(ordered_list , ordered=True))

Related

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Frequency of Value Column Given a Count Column

A dataframe has two columns ['Value', 'Count']. Value contains non-unique values. Count contains the number of occurances of Value. I want to plot Value vs sum of Count. Although this code works, I feel it doesn't utilize the power of pandas. What am I missing?
df = pd.DataFrame({'Value':[1,3,2,1],'Count':[5,2,1,4]})
gdf = df.groupby('Value')
sumdf = pd.DataFrame({'Value':k,'Sum':g['Count'].sum()} for k,g in gdf)
sumdf['Pct'] = sumdf['Sum'] / sumdf['Sum'].sum() * 100
sumdf.plot(x='Value',y='Pct',kind='bar',title='Frequency of Value')
Here's a one-liner:
ax = (df.groupby('Value')['Count'].sum() / df['Count'].sum() * 100).plot.bar(title='Frequency of Value')
Output:

python: aggregate columns in pivot table with multiindex structure

if i have multi-index pivot table like this:
what would be the way to aggregate total 'sum' and 'count' for all dates?
I want to see additional column with totals for all rows in the table.
Thanks to #Nik03 for the idea. The methond of concat returns required data frame but with single index level. To add it to original dataframe, you have to create columns first and assign new dataframes to:
table_to_show = pd.concat([table_to_record.filter(like='sum').sum(1), table_to_record.filter(like='count').sum(1)], axis=1)
table_to_show.columns = ['sum', 'count']
table_to_record['total_sum'] = table_to_show['sum']
table_to_record['total_count'] = table_to_show['count']
column_1st = table_to_record.pop('total_sum')
column_2nd = table_to_record.pop('total_count')
table_to_record.insert(0, 'total_sum', column_1st)
table_to_record.insert(1,'total_count', column_2nd)
and here is the result:
One way:
df1 = pd.concat([df.filter(like='sum').sum(
1), df.filter(like='mean').sum(1)], axis=1)
df1.columns = ['sum', 'mean']

Plotting certain bars in a series and groupnig the rest in one bar

Imagine I have the series with the column that has various different values such as:
COL1 FREQUENCY
A 30
B 20
C 50
D 10
E 15
F 5
And I want to use matplotlib.pyplot to plot a bar graph that would display the number values A, B, C, and OTHERS, appearing in the series. I managed to do so without the 'others' grouping by simply doing this:
ax = srs.plot.bar(rot=0)
or
plt.bar(srs.index, srs)
And I know it shows all bar plots, how do I limit this to just show bars for A, B, C, and OTHERS?
You can do a map then groupby.sum():
s = df['COL1'].map(lambda x: x if x in ('A','B','C') else 'OTHERS')
to_plot = df.FREQUENCY.groupby(s).sum()
to_plot.plot.bar()
Output:
You need to create a new dataframe and plot it afterwards
# list all values you want to keep
col1_to_keep = ['A','B','C']
# create a new dataframe with only these values in COL1
srs2 = srs.loc[srs['COL1'].isin(col1_to_keep)]
# create a third dataframe with only what you dont want to keep
srs3 = srs.loc[~srs['COL1'].isin(col1_to_keep)]
# create a dataframe with only one row containing the sum of frequency
rest = pd.DataFrame({'COL1':["OTHER"],'FREQUENCY': srs3['FREQUENCY'].sum()})
# add this row to srs2
srs2 =srs2.append(rest)
# you can finally plot it
ax = srs2.plot.bar(rot=0)

pandas.DataFrame input DataFrame but get NaN?

df is original DataFrame, csv file.
a = df.head(3) # get part of df.
This is table a.
b = a.loc[1:3,'22':'41'] #select part of a.
c = pd.DataFrame(data=b,index=['a','b'],columns=['v','g']) # give index and columns
final
b show 2x2. I get four value.
c show 2x2 NaN. I get four NaN.
why c don't contain any number?
Try using .values, you are running into 'intrinsic data alignment'
c = pd.DataFrame(data=b.values,index=['a','b'],columns=['v','g']) # give index and columns
Pandas likes to align indexes, by converting your 'b' dataframe into a np.array, you can then use the pandas dataframe constructor to build a new dataframe with those 2x2 values assigning new indexing.
Your DataFrame b already contains row and column indices, so when you try to create DataFrame c and you pass index and columns keyword arguments, you are implicitly indexing out of the original DataFrame b.
If all you want to do is re-index b, why not do it directly?
b = b.copy()
b.index = ['a', 'b']
b.columns = ['v', 'g']