Plotting certain bars in a series and groupnig the rest in one bar - pandas

Imagine I have the series with the column that has various different values such as:
COL1 FREQUENCY
A 30
B 20
C 50
D 10
E 15
F 5
And I want to use matplotlib.pyplot to plot a bar graph that would display the number values A, B, C, and OTHERS, appearing in the series. I managed to do so without the 'others' grouping by simply doing this:
ax = srs.plot.bar(rot=0)
or
plt.bar(srs.index, srs)
And I know it shows all bar plots, how do I limit this to just show bars for A, B, C, and OTHERS?

You can do a map then groupby.sum():
s = df['COL1'].map(lambda x: x if x in ('A','B','C') else 'OTHERS')
to_plot = df.FREQUENCY.groupby(s).sum()
to_plot.plot.bar()
Output:

You need to create a new dataframe and plot it afterwards
# list all values you want to keep
col1_to_keep = ['A','B','C']
# create a new dataframe with only these values in COL1
srs2 = srs.loc[srs['COL1'].isin(col1_to_keep)]
# create a third dataframe with only what you dont want to keep
srs3 = srs.loc[~srs['COL1'].isin(col1_to_keep)]
# create a dataframe with only one row containing the sum of frequency
rest = pd.DataFrame({'COL1':["OTHER"],'FREQUENCY': srs3['FREQUENCY'].sum()})
# add this row to srs2
srs2 =srs2.append(rest)
# you can finally plot it
ax = srs2.plot.bar(rot=0)

Related

How to use a loop to make a plot of 3 columns at the time?

I have a dataframe which contains the 3 columns of data (P, EP and Q) for each of the three catchment areas. I need to make a subplot of each catchment area showing the 3 columns of data that belong to this catchment area using one loop.
I did manage to make the three subplots without using a loop, but don't get how I am supposed to use one loop.
df = pd.read_excel('catchment_water_balance_data_ex2.xlsx', index_col=0, parse_dates=[0], skiprows=4)
df_monthly = df.resample('M').mean()
fig, axs = plt.subplots(3)
catchment_1 = df_monthly[['P1', 'EP1', 'Q1']]
catchment_2 = df_monthly[['P2', 'EP2', 'Q2']]
catchment_3 = df_monthly[['P3', 'EP3', 'Q3']]
axs[0].plot(catchment_1)
axs[1].plot(catchment_2)
axs[2].plot(catchment_3)
fig.suptitle('Water data of 3 catchments')
fig.supylabel('mm/day');
enter image description here

Change the stacked bar chart to Stacked Percentage Bar Plot

How can I change this stacked bar into a stacked Percentage Bar Plot with percentage labels:
here is the code:
df_responses= pd.read_csv('https://raw.githubusercontent.com/eng-aomar/Security_in_practice/main/secuirtyInPractice.csv')
df_new =df_responses.iloc[:,9:21]
image_format = 'svg' # e.g .png, .svg, etc.
# initialize empty dataframe
df2 = pd.DataFrame()
# group by each column counting the size of each category values
for col in df_new:
grped = df_new.groupby(col).size()
grped = grped.rename(grped.index.name)
df2 = df2.merge(grped.to_frame(), how='outer', left_index=True, right_index=True)
# plot the merged dataframe
df2.plot.bar(stacked=True)
plt.show()
You can just calculate the percentages yourself e.g. in a new column of your dataframe as you do have the absolute values and plot this column instead.
Using sum() and division using dataframes you should get there quickly.
You might wanna have a look at GeeksForGeeks post which shows how this could be done.
EDIT
I have now gone ahead and adjusted your program so it will give the results that you want (at least the result I think you would like).
Two key functions that I used and you did not, are df.value_counts() and df.transpose(). You might wanna read on those two as they are quite helpful in many situations.
import pandas as pd
import matplotlib.pyplot as plt
df_responses= pd.read_csv('https://raw.githubusercontent.com/eng-aomar/Security_in_practice/main/secuirtyInPractice.csv')
df_new =df_responses.iloc[:,9:21]
image_format = 'svg' # e.g .png, .svg, etc.
# initialize empty dataframe providing the columns
df2 = pd.DataFrame(columns=df_new.columns)
# loop over all columns
for col in df_new.columns:
# counting occurences for each value can be done by value_counts()
val_counts = df_new[col].value_counts()
# replace nan values with 0
val_counts.fillna(0)
# calculate the sum of all categories
total = val_counts.sum()
# use value count for each category and divide it by the total count of all categories
# and multiply by 100 to get nice percent values
df2[col] = val_counts / total * 100
# columns and rows need to be transposed in order to get the result we want
df2.transpose().plot.bar(stacked=True)
plt.show()

Plot Frequency of Values of Multiple Columns

I want to create a pandas plot of the frequency of occurrences of values in two columns. The scatter plot is to contain a regression line. The result is a heat map-like plot with a regression line.
First, combine columns 'A' and 'B' into a unique value. In this case both columns are numeric so I'm using addition. Next use value_counts to create a frequency. Use pandas scatter plot to create the scatter/bubble/heatmap. Finally use numpy.polyfit to drop a regression line.
combined = (plotdf['A']*plotdf['B'].nunique()+plotdf['B']) # combine numeric values of columns A and B
vcounts = combined.value_counts() # get value counts of combined values
frequency = combined.map(vcounts) # lookup count for each row
plt = plotdf.plot(x='A',y='B',c=frequency,s=frequency,colormap='viridis',kind='scatter',figsize=(16,8),title='Frequency of A and B')
plt.set(xlabel='A',ylabel='B')
x = plotdf['A'].values
y = plotdf['B'].values
m, b = np.polyfit(x, y, 1) # requires numpy
plt.plot(x, m*x + b, 'r') # r is color red

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()

Is there a way to set the order in pandas group boxplots?

Is there a way to sort the x-axis for a grouped box plot in pandas? It seems like it is sorted by an ascending order and I would like it to be ordered based on some other column value.
If you're grouping by a category, set it as an ordered categorical in the desired order.
See example below:
Here a dataset is created with three categories A, B and C where the mean value of each category is of the order C, B, A. The goal is to plot the categories in order of their mean value.
The key is converting the category to an ordered categorical data type with the desired order.
# create some data
n = 50
a = pd.concat([pd.Series(['A']*n, name='cat'),
pd.Series(np.random.normal(1, 1, n), name='val')],
axis=1)
b = pd.concat([pd.Series(['B']*n, name='cat'),
pd.Series(np.random.normal(.5, 1, n), name='val')],
axis=1)
c = pd.concat([pd.Series(['C']*n, name='cat'),
pd.Series(np.random.normal(0, 1, n), name='val')],
axis=1)
df = pd.concat([a, b, c]).reset_index(drop=True)
# unordered boxplot
df.boxplot(column='val', by='cat')
# get order by mean
means = df.groupby(['cat'])['val'].agg(np.mean).sort_values()
ordered_cats = means.index.values
# create categorical data type and set categorical column as new data type
cat_dtype = pd.CategoricalDtype(ordered_cats, ordered=True)
df['cat'] = df['cat'].astype(cat_dtype)
# ordered boxplot
df.boxplot(column='val', by='cat')
Using the solution posted by krieger, the short answer is to convert the category column to a CategoricalDtype like so:
ordered_list = ['dog', 'cat', 'mouse']
df['category'] = df['category'].astype(pd.CategoricalDtype(ordered_list , ordered=True))