Pandas series stacked bar chart normalized - pandas

I have a pandas series with a multiindex like this:
my_series.head(5)
datetime_publication my_category
2015-03-31 xxx 24
yyy 2
zzz 1
qqq 1
aaa 2
dtype: int64
I am generating a horizontal bar chart using the plot method from pandas with all those stacked categorical values divided by datetime (according to the index hierarchy) like this:
my_series.unstack(level=1).plot.barh(
stacked=True,
figsize=(16,6),
colormap='Paired',
xlim=(0,10300),
rot=45
)
plt.legend(
bbox_to_anchor=(0., 1.02, 1., .102),
loc=3,
ncol=5,
mode="expand",
borderaxespad=0.
)
However I am not able to find a way to normalize all those values in the series broken down by datetime_publication,my_category. I would like to have all the horizontal bars of the same length, but right now the legth depends on the absolute values in the series.
Is there a built-in functionality from pandas to normalize the slices of the series or some quick function to apply at the series that keeps track of the total taken from the multiindex combinatin of the levels?

Related

Graph plotting in pandas and seaborn

I have the table with 5 columns with 8000 rows:
Market DeliveryWindowID #Orders #UniqueShoppersAvailable #UniqueShoppersFulfilled
NY 296 2 2 5
MA 365 3 4 8
How do I plot a graph in pandas or seaborn that will show the #Order, #UniqueShoppersAvailable, #UniqueShoppersFulfilled v/s the market and delivery window?
Using Seaborn, reshape your dataframe with melt first:
df_chart = df.melt(['Market','DeliveryWindowID'])
sns.barplot('Market', 'value',hue='variable', data=df_chart)
Output:
One way is to set Market as index forcing it onto the x axis and do a bar graph if you wanted a quick visualization. This can be stacked or not.
Not Stacked
import matplotlib .pyplot as plt
df.drop(columns=['DeliveryWindowID']).set_index(df.Market).plot(kind='bar')
Stacked
df.drop(columns=['DeliveryWindowID']).set_index(df.Market).plot(kind='bar', stacked=True)

How to plot the number of unique values in each column in pandas dataframe as bar plot?

I want to plot the count of unique values per column for specific columns of my dataframe.
So if my dataframe has four columns 'col_a', 'col_b' , 'col_c' and 'col_d', and two ('col_a', 'col_b') of them are categorical features, I want to have a bar plot having 'col_a' and 'col_b' in the x-axis, and the count of unique values in 'col_a' and number of unique values in 'col_b' in the y-axis.
PS: I don't want to plot the count of each unique value in a specific column.
Actually, how to bar plot this with python?
properties_no_na.nunique()
Which returns:
neighborhood 51
block 6805
lot 1105
zip_code 41
residential_units 210
commercial_units 48
total_units 215
land_sqft_thousands 6192
gross_sqft_thousands 8469
year_built 170
tax_class_at_sale 4
building_class_at_sale 156
sale_price_millions 14135
sale_date 4440
sale_month 12
sale_year 15
dtype: int64
How would that be possible? If possible with Seaborn?
nunique() returns Pandas.Series. Convert it to Pandas.DataFrame with reset_index() and call seaborn.
nu = properties_no_na.nunique().reset_index()
nu.columns = ['feature','nunique']
ax = sns.barplot(x='feature', y='nunique', data=nu)
sns.displot(x=df.column_name1,col=df.column_name2,kde=True)
note: sns is the alias of python seaborn library.
x axis always column_name1 and y axis column_name2. And this code will give you number of displots depends on unique values in the column column_name2

plot a stacked bar chart matplotlib pandas

I want to plot this data frame but I get an error.
this is my df:
6month final-formula Question Text
166047.0 1 0.007421 bathing
166049.0 1 0.006441 dressing
166214.0 1 0.001960 feeding
166216.0 2 0.011621 bathing
166218.0 2 0.003500 dressing
166220.0 2 0.019672 feeding
166224.0 3 0.012882 bathing
166226.0 3 0.013162 dressing
166229.0 3 0.008821 feeding
160243.0 4 0.023424 bathing
156876.0 4 0.000000 dressing
172110.0 4 0.032024 feeding
how can I plot a stacked bar based on the Question text?
I tried some codes but raises error.
dffinal.groupby(['6month','Question Text']).unstack('Question Text').plot(kind='bar',stacked=True,x='6month', y='final-formula')
import matplotlib.pyplot as plt
plt.show()
Actually I want the 6month column be in the x-axis, final-formula in the y-axis and Question text being stacked.
so as here I have three kind of Question text, three stacked bar should be there. and as I have 4 month, 4 bars totally.
Something like this but I applied this and did not work.
Am I missing something?
this picture is without stacking them. its like all question text has been summed up. I want for each Question Text there be stacked.
You missed aggregation step after groupby, namely, sum()
df = dffinal.groupby(['6month','Question Text']).sum().unstack('Question Text')
df.columns = df.columns.droplevel()
df.plot(kind='bar', stacked=True)
I dropped multiindex level from columns just for legend consistency.

Tick labels overlap in pandas bar chart

TL;DR: In pandas how do I plot a bar chart so that its x axis tick labels look like those of a line chart?
I made a time series with evenly spaced intervals (one item each day) and can plot it like such just fine:
intensity[350:450].plot()
plt.show()
But switching to a bar chart created this mess:
intensity[350:450].plot(kind = 'bar')
plt.show()
I then created a bar chart using matplotlib directly but it lacks the nice date time series tick label formatter of pandas:
def bar_chart(series):
fig, ax = plt.subplots(1)
ax.bar(series.index, series)
fig.autofmt_xdate()
plt.show()
bar_chart(intensity[350:450])
Here's an excerpt from the intensity Series:
intensity[390:400]
2017-03-07 3
2017-03-08 0
2017-03-09 3
2017-03-10 0
2017-03-11 0
2017-03-12 0
2017-03-13 2
2017-03-14 0
2017-03-15 3
2017-03-16 0
Freq: D, dtype: int64
I could go all out on this and just create the tick labels by hand completely but I'd rather not have to baby matplotlib and let do pandas its job and do what it did in the very first figure but with a bar plot. So how do I do that?
Pandas bar plots are categorical plots. They create one tick (+label) for each category. If the categories are dates and those dates are continuous one may aim at leaving certain dates out, e.g. to plot only every fifth category,
ax = series.plot(kind="bar")
ax.set_xticklabels([t if not i%5 else "" for i,t in enumerate(ax.get_xticklabels())])
In contrast, matplotlib bar charts are numberical plots. Here a useful ticker can be applied, which ticks the dates weekly, monthly or whatever is needed.
In addition, matplotlib allows to have full control over the tick positions and their labels.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import dates
index = pd.date_range("2018-01-26", "2018-05-05")
series = pd.Series(np.random.rayleigh(size=100), index=index)
plt.bar(series.index, series.values)
plt.gca().xaxis.set_major_locator(dates.MonthLocator())
plt.gca().xaxis.set_major_formatter(dates.DateFormatter("%b\n%Y"))
plt.show()

Overlaying actual data on a boxplot from a pandas dataframe

I am using Seaborn to make boxplots from pandas dataframes. Seaborn boxplots seem to essentially read the dataframes the same way as the pandas boxplot functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot() function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes?
I will appreciate any help with this.
This hasn't been added to the seaborn.boxplot function yet, but there's something similar in the seaborn.violinplot function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)
A general solution for the boxplot for the entire dataframe, which should work for both seaborn and pandas as their are all matplotlib based under the hood, I will use pandas plot as the example, assuming import matplotlib.pyplot as plt already in place. As you have already have the ax, it would make better sense to just use ax.text(...) instead of plt.text(...).
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot, we need to first guess where each boxes are plotted at among xaxis. They appears to be at 1,2,3,4,..... Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat, repeat 1,2,3,4..., each for n times, where n is the number of observations. Then we can make a plot, using those numbers as x coordinates. Since it is one-dimensional, for the y coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.
I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot(), you get it done.
df.groupby('Group').boxplot()