How to plot Series with selective ticks? - pandas

I have a Series that I would like to plot as a bar chart: pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts()
Since I have many bars I only want to display some (equidistant) ticks.
However, unless I actively work against it, pyplot will print the wrong labels. E.g. if I leave out set_xticklabels in the code below I get
where every element from the index is taken and just displayed with the specified distance.
This code does what I want:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
mi,ma = min(s.index), max(s.index)
s = s.reindex(range(mi,ma+1,1), fill_value=0)
distance = 10
a = s.plot(kind='bar')
condition = lambda t: int(t[1].get_text()) % 10 == 0
ticks_,labels_=zip(*filter(condition, zip(a.get_xticks(), a.get_xticklabels())))
a.set_xticks(ticks_)
a.set_xticklabels(labels_)
plt.show()
But I still feel like I'm being unnecessarily clever here. Am I missing a function? Is this the best way of doing that?

Consider not using a pandas bar plot in case you intend to plot numeric values; that is because pandas bar plots are categorical in nature.
If instead using a matplotlib bar plot, which is numeric in nature, there is no need to tinker with any ticks at all.
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
plt.bar(s.index, s)

I think you overcomplicated it. You can simply use the following. You just need to find the relationship between the ticks and the ticklabels.
a = s.plot(kind='bar')
xticks = np.arange(0, max(s)*10+1, 10)
plt.xticks(xticks + abs(mi), xticks)

Related

Sns barplot does not sort sliced values

I want to plot from pd df using sns barplot. Everything works fine :
code associated :
result = df.groupby(['Code departement']).size().sort_values(ascending=False)
x=result.index
y=result.values
plot=sns.barplot(x, y)
plot.set(xlabel='Code departement', ylabel='Nombre de transactions')
sns.barplot(x, y, data=df).set_title('title')
But as you can see in PLOT 1, there are too many bars so I just want the 10 highest, and when I slice x and y :
x=result[:10].index
y=result[:10].values
plot=sns.barplot(x, y)
It prints bars unordered like this :
I checked by printing x and y (sliced) and they are right ordered, Idk what I am missing thank you for your help
You didn't state the version you are using, but probably it isn't the latest. Seaborn as well as matplotlib receive quite some improvements with each new version.
With seaborn 0.11.1 you'd get a warning, as x and y is preferred to be passed via keywords, i.e. sns.barplot(x=x, y=y). The warning tries to avoid confusion with the data= keyword. Apart from that, the numeric x-values would appear sorted numerically.
The order can be controlled via the order= keyword. In this case, sns.barplot(x=x, y=y, order=x). To only have the 10 highest, you can pass sns.barplot(x=x, y=y, order=x[:10]).
Also note that you are creating the bar plot twice (just to change the title?), which can be very confusing. As sns.barplot returns the ax (the subplot onto which the plot has been drawn), the usual approach is ax = sns.barplot(...) and then ax.set_title(...). (The name ax is preferred, to easier understand how matplotlib and seaborn example code can be employed in new code.)
The following example code has been tested with seaborn 0.11.1:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
print(sns.__version__)
df = pd.DataFrame({'Code departement': np.random.randint(1, 51, 1000)})
result = df.groupby(['Code departement']).size().sort_values(ascending=False)
x = result.index
y = result.values
ax = sns.barplot(x, y, order=x[:10])
ax.set(xlabel='Code departement', ylabel='Nombre de transactions')
ax.set_title('title')
plt.show()

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

Modify an errorbar extent in pandas barplot

I'm plotting data with a pandas barplot that includes errorbars (that are symmetric around the bar top), and I would like to modify the extent of one single errorbar in this plot, so that it shows only on half of it. How can I do that?
Here's a concrete example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
bars = pd.DataFrame(np.random.randn(2,2), index=['a','b'], columns=['c','d'])
errs = pd.DataFrame(np.random.randn(2,2), index=['a','b'], columns=['c','d'])
ax = bars.plot.barh(color=['r','g'],xerr=errs)
which yields a plot like that:
I'm trying to a posteriori access and modify the extent of the errorbar of index a and column d so that it shows only the first half of it, i.e. a segment [bar_top-err, bar_top] instead of [bar_top-err, bar_top+err]. I attempted to retrieve the following matplotlib object:
plt.getp(ax.get_children()[1],'paths')[0]
which, if I'm not mistaken, is a Bbox, and describes the right object, but I can't get to modify it in my plot. Any idea on how to do that?
You were almost there, just need to modify and update the coordinates in path.vertices. I took the liberty to assume that you want the error bar to face "away from zero", instead of just showing the negative part of it:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
bars = pd.DataFrame(np.random.randn(2,2), index=['a','b'], columns=['c','d'])
errs = pd.DataFrame(np.random.randn(2,2), index=['a','b'], columns=['c','d'])
ax = bars.plot.barh(color=['r','g'], xerr=errs)
child = ax.get_children()[1]
path = plt.getp(child, 'paths')[0]
bar_top = path.vertices.mean(axis=0)[0]
# replace the right tail if bar is negative or left tail if it's positive
method = np.argmin if np.sign(bar_top)==1 else np.argmax
idx = method(path.vertices, axis=0)[0]
path.vertices[idx, 0] = bar_top
plt.savefig('figs/hack-linecollections.png', dpi=150)
plt.show()

Cutting up the x-axis to produce multiple graphs with seaborn?

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))

Creating a bar plot using Seaborn

I am trying to plot bar chart using seaborn. Sample data:
x=[1,1000,1001]
y=[200,300,400]
cat=['first','second','third']
df = pd.DataFrame(dict(x=x, y=y,cat=cat))
When I use:
sns.factorplot("x","y", data=df,kind="bar",palette="Blues",size=6,aspect=2,legend_out=False);
The figure produced is
When I add the legend
sns.factorplot("x","y", data=df,hue="cat",kind="bar",palette="Blues",size=6,aspect=2,legend_out=False);
The resulting figure looks like this
As you can see, the bar is shifted from the value. I don't know how to get the same layout as I had in the first figure and add the legend.
I am not necessarily tied to seaborn, I like the color palette, but any other approach is fine with me. The only requirement is that the figure looks like the first one and has the legend.
It looks like this issue arises here - from the docs searborn.factorplot
hue : string, optional
Variable name in data for splitting the plot by color. In the case of ``kind=”bar”, this also influences the placement on the x axis.
So, since seaborn uses matplotlib, you can do it like this:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
x=[1,1000,1001]
y=[200,300,400]
sns.set_context(rc={"figure.figsize": (8, 4)})
nd = np.arange(3)
width=0.8
plt.xticks(nd+width/2., ('1','1000','1001'))
plt.xlim(-0.15,3)
fig = plt.bar(nd, y, color=sns.color_palette("Blues",3))
plt.legend(fig, ['First','Second','Third'], loc = "upper left", title = "cat")
plt.show()
Added #mwaskom's method to get the three sns colors.