Cutting up the x-axis to produce multiple graphs with seaborn? - pandas

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.

Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))

Related

Plot histogram with matplotlib/seaborn

How does one plot a histogram with precomputed probability distribution? I have the following:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
dist = np.array([0.50416489, 0.1057769 , 0.08717909, 0.03758235, 0.02342604, 0.03694781, 0.04196706, 0.03448674, 0.04018618, 0.01171971])
sns.histplot(data=dist, discrete=True)
plt.show()
output:
How do I change the y-axis to percentile and make the x-axis discrete (with values [0,1,2,...,10]) ?
I am working with a long tail distribution but I can only see the first element, how do I visualize it in a meaningful way?
EDIT
plt.bar gives a more interpretable result:
plt.bar(np.arange(10), dist)
plt.show()
However, when I use it on my real-world data I get the following plot:
The first 10 elements are the same as in dist, is it possible to make the x axis logarithmic?

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

How to plot Series with selective ticks?

I have a Series that I would like to plot as a bar chart: pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts()
Since I have many bars I only want to display some (equidistant) ticks.
However, unless I actively work against it, pyplot will print the wrong labels. E.g. if I leave out set_xticklabels in the code below I get
where every element from the index is taken and just displayed with the specified distance.
This code does what I want:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
mi,ma = min(s.index), max(s.index)
s = s.reindex(range(mi,ma+1,1), fill_value=0)
distance = 10
a = s.plot(kind='bar')
condition = lambda t: int(t[1].get_text()) % 10 == 0
ticks_,labels_=zip(*filter(condition, zip(a.get_xticks(), a.get_xticklabels())))
a.set_xticks(ticks_)
a.set_xticklabels(labels_)
plt.show()
But I still feel like I'm being unnecessarily clever here. Am I missing a function? Is this the best way of doing that?
Consider not using a pandas bar plot in case you intend to plot numeric values; that is because pandas bar plots are categorical in nature.
If instead using a matplotlib bar plot, which is numeric in nature, there is no need to tinker with any ticks at all.
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
plt.bar(s.index, s)
I think you overcomplicated it. You can simply use the following. You just need to find the relationship between the ticks and the ticklabels.
a = s.plot(kind='bar')
xticks = np.arange(0, max(s)*10+1, 10)
plt.xticks(xticks + abs(mi), xticks)

Basic axis malfuction in matplotlib

When plotting using matplotlib, I ran into an interesting issue where the y axis is scaled by a very inconvenient quantity. Here's a MWE that demonstrates the problem:
import numpy as np
import matplotlib.pyplot as plt
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
plt.show()
When I run this, I get a figure that looks like this picture
The y-axis clearly is scaled by a silly quantity even though the y data are all between 1 and 2.
This is similar to the question:
Axis numerical offset in matplotlib
I'm not satisfied with the answer to this question in that it makes no sense to my why I need to go the the convoluted process of changing axis settings when the data are between 1 and 2 (EDIT: between 0 and 1). Why does this happen? Why does matplotlib use such a bizarre scaling?
The data in the plot are all between 0.696000000017 and 0.696000000273. For such cases it makes sense to use some kind of offset.
If you don't want that, you can use you own formatter:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
fmt = matplotlib.ticker.StrMethodFormatter("{x:.12f}")
plt.gca().yaxis.set_major_formatter(fmt)
plt.show()

Seaborn FacetGrid plots are empty if any of the sub-plots have no data

I have a dataset that I want to plot with FacetGrids using the seaborn library. The problem is my data is "sparse"; some of the individual subplots don't exist (ie. there are zero data points). I would like those cells to either not show up, or just show up and be blank, but still see the subplots that have data. Here's a simple example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['a','b','c','d'],
data=[[1,1,1,4],[1,2,2,8],[2,1,2,12],[2,1,3,14]])
print df
g = sns.FacetGrid(df, col='a', row='b', hue='c')
g.map(plt.scatter, 'c', 'd', marker='o')
Unfortunately, when I plot this, I just get four empty plots instead of 3 filled plots and one empty one. If I change the last row of data to [2,2,3,14] instead, then all four plots appear as expected. Is this a bug in seaborn? Can I work around it somehow?