Inner boxplots in seaborn violinplots not accurate - matplotlib

The inner boxplots that I get (through specification of inner='box') when generating seaborn violinplots are not accurate for my actual data. See example plot below. The actual data extend to the tip of the thin tails. But the boxplots end well within the area of the violin.
Assuming these boxplots are supposed to be representing the quartiles, and not standard deviations or something, then they are inaccurate.
My code invoking seaborn violinplot is below. As you can see, I have set the option cut=0, which should mean that the tails of the violin plot do not extent beyond my extreme data at all, and in fact, from inspection I can see that the extents of the violin are in the correct places. But I can also see from inspection that the inner boxplots are not even close to right.
sns.violinplot(x='Policy', y='LMP', order=cat_order, data=df, inner='box', scale='area', bw=0.2, cut=0, linewidth=0.5, ax = axes)
Does anyone have any insight into what seaborn does here? Are they deciding (only for purposes of the boxplot) that some of my data are outliers, and excluding them? Any ideas for how to control that?

OK, I tracked down the answer to my own question. While I'm used to boxplots based on strict quartiles, Seaborn uses another (apparently common) approach where the tips of the boxes on their boxplots extend to only 1.5 times the "interquartile range" or IQR.
See here for information Seaborn boxplots:
http://seaborn.pydata.org/tutorial/categorical.html#distributions-of-observations-within-categories
See here for definition of IQR:
http://stattrek.com/statistics/dictionary.aspx?definition=Interquartile%20range

Related

Turn off x-axis marginal distribution axes on jointplot using seaborn package

There is a similar question here, however I fail to adapt the provided solutions to my case.
I want to have a jointplot with kind=hex while removing the marginal plot of the x-axis as it contains no information. In the linked question the suggestion is to use JointGrid directly, however Seaborn then seems to to be unable to draw the hexbin plot.
joint_kws = dict(gridsize=70)
g = sns.jointplot(data=all_data, x="Minute of Hour", y="Frequency", kind="hex", joint_kws=joint_kws)
plt.ylim([49.9, 50.1])
plt.xlim([0, 60])
g.ax_joint.axvline(x=30,ymin=49, ymax=51)
plt.show()
plt.close()
How to remove the margin plot over the x-axis?
Why is the vertical line not drawn?
Also is there a way to exchange the right margin to a plot which more clearly resembles the density?
edit: Here is a sample of the dataset (33kB). Read it with pd.read_pickle("./data.pickle")
I've been fiddling with an analog problem (using a scatterplot instead of the hexbin). In the end, the solution to your first point is awkwardly simple. Just add this line :
g.ax_marg_x.remove()
Regarding your second point, I've no clue as to why no line is plotted. But a workaround seems to be to use vlines instead :
g.ax_joint.vlines(x=30, ymin=49, ymax=51)
Concerning your last point, I'm afraid I haven't understood it. If you mean increasing/reducing the margin between the subplots, you can use the space argument stated in the doc.

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Draw an ordinary plot with the same style as in plt.hist(histtype='step')

The method plt.hist() in pyplot has a way to create a 'step-like' plot style when calling
plt.hist(data, histtype='step')
but the 'ordinary' methods that plot raw data without processing (plt.plot(), plt.scatter(), etc.) apparently do not have style options to obtain the same result. My goal is to plot a given set of points using that style, without making histogram of these points.
Is that achievable with standard library methods for plotting a given 2-D set of points?
I also think that there is at least one hack (generating a fake distribution which would have histogram equal to our data) and a 'low-level' solution to draw each segment manually, but none of these ways seems favorable.
Maybe you are looking for drawstyle="steps".
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
data = np.cumsum(np.random.randn(10))
plt.plot(data, drawstyle="steps")
plt.show()
Note that this is slightly different from histograms, because the lines do not go to zero at the ends.

Difference between matplotlib.countourf and matlab.contourf() - odd sharp edges in matplotlib

I am a recent migrant from Matlab to Python and have recently worked with Numpy and Matplotlib. I recoded one of my scripts from Matlab, which employs Matlab's contourf-function, into Python using matplotlib's corresponding contourf-function. I managed to replicate the output in Python, apart that the contourf-plots are not exacly the same, for a reason that is unknown to me. As I run the contourf-function in matplotlib, I get this otherwise nice figure but it has these sharp edges on the contour-levels on top and bottom, which should not be there (see Figure 1 below, matplotlib-output). Now, when I export the arrays I used in Python to Matlab (i.e. the exactly same data set that was used to generate the matplotlib-contourf-plot) and use Matlab's contourf-function, I get a slightly different output, without those sharp contour-level edges (see Figure 2 below, Matlab-output). I used the same number of levels in both figures. In figure 3 I have made a scatterplot of the same data, which shows that there are no such sharp edges in the data as shown in the contourf-plot (I added contour-lines just for reference). Example dataset can be downloaded through Dropbox-link given below. The data set contains three txt-files: X, Y, Z. Each of them are an 500x500 arrays, which can be directly used with contourf(), i.e. plt.contourf(X,Y,Z,...). The code that used was
plt.contourf(X,Y,Z,10, cmap=plt.cm.jet)
plt.contour(X,Y,Z,10,colors='black', linewidths=0.5)
plt.axis('equal')
plt.axis('off')
Does anyone have an idea why this happens? I would appreciate any insight on this!
Cheers,
Jussi
Below are the details of my setup:
Python 3.7.0
IPython 6.5.0
matplotlib 2.2.3
Matplotlib output
Matlab output
Matplotlib-scatter
Link to data set
The confusing thing about the matlab plot is that its colorbar shows much more levels than there are actually in the plot. Hence you don't see the actual intervals that are contoured.
You would achieve the same result in matplotlib by choosing 12 instead of 11 levels.
import numpy as np
import matplotlib.pyplot as plt
X, Y, Z = [np.loadtxt("data/roundcontourdata/{}.txt".format(i)) for i in list("XYZ")]
levels = np.linspace(Z.min(), Z.max(), 12)
cntr = plt.contourf(X,Y,Z,levels, cmap=plt.cm.jet)
plt.contour(X,Y,Z,levels,colors='black', linewidths=0.5)
plt.colorbar(cntr)
plt.axis('equal')
plt.axis('off')
plt.show()
So in conclusion, both plots are correct and show the same data. Just the levels being automatically chosen are different. This can be circumvented by choosing custom levels depending on the desired visual appearance.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.