What exactly do the whiskers in pandas' boxplots specify? - pandas

In python-pandas boxplots with default settings, the red bar is the mean median, and the box signifies the 25th and 75th quartiles, but what exactly do the whiskers mean in this case? Where is the documentation to figure out the exact definition (couldn't find it)?
Example code:
df.boxplot()
Example result:

Pandas just wraps the boxplot function from matplotlib. The matplotlib docs have the definition of the whiskers in detail:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the
first and third quartiles. In other words, where IQR is the
interquartile range (Q3-Q1), the upper whisker will extend to last
datum less than Q3 + whis*IQR). Similarly, the lower whisker will
extend to the first datum greater than Q1 - whis*IQR. Beyond the
whiskers, data are considered outliers and are plotted as individual
points.
Matplotlib (and Pandas) also gives you a lot of options to change this default definition of the whiskers:
Set this to an unreasonably high value to force the whiskers to show
the min and max values. Alternatively, set this to an ascending
sequence of percentile (e.g., [5, 95]) to set the whiskers at specific
percentiles of the data. Finally, whis can be the string 'range' to
force the whiskers to the min and max of the data.
Below a graphic that illustrates this from a stats.stackexchange answer. Note that k=1.5 if you don't supply the whis keyword in Pandas.

From Amelio Vazquez-Reina's answer in Boxplots in matplotlib: Markers and outliers:
The outliers (the + markers in the boxplot) are simply points outside of the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)] margin below.
FYI: Confused by location of fences in box-whisker plots

You mention in your question that the red line is the mean - it is actually the median.
From the matplotlib link mentioned by Chang She above:
The box extends from the lower to upper quartile values of the data,
with a line at the median. The whiskers extend from the box to show
the range of the data. Flier points are those past the end of the
whiskers.
I didn't experiment, but there is a 'meanline' option which might put the line at the mean.

These are specified in the matplotlib documentation. The whiskers are some multiple (1.5 by default) of the interquartile range.

Related

why is ggplot2 geom_col misreading discrete x axis labels as continuous?

Aim: plot a column chart representing concentration values at discrete sites
Problem: the 14 site labels are numeric, so I think ggplot2 is assuming continuous data and adding spaces for what it sees as 'missing numbers'. I only want 14 columns with 14 marks/labels, relative to the 14 values in the dataframe. I've tried assigning the sites as factors and characters but neither work.
Also, how do you ensure the y-axis ends at '0', so the bottom of the columns meet the x-axis?
Thanks
Data:
Sites: 2,4,6,7,8,9,10,11,12,13,14,15,16,17
Concentration: 10,16,3,15,17,10,11,19,14,12,14,13,18,16
You have two questions in one with two pretty straightforward answers:
1. How to force a discrete axis when your column is a continuous one? To make ggplot2 draw a discrete axis, the data must be discrete. You can force your numeric data to be discrete by converting to a factor. So, instead of x=Sites in your plot code, use x=as.factor(Sites).
2. How to eliminate the white space below the columns in a column plot? You can control the limits of the y axis via the scale_y_continuous() function. By default, the limits extend a bit past the actual data (in this case, from 0 to the max Concentration). You can override that behavior via the expand= argument. Check the documentation for expansion() for more details, but here I'm going to use mult=, which uses a multiplication to find the new limits based on the data. I'm using 0 for the lower limit to make the lower axis limit equal the minimum in your data (0), and 0.05 as the upper limit to expand the chart limits about 5% past the max value (this is default, I believe).
Here's the code and resulting plot.
library(ggplot2)
df <- data.frame(
Sites = c(2,4,6,7,8,9,10,11,12,13,14,15,16,17),
Concentration = c(10,16,3,15,17,10,11,19,14,12,14,13,18,16)
)
ggplot(df, aes(x=as.factor(Sites), y=Concentration)) +
geom_col(color="black", fill="lightblue") +
scale_y_continuous(expand=expansion(mult=c(0, 0.05))) +
theme_bw()

"Zoom in" on a violinplot whilst keeping accurate quartile lines (matplotlib/seaborn)

TL;DR: How can I get a subrange of a violinplot whilst keeping accurate quartile lines?
I am using seaborn violinplots to make static charts for a report, but as far as I can tell, there's no way to redraw a particular area between limits whilst retaining the 25/median/75 quartile lines of the original dataset.
Here's my example dataset as a violin. The 25/median/75 values are left side: 1.0/5.0/9.0; right side: 2.0/5.0/9.0
My data has such a long tail that all the useful info is scrunched up into a tiny area. I want to ignore (but not throw away) the tail and show a closer look at the interesting bit.
I tried to reset the ylim using ax.set(ylim=(0, upp)), but the resultant graph is not great: it's jaggy and the inner lines don't meet the violin edge.
Is there a way to reset the y-axis limits but get a better quality result?
Next I tried to cut off the tail by dropping values from the dataset. I dropped anything over the 97th centile. The violin looks way better, but the quartile lines have been recalculated for this new dataset. They're showing a median of about 4, not 5 as per the original dataset.
I'm using inner="quartile", so the code that gets called in Seaborn is _ViolinPlotter::draw_quartiles
def draw_quartiles(self, ax, data, support, density, center, split=False):
"""Draw the quartiles as lines at width of density."""
q25, q50, q75 = np.percentile(data, [25, 50, 75])
self.draw_to_density(ax, center, q25, support, density, split,
linewidth=self.linewidth,
dashes=[self.linewidth * 1.5] * 2)
As you can see, it assumes (understandably) that one wants to draw the quartile lines at percentiles 25, 50 and 75. It'd be amazeballs if there was a way I could call draw_to_density with my own values (is there?).
At the moment, I am attempting to manually adjust the position of the lines. It's trivial to figure out & set the y-values:
for l in ax.lines:
l.set_ydata(<get correct quartile value from original dataset>)
but I'm finding it hard to figure out the limits for x, i.e. the density of the distribution at the quartiles. It seems to involve gaussian kde, and tbh it's getting hacky and inelegant at this point. Is there an easy way to calculate how long each line should be?
What do you suggest?
Thanks for your help
Lnr
W/ Thanks to #JohanC.
added gridsize=1000 to the params of the violinplot and used ax.set(ylim=(0, upp)) to resize the y-axis to show the range from 0 to upp where upp is the upper limit. Much prettier lookin' graph:

Interpreting the Y values of a normal distribution

I've written this code to generate a normal distribution of a set of values 1,2,3 :
import pandas as pd
import random
import numpy as np
df = pd.DataFrame({'col1':[1,2,3]})
print(df)
fig, ax = plt.subplots(1,1)
df.plot(kind='hist', normed=True, ax=ax)
Returns :
The X values are the range of possible values but how are the Y values interpreted ?
Reading http://www.stat.yale.edu/Courses/1997-98/101/normal.htm the Y value is calculated using :
A normal distribution has a bell-shaped density curve described by its
mean and standard deviation . The density curve is symmetrical,
centered about its mean, with its spread determined by its standard
deviation. The height of a normal density curve at a given point x is
given by
What is the meaning of this formula ?
I think you are confusing two concepts here. A histogram will just plot how many times a certain value appears. So for your list of [1,2,3], the value 1 will appear once and the same for 2 and 3. If you would have set Normed=False you would get the plot you have now with a height of 1.0.
However, when you set Normed=True, you will turn on normalization. Note that this does not have anything to do with a normal distribution. Have a look at the documentation for hist, which you can find here: http://matplotlib.org/api/pyplot_api.html?highlight=hist#matplotlib.pyplot.hist
There you see that what the option Normed does, which is:
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., n/(len(x)`dbin), i.e., the integral of the histogram will sum to 1. If stacked is also True, the sum of the histograms is normalized to 1.
So it gives you the formula right there. So in your case, you have three points, i.e. len(x)=3. If you look at your plot you can see that your bins have a width of 0.2 so dbin=0.2. Each value appears only once for for both 1, 2, and 3, you will have n=1. Thus the height of your bars should be 1/(3*0.2) = 1.67, which is exactly what you see in your histogram.
Now for the normal distribution, that is just a specific probability function that is defined as the formula you gave. It is useful in many fields as it relates to uncertainties. You'll see it a lot in statistics for example. The Wikipedia article on it has lots of info.
If want to generate a list of values that conform to a normal distribution, I would suggest reading the documentation of numpy.random.normal which will do this for you: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html

Colorbar for imshow, centered on 0 and with symlog scale

I want to generate a grid of plots, of several arrays, with positive and negative values, with log scale, sharing the same colorbar.
I've achieved the sharing part of the colorbar (using ImageGrid and common max and min values), and I know that I could get a logarithmic scale using LogNorm() on the imshow call in the case of only positive values. But given the presence of negative values, I would need a colorbar on symmetric logarithmic scale.
I have found what would be the solution on https://stackoverflow.com/a/7741317/1101750 , but running the sample code Yann provides gives me very different results, cleary wrong:
Reviewing the code, I'm not able to grasp what's going on.
In addition to that, I've discovered that on Matplotlib 1.2, scale.SymmetricalLogScale.SymmetricalLogTransform asks for a new argument not explained on the documentation (linscale, which looking at the code of other transforms I assume that leaving it as 1 is a safe value).
Is the easiest solution subclassing LogNorm?
I've used a pretty simple recipe in the past to do exactly this, without the need to do any subclassing. matplotlib.colors.SymLogNorm provides most of the functionality you need, except that I've found it necessary to generate the tick marks by hand. Note that this solution uses matplotlib 1.3.0, and I may be using features that weren't available with 1.2.
def imshow_symlog(my_matrix, vmin, vmax, logthresh=5):
img=imshow( my_matrix ,
vmin=float(vmin), vmax=float(vmax),
norm=matplotlib.colors.SymLogNorm(10**-logthresh) )
maxlog=int(np.ceil( np.log10(vmax) ))
minlog=int(np.ceil( np.log10(-vmin) ))
#generate logarithmic ticks
tick_locations=([-(10**x) for x in xrange(minlog,-logthresh-1,-1)]
+[0.0]
+[(10**x) for x in xrange(-logthresh,maxlog+1)] )
cb=colorbar(ticks=tick_locations)
return img,cb
Since 1.3 matplotlib has a SymLogNorm. http://matplotlib.org/api/colors_api.html#matplotlib.colors.SymLogNorm

How to change text of y-axes on a matplotlib generated picture

The page is
"http://matplotlib.sourceforge.net/examples/pylab_examples/histogram_demo_extended.html"
Let's look at the y-axis, the numbers there do not make any sense, could we change it to something else that is meaningful?
Except the cumulative distribution plot, and the last one, the rest of the y-axes data show normalized histogram values with normed=1 keyword set (i.e., the are underneath the histogram equals to 1 as in the definition of a probability density function (PDF))
You can use yticks(), see this example.