How to change text of y-axes on a matplotlib generated picture - matplotlib

The page is
"http://matplotlib.sourceforge.net/examples/pylab_examples/histogram_demo_extended.html"
Let's look at the y-axis, the numbers there do not make any sense, could we change it to something else that is meaningful?

Except the cumulative distribution plot, and the last one, the rest of the y-axes data show normalized histogram values with normed=1 keyword set (i.e., the are underneath the histogram equals to 1 as in the definition of a probability density function (PDF))

You can use yticks(), see this example.

Related

Ordering Seaborn heatmap with non-index variable

I am currently in the process of moving from R and ggplot2 to seaborn for a lot of work because R was struggling with the size of data I was using. I am currently working on a heatmap that is fairly simplistic and I have been able to render the general heatmap without too many issues, but I am not sure how to adjust the ordering of my categoricals for the heatmap.
In this case my data has this header:
Sample Position Depth Order
Sample is the "y-axis" categorical and Position is the "x-axis" categorical. Depth is the value of the cell. Order is a meta-value calculated elsewhere, but I want to use Order as my ordering value for the y-axis, while retaining Sample as the label. Is there a way to do this?
You need to provide a rectangular format, or matrix for sns.heatmap, so though you have a Order column for ordering Sample, it's not clear whether there is a unique value for each 'Order' category.
Below I use a simple example, and basically you change the 'Sample' to a category, according to the mean value of 'Order'. It is like changing the factor levels in R. Also, you need to make sure there is no NaN otherwise the heatmap might complain:
df = pd.DataFrame({'Sample':np.repeat(['A','B','C'],4),
'Position':[1,2,3,4]*3,
'Depth':np.random.normal(0,1,12),
'Order':np.repeat([2,1,3],4)})
y_order = df.groupby('Sample')['Order'].agg('mean').sort_values().index
df['Sample'] = pd.Categorical(df['Sample'],ordered=True,categories=y_order)
sns.heatmap(df.pivot(index='Sample',columns='Position', values='Depth'))

Matplotlib's Figure and Axes explanation

I am really pretty new to matplotlib, though I know that it can be very powerful.
I've been reading number of tutorials and examples and it's a real hassle to understand how does matplotlib's Figure and Axes work. I am illustrating, what I am trying to understand, with the attached figure.
I know how to create a figure instance of certain size in inches. However, what bothers me is how can I create subplots and then axes, within each subplot, with relative coordinates (bottom=0,left=0,top=1,right=1) as illustrated.
So, for example I want to create a "parent" plot area (say (6in,10in)). Then, I want to create two subplot areas, each with size (3in,3in), with 1in space from the top, 2in space between the two vertical subplot areas and 1in from bottom. Then, 1in space on the left and 2in space on the write. In the same time, I would like to be able to get the coordinates of the subplot areas with respect to the main plot area.
Then, inside the first subplot area, I'd like to create 2 axis instances, with Axis 1, having coordinates with respect to Subplot Area1 (0.1,0.7,0.7,0.2) and Axes 2 (0.1,0.2,0.7,0.5). And then of course I'd like to be able to plot on these axes e.g., ax1.plot()....
If you could provide a sample code to achieve that, then I can study it.
Your help will be very much appreciated!
a subplot and an Axes object are really the same thing. There is not really a "subplot" as you describe it in matplotlib. You can just create your three Axes objects using gridspec without the need to put them in your "subplots".
There are a few different ways to create Axes instances within your figure.
fig.add_axes will create an Axes instance at the position given to it (you give it [left,bottom,width,height] in figure coordinates (i.e. 0,0 is bottom left, 1,1 is top right).
fig.add_subplot will also create an Axes instance. In this case, rather than giving it a rectangle to be created in, you give it the number of rows and columns of subplots you would like, and then the plot_number, where plot_number starts at 1, increments across rows first and has a maximum of nrows * ncols.
For example, to create the top-left Axes in a grid of 2 row and 2 columns, you could do the following:
fig.add_subplot(2,2,1)
or the shorthand
fig.add_subplot(221)
There are some more customisable ways to create Axes as well, for example gridspec and subplot2grid which allow for easy creation of many subplots of different shapes and sizes.

How do create a scale for a second axis without unnecessary (or redundant) plotting?

I have a plot in which I have already plotted all my data and a "twined" axis, on which I'd like to use another scale, in this case dates. I also have a list of all the dates corresponding to each element of my data, and want to add an a scale for the dates to the twined axis.
For example, I have
ax2 = ax1.twinx()
and lists x_temporal_data, y_day_offsets, y_dates, all of the same length, and have already plotted the relationship between the first two with
ax1.plot(x_temporal_data, y_day_offsets)
and I just want to have a scale on ax2 for the dates in y_dates, since y_day_offsets and y_dates are "synonyms" for the same time information.
Is there a way to do this without "plotting" something I don't need to display (since all my data is already plotted). For example, I can get the dates to appear perfectly on ax2 with
ax2.plot(len(y_dates)*[some_random_out_of_xrange_value], y_dates)
but that seems like a hack: plotting nothing to "calibrate" the second axis.
Is there a better, more idiomatic way of accomplishing this?
Simply set the scale on the second y-axis to your liking with:
ax2.set_ylim([min(y_dates), max(y_dates)])

Put pcolormesh and contour onto same grid?

I'm trying to display 2D data with axis labels using both contour and pcolormesh. As has been noted on the matplotlib user list, these functions obey different conventions: pcolormesh expects the x and y values to specify the corners of the individual pixels, while contour expects the centers of the pixels.
What is the best way to make these behave consistently?
One option I've considered is to make a "centers-to-edges" function, assuming evenly spaced data:
def centers_to_edges(arr):
dx = arr[1]-arr[0]
newarr = np.linspace(arr.min()-dx/2,arr.max()+dx/2,arr.size+1)
return newarr
Another option is to use imshow with the extent keyword set.
The first approach doesn't play nicely with 2D axes (e.g., as created by meshgrid or indices) and the second discards the axis numbers entirely
Your data is a regular mesh? If it doesn't, you can use griddata() to obtain it. I think that if your data is too big, a sub-sampling or regularization always is possible. If the data is too big, maybe your output image always will be small compared with it and you can exploit this.
If you use imshow() with "extent" and "interpolation='nearest'", you will see that the data is cell-centered, and extent provided the lower edges of cells (corners). On the other hand, contour assumes that the data is cell-centered, and X,Y must be the center of cells. So, you need to be care about the input domain for contour. The trivial example is:
x = np.arange(-10,10,1)
X,Y = np.meshgrid(x,x)
P = X**2+Y**2
imshow(P,extent=[-10,10,-10,10],interpolation='nearest',origin='lower')
contour(X+0.5,Y+0.5,P,20,colors='k')
My tests told me that pcolormesh() is a very slow routine, and I always try to avoid it. griddata and imshow() always is a good choose for me.

What exactly do the whiskers in pandas' boxplots specify?

In python-pandas boxplots with default settings, the red bar is the mean median, and the box signifies the 25th and 75th quartiles, but what exactly do the whiskers mean in this case? Where is the documentation to figure out the exact definition (couldn't find it)?
Example code:
df.boxplot()
Example result:
Pandas just wraps the boxplot function from matplotlib. The matplotlib docs have the definition of the whiskers in detail:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the
first and third quartiles. In other words, where IQR is the
interquartile range (Q3-Q1), the upper whisker will extend to last
datum less than Q3 + whis*IQR). Similarly, the lower whisker will
extend to the first datum greater than Q1 - whis*IQR. Beyond the
whiskers, data are considered outliers and are plotted as individual
points.
Matplotlib (and Pandas) also gives you a lot of options to change this default definition of the whiskers:
Set this to an unreasonably high value to force the whiskers to show
the min and max values. Alternatively, set this to an ascending
sequence of percentile (e.g., [5, 95]) to set the whiskers at specific
percentiles of the data. Finally, whis can be the string 'range' to
force the whiskers to the min and max of the data.
Below a graphic that illustrates this from a stats.stackexchange answer. Note that k=1.5 if you don't supply the whis keyword in Pandas.
From Amelio Vazquez-Reina's answer in Boxplots in matplotlib: Markers and outliers:
The outliers (the + markers in the boxplot) are simply points outside of the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)] margin below.
FYI: Confused by location of fences in box-whisker plots
You mention in your question that the red line is the mean - it is actually the median.
From the matplotlib link mentioned by Chang She above:
The box extends from the lower to upper quartile values of the data,
with a line at the median. The whiskers extend from the box to show
the range of the data. Flier points are those past the end of the
whiskers.
I didn't experiment, but there is a 'meanline' option which might put the line at the mean.
These are specified in the matplotlib documentation. The whiskers are some multiple (1.5 by default) of the interquartile range.