Adding descriptive stats to this plot - pandas

In pandas/seaborn:
sns.distplot(combo['resubmits'], kde=False, bins=8)
plt.savefig("g1.png")
Makes a very pretty histogram. I want to include a textual "legend" showing the mean, stdev, n, etc as numbers in a box. You would think this is so common that there's a semi automatic way to do it but I can't find it.

There is a feature request for that.
However, note that using matplotlib.pyplot.axvline, you can easily do it yourself for now.
from matplotlib import pyplot as plt
plt.axvline(x, 0, y_max)
where x=combo['resubmits'].mean() and y_max is the maximal value of hist(combo['resubmits'])'s bins' values.

Related

Matplotlib value on top left and remove it

I have this array with 10 values.
I get that my array has so many numbers behind the comma.
But I notice there's value on top left corner.
Anyone knows what is it and how remove it?
thank you in advance.
the array:
0.00409960926442099
0.00409960926442083
0.004099609264420652
0.004099609264420653
0.004099609264420585
0.0040996092644205884
0.004099609264420545
0.004099609264420517
0.004099609264420514
0.004099609264420513
As your values are all very close together, the usual ticks would all be the same. For example, if you use '%.6f' as the tick format, you'd get '0.00410' for each of the ticks. That would not be very helpful. Therefore, matplotlib puts a base number '4.099609264420e-3' together with an offset '1e-16' to label the yticks. So, every real ytick would be the base plus the offset times the tick-value.
To get rid of these strange numbers, you have to re-evaluate what exactly you want to achieve with your plot. If you'd set some y-limits (e.g. plt.ylim(0.004099, 0.004100)), you'd get a quite dull horizontal line. Note that 1e-16 is very close to the maximum precision you can get using standard floating-point math.
Here is some demo code to show how it would look with the '%.6f' format:
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.plot([0.00409960926442099, 0.00409960926442083, 0.004099609264420652, 0.004099609264420653,
0.004099609264420585, 0.0040996092644205884, 0.004099609264420545, 0.004099609264420517,
0.004099609264420514, 0.004099609264420513])
plt.gca().yaxis.set_major_formatter(mtick.FormatStrFormatter('%.6f'))
plt.tight_layout()
plt.show()

Draw an ordinary plot with the same style as in plt.hist(histtype='step')

The method plt.hist() in pyplot has a way to create a 'step-like' plot style when calling
plt.hist(data, histtype='step')
but the 'ordinary' methods that plot raw data without processing (plt.plot(), plt.scatter(), etc.) apparently do not have style options to obtain the same result. My goal is to plot a given set of points using that style, without making histogram of these points.
Is that achievable with standard library methods for plotting a given 2-D set of points?
I also think that there is at least one hack (generating a fake distribution which would have histogram equal to our data) and a 'low-level' solution to draw each segment manually, but none of these ways seems favorable.
Maybe you are looking for drawstyle="steps".
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
data = np.cumsum(np.random.randn(10))
plt.plot(data, drawstyle="steps")
plt.show()
Note that this is slightly different from histograms, because the lines do not go to zero at the ends.

Better ticks and tick labels with log scale

I am trying to get better looking log-log plots and I almost got what I want except for a minor problem.
The reason my example throws off the standard settings is that the x values are confined within less than one decade and I want to use decimal, not scientific notation.
Allow me to illustrate with an example:
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib as mpl
import numpy as np
x = np.array([0.6,0.83,1.1,1.8,2])
y = np.array([1e-5,1e-4,1e-3,1e-2,0.1])
fig1,ax = plt.subplots()
ax.plot(x,y)
ax.set_xscale('log')
ax.set_yscale('log')
which produces:
There are two problems with the x axis:
The use of scientific notation, which in this case is counterproductive
The horrible "offset" at the lower right corner
After much reading, I added three lines of code:
ax.xaxis.set_major_formatter(mpl.ticker.ScalarFormatter())
ax.xaxis.set_minor_formatter(mpl.ticker.ScalarFormatter())
ax.ticklabel_format(style='plain',axis='x',useOffset=False)
This produces:
My understanding of this is that there are 5 minor ticks and 1 major one. It is much better, but still not perfect:
I would like some additional ticks between 1 and 2
Formatting of label at 1 is wrong. It should be "1.0"
So I inserted the following line before the formatter statement:
ax.xaxis.set_major_locator(mpl.ticker.MultipleLocator(0.2))
I finally get the ticks I want:
I now have 8 major and 2 minor ticks. Now, this almost looks right except for the fact that the tick labels at 0.6, 0.8 and 2.0 appear bolder than the others. What is the reason for this and how can I correct it?
The reason, some of the labels appear bold is that they are part of the major and minor ticklabels. If two texts perfectly overlap, they appear bolder due to the antialiasing.
You may decide to only use minor ticklabels and set the major ones with a NullLocator.
Since the locations of the ticklabels you wish to have is really specific there is no automatic locator that would provide them out of the box. For this special case it may be easiest to use a FixedLocator and specify the labels you wish to have as a list.
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
x = np.array([0.6,0.83,1.1,1.8,2])
y = np.array([1e-5,1e-4,1e-3,1e-2,0.1])
fig1,ax = plt.subplots(dpi=72, figsize=(6,4))
ax.plot(x,y)
ax.set_xscale('log')
ax.set_yscale('log')
locs = np.append( np.arange(0.1,1,0.1),np.arange(1,10,0.2))
ax.xaxis.set_minor_locator(ticker.FixedLocator(locs))
ax.xaxis.set_major_locator(ticker.NullLocator())
ax.xaxis.set_minor_formatter(ticker.ScalarFormatter())
plt.show()
For a more generic labeling, one could of course subclass a locator, but we would then need to know the logic to use to determine the ticklabels. (As I do not see a well defined logic for the desired ticks from the question, I feel it would be wasted effort to provide such a solution for now.)

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.

Matplotlib namespace issues?

I have a question regarding the Matplotlib.pyplot and namespaces.
See the following code:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
x=np.linspace(0,1,28)
color=iter(cm.gist_rainbow_r(np.linspace(0,1,28)))
plt.clf()
for s in range(28):
c=next(color)
plt.plot(x,x*s, c=c)
plt.show()
The idea was to have the plots in different colors of the rainbow map.
Now what happens is that on first execution it works, but then things are getting weird.
On several consecutive executions the map is stopped being used and instead of that the default map is used.
I see that the problem may lie within the "c=c" in the plot function, but I have played around with different namings "c", "color", .... and could not find the systematic of the issue here.
Can someone reproduce the problem and (try the code at least 5 times or so consecutively) is able to explain, what is going on here?
Thanks
This is known issue with mpl + python3.4+ that has been fixed in mpl v1.5+.
Many of the style parameters have multiple aliases (ex 'c' vs 'color') which mpl was not merging properly and the artists were essentially getting told two different colors which internally means there is a dictionary with both 'c' and 'color' in it.
In python 3.4+ process-to-process order of iteration of dictionaries is random by default due to the seed for the underlying hash table being randomized (this was to prevent a possible DOS attack based on intentional hash table collisions). In older versions of python it so happened that the user supplied color always came later in the iteration order so things coincidentally worked.
The simple work around (iirc) is to use plot(x, y, color=c) or update to mpl 1.5.1.