matplotlib hexbin: Which bin has the highest count [duplicate] - matplotlib

I have two distributions within a hexbin plot, like the one shown:
One distributions has a max value of about 4000, while the other has a max value of about 2500. The plotting colours are therefore different.
I was thinking I could normalize it if I knew the max value of the hexbin plot. How do I know how many points are within the max hexbin other than looking at the colorbar? I am using matplotlib.pyplot.hexbin

You can get the min and max of the norm which is what is used to normalize the data for color picking.
hb = plt.hexbin(x, y)
print hb.norm.vmin, hb.norm.vmax
You could then go on to pass a norm with this information to the second plot. The problem with this is that the first plot must have more range than the second, otherwise the second plot will not all be colored.
Alternatively, and preferably, you can construct a norm which you pass to the hexbin function for both of your plots:
norm = plt.normalize(min_v, max_v)
hb1 = plt.hexbin(x1, y1, norm=norm)
hb2 = plt.hexbin(x2, y2, norm=norm)
HTH,

Related

Matplotlib - x axis does not match the data

The description of the data frame
When I try to find the relationship between budget and revenue_of_investment
x = dfm_2.budget
y = dfm_2.revenue_of_investment
plt.figure(figsize = (10,8))
plt.xlim((40000,42500000))
plt.scatter(x,y)
The output is:
I know the range of the budget is big, but I do not figure out the data on the x-axis.
I even set the range, however, the x-axis doesn't fit the data.
If I understand your question correctly (i.e. that the plot is not displaying all of the data on the x-axis), it's because your upper xlim is too small.
The maximum value of dfm_2.budget is 4.25 * 1e8 (i.e. 425000000), but your xlim upper limit is set to 4.25 * 1e7 (i.e. 42500000) (i.e. you're missing a zero in your plt.xlim())

matplotlib subplots do not show the exact x tick labels passed to it as list

I am plotting a plot of Accuracy versus the var_smoothing curve of 4 different instances. My values are:
var_smoothing_values
>>
[1e-09, 1e-06, 0.001, 1]
gauss_accuracies
>>
[0.728, 0.8, 0.826, 0.832]
I have used the 2 subplots and on the second subplot, I am plotting this as:
f,ax = plt.subplots(1,2,figsize=(15,5))
ax[1].plot(var_smoothing_values,gauss_accuracies,marker='*',markersize=12)
ax[1].set_ylabel('Accuracy')
ax[1].set_xlabel('var_smoothing values')
ax[1].set_title('Accuracy vs var_smoothing | GaussianNB',size='large')
plt.show()
ax[1].set_xticks(var_smoothing_values) shows only 3 ticks.
How can I show only 4 ticks which corresponds to each of my var_smoothing_values??
You need to use the log scale on the x-axis since your x-values span acoss several orders of magnitude
ax[1].set_xscale('log')
ax[1].set_xticks(var_smoothing_values);

Matplotlib how to divide an histogram by a constant number

I would like to perform a personalized normalization on histograms on matplotlib. In particular I have two histograms and I would like to divide each of them by a given number (number of generated events).
I don't want to "normally" normalize it, because the "normal normalization" makes the area equal to 1. What I wish for is basically to divide the value of each bin by a given number N, so that if my histogram has 2 bins, one with 5 entries and one with 3, the resulting "normalized" (or "divided") histogram would have the first bin with 5/N entries and the second one with 3/N.
I searched far&wide and found nothing really helpful. Do you have any handy solution? This is my code, working with pandas:
num_bins = 128
list_1 = dataframe_1['E']
list_2 = dataframe_2['E']
fig, ax = plt.subplots()
ax.set_xlabel('Proton energy [MeV]')
ax.set_ylabel('Normalized frequency')
ax.set_title('Proton energy distribution')
n, bins, patches = ax.hist(list_1, num_bins, density=1, alpha=0.5, color='red', ec='red', label='label_1')
n, bins, patches = ax.hist(list_2, num_bins, density=1, alpha=0.5, color='blue', ec='blue', label='label_2')
plt.legend(loc='upper center', fontsize='x-large')
fig.savefig('NiceTitle.pdf')
plt.close('all')

Logarithmic trendline different for scatter plot versus linear plot

I am having an issue where the logarithmic function is behaving differently depending on the type of graph I use with the same data. When I generate the equation by hand, it returns the scatterplot linear trendline, but the slope function and linear graph produce a different trendline.
Linear vs Scatter
The equation for the scatter plot logarithmic line is:
y = -0.079ln(x) + 0.424
The equation for the linear plot trenline is:
y = -0.052ln(x) + 0.3138
I can generate the linear plot trenline slope using this equation:
=SLOPE(B2:B64,LN(A2:A64)) = -0.052
But using the general slope equation, I get the scatter plot trendline (using SQL):
SELECT SUM(multipliedresiduals) / SUM(xresidsquared)
FROM (
SELECT *
,log(x.x) - l.avgx xresiduals
,x.y - l.avgy yresiduals
,power(log(x.x) - l.avgx, 2) xresidsquared
,((log(x.x) - l.avgx) * (x.y - l.avgy)) multipliedresiduals
FROM ##logtest x
CROSS JOIN (
SELECT avg(log(x)) avgx
,avg(y) avgy
FROM ##logtest l
) l
) z = -0.0789746757495071 (Scatter Plot Slope)
What's going on? I'm mainly interested in replicating the linear plot trenline equation in SQL
Here is the data:
https://docs.google.com/spreadsheets/d/1sOlyXaHnUcCuD9J28cKHnrhhcr2hvYSU1iCNWXcTqEA/edit?usp=sharing
Here is the Excel File:
https://www.dropbox.com/s/6hpd4bzvmbxe5pu/ScatterLinearTest.xlsx?dl=0
Line and Scatter graphs in Excel are quite different with regard to the X-axis. In the case of a scatter graph, the x-axis represents actual values. In the case of a line graph, the x-axis are labels. If you try to compute a slope, with a line graph, the x-axis will have the values of 1,2,3,4, ... no matter what the label shows (e.g: even if it shows 7..69). With a scatter graph, the x-axis will have the value of the label.
In your case, the difference between the two slopes can be explained by the x-axis line graph values starting at 1 (even though it is labelled 7); and the x-axis scatter graph values starting at 7 -- the actual value.
So, in fact, the real slope for the the data you present, with "X" starting at a value of "7", is the slope you get from the scatter graph data, which is the same as you are getting in your SQL.
In order for the SQL equation to match the linear plot trendline equation, you would need to replace the x-axis values with a series [1..n] instead of the actual x-axis values.
I don't have SQL, but the results of these two SLOPE formulas should clarify what I am writing:
Scatter plot: =SLOPE(B2:B64,LN(ROW(INDIRECT("7:69")))) -0.078974676
Line Plot: =SLOPE(B2:B64,LN(ROW(INDIRECT("1:63")))) -0.051735504
The first is the Scatter plot, the second is the Line plot

How do I add error bars on a histogram?

I've created a histogram to see the number of similar values in a list.
data = np.genfromtxt("Pendel-Messung.dat")
stdm = (np.std(data))/((700)**(1/2))
breite = 700**(1/2)
fig2 = plt.figure()
ax1 = plt.subplot(111)
ax1.set_ylim(0,150)
ax1.hist(data, bins=breite)
ax2 = ax1.twinx()
ax2.set_ylim(0,150/700)
plt.show()
I want to create error bars (the error being stdm) in the middle of each bar of the histogram. I know I can create errorbars using
plt.errorbar("something", data, yerr = stdm)
But how do I make them start in the middle of each bar? I thought of just adding breite/2, but that gives me an error.
Sorry, I'm a beginner! Thank you!
ax.hist returns the bin edges and the frequencies (n) so we can use those for x and y in the call to errorbar. Also, the bins input to hist takes either an integer for the number of bins, or a sequence of bin edges. I think you we trying to give a bin width of breite? If so, this should work (you just need to select an appropriate xmax):
n,bin_edges,patches = ax.hist(data,bins=np.arange(0,xmax,breite))
x = bin_edges[:-1]+breite/2.
ax.errorbar(x,n,yerr=stdm,linestyle='None')