Logarithmic trendline different for scatter plot versus linear plot - sql

I am having an issue where the logarithmic function is behaving differently depending on the type of graph I use with the same data. When I generate the equation by hand, it returns the scatterplot linear trendline, but the slope function and linear graph produce a different trendline.
Linear vs Scatter
The equation for the scatter plot logarithmic line is:
y = -0.079ln(x) + 0.424
The equation for the linear plot trenline is:
y = -0.052ln(x) + 0.3138
I can generate the linear plot trenline slope using this equation:
=SLOPE(B2:B64,LN(A2:A64)) = -0.052
But using the general slope equation, I get the scatter plot trendline (using SQL):
SELECT SUM(multipliedresiduals) / SUM(xresidsquared)
FROM (
SELECT *
,log(x.x) - l.avgx xresiduals
,x.y - l.avgy yresiduals
,power(log(x.x) - l.avgx, 2) xresidsquared
,((log(x.x) - l.avgx) * (x.y - l.avgy)) multipliedresiduals
FROM ##logtest x
CROSS JOIN (
SELECT avg(log(x)) avgx
,avg(y) avgy
FROM ##logtest l
) l
) z = -0.0789746757495071 (Scatter Plot Slope)
What's going on? I'm mainly interested in replicating the linear plot trenline equation in SQL
Here is the data:
https://docs.google.com/spreadsheets/d/1sOlyXaHnUcCuD9J28cKHnrhhcr2hvYSU1iCNWXcTqEA/edit?usp=sharing
Here is the Excel File:
https://www.dropbox.com/s/6hpd4bzvmbxe5pu/ScatterLinearTest.xlsx?dl=0

Line and Scatter graphs in Excel are quite different with regard to the X-axis. In the case of a scatter graph, the x-axis represents actual values. In the case of a line graph, the x-axis are labels. If you try to compute a slope, with a line graph, the x-axis will have the values of 1,2,3,4, ... no matter what the label shows (e.g: even if it shows 7..69). With a scatter graph, the x-axis will have the value of the label.
In your case, the difference between the two slopes can be explained by the x-axis line graph values starting at 1 (even though it is labelled 7); and the x-axis scatter graph values starting at 7 -- the actual value.
So, in fact, the real slope for the the data you present, with "X" starting at a value of "7", is the slope you get from the scatter graph data, which is the same as you are getting in your SQL.
In order for the SQL equation to match the linear plot trendline equation, you would need to replace the x-axis values with a series [1..n] instead of the actual x-axis values.
I don't have SQL, but the results of these two SLOPE formulas should clarify what I am writing:
Scatter plot: =SLOPE(B2:B64,LN(ROW(INDIRECT("7:69")))) -0.078974676
Line Plot: =SLOPE(B2:B64,LN(ROW(INDIRECT("1:63")))) -0.051735504
The first is the Scatter plot, the second is the Line plot

Related

How to adjust Python linear regression y axis

I have been having Problems with price column every time I try to plot graphs on it and all my graphs have this problem and I want to change it to its actual values instead of decimals
Example of of linear graph
This is the dataframe containing the information of the dataset
Train is the name of dataframe.
Column contains the selected
columns = ['Id', 'year', 'distance_travelled(kms)', 'brand_rank', 'car_age']
for i in columns:
plt.scatter(train[i], y, label='Actual')
plt.xlabel(i)
plt.ylabel('price')
plt.show()

matplotlib subplots do not show the exact x tick labels passed to it as list

I am plotting a plot of Accuracy versus the var_smoothing curve of 4 different instances. My values are:
var_smoothing_values
>>
[1e-09, 1e-06, 0.001, 1]
gauss_accuracies
>>
[0.728, 0.8, 0.826, 0.832]
I have used the 2 subplots and on the second subplot, I am plotting this as:
f,ax = plt.subplots(1,2,figsize=(15,5))
ax[1].plot(var_smoothing_values,gauss_accuracies,marker='*',markersize=12)
ax[1].set_ylabel('Accuracy')
ax[1].set_xlabel('var_smoothing values')
ax[1].set_title('Accuracy vs var_smoothing | GaussianNB',size='large')
plt.show()
ax[1].set_xticks(var_smoothing_values) shows only 3 ticks.
How can I show only 4 ticks which corresponds to each of my var_smoothing_values??
You need to use the log scale on the x-axis since your x-values span acoss several orders of magnitude
ax[1].set_xscale('log')
ax[1].set_xticks(var_smoothing_values);

How to avoid initial data changing when plotting additional data in plot

I want to plot two data series in one plot, but when plotting both data series, one of the series are changing. Matplotlib draws lines between the wrong data.
Firsty_values and secondy_values are lists of timestamps sorted and stretching one 24h interval.
Firstx_values and secondx_values are values in the range 18-21.
The first plot shows the two series together while the last plot shows one of the series alone.
#Firsty_values and secondy_values looks like this:
#['2019-05-04 00:00:03',
# '2019-05-04 00:02:03',
# ...
# '2019-05-04 23:56:03',
# '2019-05-04 23:58:02']
#Firstx_values and secondx_values looks like this:
#[18.32,18.34 ..... 19.32,19.31]
plt.plot(firsty_values,firstx_values,'b')
plt.plot(secondy_values, secondx_values, 'g')
plt.ylabel('Temperature [C]')
plt.xlabel('Time')
plt.legend(['SA1_563_04_RT601A', 'SA1_563_04_RT601B'])
plt.xticks([100,604,1053]) #length more than 1053
plt.show()
#plt.plot(firsty_values,firstx_values,'b')
plt.plot(secondy_values, secondx_values, 'g')
plt.ylabel('Temperature [C]')
plt.xlabel('Time')
plt.legend(['SA1_563_04_RT601A', 'SA1_563_04_RT601B'])
plt.xticks([100,604,1053]) #length less than 1053
plt.show()
Output:
Output with both data series :
Output with one data series :
First plot draw lines between data points that does not lie next to each other. The problem seems to be that some of the data points from the second series are put out of order after the points from the first series. This is reflected by the "xticks" showing three lables when ploting both and two lables when ploting one series.

matplotlib hexbin: Which bin has the highest count [duplicate]

I have two distributions within a hexbin plot, like the one shown:
One distributions has a max value of about 4000, while the other has a max value of about 2500. The plotting colours are therefore different.
I was thinking I could normalize it if I knew the max value of the hexbin plot. How do I know how many points are within the max hexbin other than looking at the colorbar? I am using matplotlib.pyplot.hexbin
You can get the min and max of the norm which is what is used to normalize the data for color picking.
hb = plt.hexbin(x, y)
print hb.norm.vmin, hb.norm.vmax
You could then go on to pass a norm with this information to the second plot. The problem with this is that the first plot must have more range than the second, otherwise the second plot will not all be colored.
Alternatively, and preferably, you can construct a norm which you pass to the hexbin function for both of your plots:
norm = plt.normalize(min_v, max_v)
hb1 = plt.hexbin(x1, y1, norm=norm)
hb2 = plt.hexbin(x2, y2, norm=norm)
HTH,

matplotlib: left yaxis and right yaxis to have different units

I'm plotting curves in Kelvin.
I would like to have the left yaxis to show units in Kelvin and the right yaxis to show them in Celsius, and both rounded to the closest integer (so the ticks are not aligned, as TempK=TempC+273.15)
fig=plt.figure
figure=fig.add_subplot(111)
figure.plot(xpos, tos, color='blue')
I should not use twinx() as it allows superimposing curves with two different scales, which is not my case (only the right axis has to be changed, not the curves).
I found the following solution:
fig=plt.figure
figure=fig.add_subplot(111)
figure.plot(xpos, tos, color='blue')
... plot other curves if necessary
... and once all data are plot, one can create a new axis
y1, y2=figure.get_ylim()
x1, x2=figure.get_xlim()
ax2=figure.twinx()
ax2.set_ylim(y1-273.15, y2-273.15)
ax2.set_yticks( range(int(y1-273.15), int(y2-273.15), 2) )
ax2.set_ylabel('Celsius')
ax2.set_xlim(x1, x2)
figure.set_ylabel('Surface Temperature (K)')
Do not forget to set the twinx axis xaxis!