Define x and y limits for regression line in ggplot2 - ggplot2

I am using ggplot2 to plot graphs, the basic aim is:
the graph has two layers, the lower layer (scatter plot) will use the data gathered from public database, and then I will add the data from my study on the top of it. I also added a regression line for my data. You can have a brief idea of what I have from this picture:
The problem is that, due the different dimensions of the two data sets, the regression lines are too long (full range), which makes the picture look strange. I want to define the x and y axis for the layer of my data, however, I just can not reach this.
For the regression, I use geom_abline to define the slope, intercept, etc, instead of using geom_lm, which I see can take the argument fullrange = FALSE.

Use stat_smooth with method = "lm" and se = FALSE (turns off confidence interval shadow).
ggplot(mpg, aes(displ, cty, color = as.factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(color = "Cylinders",
x = "Displacement in Liters",
y = "Miles per Gallon")

Related

How to apply matplotlib quiver autoscale to two vector fields?

I am plotting two vector fields on top of each other and I want to use the auto-scale feature to set the arrow size such that the two fields are at the same scale automatically. (Part of this notebook.)
If I plot them one after the other, they are drawn at different scales. In this case the black arrows are artificially inflated compared to green.
plt.quiver(*XY, *np.real(UV))
plt.quiver(*XY, *np.imag(UV), color='g')
If I use this solution the first plot sets the scale for the second plot. But this fails to take the scale of the second field into account. If the first field has a small magnitude compared to the second, then it looks terrible.
Q = plt.quiver(*XY, *np.real(UV))
Q._init()
plt.quiver(*XY, *np.imag(UV), scale=Q.scale, color='g')
I want to set the auto-scale based on both fields, not just one or the other. Ideas?
You need to pass the same scale argument to both plt.quiver calls.
If you don't provide a scale than a visually pleasing scale is derived automatically. So you could in principle extract the autoscaling code and use it to get the automatic scales for both quiver plots and then use for instance the average of the two values.
Another, easier, way is to first invisibly plot both quiver plots using the do-nothing backend 'template', retrieve the automatically calculated scales and use the average of them in both real plotting calls:
def plot_flow(x,y,q,XY,G=source,args=(),size=(7,7),ymax=None):
"Plot the geometry and induced velocity field"
# Loop through segments, superimposing the velocity
def uv(i): return q[i]*velocity(*XY, x[i], y[i], x[i+1], y[i+1], G, args)
UV = sum(uv(i) for i in range(len(x)-1))
def get_scale(XY, UV):
"""Get autoscale value by plotting to do-nothing backend."""
backend = plt.matplotlib.get_backend()
plt.matplotlib.use('template')
Q = plt.quiver(*XY, *UV, scale=None)
plt.matplotlib.use(backend)
Q._init()
return Q.scale
# Get autoscales
scale_real = get_scale(XY, np.real(UV))
scale_imag = get_scale(XY, np.imag(UV)) if np.iscomplexobj(UV) else scale_real
scale = (scale_real + scale_imag)/2
# Create plot
plt.figure(figsize=size)
ax=plt.axes(); ax.set_aspect('equal', adjustable='box')
# Plot vectors and segments
plt.quiver(*XY, *np.real(UV), scale=scale)
if np.iscomplexobj(UV):
plt.quiver(*XY, *np.imag(UV), scale=scale, color='g')
plt.plot(x,y,c='b')
plt.ylim(None,ymax)
In the example, we get a scale of 7.7 as the average of 12.2 and 3.3:
Normalizing the data before plotting it can help getting similar scales on the arrow sizes:
scale = 1
UV_real = np.real(UV) / np.linalg.norm(UV)
UV_imag = np.imag(UV) / np.linalg.norm(UV)
Q1 = plt.quiver(*XY, *UV_real, scale=scale)
Q2 = plt.quiver(*XY, *UV_imag, scale=scale, color='g')
Tested for multiple magnitude ratios between real and imaginary parts.

how fix the y-axis's rate in plot

I am using a line to estimate the slope of my graphs. the data points are in the same size. But look at these two pictures. the first one seems to have a larger slope but its not true. the second one has larger slope. but since the y-axis has different rate, the first one looks to have a larger slope. is there any way to fix the rate of y-axis, then I can see with my eye which one has bigger slop?
code:
x = np.array(list(range(0,df.shape[0]))) # = array([0, 1, 2, ..., 3598, 3599, 3600])
df1[skill]=pd.to_numeric(df1[skill])
fit = np.polyfit(x, df1[skill], 1)
fit_fn = np.poly1d(fit)
df['fit_fn(x)']=fit_fn(x)
df[['Hodrick-Prescott filter',skill,'fit_fn(x)']].plot(title=skill + date)
Two ways:
One, use matplotlib.pyplot.axis to get the axis limits of the first figure and set the second figure to have the same axis limits (using the same function) (could also use get_ylim and set_ylim, which are specific to the y-axis but require directly referencing the Axes object)
Two, plot both in a subplots figure and set the argument sharey to True (my preferred, depending on the desired use)

Predicting from the full posterior distribution using stan_glmer

Could I ask for some help please?
I have fit a binomial model using stan_glmer and have picked the model which I think best fits the data. I have used the posterior predict command to compare my observed data to data simulated by the model and it seems very similar.
I now want to predict the probability of an event for different levels of the predictors. I would usually use the predict command in glmer but I know I should use the posterior_predict command for stan_glmer to take into account the full uncertainty in the model. If x1 and x2 are continuous predictors for a binary event and I want a random intercept on group, the model formula would be:
model <- stan_glmer(binary event ~ x1 + x2 +(1 | group), family="binomial"
My question is: I want to vary the predictors (x1 and x2) to see how the model predicts the observed data (and the variability in those predictions), maybe as a plot but I’m not sure how. Any help or guidance would be greatly appreciated.
In short, posterior_predict has a newdata argument that expects a data.frame with values of x1, x2, and group. This argument is similar to that in many other prediction functions and there is an example of using that can be executed via example(posterior_predict, package = "rstanarm").
In your case, it might be something like
nd <- with(original_data,
expand.grid(x1 = seq(from = min(x1), to = max(x1), length.out = 20),
x2 = seq(from = min(x2), to = max(x2), length.out = 20),
group = levels(group)))
PPD <- posterior_predict(model, newdata = nd)
but you could choose the values of x1 and x2 in various other ways.

C-Support Vector Classification Comprehension

I have a question regarding a code snipped which I have found i a book.
The author creates two categories of sample points. Next the author learns a model and plots the SVC model onto the "blobs".
This is the code snipped:
# create 50 separable points
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
# fit the support vector classifier model
clf = SVC(kernel='linear')
clf.fit(X, y)
# plot the data
fig, ax = plt.subplots(figsize=(8, 6))
point_style = dict(cmap='Paired', s=50)
ax.scatter(X[:, 0], X[:, 1], c=y, **point_style)
# format plot
format_plot(ax, 'Input Data')
ax.axis([-1, 4, -2, 7])
# Get contours describing the model
xx = np.linspace(-1, 4, 10)
yy = np.linspace(-2, 7, 10)
xy1, xy2 = np.meshgrid(xx, yy)
Z = np.array([clf.decision_function([t])
for t in zip(xy1.flat, xy2.flat)]).reshape(xy1.shape)
line_style = dict(levels = [-1.0, 0.0, 1.0],
linestyles = ['dashed', 'solid', 'dashed'],
colors = 'gray', linewidths=1)
ax.contour(xy1, xy2, Z, **line_style)
The result is the following:
My question is now, why do we create "xx" and "yy" as well as "xy1" and "xy2"? Because actually we want to show the SVC "function" for the X and y data and if we pass xy1 and xy2 as well as Z (which is also created with xy1 and xy2) to the meshgrid function to plot the meshgrid, there is no connection to the data with which the SVC model was learned...isn't it?
Can anybody explain this to me please or give a recommendation how to solve this more easily?
Thanks for your answers
I'll start with short broad answers. ax.contour() is just one way to plot the separating hyperplane and its "parallel" planes. You can certainly plot it by calculating the plane, like this example.
To answer your last question, in my opinion it's already a relatively simple (in math and logic) and easy (in coding) way to plot your model. And it is especially useful when your separating hyperplane is not mathematically easy to describe (such as polynomial and RBF kernel for non-linear separation), like this example.
To address your second question and comments, and to answer your first question, yes you're right, xx, yy, xy1, xy2 and Z all have very limited connect to your (simulated blobs of) data. They are used for drawing the hyperplanes to describe your model.
That should answer your questions. But please allow me to give some more details here in case others are not familiar with the topic as you do. The only connection between your data and xx, yy, xy1, xy2, Z is:
xx, yy, xy1 and xy2 sample an area surrounding the simulated data. Specifically, the simulated data centered around 2. xx sets a limit between (-1, 4) and yy sets a limit between (-2, 7). One can check the "meshgrid" by ax.scatter(xy1, xy2).
Z is a calculation for all sample points in the "meshgrid". It calculates the normalized distance from a sample point to the separating hyperplane. Z is the levels on the contour plot.
ax.contour then uses the "meshgrid" and Z to plot contour lines. Here are some key points:
xy1 and xy2 are both 2-D specifying the (x, y) coordinates of the surface. They list sample points in the area row by row.
Z is a 2-D array with the same shape as xy1 and xy2. It defines the level at each point so that the program can "understand" the shape of the 3-dimensional surface.
levels = [-1.0, 0.0, 1.0] indicates that there are 3 curves (lines in this case) at corresponding levels to draw. In related to SVC, level 0 is the separating hyperplane; level -1 and 1 are very close (differ by a ζi) to the maximum margin separating hyperplane.
linestyles = ['dashed', 'solid', 'dashed'] indicates that the separating hyperplan is drawn as a solid line and the two planes on both sides are drawn as a dashed line.
Edit (in response to the comment):
Mathematically, the decision function should be a sign function which tell us a point is level 0 or 1, as you said. However, when you check values in Z, you will find they are continuous data. The decision_function(X) works in a way that the sign of the value indicates the classification, while the absolute value is the "Distance of the samples X to the separating hyperplane" which reflects (kind of) the confidence/significance of the predicted classification. This is critical to the plot of model. If Z is categorical, you would have contour lines which makes an area like a mesh rather than a single contour line. It will be like the colormesh in the example; but you won't see that with ax.contour() since it's not a correct behavior for a contour plot.

ValueError: setting an array element with a sequence at fit(X, y) in k-nearest neighbor

i have an error at this line:neigh.fit(X, y) :
ValueError: setting an array element with a sequence.
I checked fit function and X is: {array-like, sparse matrix, BallTree, cKDTree}
My X is a list of list with first element solidity number and second elemnt humoment list (7 cells).
If i change and i take only first humoment number for having a pure list of list
give this error: query data dimension must match BallTree data dimension.
My code:
listafeaturevector = list()
path = 'imgknn/'
for infile in glob.glob( os.path.join(path, '*.jpg') ):
print("current file is: " + infile )
gray = cv2.imread(infile,0)
element = cv2.getStructuringElement(cv2.MORPH_CROSS,(6,6))
graydilate = cv2.erode(gray, element)
ret,thresh = cv2.threshold(graydilate,127,255,cv2.THRESH_BINARY_INV)
imgbnbin = thresh
#CONTOURS
contours, hierarchy = cv2.findContours(imgbnbin, cv2.RETR_TREE ,cv2.CHAIN_APPROX_SIMPLE)
print(len(contours))
for i in range (0, len(contours)):
fv = list() #1 feature vector
#HUMOMENTS
#print("humoments")
mom = cv2.moments(contours[i], 1)
Humoments = cv2.HuMoments(mom)
#print(Humoments)
fv.append(Humoments) #query data dimension must match BallTree data dimension
#SOLIDITY
area = cv2.contourArea(contours[i])
hull = cv2.convexHull(contours[i]) #ha tanti valori
hull_area = cv2.contourArea(hull)
solidity = float(area)/hull_area
fv.append(solidity)
#fv.append(elongation)
listafeaturevector.append(fv)
print("i have done")
print(len(listafeaturevector))
lenmatrice=len(listafeaturevector)
#KNN
X = listafeaturevector
y = [0,1,2,3]* (lenmatrice/4)
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) #ValueError: setting an array element with a sequence.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
If i try to covert it in a numpy array:
listafv = np.dstack(listafeaturevector)
listafv=np.rollaxis(listafv,-1)
print(listafv.shape)
data = listafv.reshape((lenmatrice, -1))
print(data.shape)
#KNN
X = data
i got: setting an array element with a sequence
A couple of suggestions/questions:
Humoments = cv2.HuMoments(mom)
What is the class of the return value Humoments? a float or a list? If float, that is fine.
for each image file
for i in range (0, len(contours)):
fv = list() #1 feature vector
...
fv.append(Humoments)
...
fv.append(solidity)
listafeaturevector.append(fv)
The above code does not seem correct. In your problem, I think you need to a construct a feature vector for each image. So anything that is related to image i should go to the same feature vector x_i. Then you combine all feature vectors to get a list of feature vectors X. However, your listafeaturevector (or X) presents in the inner-most loop, it's obviously not correct.
Second, you have a loop against the number of elements in the contours, are you sure the number of elements stays the same for each image? Otherwise, the number of features (|x_i|) is totally different across different images, that might cause the error of
setting an array element with a sequence.
Third, are you clear about how you want to classify the images? what are the target values/labels of different images? I see you just setting labels with [0,1,2,3]* (lenmatrice/4). Can you elaborate on what you are trying to do with those images? Are they containing different type of object? Are they showing different patterns? Are those images describe different topic/color? If yes, for each different type, you give a different label - either 0,1,2 or 'red','white','black' (assume you have only 3 types). The values of the label do not matter. What matters is how many values they have. I am trying to understand the difference of labels in your case.
On the other hand, if you only want to retrieve similar images, you don't need to use a classifier or specify a label for each image. Instead, try to use NearestNeighbors.
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
Fourth, the above two lines of test are not correct. You need to set an X-like object in order to get a prediction from the classifier. That is to say, you need a feature vector x with the identical structure as you constructed in your training examples (with all h,e,s in the same order).