How can I find the slope and intercept of a line in ggplot? - ggplot2

I made a calibration (check agreement between predictions and outcome: discrepancy yes or no) plot with ggplot. Perfect calibration would give a slope of 1 and intercept of 0, so now I was wonderding how to determine the slope and intercept.
This is my commando:
plot1 <-ggplot(dataset_validatie, aes(x=clinically_relevant_discrepancy, y=predictions, group = 1)) + geom_line()+geom_point()
print(plot1)
Here is my calibration plot:
plot

Related

Binary treatment based on a continuous variable (Stata)

I want to create a scatter plot showing my treatment assignment on the y-axis and the margin of winning on the x-axis.
To create a binary treatment variable, where a margin over 0 indicates that a Republican candidate won the local election.
gen republican_win = (margin>0)
Here is a data example:
* Example generated by -dataex-. For more info, type help dataex
clear
input double margin float republican_win
-.356066316366196 0
-.54347825050354 0
-.204092293977737 0
-.449720650911331 1
-.201149433851242 1
-.505899667739868 0
-.206885248422623 1
end
To generate a scatter plot, I ran this. While the code ran well, I was wondering if it would be possible to display a continuous distribution of the margin of Republican wins and losses?
scatter margin republican_win
You can use the predicted probabilities by storing them in a variable, and then plot it at the same time as your scatter plot.
I would then reverse the axes to show your logistic distribution.
logit republican_win margin
predict win_hat
twoway scatter win_hat republican_win margin, ///
connect(l i) msymbol(i 0) sort ylabel(0 1)
There are not enough data points in your data example to show a nice fitted curve, but I'm sure it will look better on your whole dataset.

Is the IOU in Tensorflow Object Detection API wrong?

I just digged a bit through the Tensorflow Object Detection API code especially the eval_util part, as I wanted to implement the COCO metrics.
But I noticed that the metrics are solely calculated using the bounding boxes which have normalized coordinates between [0, 1].
There are no aspect ratios or absolute coordinates used.
So, doesn't this mean that the intersection over unions calculated on these results are incorrect?
Let's take an 200x100 image pixel as an example.
If the box would be off by 20px to the left, that's 0.1 in normalized coordinates.
But if it would be off by 20px to the top, that would be 0.2 in normalized coordinates.
Doesn't that mean, being off to the top is harder penalizing the score than being off to the side?
I believe the predicted coordinates are resized to the absolute image coordinates in the eval binary.
But the other thing I would say is that IOU is scale invariant in the sense that if you scale two boxes by some factor, they will still have the same IOU overlap. As an example if we scale by 2 in the x-direction and scale by 3 in the y direction:
If A is (x1, y1, x2, y2) and B is (u1, v1, u2, v2), then IOU((A, B))
= IOU((2*x1, 3*y1, 2*x2, 3*y2), (2*u1, 3*v1, 2*u2, 3*v2))
What this means is that evaluating in normalized coordinates should give the same result as evaluating in absolute coordinates.

C-Support Vector Classification Comprehension

I have a question regarding a code snipped which I have found i a book.
The author creates two categories of sample points. Next the author learns a model and plots the SVC model onto the "blobs".
This is the code snipped:
# create 50 separable points
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
# fit the support vector classifier model
clf = SVC(kernel='linear')
clf.fit(X, y)
# plot the data
fig, ax = plt.subplots(figsize=(8, 6))
point_style = dict(cmap='Paired', s=50)
ax.scatter(X[:, 0], X[:, 1], c=y, **point_style)
# format plot
format_plot(ax, 'Input Data')
ax.axis([-1, 4, -2, 7])
# Get contours describing the model
xx = np.linspace(-1, 4, 10)
yy = np.linspace(-2, 7, 10)
xy1, xy2 = np.meshgrid(xx, yy)
Z = np.array([clf.decision_function([t])
for t in zip(xy1.flat, xy2.flat)]).reshape(xy1.shape)
line_style = dict(levels = [-1.0, 0.0, 1.0],
linestyles = ['dashed', 'solid', 'dashed'],
colors = 'gray', linewidths=1)
ax.contour(xy1, xy2, Z, **line_style)
The result is the following:
My question is now, why do we create "xx" and "yy" as well as "xy1" and "xy2"? Because actually we want to show the SVC "function" for the X and y data and if we pass xy1 and xy2 as well as Z (which is also created with xy1 and xy2) to the meshgrid function to plot the meshgrid, there is no connection to the data with which the SVC model was learned...isn't it?
Can anybody explain this to me please or give a recommendation how to solve this more easily?
Thanks for your answers
I'll start with short broad answers. ax.contour() is just one way to plot the separating hyperplane and its "parallel" planes. You can certainly plot it by calculating the plane, like this example.
To answer your last question, in my opinion it's already a relatively simple (in math and logic) and easy (in coding) way to plot your model. And it is especially useful when your separating hyperplane is not mathematically easy to describe (such as polynomial and RBF kernel for non-linear separation), like this example.
To address your second question and comments, and to answer your first question, yes you're right, xx, yy, xy1, xy2 and Z all have very limited connect to your (simulated blobs of) data. They are used for drawing the hyperplanes to describe your model.
That should answer your questions. But please allow me to give some more details here in case others are not familiar with the topic as you do. The only connection between your data and xx, yy, xy1, xy2, Z is:
xx, yy, xy1 and xy2 sample an area surrounding the simulated data. Specifically, the simulated data centered around 2. xx sets a limit between (-1, 4) and yy sets a limit between (-2, 7). One can check the "meshgrid" by ax.scatter(xy1, xy2).
Z is a calculation for all sample points in the "meshgrid". It calculates the normalized distance from a sample point to the separating hyperplane. Z is the levels on the contour plot.
ax.contour then uses the "meshgrid" and Z to plot contour lines. Here are some key points:
xy1 and xy2 are both 2-D specifying the (x, y) coordinates of the surface. They list sample points in the area row by row.
Z is a 2-D array with the same shape as xy1 and xy2. It defines the level at each point so that the program can "understand" the shape of the 3-dimensional surface.
levels = [-1.0, 0.0, 1.0] indicates that there are 3 curves (lines in this case) at corresponding levels to draw. In related to SVC, level 0 is the separating hyperplane; level -1 and 1 are very close (differ by a ζi) to the maximum margin separating hyperplane.
linestyles = ['dashed', 'solid', 'dashed'] indicates that the separating hyperplan is drawn as a solid line and the two planes on both sides are drawn as a dashed line.
Edit (in response to the comment):
Mathematically, the decision function should be a sign function which tell us a point is level 0 or 1, as you said. However, when you check values in Z, you will find they are continuous data. The decision_function(X) works in a way that the sign of the value indicates the classification, while the absolute value is the "Distance of the samples X to the separating hyperplane" which reflects (kind of) the confidence/significance of the predicted classification. This is critical to the plot of model. If Z is categorical, you would have contour lines which makes an area like a mesh rather than a single contour line. It will be like the colormesh in the example; but you won't see that with ax.contour() since it's not a correct behavior for a contour plot.

FFT real/imaginary/abs parts interpretation

I'm currently learning about discret Fourier transform and I'm playing with numpy to understand it better.
I tried to plot a "sin x sin x sin" signal and obtained a clean FFT with 4 non-zero points. I naively told myself : "well, if I plot a "sin + sin + sin + sin" signal with these amplitudes and frequencies, I should obtain the same "sin x sin x sin" signal, right?
Well... not exactly
(First is "x" signal, second is "+" signal)
Both share the same amplitudes/frequencies, but are not the same signals, even if I can see they have some similarities.
Ok, since I only plotted absolute values of FFT, I guess I lost some informations.
Then I plotted real part, imaginary part and absolute values for both signals :
Now, I'm confused. What do I do with all this? I read about DFT from a mathematical point of view. I understand that complex values come from the unit circle. I even had to learn about Hilbert space to understand how it works (and it was painful!...and I only scratched the surface). I only wish to understand if these real/imaginary plots have any concrete meaning outside mathematical world:
abs(fft) : frequencies + amplitudes
real(fft) : ?
imaginary(fft) : ?
code :
import numpy as np
import matplotlib.pyplot as plt
N = 512 # Sample count
fs = 128 # Sampling rate
st = 1.0 / fs # Sample time
t = np.arange(N) * st # Time vector
signal1 = \
1 *np.cos(2*np.pi * t) *\
2 *np.cos(2*np.pi * 4*t) *\
0.5 *np.cos(2*np.pi * 0.5*t)
signal2 = \
0.25*np.sin(2*np.pi * 2.5*t) +\
0.25*np.sin(2*np.pi * 3.5*t) +\
0.25*np.sin(2*np.pi * 4.5*t) +\
0.25*np.sin(2*np.pi * 5.5*t)
_, axes = plt.subplots(4, 2)
# Plot signal
axes[0][0].set_title("Signal 1 (multiply)")
axes[0][0].grid()
axes[0][0].plot(t, signal1, 'b-')
axes[0][1].set_title("Signal 2 (add)")
axes[0][1].grid()
axes[0][1].plot(t, signal2, 'r-')
# FFT + bins + normalization
bins = np.fft.fftfreq(N, st)
fft = [i / (N/2) for i in np.fft.fft(signal1)]
fft2 = [i / (N/2) for i in np.fft.fft(signal2)]
# Plot real
axes[1][0].set_title("FFT 1 (real)")
axes[1][0].grid()
axes[1][0].plot(bins[:N/2], np.real(fft[:N/2]), 'b-')
axes[1][1].set_title("FFT 2 (real)")
axes[1][1].grid()
axes[1][1].plot(bins[:N/2], np.real(fft2[:N/2]), 'r-')
# Plot imaginary
axes[2][0].set_title("FFT 1 (imaginary)")
axes[2][0].grid()
axes[2][0].plot(bins[:N/2], np.imag(fft[:N/2]), 'b-')
axes[2][1].set_title("FFT 2 (imaginary)")
axes[2][1].grid()
axes[2][1].plot(bins[:N/2], np.imag(fft2[:N/2]), 'r-')
# Plot abs
axes[3][0].set_title("FFT 1 (abs)")
axes[3][0].grid()
axes[3][0].plot(bins[:N/2], np.abs(fft[:N/2]), 'b-')
axes[3][1].set_title("FFT 2 (abs)")
axes[3][1].grid()
axes[3][1].plot(bins[:N/2], np.abs(fft2[:N/2]), 'r-')
plt.show()
For each frequency bin, the magnitude sqrt(re^2 + im^2) tells you the amplitude of the component at the corresponding frequency. The phase atan2(im, re) tells you the relative phase of that component. The real and imaginary parts, on their own, are not particularly useful, unless you are interested in symmetry properties around the data window's center (even vs. odd).
With respect to some reference point, say the center of a fixed time window, a sine wave and a cosine wave of the same frequency will look different (have different starting phases with respect to any fixed time reference point). They will also be mathematically orthogonal over any integer periodic width, so can represent independent basis vector components of a transform.
The real portion of an FFT result is how much each frequency component resembles a cosine wave, the imaginary component, how much each component resembles a sine wave. Various ratios of sine and cosine components together allow one to construct a sinusoid of any arbitrary or desired phase, thus allowing the FFT result to be complete.
Magnitude alone can't tell the difference between a sine and cosine wave. An IFFT(imag(FFT)) would screw up the reconstruction of any signal with a different phase than pure cosines. Same with IFFT(re(FFT)) and pure sine waves (with respect to the FFT aperture window).
You can convert the signal 1, which consists of a product of three cos functions to a sum of four cos functions. This makes the difference to function 2 which is a sum of four sine functions.
A cos function is an even function cos(-x) == cos(x).
The Fourier Transformation of an even function is pure real.
That is the reason why the plot of the imaginary part of the fft of function 1 contains only values close to zero (1e-15).
A sine function is an odd function sin(-x) == -sin(x).
The Fourier Transformation of an odd function is pure imaginary.
That is the reason why the plot of the real part of the fft of function 2 contains only values close to zero (1e-15).
If you want to understand FFT and DFT in more detail read a textbook of signal analysis for electrical engineering.
Although ... you must be a good expert now :)
For others: Please notice that
with this set of Correct mathematical equation
so correcting the sum as:
signal1 = \
1 *np.cos(2*np.pi * t) *\
2 *np.cos(2*np.pi * 4*t) *\
0.5 *np.cos(2*np.pi * 0.5*t)
signal2 = \
0.25*np.cos(-2*np.pi * 2.5*t) +\
0.25*np.cos(2*np.pi * 3.5*t) +\
0.25*np.cos(-2*np.pi * 4.5*t) +\
0.25*np.cos(2*np.pi * 5.5*t)
Now gives following result
(Now results are)
So point is that real part should also be the same

pyplot scatter plot marker size

In the pyplot document for scatter plot:
matplotlib.pyplot.scatter(x, y, s=20, c='b', marker='o', cmap=None, norm=None,
vmin=None, vmax=None, alpha=None, linewidths=None,
faceted=True, verts=None, hold=None, **kwargs)
The marker size
s:
size in points^2. It is a scalar or an array of the same length as x and y.
What kind of unit is points^2? What does it mean? Does s=100 mean 10 pixel x 10 pixel?
Basically I'm trying to make scatter plots with different marker sizes, and I want to figure out what does the s number mean.
This can be a somewhat confusing way of defining the size but you are basically specifying the area of the marker. This means, to double the width (or height) of the marker you need to increase s by a factor of 4. [because A = WH => (2W)(2H)=4A]
There is a reason, however, that the size of markers is defined in this way. Because of the scaling of area as the square of width, doubling the width actually appears to increase the size by more than a factor 2 (in fact it increases it by a factor of 4). To see this consider the following two examples and the output they produce.
# doubling the width of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*4**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Notice how the size increases very quickly. If instead we have
# doubling the area of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*2**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()
gives
Now the apparent size of the markers increases roughly linearly in an intuitive fashion.
As for the exact meaning of what a 'point' is, it is fairly arbitrary for plotting purposes, you can just scale all of your sizes by a constant until they look reasonable.
Edit: (In response to comment from #Emma)
It's probably confusing wording on my part. The question asked about doubling the width of a circle so in the first picture for each circle (as we move from left to right) it's width is double the previous one so for the area this is an exponential with base 4. Similarly the second example each circle has area double the last one which gives an exponential with base 2.
However it is the second example (where we are scaling area) that doubling area appears to make the circle twice as big to the eye. Thus if we want a circle to appear a factor of n bigger we would increase the area by a factor n not the radius so the apparent size scales linearly with the area.
Edit to visualize the comment by #TomaszGandor:
This is what it looks like for different functions of the marker size:
x = [0,2,4,6,8,10,12,14,16,18]
s_exp = [20*2**n for n in range(len(x))]
s_square = [20*n**2 for n in range(len(x))]
s_linear = [20*n for n in range(len(x))]
plt.scatter(x,[1]*len(x),s=s_exp, label='$s=2^n$', lw=1)
plt.scatter(x,[0]*len(x),s=s_square, label='$s=n^2$')
plt.scatter(x,[-1]*len(x),s=s_linear, label='$s=n$')
plt.ylim(-1.5,1.5)
plt.legend(loc='center left', bbox_to_anchor=(1.1, 0.5), labelspacing=3)
plt.show()
Because other answers here claim that s denotes the area of the marker, I'm adding this answer to clearify that this is not necessarily the case.
Size in points^2
The argument s in plt.scatter denotes the markersize**2. As the documentation says
s : scalar or array_like, shape (n, ), optional
size in points^2. Default is rcParams['lines.markersize'] ** 2.
This can be taken literally. In order to obtain a marker which is x points large, you need to square that number and give it to the s argument.
So the relationship between the markersize of a line plot and the scatter size argument is the square. In order to produce a scatter marker of the same size as a plot marker of size 10 points you would hence call scatter( .., s=100).
import matplotlib.pyplot as plt
fig,ax = plt.subplots()
ax.plot([0],[0], marker="o", markersize=10)
ax.plot([0.07,0.93],[0,0], linewidth=10)
ax.scatter([1],[0], s=100)
ax.plot([0],[1], marker="o", markersize=22)
ax.plot([0.14,0.86],[1,1], linewidth=22)
ax.scatter([1],[1], s=22**2)
plt.show()
Connection to "area"
So why do other answers and even the documentation speak about "area" when it comes to the s parameter?
Of course the units of points**2 are area units.
For the special case of a square marker, marker="s", the area of the marker is indeed directly the value of the s parameter.
For a circle, the area of the circle is area = pi/4*s.
For other markers there may not even be any obvious relation to the area of the marker.
In all cases however the area of the marker is proportional to the s parameter. This is the motivation to call it "area" even though in most cases it isn't really.
Specifying the size of the scatter markers in terms of some quantity which is proportional to the area of the marker makes in thus far sense as it is the area of the marker that is perceived when comparing different patches rather than its side length or diameter. I.e. doubling the underlying quantity should double the area of the marker.
What are points?
So far the answer to what the size of a scatter marker means is given in units of points. Points are often used in typography, where fonts are specified in points. Also linewidths is often specified in points. The standard size of points in matplotlib is 72 points per inch (ppi) - 1 point is hence 1/72 inches.
It might be useful to be able to specify sizes in pixels instead of points. If the figure dpi is 72 as well, one point is one pixel. If the figure dpi is different (matplotlib default is fig.dpi=100),
1 point == fig.dpi/72. pixels
While the scatter marker's size in points would hence look different for different figure dpi, one could produce a 10 by 10 pixels^2 marker, which would always have the same number of pixels covered:
import matplotlib.pyplot as plt
for dpi in [72,100,144]:
fig,ax = plt.subplots(figsize=(1.5,2), dpi=dpi)
ax.set_title("fig.dpi={}".format(dpi))
ax.set_ylim(-3,3)
ax.set_xlim(-2,2)
ax.scatter([0],[1], s=10**2,
marker="s", linewidth=0, label="100 points^2")
ax.scatter([1],[1], s=(10*72./fig.dpi)**2,
marker="s", linewidth=0, label="100 pixels^2")
ax.legend(loc=8,framealpha=1, fontsize=8)
fig.savefig("fig{}.png".format(dpi), bbox_inches="tight")
plt.show()
If you are interested in a scatter in data units, check this answer.
You can use markersize to specify the size of the circle in plot method
import numpy as np
import matplotlib.pyplot as plt
x1 = np.random.randn(20)
x2 = np.random.randn(20)
plt.figure(1)
# you can specify the marker size two ways directly:
plt.plot(x1, 'bo', markersize=20) # blue circle with size 10
plt.plot(x2, 'ro', ms=10,) # ms is just an alias for markersize
plt.show()
From here
It is the area of the marker. I mean if you have s1 = 1000 and then s2 = 4000, the relation between the radius of each circle is: r_s2 = 2 * r_s1. See the following plot:
plt.scatter(2, 1, s=4000, c='r')
plt.scatter(2, 1, s=1000 ,c='b')
plt.scatter(2, 1, s=10, c='g')
I had the same doubt when I saw the post, so I did this example then I used a ruler on the screen to measure the radii.
I also attempted to use 'scatter' initially for this purpose. After quite a bit of wasted time - I settled on the following solution.
import matplotlib.pyplot as plt
input_list = [{'x':100,'y':200,'radius':50, 'color':(0.1,0.2,0.3)}]
output_list = []
for point in input_list:
output_list.append(plt.Circle((point['x'], point['y']), point['radius'], color=point['color'], fill=False))
ax = plt.gca(aspect='equal')
ax.cla()
ax.set_xlim((0, 1000))
ax.set_ylim((0, 1000))
for circle in output_list:
ax.add_artist(circle)
This is based on an answer to this question
If the size of the circles corresponds to the square of the parameter in s=parameter, then assign a square root to each element you append to your size array, like this: s=[1, 1.414, 1.73, 2.0, 2.24] such that when it takes these values and returns them, their relative size increase will be the square root of the squared progression, which returns a linear progression.
If I were to square each one as it gets output to the plot: output=[1, 2, 3, 4, 5]. Try list interpretation: s=[numpy.sqrt(i) for i in s]