C-Support Vector Classification Comprehension - matplotlib

I have a question regarding a code snipped which I have found i a book.
The author creates two categories of sample points. Next the author learns a model and plots the SVC model onto the "blobs".
This is the code snipped:
# create 50 separable points
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
# fit the support vector classifier model
clf = SVC(kernel='linear')
clf.fit(X, y)
# plot the data
fig, ax = plt.subplots(figsize=(8, 6))
point_style = dict(cmap='Paired', s=50)
ax.scatter(X[:, 0], X[:, 1], c=y, **point_style)
# format plot
format_plot(ax, 'Input Data')
ax.axis([-1, 4, -2, 7])
# Get contours describing the model
xx = np.linspace(-1, 4, 10)
yy = np.linspace(-2, 7, 10)
xy1, xy2 = np.meshgrid(xx, yy)
Z = np.array([clf.decision_function([t])
for t in zip(xy1.flat, xy2.flat)]).reshape(xy1.shape)
line_style = dict(levels = [-1.0, 0.0, 1.0],
linestyles = ['dashed', 'solid', 'dashed'],
colors = 'gray', linewidths=1)
ax.contour(xy1, xy2, Z, **line_style)
The result is the following:
My question is now, why do we create "xx" and "yy" as well as "xy1" and "xy2"? Because actually we want to show the SVC "function" for the X and y data and if we pass xy1 and xy2 as well as Z (which is also created with xy1 and xy2) to the meshgrid function to plot the meshgrid, there is no connection to the data with which the SVC model was learned...isn't it?
Can anybody explain this to me please or give a recommendation how to solve this more easily?
Thanks for your answers

I'll start with short broad answers. ax.contour() is just one way to plot the separating hyperplane and its "parallel" planes. You can certainly plot it by calculating the plane, like this example.
To answer your last question, in my opinion it's already a relatively simple (in math and logic) and easy (in coding) way to plot your model. And it is especially useful when your separating hyperplane is not mathematically easy to describe (such as polynomial and RBF kernel for non-linear separation), like this example.
To address your second question and comments, and to answer your first question, yes you're right, xx, yy, xy1, xy2 and Z all have very limited connect to your (simulated blobs of) data. They are used for drawing the hyperplanes to describe your model.
That should answer your questions. But please allow me to give some more details here in case others are not familiar with the topic as you do. The only connection between your data and xx, yy, xy1, xy2, Z is:
xx, yy, xy1 and xy2 sample an area surrounding the simulated data. Specifically, the simulated data centered around 2. xx sets a limit between (-1, 4) and yy sets a limit between (-2, 7). One can check the "meshgrid" by ax.scatter(xy1, xy2).
Z is a calculation for all sample points in the "meshgrid". It calculates the normalized distance from a sample point to the separating hyperplane. Z is the levels on the contour plot.
ax.contour then uses the "meshgrid" and Z to plot contour lines. Here are some key points:
xy1 and xy2 are both 2-D specifying the (x, y) coordinates of the surface. They list sample points in the area row by row.
Z is a 2-D array with the same shape as xy1 and xy2. It defines the level at each point so that the program can "understand" the shape of the 3-dimensional surface.
levels = [-1.0, 0.0, 1.0] indicates that there are 3 curves (lines in this case) at corresponding levels to draw. In related to SVC, level 0 is the separating hyperplane; level -1 and 1 are very close (differ by a ζi) to the maximum margin separating hyperplane.
linestyles = ['dashed', 'solid', 'dashed'] indicates that the separating hyperplan is drawn as a solid line and the two planes on both sides are drawn as a dashed line.
Edit (in response to the comment):
Mathematically, the decision function should be a sign function which tell us a point is level 0 or 1, as you said. However, when you check values in Z, you will find they are continuous data. The decision_function(X) works in a way that the sign of the value indicates the classification, while the absolute value is the "Distance of the samples X to the separating hyperplane" which reflects (kind of) the confidence/significance of the predicted classification. This is critical to the plot of model. If Z is categorical, you would have contour lines which makes an area like a mesh rather than a single contour line. It will be like the colormesh in the example; but you won't see that with ax.contour() since it's not a correct behavior for a contour plot.

Related

How to do 2D Convolution only at a specific location?

This question has been asked multiple times but still I could not get what I was looking for. Imagine
data=np.random.rand(N,N) #shape N x N
kernel=np.random.rand(3,3) #shape M x M
I know convolution typically means placing the kernel all over the data. But in my case N and M are of the orders of 10000. So I wish to get the value of the convolution at a specific location in the data, say at (10,37) without doing unnecessary calculations at all locations. So the output will be just a number. The main goal is to reduce the computation and memory expenses. Is there any inbuilt function that does this with minimal adjustments?
Indeed, applying the convolution for a particular position coincides with the mere sum over the entries of a (pointwise) multiplication of the submatrix in data and the flipped kernel itself. Here, is a reproducible example.
Code
N = 1000
M = 3
np.random.seed(777)
data = np.random.rand(N,N) #shape N x N
kernel= np.random.rand(M,M) #shape M x M
# Pointwise convolution = pointwise product
data[10:10+M,37:37+M]*kernel[::-1, ::-1]
>array([[0.70980514, 0.37426475, 0.02392947],
[0.24387766, 0.1985901 , 0.01103323],
[0.06321042, 0.57352696, 0.25606805]])
with output
conv = np.sum(data[10:10+M,37:37+M]*kernel[::-1, ::-1])
conv
>2.45430578
The kernel is being flipped by definition of the convolution as explained in here and was kindly pointed Warren Weckesser. Thanks!
The key is to make sense of the index you provided. I assumed it refers to the upper left corner of the sub-matrix in data. However, it can refer to the midpoint as well when M is odd.
Concept
A different example with N=7 and M=3 exemplifies the idea
and is presented in here for the kernel
kernel = np.array([[3,0,-1], [2,0,1], [4,4,3]])
which, when flipped, yields
k[::-1,::-1]
> array([[ 3, 4, 4],
[ 1, 0, 2],
[-1, 0, 3]])
EDIT 1:
Please note that the lecturer in this video does not explicitly mention that flipping the kernel is required before the pointwise multiplication to adhere to the mathematically proper definition of convolution.
EDIT 2:
For large M and target index close to the boundary of data, a ValueError: operands could not be broadcast together with shapes ... might be thrown. To prevent this, padding the matrix data with zeros can prevent this (although it blows up the memory requirement). I.e.
data = np.pad(data, pad_width=M, mode='constant')

How to apply matplotlib quiver autoscale to two vector fields?

I am plotting two vector fields on top of each other and I want to use the auto-scale feature to set the arrow size such that the two fields are at the same scale automatically. (Part of this notebook.)
If I plot them one after the other, they are drawn at different scales. In this case the black arrows are artificially inflated compared to green.
plt.quiver(*XY, *np.real(UV))
plt.quiver(*XY, *np.imag(UV), color='g')
If I use this solution the first plot sets the scale for the second plot. But this fails to take the scale of the second field into account. If the first field has a small magnitude compared to the second, then it looks terrible.
Q = plt.quiver(*XY, *np.real(UV))
Q._init()
plt.quiver(*XY, *np.imag(UV), scale=Q.scale, color='g')
I want to set the auto-scale based on both fields, not just one or the other. Ideas?
You need to pass the same scale argument to both plt.quiver calls.
If you don't provide a scale than a visually pleasing scale is derived automatically. So you could in principle extract the autoscaling code and use it to get the automatic scales for both quiver plots and then use for instance the average of the two values.
Another, easier, way is to first invisibly plot both quiver plots using the do-nothing backend 'template', retrieve the automatically calculated scales and use the average of them in both real plotting calls:
def plot_flow(x,y,q,XY,G=source,args=(),size=(7,7),ymax=None):
"Plot the geometry and induced velocity field"
# Loop through segments, superimposing the velocity
def uv(i): return q[i]*velocity(*XY, x[i], y[i], x[i+1], y[i+1], G, args)
UV = sum(uv(i) for i in range(len(x)-1))
def get_scale(XY, UV):
"""Get autoscale value by plotting to do-nothing backend."""
backend = plt.matplotlib.get_backend()
plt.matplotlib.use('template')
Q = plt.quiver(*XY, *UV, scale=None)
plt.matplotlib.use(backend)
Q._init()
return Q.scale
# Get autoscales
scale_real = get_scale(XY, np.real(UV))
scale_imag = get_scale(XY, np.imag(UV)) if np.iscomplexobj(UV) else scale_real
scale = (scale_real + scale_imag)/2
# Create plot
plt.figure(figsize=size)
ax=plt.axes(); ax.set_aspect('equal', adjustable='box')
# Plot vectors and segments
plt.quiver(*XY, *np.real(UV), scale=scale)
if np.iscomplexobj(UV):
plt.quiver(*XY, *np.imag(UV), scale=scale, color='g')
plt.plot(x,y,c='b')
plt.ylim(None,ymax)
In the example, we get a scale of 7.7 as the average of 12.2 and 3.3:
Normalizing the data before plotting it can help getting similar scales on the arrow sizes:
scale = 1
UV_real = np.real(UV) / np.linalg.norm(UV)
UV_imag = np.imag(UV) / np.linalg.norm(UV)
Q1 = plt.quiver(*XY, *UV_real, scale=scale)
Q2 = plt.quiver(*XY, *UV_imag, scale=scale, color='g')
Tested for multiple magnitude ratios between real and imaginary parts.

Getting the inverse of a 2d polynomial transform with numpy (for image or raster image warping/sampling)

If I have a 2-dimensional (x and y coordinates) polynomial transform function of 1st/affine, 2nd, or 3rd order (i.e. I have the coefficients/transformation matrix A), what is the mathematical or programmatic approach to getting the exact inverse of this function? Ideally, how would I implement this in Numpy? This is in the context of image warping or map georeferencing, i.e. transforming or warping the coordinates from an input image to an output image in a new warped coordinate system.
Attempted Solution
To solve this I have tried a matrix algebra approach for solving sets of equations. Mathematically, the transformation procedure is represented as Au = v. Forward transforming is easy, where you calculate u as a column-matrix containing the terms of the polynomial equation based on your input coordinates, and then matrix-multiply u with the transformation matrix A, in order to get the transformed output column matrix v containing the output coordinates. Backwards transforming on the other hand, means we know the output coordinates v and want to find the input coordinates u, so we need to reshuffle our equation as u = Av. By the rules of matrix algebra, the A matrix has to be inverted when moving it over. Implementing this in Numpy for a 2nd order polynomial transform, it does seem to work:
import numpy as np
# input coords
x = np.array([13])
y = np.array([13])
# terms of the 2nd order polynomial equation
x = x
y = y
xx = x*x
xy = x*y
yy = y*y
ones = np.ones(x.shape)
# u consists of each term in 2nd order polynomial equation
# with each term being array if want to transform multiple
u = np.array([xx,xy,yy,x,y,ones])
print('original input u', u)
## output:
## ('original input u', array([[169.],
## [169.],
## [169.],
## [ 13.],
## [ 13.],
## [ 1.]]))
# forward transform matrix
A = np.array([[1,2,3,1,6,8],
[5,2,9,2,0,1],
[8,1,5,8,4,3],
[1,4,8,2,3,9],
[9,3,2,1,9,5],
[4,2,5,6,2,1]])
# get forward coords
v = A.dot(u)
print('output v', v)
## output:
## ('output v', array([[1113.],
## [2731.],
## [2525.],
## [2271.],
## [2501.],
## [1964.]]))
# get backward coords (should exactly reproduce the input coords)
Ainv = np.linalg.inv(A)
u_pred = Ainv.dot(v)
print('backwards predicted input u', u_pred)
## output:
## ('backwards predicted input u', array([[169.],
## [169.],
## [169.],
## [ 13.],
## [ 13.],
## [ 1.]]))
In the above example the output v is actually a 1x6 matrix, where only the top two rows/values represent the transformed x and y coordinates. The problem becomes that we need all the additional values in v in order to exactly inverse the coordinates. But in real-world scenarios we only know transformed x and y values (i.e. the top two rows/values of v), we don't know the full 1x6 v matrix.
Maybe I'm thinking about this wrong, or maybe this matrix algebra approach is not the right approach, since 2nd order polynomials and higher are no longer linear? Any alternate programmatic/numpy approaches for inversing the polyonimal transformation?
Some context
I've looked up many similar questions and websites as well as numpy functions such as numpy.polynomial.Polynomial.fit, but most of them relate only to inversing 1-dimensional polynomial transforms. The few links I've found that talk about 2-dimensional transforms say there is no exact way to inverse it, which doesn't make sense since this is a very common operation in image warping/resampling and map georeferencing. For example, the steps for warping an image is often broken down to:
Forward project all original pixel (column-row) coordinates u using the transformation function/matrix A, in order to find the bounds of the transformed coordinate space v.
Then for every coordinate sampled at regular intervals in the transformed coordinate space bounds (found in step 1), backwards sample these v coordinates in the transformed coordinate system to find their original coordinates u. This determines which original pixels to sample for each location in the transformed image.
My problem then is that I have the forward transformation necessary for step 1, but I need to find the exact inverse of that transformation necessary for backwards sampling in step 2. Either a math answer or a numpy solution would be fine.
Inversion of a 2D affine function is pretty easy. It takes the resolution of a 2x2 linear system of equations.
The case of quadratic and cubic polynomials is much more problematic. If I am right, a system in two unknows is equivalent to a single quartic or nonic (degree 9) polynomial equation. Explicit (though complicated) formulas exist for the quartic case, but none for the nonic case, and you will have to resort to numerical methods (Newton's iterations).
In addition, the solution of these nonlinear equations are not unique (you can have 4 or 9 solutions) and you need to keep the right ones.
If your transformation remains close to affine (such as when correcting image distortion), I would suggest to choose an affine transformation that approximates the complete equation, use the backward transformation to find initial approximations, then refine with Newton.

Overlaying mixed effects model results with ggplot2

I have been having some difficulty in displaying the results from my lmer model within ggplot2. I am specifically interested in displaying predicted regression lines on top of observed data. The lmer model I am running on this (speech) data is here below:
lmer.declination <- lmer(zlogF0_m60~Center.syll*Tone + (1|Trial) + (1+Tone|Speaker) + (1|Utterance.num), data=data)
The dependent variable here is fundamental frequency (F0), normalized and averaged across the middle 60% of a syllable. The fixed effects are syllable number (Center.syll), counted backwards from the end of a sentence (e.g. -2 is the 3rd last syllable in the sentence). The data here is from a lexical tone language, so the Tone (all low tone /1/, all mid tone /3/, and all high tone /4/) is a discrete fixed effect. The experimental questions are whether F0 falls across the sentences for this language, if so, by how much, and whether tone matters. It was a bit difficult for me to think of a way to produce a toy data set here, but the data can be downloaded here (a 437K file).
In order to extract the model fits, I used the effects package and converted the output to a data frame.
ex <- Effect(c("Center.syll","Tone"),lmer.declination)
ex.df <- as.data.frame(ex)
I plot the data using ggplot2, with the following code:
t.plot <- ggplot(data, aes(factor(Center.syll), zlogF0_m60, group=Tone, color=Tone)) + stat_summary(fun.data = mean_cl_boot, geom = "smooth") + ylab("Normalized log(F0)") + xlab("Syllable number") + ggtitle("F0 change across utterances with identical level tones, medial 60% of vowel") + geom_pointrange(data=ex.df, mapping=aes(x=Center.syll, y=fit, ymin=lower, ymax=upper)) + theme_bw()
t.plot
This produces the following plot:
Predicted trajectories and observed trajectories
The predicted values appear to the left of the observed data, not overlaid on the data itself. Whatever I seem to try, I can not get them to overlap on the observed data. I would ideally like to have a single line drawn rather than a pointrange, but when I attempted to use geom_line, the default was for the line to connect from the upper bound of one point to the lower bound of the next (not at the median/midpoint). Thank you for your help.
(Edit: As the OP pointed out, he did in fact include a link to his data set. My apologies for implying that he didn't.)
First of all, you will have much better luck getting a helpful response if you provide a minimal, complete, and verifiable example (MVCE). Look here for information on how to best do that for R specifically.
Lacking your actual data to work with, I believe your problem is that you're factoring the x-axis for the stat_summary, but not for the geom_pointrange. I mocked up a toy example from the plot you linked to in order to demonstrate:
dat1 <- data.frame(x=c(-6:0, -5:0, -4:0),
y=c(-0.25, -0.5, -0.6, -0.75, -0.8, -0.8, -1.5,
0.5, 0.45, 0.4, 0.2, 0.1, 0,
0.5, 0.9, 0.7, 0.6, 1.1),
z=c(rep('a', 7), rep('b', 6), rep('c', 5)))
dat2 <- data.frame(x=dat1$x,
y=dat1$y + runif(18, -0.2, 0.2),
z=dat1$z,
upper=dat1$y + 0.3 + runif(18, -0.1, 0.1),
lower=dat1$y - 0.3 + runif(18, -0.1, 0.1))
Now, the following call gives me a result similar to the graph you linked to:
ggplot(dat1, aes(factor(x), # note x being factored here
y, group=z, color=z)) +
geom_line() + # (this is a place-holder for your stat_summary)
geom_pointrange(data=dat2,
mapping=aes(x=x, # but x not being factored here
y=y, ymin=lower, ymax=upper))
However, if I remove the factoring of the initial x value, I get the line and the point ranges overlaid:
ggplot(dat1, aes(x, # no more factoring here
y, group=z, color=z)) +
geom_line() +
geom_pointrange(data=dat2,
mapping=aes(x=x, y=y, ymin=lower, ymax=upper))
Note that I still get the overlaid result if I factor both of the x-axes. The two simply have to be consistent.
Again, I can't stress enough how much it helps this entire process if you provide code we can copy/paste into an R session and see what you're seeing. Hopefully this helps you out, but it all goes more smoothly (and quickly) if you help us help you.

How does the gradient of the sum trick work to get maxpooling positions in keras?

The keras examples directory contains a lightweight version of a stacked what-where autoencoder (SWWAE) which they train on MNIST data. (https://github.com/fchollet/keras/blob/master/examples/mnist_swwae.py)
In the original SWWAE paper, the authors compute the what and where using soft functions. However, in the keras implementation, they use a trick to get these locations. I would like to understand this trick.
Here is the code of the trick.
def getwhere(x):
''' Calculate the 'where' mask that contains switches indicating which
index contained the max value when MaxPool2D was applied. Using the
gradient of the sum is a nice trick to keep everything high level.'''
y_prepool, y_postpool = x
return K.gradients(K.sum(y_postpool), y_prepool) # How exactly does this line work?
Where y_prepool is a MxN matrix and y_postpool is a M/2 x N/2 matrix (lets assume canonical pooling of a size 2 pixels).
I have verified that the output of getwhere() is a bed of nails matrix where the nails indicate the position of the max (the local argmax if you will).
Can someone construct a small example demonstrating how getwhere works using this "Trick?"
Lets focus on the simplest example, without really talking about convolutions, say we have a vector
x = [1 4 2]
which we max-pool over (with a single, big window), we get
mx = 4
mathematically speaking, it is:
mx = x[argmax(x)]
now, the "trick" to recover one hot mask used by pooling is to do
magic = d mx / dx
there is no gradient for argmax, however it "passes" the corresponding gradient to an element in a vector at the location of maximum element, so:
d mx / dx = [0/dx[1] dx[2]/dx[2] 0/dx[3]] = [0 1 0]
as you can see, all the gradient for non-maximum elements are zero (due to argmax), and "1" appears at the maximum value because dx/x = 1.
Now for "proper" maxpool you have many pooling regions, connected to many input locations, thus taking analogous gradient of sum of pooled values, will recover all the indices.
Note however, that this trick will not work if you have heavily overlapping kernels - you might end up with bigger values than "1". Basically if a pixel is max-pooled by K kernels, than it will have value K, not 1, for example:
[1 ,2, 3]
x = [13,3, 1]
[4, 2, 9]
if we max pool with 2x2 window we get
mx = [13,3]
[13,9]
and the gradient trick gives you
[0, 0, 1]
magic = [2, 0, 0]
[0, 0, 1]