Closed form solution to single layer perceptron - tensorflow

A single layer perceptron is easy to covert to the form:
A # x = b
Where:
A is a matrix of shape (m,n),
x is of shape (m),
and b is of shape (n).
(Apologies if the shapes are transposed... in ML, the first dim is usually the y axis not the x due to row-major stuff, but I think in normal matrix math, the first axis is the x axis).
Can I use the moore-penrose approximation of inverses to calculate the OLS best fit approximation of A?
I suspect this is trivial high school linear algebra.

Yea, it was high school linear algebra.
A # x = b
A # x # x^-1 = b # x^-1
A = b # x^-1

Related

Getting single value from the N dim histogram in NumPy or SciPy

Assume I have a data like this:
x = np.random.randn(4, 100000)
and I fit a histogram
hist = np.histogramdd(x, density=True)
What I want is to get the probability of number g, e.g. g=0.1. Assume some hypothetical function foo then.
g = 0.1
prob = foo(hist, g)
print(prob)
>> 0.2223124214
How could I do something like this, where I get probability back for a single or a vector of numbers for a fitted histogram? Especially histogram that is N-dimensional.
histogramdd takes O(r^D) memory, and unless you have a very large dataset or very small dimension you will have a poor estimate. Consider your example data, 100k points in 4-D space, the default histogram will be 10 x 10 x 10 x 10, so it will have 10k bins.
x = np.random.randn(4, 100000)
hist = np.histogramdd(x.transpose(), density=True)
np.mean(hist[0] == 0)
gives something arround 0.77 meaning that 77% of the bins in the histogram have no points.
You probably want to smooth the distribution. Unless you have a good reason to not do, I would suggest you to use Gaussian kernel-density Estimate
x = np.random.randn(4, 100000) # d x n array
f = scipy.stats.gaussian_kde(x) # d-dimensional PDF
f([1,2,3,4]) # evaluate the PDF in a given point

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

In Tensorflow (in general deep learning), can I apply "local FC layer" with "CONV layer"?

Could anyone make sure my reasoning?
Let's say I have a (pre-trained) fully connected layer fc that takes bx20x20x10 as input and bx64as output layer, where b is batch size.
Now, I have an input of cx100x60x10. The height and weight 100x60 can be subdivided into 5x3 of 20x20. I would like to have 5x3 of local response (output) by fc layer, i.e., `cx5x3x64'.
Now I am thinking: doing this is same with having convolution layer with fc weights and stride with width 20 and height 20. Is that correct? There can be difference?
Yes, it will be the same if appropriate reshaping of the dense layer weight matrix is performed.
Let us first look at the dense layer. You input a 20 x 20 x 10 matrix to the dense layer. It will first be flattened out to produce a 4000 x 1 vector. You want the output to be of size 64 x 1 vector. So, the weight matrix required is 4000 x 64 and 64 bias parameters. Then y = w^T * x + b = [4000 x 64]^T * [4000 x 1] + [64 x 1] will yield a [64 x 1] vector. Therefore, y[i] = w[i][0]*x[0] + ... + w[i][3999]*x[3999] + b[i] for i = [0, 63]. Note that b indicates a bias parameter.
Let us turn to convolution. To produce a 5 x 3 x 64 output from an input of size 100 x 60 x 10, you need 64 filters, each of size (20,20) and strides (20,20) with no zero-padding. Each 20 x 20 filter however has local connectivity extending along the entire depth i.e. a neuron is connected to all the 10 dimensions along the depth of input. Please read this for more information on local connectivity of convolutional layer.
You convolutional layer has a receptive field of 20 x 20. Each neuron in the convolutional layer will be connected to a 20 x 20 x 10. Thus total 4000 weights (and one bias parameter). You have 64 such filters. Therefore, your total learnable weights for this layer = 4000 x 64 + 64. Convolution between one 20 x 20 x 10 block of x and w (size = 64 x 20 x 20 x 10) can be performed as:
convResult = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1], axis=-1), axis=-1),axis=-1)
There are some fine points here. I did w[:,:,::-1,::-1] because theano convolution flips the convolution kernel (well, not that simple!). If you are interested in who flips and who does not, read this.
Finally, dense layer and convolution layer (in this context) essentially do the same operation. They first element-wise multiply and then sum up two sets of vectors/matrices of 4000 elements. This procedure is repeated 64 times to produce a 64 x 1 vector. So, it is possible to achieve exactly the same result with dense and convolution layer by proper reshaping of the dense layer weight matrix. However, you need to take care of kernel flipping to match the results.
Below I give a code snippet to compute convolution manually (using numpy) and using Theano.
import theano
from theano import tensor as T
import numpy as np
X = T.ftensor4('X')
W = T.ftensor4('W')
out = T.nnet.conv2d(X,W)
f = theano.function([X, W], out, allow_input_downcast=True)
x = np.random.random((1,10,20,20))
w = np.random.random((64,10,20,20))
# convolution using Theano
c1 = np.squeeze(f(x,w)[0])
# convolution using Numpy
c2 = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1],axis=-1),axis=-1),axis=-1)
# check that both are almost identical
print np.amax(c2 - c1)

Efficent way in TensorFlow to subtract mean and divide by standard deivation for each row

I have a tensor of shape [x, y] and I want to subtract the mean and divide by the standard deviation row-wise (i.e. I want to do it for each row). What is the most efficient way to do this in TensorFlow?
Of course I can loop through rows as follows:
new_tensor = [i - tf.reduce_mean(i) for i in old_tensor]
...to subtract the mean and then do something similar to find the standard deviation and divide by it, but is this the best way to do it in TensorFlow?
The TensorFlow tf.sub() and tf.div() operators support broadcasting, so you don't need to iterate through every row. Let's consider the mean, and leave standard deviation as an exercise:
old_tensor = ... # shape = (x, y)
mean = tf.reduce_mean(old_tensor, 1, keep_dims=True) # shape = (x, 1)
stdev = ... # shape = (x,)
stdev = tf.expand_dims(stdev, 1) # shape = (x, 1)
new_tensor = old_tensor - mean # shape = (x, y)
new_tensor = old_tensor / stdev # shape = (x, y)
The subtraction and division operators implicitly broadcast a tensor of shape (x, 1) along the column dimension to match the shape of the other argument, (x, y). For more details about how broadcasting works, see the NumPy documentation on the topic (TensorFlow implements NumPy broadcasting semantics).
calculate moments along axis 1 (y in your case) and keep dimensions, i.e. shape of mean and var is (len(x), 1)
subtract mean and divide by standard deviation (i.e. square root of variance)
mean, var = tf.nn.moments(old_tensor, [1], keep_dims=True)
new_tensor = tf.div(tf.subtract(old_tensor, mean), tf.sqrt(var))

finding matrix through optimisation

I am looking for algorithm to solve the following problem :
I have two sets of vectors, and I want to find the matrix that best approximate the transformation from the input vectors to the output vectors.
vectors are 3x1, so matrix is 3x3.
This is the general problem. My particular problem is I have a set of RGB colors, and another set that contains the desired color. I am trying to find an RGB to RGB transformation that would give me colors closer to the desired ones.
There is correspondence between the input and output vectors, so computing an error function that should be minimized is the easy part. But how can I minimize this function ?
This is a classic linear algebra problem, the key phrase to search on is "multiple linear regression".
I've had to code some variation of this many times over the years. For example, code to calibrate a digitizer tablet or stylus touch-screen uses the same math.
Here's the math:
Let p be an input vector and q the corresponding output vector.
The transformation you want is a 3x3 matrix; call it A.
For a single input and output vector p and q, there is an error vector e
e = q - A x p
The square of the magnitude of the error is a scalar value:
eT x e = (q - A x p)T x (q - A x p)
(where the T operator is transpose).
What you really want to minimize is the sum of e values over the sets:
E = sum (e)
This minimum satisfies the matrix equation D = 0 where
D(i,j) = the partial derivative of E with respect to A(i,j)
Say you have N input and output vectors.
Your set of input 3-vectors is a 3xN matrix; call this matrix P.
The ith column of P is the ith input vector.
So is the set of output 3-vectors; call this matrix Q.
When you grind thru all of the algebra, the solution is
A = Q x PT x (P x PT) ^-1
(where ^-1 is the inverse operator -- sorry about no superscripts or subscripts)
Here's the algorithm:
Create the 3xN matrix P from the set of input vectors.
Create the 3xN matrix Q from the set of output vectors.
Matrix Multiply R = P x transpose (P)
Compute the inverseof R
Matrix Multiply A = Q x transpose(P) x inverse (R)
using the matrix multiplication and matrix inversion routines of your linear algebra library of choice.
However, a 3x3 affine transform matrix is capable of scaling and rotating the input vectors, but not doing any translation! This might not be general enough for your problem. It's usually a good idea to append a "1" on the end of each of the 3-vectors to make then a 4-vector, and look for the best 3x4 transform matrix that minimizes the error. This can't hurt; it can only lead to a better fit of the data.
You don't specify a language, but here's how I would approach the problem in Matlab.
v1 is a 3xn matrix, containing your input colors in vertical vectors
v2 is also a 3xn matrix containing your output colors
You want to solve the system
M*v1 = v2
M = v2*inv(v1)
However, v1 is not directly invertible, since it's not a square matrix. Matlab will solve this automatically with the mrdivide operation (M = v2/v1), where M is the best fit solution.
eg:
>> v1 = rand(3,10);
>> M = rand(3,3);
>> v2 = M * v1;
>> v2/v1 - M
ans =
1.0e-15 *
0.4510 0.4441 -0.5551
0.2220 0.1388 -0.3331
0.4441 0.2220 -0.4441
>> (v2 + randn(size(v2))*0.1)/v1 - M
ans =
0.0598 -0.1961 0.0931
-0.1684 0.0509 0.1465
-0.0931 -0.0009 0.0213
This gives a more language-agnostic solution on how to solve the problem.
Some linear algebra should be enough :
Write the average squared difference between inputs and outputs ( the sum of the squares of each difference between each input and output value ). I assume this as definition of "best approximate"
This is a quadratic function of your 9 unknown matrix coefficients.
To minimize it, derive it with respect to each of them.
You will get a linear system of 9 equations you have to solve to get the solution ( unique or a space variety depending on the input set )
When the difference function is not quadratic, you can do the same but you have to use an iterative method to solve the equation system.
This answer is better for beginners in my opinion:
Have the following scenario:
We don't know the matrix M, but we know the vector In and a corresponding output vector On. n can range from 3 and up.
If we had 3 input vectors and 3 output vectors (for 3x3 matrix), we could precisely compute the coefficients αr;c. This way we would have a fully specified system.
But we have more than 3 vectors and thus we have an overdetermined system of equations.
Let's write down these equations. Say that we have these vectors:
We know, that to get the vector On, we must perform matrix multiplication with vector In.In other words: M · I̅n = O̅n
If we expand this operation, we get (normal equations):
We do not know the alphas, but we know all the rest. In fact, there are 9 unknowns, but 12 equations. This is why the system is overdetermined. There are more equations than unknowns. We will approximate the unknowns using all the equations, and we will use the sum of squares to aggregate more equations into less unknowns.
So we will combine the above equations into a matrix form:
And with some least squares algebra magic (regression), we can solve for b̅:
This is what is happening behind that formula:
Transposing a matrix and multiplying it with its non-transposed part creates a square matrix, reduced to lower dimension ([12x9] · [9x12] = [9x9]).
Inverse of this result allows us to solve for b̅.
Multiplying vector y̅ with transposed x reduces the y̅ vector into lower [1x9] dimension. Then, by multiplying [9x9] inverse with [1x9] vector we solved the system for b̅.
Now, we take the [1x9] result vector and create a matrix from it. This is our approximated transformation matrix.
A python code:
import numpy as np
import numpy.linalg
INPUTS = [[5,6,2],[1,7,3],[2,6,5],[1,7,5]]
OUTPUTS = [[3,7,1],[3,7,1],[3,7,2],[3,7,2]]
def get_mat(inputs, outputs, entry_len):
n_of_vectors = inputs.__len__()
noe = n_of_vectors*entry_len# Number of equations
#We need to construct the input matrix.
#We need to linearize the matrix. SO we will flatten the matrix array such as [a11, a12, a21, a22]
#So for each row we combine the row's variables with each input vector.
X_mat = []
for in_n in range(0, n_of_vectors): #For each input vector
#populate all matrix flattened variables. for 2x2 matrix - 4 variables, for 3x3 - 9 variables and so on.
base = 0
for col_n in range(0, entry_len): #Each original unknown matrix's row must be matched to all entries in the input vector
row = [0 for i in range(0, entry_len ** 2)]
for entry in inputs[in_n]:
row[base] = entry
base+=1
X_mat.append(row)
Y_mat = [item for sublist in outputs for item in sublist]
X_np = np.array(X_mat)
Y_np = np.array([Y_mat]).T
solution = np.dot(np.dot(numpy.linalg.inv(np.dot(X_np.T,X_np)),X_np.T),Y_np)
var_mat = solution.reshape(entry_len, entry_len) #create square matrix
return var_mat
transf_mat = get_mat(INPUTS, OUTPUTS, 3) #3 means 3x3 matrix, and in/out vector size 3
print(transf_mat)
for i in range(0,INPUTS.__len__()):
o = np.dot(transf_mat, np.array([INPUTS[i]]).T)
print(f"{INPUTS[i]} x [M] = {o.T} ({OUTPUTS[i]})")
The output is as such:
[[ 0.13654096 0.35890767 0.09530002]
[ 0.31859558 0.83745124 0.22236671]
[ 0.08322497 -0.0526658 0.4417611 ]]
[5, 6, 2] x [M] = [[3.02675088 7.06241873 0.98365224]] ([3, 7, 1])
[1, 7, 3] x [M] = [[2.93479472 6.84785436 1.03984767]] ([3, 7, 1])
[2, 6, 5] x [M] = [[2.90302805 6.77373212 2.05926064]] ([3, 7, 2])
[1, 7, 5] x [M] = [[3.12539476 7.29258778 1.92336987]] ([3, 7, 2])
You can see, that it took all the specified inputs, got the transformed outputs and matched the outputs to the reference vectors. The results are not precise, since we have an approximation from the overspecified system. If we used INPUT and OUTPUT with only 3 vectors, the result would be exact.