Tabular data: Implementing a custom tensor layer without resorting to iteration - tensorflow

I have an idea for a tensor operation that would not be difficult to implement via iteration, with batch size one. However I would like to parallelize it as much as possible.
I have two tensors with shape (n, 5) called X and Y. X is actually supposed to represent 5 one-dimensional tensors with shape (n, 1): (x_1, ..., x_n). Ditto for Y.
I would like to compute a tensor with shape (n, 25) where each column represents the output of the tensor operation f(x_i, y_j), where f is fixed for all 1 <= i, j <= 5. The operation f has output shape (n, 1), just like x_i and y_i.
I feel it is important to clarify that f is essentially a fully-connected layer from the concatenated [...x_i, ...y_i] tensor with shape (1, 10), to an output layer with shape (1,5).
Again, it is easy to see how to do this manually with iteration and slicing. However this is probably very slow. Performing this operation in batches, where the tensors X, Y now have shape (n, 5, batch_size) is also desirable, particularly for mini-batch gradient descent.
It is difficult to really articulate here why I desire to create this network; I feel it is suited for my domain of 'itemized tabular data' and cuts down significantly on the number of weights per operation, compared to a fully connected network.
Is this possible using tensorflow? Certainly not using just keras.
Below is an example in numpy per AloneTogether's request
import numpy as np
features = 16
batch_size = 256
X_batch = np.random.random((features, 5, batch_size))
Y_batch = np.random.random((features, 5, batch_size))
# one tensor operation to reduce weights in this custom 'layer'
f = np.random.random((features, 2 * features))
for b in range(batch_size):
X = X_batch[:, :, b]
Y = Y_batch[:, :, b]
for i in range(5):
x_i = X[:, i:i+1]
for j in range(5):
y_j = Y[:, j:j+1]
x_i_y_j = np.concatenate([x_i, y_j], axis=0)
# f(x_i, y_j)
# implemented by a fully-connected layer
f_i_j = np.matmul(f, x_i_y_j)

All operations you need (concatenation and matrix multiplication) can be batched.
Difficult part here is, that you want to concatenate features of all items in X with features of all items in Y (all combinations).
My recommended solution is to expand the dimensions of X to [batch, features, 5, 1], expand dimensions of Y to [batch, features, 1, 5]
Than tf.repeat() both tensors so their shapes become [batch, features, 5, 5].
Now you can concatenate X and Y. You will have a tensor of shape [batch, 2*features, 5, 5]. Observe that this way all combinations are built.
Next step is matrix multiplication. tf.matmul() can also do batch matrix multiplication, but I use here tf.einsum() because I want more control over which dimensions are considered as batch.
Full code:
import tensorflow as tf
import numpy as np
batch_size=3
features=6
items=5
x = np.random.uniform(size=[batch_size,features,items])
y = np.random.uniform(size=[batch_size,features,items])
f = np.random.uniform(size=[2*features,features])
x_reps= tf.repeat(x[:,:,:,tf.newaxis], items, axis=3)
y_reps= tf.repeat(y[:,:,tf.newaxis,:], items, axis=2)
xy_conc = tf.concat([x_reps,y_reps], axis=1)
f_i_j = tf.einsum("bfij, fg->bgij", xy_conc,f)
f_i_j = tf.reshape(f_i_j , [batch_size,features,items*items])

Related

Optimizing Tensorflow for many small matrix-vector multiplications

To build up a capsule network training script, I need to compute many small matrix-vector multiplications.
The size of each weight matrix is at most 20 by 20.
The number of weight matrices is more more than 900.
I'm curious tf.matmul or tf.linalg.matvec is the best option for this.
Could anybody give me a hint to optimize the training script?
EDIT:
Looking at the notebook that you are referring to, it seems you have the following parameters:
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
And then you have a tensor w with shape (caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims) (in the notebook it has an initial dimension with size 1 that I am skipping) and another tensor caps1_output with shape (batch_size, caps1_n_caps, caps1_n_dims). And you need to combine them to produce caps2_predicted with shape (batch_size, caps1_n_caps, caps1_n_dims, caps2_n_dims).
In the notebook they tile the tensors in order to operate them with tf.linalg.matmul, but actually you can compute the same result without any tiling just using tf.einsum:
import tensorflow as tf
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
w = tf.zeros((caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims), dtype=tf.float32)
caps1_output = tf.zeros((batch_size, caps1_n_caps, caps1_n_dims), dtype=tf.float32)
caps2_predicted = tf.einsum('ijkl,bil->bilk', w, caps1_output)
print(caps2_predicted.shape)
# (50, 1152, 8, 16)
I'm not sure if I have understood exactly what you want, but you say you want to compute something like:
ûij = Wij × ui
For a collection of several matrices W and vectors u. Assuming you have 900 matrices and vectors, matrices have size 20×20 and vectors have size 20, you can represent them as two tensors, ws, with shape (900, 20, 20), and us, with shape (900, 20). If you do that, you result us_hat, with shape (900, 20, 20), would be computed simply as:
us_hat = ws * tf.expand_dims(us, axis=-1)

Cleaner way to whiten each image in a batch using keras

I would like to whiten each image in a batch. The code I have to do so is this:
def whiten(self, x):
shape = x.shape
x = K.batch_flatten(x)
mn = K.mean(x, 0)
std = K.std(x, 0) + K.epsilon()
r = (x - mn) / std
r = K.reshape(x, (-1,shape[1],shape[2],shape[3]))
return r
#
where x is (?, 320,320,1). I am not keen on the reshape function with a -1 arg. Is there a cleaner way to do this?
Let's see what the -1 does. From the Tensorflow documentation (Because the documentation from Keras is scarce compared to the one from Tensorflow):
If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant.
So what this means:
from keras import backend as K
X = tf.constant([1,2,3,4,5])
K.reshape(X, [-1, 5])
# Add one more dimension, the number of columns should be 5, and keep the number of elements to be constant
# [[1 2 3 4 5]]
X = tf.constant([1,2,3,4,5,6])
K.reshape(X, [-1, 3])
# Add one more dimension, the number of columns should be 3
# For the number of elements to be constant the number of rows should be 2
# [[1 2 3]
# [4 5 6]]
I think it is simple enough. So what happens in your code:
# Let's assume we have 5 images, 320x320 with 3 channels
X = tf.ones((5, 320, 320, 3))
shape = X.shape
# Let's flat the tensor so we can perform the rest of the computation
flatten = K.batch_flatten(X)
# What this did is: Turn a nD tensor into a 2D tensor with same 0th dimension. (Taken from the documentation directly, let's see that below)
flatten.shape
# (5, 307200)
# So all the other elements were squeezed in 1 dimension while keeping the batch_size the same
# ...The rest of the stuff in your code is executed here...
# So we did all we wanted and now we want to revert the tensor in the shape it had previously
r = K.reshape(flatten, (-1, shape[1],shape[2],shape[3]))
r.shape
# (5, 320, 320, 3)
Besides, I can't think of a cleaner way to do what you want to do. If you ask me, your code is already clear enough.

Solving XOR with 3 data points using Multi-Layered Perceptron

The XOR problem is known to be solved by the multi-layer perceptron given all 4 boolean inputs and outputs, it trains and memorizes the weights needed to reproduce the I/O. E.g.
import numpy as np
np.random.seed(0)
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def cost(predicted, truth):
return truth - predicted
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
X = xor_input
Y = xor_output
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
num_epochs = 10000
learning_rate = 1.0
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = cost(layer2, Y)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += learning_rate * np.dot(layer1.T, layer2_delta)
W1 += learning_rate * np.dot(layer0.T, layer1_delta)
We see that we've fully trained the network to memorize the outputs for XOR:
# On the training data
[int(prediction > 0.5) for prediction in layer2]
[out]:
[0, 1, 1, 0]
If we re-feed the same inputs, we get the same output:
for x, y in zip(X, Y):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
0 [0]
But if we retrain the parameters (W1 and W2) without one of the data points, i.e.
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
Let's drop the last row of data and use that as unseen test.
X = xor_input[:-1]
Y = xor_output[:-1]
And with the rest of the same code, regardless of how I change the hyperparameters, it's un-able to learn the XOR function and reproduce the I/O.
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
1 [0]
Even if we shuffle the in-/output:
# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))
We can't train the XOR function fully:'
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(x, int(prediction > 0.5), y)
[out]:
[0 0] 1 [0]
[0 1] 1 [1]
[1 0] 1 [1]
[1 1] 0 [0]
So when the literature states that the multi-layered perceptron (Aka the basic deep learning) solves XOR, does it mean that it can fully learn and memorize the weights given the fully set of in-/outputs but cannot generalize the XOR problem given that one of data point is missing?
Here's the link of the Kaggle dataset that answerers can test the network for themselves: https://www.kaggle.com/alvations/xor-with-mlp/
I think learning (generalizing) XOR and memorizing XOR are different things.
A two-layer perceptron can memorize XOR as you have seen, that is there exists a combination of weights where the loss is minimum and equal to 0 (absolute minimum).
If the weights are randomly initialized, you might end up with the situation where you have actually learned XOR and not only memorized.
Note that multi-layer perceptrons are non-convex functions so, there could be multiple minima (multiple global minima even). When data is missing one input, there are multiple minima (and all are equal in value) and there exists minima where the missing point would be correctly classified. Hence, MLP can learn an XOR. (though finding that weight combination might be hard with a missing point).
It is quite often argued that Neural Networks are universal function approximator and can approximate non-sense labels even. In that light, you might want to look at this work https://arxiv.org/abs/1611.03530

No broadcasting for tf.matmul in TensorFlow

I have a problem with which I've been struggling. It is related to tf.matmul() and its absence of broadcasting.
I am aware of a similar issue on https://github.com/tensorflow/tensorflow/issues/216, but tf.batch_matmul() doesn't look like a solution for my case.
I need to encode my input data as a 4D tensor:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
The first dimension is the size of a batch, the second the number of entries in the batch.
You can imagine each entry as a composition of a number of objects (third dimension). Finally, each object is described by a vector of 100 float values.
Note that I used None for the second and third dimensions because the actual sizes may change in each batch. However, for simplicity, let's shape the tensor with actual numbers:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
These are the steps of my computation:
compute a function of each vector of 100 float values (e.g., linear function)
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.matmul(X, W)
problem: no broadcasting for tf.matmul() and no success using tf.batch_matmul()
expected shape of Y: (5, 10, 4, 50)
applying average pooling for each entry of the batch (over the objects of each entry):
Y_avg = tf.reduce_mean(Y, 2)
expected shape of Y_avg: (5, 10, 50)
I expected that tf.matmul() would have supported broadcasting. Then I found tf.batch_matmul(), but still it looks like doesn't apply to my case (e.g., W needs to have 3 dimensions at least, not clear why).
BTW, above I used a simple linear function (the weights of which are stored in W). But in my model I have a deep network instead. So, the more general problem I have is automatically computing a function for each slice of a tensor. This is why I expected that tf.matmul() would have had a broadcasting behavior (if so, maybe tf.batch_matmul() wouldn't even be necessary).
Look forward to learning from you!
Alessio
You could achieve that by reshaping X to shape [n, d], where d is the dimensionality of one single "instance" of computation (100 in your example) and n is the number of those instances in your multi-dimensional object (5*10*4=200 in your example). After reshaping, you can use tf.matmul and then reshape back to the desired shape. The fact that the first three dimensions can vary makes that little tricky, but you can use tf.shape to determine the actual shapes during run time. Finally, you can perform the second step of your computation, which should be a simple tf.reduce_mean over the respective dimension. All in all, it would look like this:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
X_ = tf.reshape(X, [-1, 100])
Y_ = tf.matmul(X_, W)
X_shape = tf.gather(tf.shape(X), [0,1,2]) # Extract the first three dimensions
target_shape = tf.concat(0, [X_shape, [50]])
Y = tf.reshape(Y_, target_shape)
Y_avg = tf.reduce_mean(Y, 2)
As the renamed title of the GitHub issue you linked suggests, you should use tf.tensordot(). It enables contraction of axes pairs between two tensors, in line with Numpy's tensordot(). For your case:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.tensordot(X, W, [[3], [0]]) # gives shape=[5, 10, 4, 50]

How does tensorflow batch_matmul work?

Tensorflow has a function called batch_matmul which multiplies higher dimensional tensors. But I'm having a hard time understanding how it works, perhaps partially because I'm having a hard time visualizing it.
What I want to do is multiply a matrix by each slice of a 3D tensor, but I don't quite understand what the shape of tensor a is. Is z the innermost dimension? Which of the following is correct?
I would most prefer the first to be correct -- it's most intuitive to me and easy to see in the .eval() output. But I suspect the second is correct.
Tensorflow says that batch_matmul performs:
out[..., :, :] = matrix(x[..., :, :]) * matrix(y[..., :, :])
What does that mean? What does that mean in the context of my example? What is being multiplied with with what? And why aren't I getting a 3D tensor the way I expected?
You can imagine it as doing a matmul over each training example in the batch.
For example, if you have two tensors with the following dimensions:
a.shape = [100, 2, 5]
b.shape = [100, 5, 2]
and you do a batch tf.matmul(a, b), your output will have the shape [100, 2, 2].
100 is your batch size, the other two dimensions are the dimensions of your data.
First of all tf.batch_matmul() was removed and no longer available. Now you suppose to use tf.matmul():
The inputs must be matrices (or tensors of rank > 2, representing
batches of matrices), with matching inner dimensions, possibly after
transposition.
So let's assume you have the following code:
import tensorflow as tf
batch_size, n, m, k = 10, 3, 5, 2
A = tf.Variable(tf.random_normal(shape=(batch_size, n, m)))
B = tf.Variable(tf.random_normal(shape=(batch_size, m, k)))
tf.matmul(A, B)
Now you will receive a tensor of the shape (batch_size, n, k). Here is what is going on here. Assume you have batch_size of matrices nxm and batch_size of matrices mxk. Now for each pair of them you calculate nxm X mxk which gives you an nxk matrix. You will have batch_size of them.
Notice that something like this is also valid:
A = tf.Variable(tf.random_normal(shape=(a, b, n, m)))
B = tf.Variable(tf.random_normal(shape=(a, b, m, k)))
tf.matmul(A, B)
and will give you a shape (a, b, n, k)
You can now do it using tf.einsum, starting from Tensorflow 0.11.0rc0.
For example,
M1 = tf.Variable(tf.random_normal([2,3,4]))
M2 = tf.Variable(tf.random_normal([5,4]))
N = tf.einsum('ijk,lk->ijl',M1,M2)
It multiplies the matrix M2 with every frame (3 frames) in every batch (2 batches) in M1.
The output is:
[array([[[ 0.80474716, -1.38590837, -0.3379252 , -1.24965811],
[ 2.57852983, 0.05492432, 0.23039417, -0.74263287],
[-2.42627382, 1.70774114, 1.19503212, 0.43006262]],
[[-1.04652011, -0.32753903, -1.26430523, 0.8810069 ],
[-0.48935518, 0.12831448, -1.30816901, -0.01271309],
[ 2.33260512, -1.22395933, -0.92082584, 0.48991606]]], dtype=float32),
array([[ 1.71076882, 0.79229093, -0.58058828, -0.23246667],
[ 0.20446332, 1.30742455, -0.07969904, 0.9247328 ],
[-0.32047141, 0.66072595, -1.12330854, 0.80426538],
[-0.02781649, -0.29672042, 2.17819595, -0.73862702],
[-0.99663496, 1.3840003 , -1.39621222, 0.77119476]], dtype=float32),
array([[[ 0.76539308, 2.77609682, -1.79906654, 0.57580602, -3.21205115],
[ 4.49365759, -0.10607499, -1.64613271, 0.96234947, -3.38823152],
[-3.59156275, 2.03910899, 0.90939498, 1.84612727, 3.44476724]],
[[-1.52062428, 0.27325237, 2.24773455, -3.27834225, 3.03435063],
[ 0.02695178, 0.16020992, 1.70085776, -2.8645196 , 2.48197317],
[ 3.44154787, -0.59687197, -0.12784094, -2.06931567, -2.35522676]]], dtype=float32)]
I have verified, the arithmetic is correct.
tf.tensordot should solve this problem. It supports batch operations, e.g., if you want to contract a 2D tensor with a 3D tensor, with the latter having a batch dimension.
If a is shape [n,m] b is shape [?,m,l], then
y = tf.tensordot(b, a, axes=[1, 1]) will produce a tensor of shape [?,n,l]
https://www.tensorflow.org/api_docs/python/tf/tensordot
It is simply like splitting on the first dimension respectively, multiply and concat them back. If you want to do 3D by 2D, you can reshape, multiply, and reshape it back. I.e. [100, 2, 5] -> [200, 5] -> [200, 2] -> [100, 2, 2]
The answer to this particular answer is using tf.scan function.
If a = [5,3,2] #dimension of 5 batch, with 3X2 mat in each batch
and b = [2,3] # a constant matrix to be multiplied with each sample
then let def fn(a,x):
return tf.matmul(x,b)
initializer = tf.Variable(tf.random_number(3,3))
h = tf.scan(fn,outputs,initializer)
this h will store all the outputs.