Optimizing Tensorflow for many small matrix-vector multiplications - tensorflow

To build up a capsule network training script, I need to compute many small matrix-vector multiplications.
The size of each weight matrix is at most 20 by 20.
The number of weight matrices is more more than 900.
I'm curious tf.matmul or tf.linalg.matvec is the best option for this.
Could anybody give me a hint to optimize the training script?

EDIT:
Looking at the notebook that you are referring to, it seems you have the following parameters:
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
And then you have a tensor w with shape (caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims) (in the notebook it has an initial dimension with size 1 that I am skipping) and another tensor caps1_output with shape (batch_size, caps1_n_caps, caps1_n_dims). And you need to combine them to produce caps2_predicted with shape (batch_size, caps1_n_caps, caps1_n_dims, caps2_n_dims).
In the notebook they tile the tensors in order to operate them with tf.linalg.matmul, but actually you can compute the same result without any tiling just using tf.einsum:
import tensorflow as tf
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
w = tf.zeros((caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims), dtype=tf.float32)
caps1_output = tf.zeros((batch_size, caps1_n_caps, caps1_n_dims), dtype=tf.float32)
caps2_predicted = tf.einsum('ijkl,bil->bilk', w, caps1_output)
print(caps2_predicted.shape)
# (50, 1152, 8, 16)
I'm not sure if I have understood exactly what you want, but you say you want to compute something like:
ûij = Wij × ui
For a collection of several matrices W and vectors u. Assuming you have 900 matrices and vectors, matrices have size 20×20 and vectors have size 20, you can represent them as two tensors, ws, with shape (900, 20, 20), and us, with shape (900, 20). If you do that, you result us_hat, with shape (900, 20, 20), would be computed simply as:
us_hat = ws * tf.expand_dims(us, axis=-1)

Related

ValueError: Dimensions must be equal in Tensorflow/Keras

My codes are as follow:
v = tf.Variable(initial_value=v, trainable=True)
v.shape is (1, 768)
In the model:
inputs_sents = keras.Input(shape=(50,3))
inputs_events = keras.Input(shape=(50,768))
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)
But I got an error,
ValueError: Dimensions must be equal, but are 768 and 50 for
'{{node BatchMatMulV2_3}} =
BatchMatMulV2[T=DT_FLOAT,
adj_x=false,
adj_y=false](BatchMatMulV2_3/ReadVariableOp,
Transpose_3)' with input shapes: [1,768], [768,50,?]
I think it takes consideration of the batch? But how shall I deal with this?
v is a trainable vector (or 2d array with first dimension being 1), I want it to be trained in the training process.
PS: This is the result I got using the codes provided by the first answer, I think it is incorrect cause keras already takes consideration of the first batch dimension.
Plus, from the keras documentation,
shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.
https://keras.io/api/layers/core_layers/input/
Should I rewrite my codes without keras?
The shape of a batch is denoted by None:
import numpy as np
inputs_sents = keras.Input(shape=(None,1,3))
inputs_events = keras.Input(shape=(None,1,768))
v = np.ones(shape=(1,768), dtype=np.float32)
v = tf.Variable(initial_value=v, trainable=True)
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)

Tabular data: Implementing a custom tensor layer without resorting to iteration

I have an idea for a tensor operation that would not be difficult to implement via iteration, with batch size one. However I would like to parallelize it as much as possible.
I have two tensors with shape (n, 5) called X and Y. X is actually supposed to represent 5 one-dimensional tensors with shape (n, 1): (x_1, ..., x_n). Ditto for Y.
I would like to compute a tensor with shape (n, 25) where each column represents the output of the tensor operation f(x_i, y_j), where f is fixed for all 1 <= i, j <= 5. The operation f has output shape (n, 1), just like x_i and y_i.
I feel it is important to clarify that f is essentially a fully-connected layer from the concatenated [...x_i, ...y_i] tensor with shape (1, 10), to an output layer with shape (1,5).
Again, it is easy to see how to do this manually with iteration and slicing. However this is probably very slow. Performing this operation in batches, where the tensors X, Y now have shape (n, 5, batch_size) is also desirable, particularly for mini-batch gradient descent.
It is difficult to really articulate here why I desire to create this network; I feel it is suited for my domain of 'itemized tabular data' and cuts down significantly on the number of weights per operation, compared to a fully connected network.
Is this possible using tensorflow? Certainly not using just keras.
Below is an example in numpy per AloneTogether's request
import numpy as np
features = 16
batch_size = 256
X_batch = np.random.random((features, 5, batch_size))
Y_batch = np.random.random((features, 5, batch_size))
# one tensor operation to reduce weights in this custom 'layer'
f = np.random.random((features, 2 * features))
for b in range(batch_size):
X = X_batch[:, :, b]
Y = Y_batch[:, :, b]
for i in range(5):
x_i = X[:, i:i+1]
for j in range(5):
y_j = Y[:, j:j+1]
x_i_y_j = np.concatenate([x_i, y_j], axis=0)
# f(x_i, y_j)
# implemented by a fully-connected layer
f_i_j = np.matmul(f, x_i_y_j)
All operations you need (concatenation and matrix multiplication) can be batched.
Difficult part here is, that you want to concatenate features of all items in X with features of all items in Y (all combinations).
My recommended solution is to expand the dimensions of X to [batch, features, 5, 1], expand dimensions of Y to [batch, features, 1, 5]
Than tf.repeat() both tensors so their shapes become [batch, features, 5, 5].
Now you can concatenate X and Y. You will have a tensor of shape [batch, 2*features, 5, 5]. Observe that this way all combinations are built.
Next step is matrix multiplication. tf.matmul() can also do batch matrix multiplication, but I use here tf.einsum() because I want more control over which dimensions are considered as batch.
Full code:
import tensorflow as tf
import numpy as np
batch_size=3
features=6
items=5
x = np.random.uniform(size=[batch_size,features,items])
y = np.random.uniform(size=[batch_size,features,items])
f = np.random.uniform(size=[2*features,features])
x_reps= tf.repeat(x[:,:,:,tf.newaxis], items, axis=3)
y_reps= tf.repeat(y[:,:,tf.newaxis,:], items, axis=2)
xy_conc = tf.concat([x_reps,y_reps], axis=1)
f_i_j = tf.einsum("bfij, fg->bgij", xy_conc,f)
f_i_j = tf.reshape(f_i_j , [batch_size,features,items*items])

Setting filter weights of a convolutional layer

Im working on a semantic segmentation project which involves dynamic filters in order to learn multiscale representations.
To create these filters I use a Unet backbone and extract the feature maps from the bottleneck layer.
The feature maps are of size H x W X 512, where H is the height of the feature map, W the width and 512 is the number of channels (maps).
These features are passed to a 1x1 convolution to reduce the amount of filters to H X W X 128 and the features are also passed to an adaptive pooling layer to reduce H X W X 512 to k x k x 512, where k is the size of the filter (i.ex. 5).
The filter is then also fed through a 1 x 1 convolution to reduce it to 128.
This gives me a feature map f = H x W x 128 and a filter kernel g of size k x k x 128.
Now I want to convolve f with g and tried the following in keras:
conv = Conv2D(128, kernel_size = 5, kernel_initializer = g, trainable = False)(f)
Unfortunately this does not work and I just get an error saying:
"Could not interpret initializer identifier: Tensor("strided_slice:0", shape = (5,5,128), dtype = float32)"
Now Iam wondering what Iam doing wrong?
In addition I have to mention that the shape of the output tnesor after average pooling /1x1 conv is (? , 5, 5, 128), where ? is the batch size.
The get the kernel I tried something like:
g = g[0,:,:,:]
Thanks for any advice,
cheers,
Michael
The kernel_initializer argument of the constructor of Conv2D does not expect a kernel, but a function that would initialize a kernel. You can read more in the documentation
If you just want to perform a convolution without trainable weights, you are better off using the tensorflow native function tf.nn.conv2d :
conv = tf.nn.conv2d(f,g,strides=[1,1,1,1],padding='VALID')

In Tensorflow (in general deep learning), can I apply "local FC layer" with "CONV layer"?

Could anyone make sure my reasoning?
Let's say I have a (pre-trained) fully connected layer fc that takes bx20x20x10 as input and bx64as output layer, where b is batch size.
Now, I have an input of cx100x60x10. The height and weight 100x60 can be subdivided into 5x3 of 20x20. I would like to have 5x3 of local response (output) by fc layer, i.e., `cx5x3x64'.
Now I am thinking: doing this is same with having convolution layer with fc weights and stride with width 20 and height 20. Is that correct? There can be difference?
Yes, it will be the same if appropriate reshaping of the dense layer weight matrix is performed.
Let us first look at the dense layer. You input a 20 x 20 x 10 matrix to the dense layer. It will first be flattened out to produce a 4000 x 1 vector. You want the output to be of size 64 x 1 vector. So, the weight matrix required is 4000 x 64 and 64 bias parameters. Then y = w^T * x + b = [4000 x 64]^T * [4000 x 1] + [64 x 1] will yield a [64 x 1] vector. Therefore, y[i] = w[i][0]*x[0] + ... + w[i][3999]*x[3999] + b[i] for i = [0, 63]. Note that b indicates a bias parameter.
Let us turn to convolution. To produce a 5 x 3 x 64 output from an input of size 100 x 60 x 10, you need 64 filters, each of size (20,20) and strides (20,20) with no zero-padding. Each 20 x 20 filter however has local connectivity extending along the entire depth i.e. a neuron is connected to all the 10 dimensions along the depth of input. Please read this for more information on local connectivity of convolutional layer.
You convolutional layer has a receptive field of 20 x 20. Each neuron in the convolutional layer will be connected to a 20 x 20 x 10. Thus total 4000 weights (and one bias parameter). You have 64 such filters. Therefore, your total learnable weights for this layer = 4000 x 64 + 64. Convolution between one 20 x 20 x 10 block of x and w (size = 64 x 20 x 20 x 10) can be performed as:
convResult = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1], axis=-1), axis=-1),axis=-1)
There are some fine points here. I did w[:,:,::-1,::-1] because theano convolution flips the convolution kernel (well, not that simple!). If you are interested in who flips and who does not, read this.
Finally, dense layer and convolution layer (in this context) essentially do the same operation. They first element-wise multiply and then sum up two sets of vectors/matrices of 4000 elements. This procedure is repeated 64 times to produce a 64 x 1 vector. So, it is possible to achieve exactly the same result with dense and convolution layer by proper reshaping of the dense layer weight matrix. However, you need to take care of kernel flipping to match the results.
Below I give a code snippet to compute convolution manually (using numpy) and using Theano.
import theano
from theano import tensor as T
import numpy as np
X = T.ftensor4('X')
W = T.ftensor4('W')
out = T.nnet.conv2d(X,W)
f = theano.function([X, W], out, allow_input_downcast=True)
x = np.random.random((1,10,20,20))
w = np.random.random((64,10,20,20))
# convolution using Theano
c1 = np.squeeze(f(x,w)[0])
# convolution using Numpy
c2 = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1],axis=-1),axis=-1),axis=-1)
# check that both are almost identical
print np.amax(c2 - c1)

No broadcasting for tf.matmul in TensorFlow

I have a problem with which I've been struggling. It is related to tf.matmul() and its absence of broadcasting.
I am aware of a similar issue on https://github.com/tensorflow/tensorflow/issues/216, but tf.batch_matmul() doesn't look like a solution for my case.
I need to encode my input data as a 4D tensor:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
The first dimension is the size of a batch, the second the number of entries in the batch.
You can imagine each entry as a composition of a number of objects (third dimension). Finally, each object is described by a vector of 100 float values.
Note that I used None for the second and third dimensions because the actual sizes may change in each batch. However, for simplicity, let's shape the tensor with actual numbers:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
These are the steps of my computation:
compute a function of each vector of 100 float values (e.g., linear function)
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.matmul(X, W)
problem: no broadcasting for tf.matmul() and no success using tf.batch_matmul()
expected shape of Y: (5, 10, 4, 50)
applying average pooling for each entry of the batch (over the objects of each entry):
Y_avg = tf.reduce_mean(Y, 2)
expected shape of Y_avg: (5, 10, 50)
I expected that tf.matmul() would have supported broadcasting. Then I found tf.batch_matmul(), but still it looks like doesn't apply to my case (e.g., W needs to have 3 dimensions at least, not clear why).
BTW, above I used a simple linear function (the weights of which are stored in W). But in my model I have a deep network instead. So, the more general problem I have is automatically computing a function for each slice of a tensor. This is why I expected that tf.matmul() would have had a broadcasting behavior (if so, maybe tf.batch_matmul() wouldn't even be necessary).
Look forward to learning from you!
Alessio
You could achieve that by reshaping X to shape [n, d], where d is the dimensionality of one single "instance" of computation (100 in your example) and n is the number of those instances in your multi-dimensional object (5*10*4=200 in your example). After reshaping, you can use tf.matmul and then reshape back to the desired shape. The fact that the first three dimensions can vary makes that little tricky, but you can use tf.shape to determine the actual shapes during run time. Finally, you can perform the second step of your computation, which should be a simple tf.reduce_mean over the respective dimension. All in all, it would look like this:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
X_ = tf.reshape(X, [-1, 100])
Y_ = tf.matmul(X_, W)
X_shape = tf.gather(tf.shape(X), [0,1,2]) # Extract the first three dimensions
target_shape = tf.concat(0, [X_shape, [50]])
Y = tf.reshape(Y_, target_shape)
Y_avg = tf.reduce_mean(Y, 2)
As the renamed title of the GitHub issue you linked suggests, you should use tf.tensordot(). It enables contraction of axes pairs between two tensors, in line with Numpy's tensordot(). For your case:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.tensordot(X, W, [[3], [0]]) # gives shape=[5, 10, 4, 50]