TensorFlow: implicit broadcasting in element-wise addition/multiplication - tensorflow

How does the implicit broadcasting in tensorflow using + and * work?
If i Have two tensors, such that
a.get_shape() = [64, 10, 1, 100]
b.get_shape() = [64, 100]
(a+b).get_shape = [64, 10, 64, 100]
(a*b).get_shape = [64, 10, 64, 100]
How does that become [64, 10, 64, 100]??

According to the documentation, operations like add are broadcasting operation.
Quoting the glossary:
Broadcasting operation
An operation that uses numpy-style broadcasting to make the shapes of its tensor arguments compatible.
The numpy-style broadcasting is well documented in the documentation:
In brief:
[...] the smaller array is “broadcast” across the larger array so that they have compatible shapes.
Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

I think that the broadcasting isn't doing what you intended. It's actually broadcasting both directions. Let me show you what I mean by modifying your example
a = tf.ones([64, 10, 1, 100])
b = tf.ones([128, 100])
print((a+b).shape) # prints "(64, 10, 128, 100)"
From this we see that it broadcasts by matching the last dimensions first. It's implicitly tiling a across it's third dimension to match the size of b's first dimension, then implicitly adding singletons and tiling b across a's first two dimensions.
What I think you expected to do was to implicitly tile b across a's second dimension. To do that, you need b to be a different shape:
a = tf.ones([64, 10, 1, 100])
b = tf.ones([64, 1, 1, 100])
print((a+b).shape) # prints "(64, 10, 1, 100)"
You can use tf.expand_dims() twice on your b to add the two singleton dimensions to match this shape.

numpy style broadcasting is well documented, but to give a short explanation: the 2 tensors' shapes will be compared start from the last shape backward, then any shape lacked in either tensor will be replicated to be matched.
For example, with
a.get_shape() = [64, 10, 1, 100]
b.get_shape() = [64, 100]
(a*b).get_shape = [64, 10, 64, 100]
a and b have the same last shape==100, then the next to last shape of a is replicated to match b shape==64, b lacks the first two shapes of a and they will be created.
Note that any lacking shape must be 1 or absent, because the whole of lower-level shapes are replicated.

Related

TF broadcast along first axis

Say I have 2 tensors, one with shape (10,1) and another one with shape (10, 11, 1)... what I want is to multiply those broadcasting along the first axis, and not the last one, as used to
tf.zeros([10,1]) * tf.ones([10,12,1])
however this is not working... is there a way to do it without transposing it using perm?
You cannot change the broadcasting rules, but you can prevent broadcasting by doing it yourself. Broadcasting takes effect if the ranks are different.
So instead of permuting the axes, you can also repeat along a new axis:
import tensorflow as tf
import einops as ops
a = tf.zeros([10, 1])
b = tf.ones([10, 12, 1])
c = ops.repeat(a, 'x z -> x y z', y=b.shape[1]) * b
c.shape
>>> TensorShape([10, 12, 1])
For the above example, you need to do tf.zeros([10,1])[...,None] * tf.ones([10,12,1]) to satisfy broadcasting rules: https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules
If you want to do this for any random shapes, you can do the multiplication with the transposed shape, so that the last dimensions of both the matrices match, obeying broadcasting rule and then do the transpose again, to get back to the required output,
tf.transpose(a*tf.transpose(b))
Example,
a = tf.ones([10,])
b = tf.ones([10,11,12,13,1])
tf.transpose(b)
#[1, 13, 12, 11, 10]
(a*tf.transpose(b))
#[1, 13, 12, 11, 10]
tf.transpose(a*tf.transpose(b)) #Note a is [10,] not [10,1], otherwise you need to add transpose to a as well.
#[10, 11, 12, 13, 1]
Another approach is to expanding the axis:
a = tf.ones([10])[(...,) + (tf.rank(b)-1) * (tf.newaxis,)]

Indexing a tensor with None in PyTorch

I've seen this syntax to index a tensor in PyTorch, not sure what it means:
v = torch.div(t, n[:, None])
where v, t, and n are tensors.
What is the role of "None" here? I can't seem to find it in the documentation.
Similar to NumPy you can insert a singleton dimension ("unsqueeze" a dimension) by indexing this dimension with None. In turn n[:, None] will have the effect of inserting a new dimension on dim=1. This is equivalent to n.unsqueeze(dim=1):
>>> n = torch.rand(3, 100, 100)
>>> n[:, None].shape
(3, 1, 100, 100)
>>> n.unsqueeze(1).shape
(3, 1, 100, 100)
Here are some other types of None indexings.
In the example above : is was used as a placeholder to designate the first dimension dim=0. If you want to insert a dimension on dim=2, you can add a second : as n[:, :, None].
You can also place None with respect to the last dimension instead. To do so you can use the ellipsis syntax ...:
n[..., None] will insert a dimension last, i.e. n.unsqueeze(dim=-1).
n[..., None, :] on the before last dimension, i.e. n.unsqueeze(dim=-2).

How to conveniently use operations on numpy fortran contiguos arrays?

Some numpy functions like np.matmul(a, b) have convenient behavior for stacks of matrices.
The manual states:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, for a.shape = (10 , 2, 4) and b.shape(10, 4, 2) the statementa # b is meaningful and will have shape (10, 2, 2)
However, I'm coming from the linear algebra world, where I'm used to a Fortran contiguous array layout.
The same a represented as a Fortran contiguous array would have shape (4, 2, 10) and similarly b.shape = (2, 4, 10).
To do a # b as before I would have to invoke
(a.T # b.T).T .
Even worse, assume you naively created the same Fortran-contiguous array a with the behavior of matmul in mind, such that it has shape (10, 4, 2).
Then a.strides = (8, 80, 320) with the smallest stride in the 'stack' index, which actually should have highest stride.
Is this really the way to go or am I missing something?
While numpy can handle all sorts of layouts, many details are designed with the "C" layout in mind. Good examples are how nested lists translate into arrays, and the way numpy operations batch excess dimensions as in the matmul case.
It is correct that results in numpy as a rule of thumb do not depend on array layout (FORTRAN,C,non-contiguous); speed, however, certainly does and heavily so:
rng = np.random.default_rng()
a = rng.random((100,111,200))
b = rng.random((111,77,200))
af = np.array(a,order="F")
bf = np.array(b,order="F")
np.allclose((b.T#a.T).T,(bf.T#af.T).T)
# True
timeit(lambda:(b.T#a.T).T,number=10)
# 5.972857117187232
timeit(lambda:(bf.T#af.T).T,number=10)
# 0.1994628761895001
In fact, sometimes it is totally worth it to non-lazily transpose, i.e. copy your data into the best layout:
timeit(lambda:(np.array(b.T,order="C")#np.array(a.T,order="C")).T,number=10)
# 0.3931349152699113
My advice: If you want speed and convenience it is probably best to go with the "C" layout, it doesn't take all that long to get used to and saves you a lot of potential headaches.
numpy's matrix multiplication works regardless of the internal layout of the array. For example, here are two C-ordered arrays:
>>> import numpy as np
>>> a = np.random.rand(10, 2, 4)
>>> b = np.random.rand(10, 4, 2)
>>> print('a', a.shape, a.strides)
>>> print('b', b.shape, b.strides)
a (10, 2, 4) (64, 32, 8)
b (10, 4, 2) (64, 16, 8)
Here are the equivalent arrays in Fortran order:
>>> af = np.asfortranarray(a)
>>> bf = np.asfortranarray(b)
>>> print('af', af.shape, af.strides)
>>> print('bf', bf.shape, bf.strides)
af (10, 2, 4) (8, 80, 160)
bf (10, 4, 2) (8, 80, 320)
Numpy treats equivalent arrays as equivalent, regardless of their internal layout:
>>> np.allclose(a, af) and np.allclose(b, bf)
True
The results of a matrix multiplication do not depend on the internal layout:
>>> np.allclose(a # b, af # bf)
True
and you can even mix layouts if you wish:
>>> np.allclose(a # bf, af # b)
True
In short, the most convenient way to use Fortran-ordered arrays in numpy is to not worry about internal array layout: the shape is all that matters.
If your array shapes differ from what is expected by the numpy matmul API, your best bet is to reshape the arrays, for example using a.transpose(2, 0, 1) # b.transpose(2, 0, 1) or similar, depending on what is appropriate for your use-case, but don't worry: for C or Fortran contiguous arrays, this operation only adjusts the metadata around the array view, it does not cause the underlying data buffer to be copied or re-ordered.

No broadcasting for tf.matmul in TensorFlow

I have a problem with which I've been struggling. It is related to tf.matmul() and its absence of broadcasting.
I am aware of a similar issue on https://github.com/tensorflow/tensorflow/issues/216, but tf.batch_matmul() doesn't look like a solution for my case.
I need to encode my input data as a 4D tensor:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
The first dimension is the size of a batch, the second the number of entries in the batch.
You can imagine each entry as a composition of a number of objects (third dimension). Finally, each object is described by a vector of 100 float values.
Note that I used None for the second and third dimensions because the actual sizes may change in each batch. However, for simplicity, let's shape the tensor with actual numbers:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
These are the steps of my computation:
compute a function of each vector of 100 float values (e.g., linear function)
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.matmul(X, W)
problem: no broadcasting for tf.matmul() and no success using tf.batch_matmul()
expected shape of Y: (5, 10, 4, 50)
applying average pooling for each entry of the batch (over the objects of each entry):
Y_avg = tf.reduce_mean(Y, 2)
expected shape of Y_avg: (5, 10, 50)
I expected that tf.matmul() would have supported broadcasting. Then I found tf.batch_matmul(), but still it looks like doesn't apply to my case (e.g., W needs to have 3 dimensions at least, not clear why).
BTW, above I used a simple linear function (the weights of which are stored in W). But in my model I have a deep network instead. So, the more general problem I have is automatically computing a function for each slice of a tensor. This is why I expected that tf.matmul() would have had a broadcasting behavior (if so, maybe tf.batch_matmul() wouldn't even be necessary).
Look forward to learning from you!
Alessio
You could achieve that by reshaping X to shape [n, d], where d is the dimensionality of one single "instance" of computation (100 in your example) and n is the number of those instances in your multi-dimensional object (5*10*4=200 in your example). After reshaping, you can use tf.matmul and then reshape back to the desired shape. The fact that the first three dimensions can vary makes that little tricky, but you can use tf.shape to determine the actual shapes during run time. Finally, you can perform the second step of your computation, which should be a simple tf.reduce_mean over the respective dimension. All in all, it would look like this:
X = tf.placeholder(tf.float32, shape=(None, None, None, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
X_ = tf.reshape(X, [-1, 100])
Y_ = tf.matmul(X_, W)
X_shape = tf.gather(tf.shape(X), [0,1,2]) # Extract the first three dimensions
target_shape = tf.concat(0, [X_shape, [50]])
Y = tf.reshape(Y_, target_shape)
Y_avg = tf.reduce_mean(Y, 2)
As the renamed title of the GitHub issue you linked suggests, you should use tf.tensordot(). It enables contraction of axes pairs between two tensors, in line with Numpy's tensordot(). For your case:
X = tf.placeholder(tf.float32, shape=(5, 10, 4, 100))
W = tf.Variable(tf.truncated_normal([100, 50], stddev=0.1))
Y = tf.tensordot(X, W, [[3], [0]]) # gives shape=[5, 10, 4, 50]

Outer product in tensorflow

In tensorflow, there are nice functions for entrywise and matrix multiplication, but after looking through the docs, I cannot find any internal function for taking an outer product of two tensors, i.e., making a bigger tensor by all possible products of elements of smaller tensors (like numpy.outer):
v_{i,j} = x_i*h_j
or
M_{ij,kl} = A_{ij}*B_{kl}
Does tensorflow have such a function?
Yes, you can do this by taking advantage of the broadcast semantics of tensorflow. Size the first out to size 1xN of itself, and the second to size Mx1 of itself, and you'll get a broadcast to MxN of all of the results when you multiply them.
(You can play around with the same thing in numpy to see how it behaves in a simpler context, btw:
a = np.array([1, 2, 3, 4, 5]).reshape([5,1])
b = np.array([6, 7, 8, 9, 10]).reshape([1,5])
a*b
How exactly you do it in tensorflow depends a bit on which axes you want to use and what semantics you want for the resulting multiply, but the general idea applies.
It is somewhat surprising that until recently there was no easy and "natural" way of doing an outer product between arbitrary tensors (also known as "tensor product") in tensorflow, especially given the name of the library...
With tensorflow>=1.6 you can now finally get what you want with a simple:
M = tf.tensordot(A, B, axes=0)
In earlier versions of tensorflow, axes=0 raises a ValueError: 'axes' must be at least 1.. Somehow tf.tensordot() used to need at least one dimension to actually sum over. The easy way out is to simply add a "fake" dimension with tf.expand_dims().
On tensorflow<=1.5 you can thus get the same result as above by doing:
M = tf.tensordot(tf.expand_dims(A, 0), tf.expand_dims(B, 0), axes=[[0],[0]])
This adds a new index of dimension 1 in location 0 for both tensors and then lets tf.tensordot() sum over those indices.
In case someone else stumbles upon this, according to the tensorflow docs you can use the tf.einsum() function to compute the outer product of two tensors a and b:
# Outer product
>>> einsum('i,j->ij', u, v) # output[i,j] = u[i]*v[j]
tf.multiply (and its '*' shortcut) result in an outer product, whether or not a batch is used. In particular, if the two input tensors have a 3D shapes of [batch, n, 1] and [batch, 1, n] then this op will calculate the outer product for [n,1],[1,n] per each sample in the batch. If there is no batch, so that the two input tensors are 2D, this op will calculate the outer product just the same.
On the other hand, while tf.tensordot yields the outer product for 2D matrices, it did not broadcast similarly when a batch was added.
Without a batch:
a_np = np.array([[1, 2, 3]]) # shape: (1,3) [a row vector], 2D Tensor
b_np = np.array([[4], [5], [6]]) # shape: (3,1) [a column vector], 2D Tensor
a = tf.placeholder(dtype='float32', shape=[1, 3])
b = tf.placeholder(dtype='float32', shape=[3, 1])
c = a*b # Result: an outer-product of a,b
d = tf.multiply(a,b) # Result: an outer-product of a,b
e = tf.tensordot(a,b, axes=[0,1]) # Result: an outer-product of a,b
With a batch:
a_np = np.array([[[1, 2, 3]], [[4, 5, 6]]]) # shape: (2,1,3) [a batch of two row vectors], 3D Tensor
b_np = np.array([[[7], [8], [9]], [[10], [11], [12]]]) # shape: (2,3,1) [a batch of two column vectors], 3D Tensor
a = tf.placeholder(dtype='float32', shape=[None, 1, 3])
b = tf.placeholder(dtype='float32', shape=[None, 3, 1])
c = a*b # Result: an outer-product per batch
d = tf.multiply(a,b) # Result: an outer-product per batch
e = tf.tensordot(a,b, axes=[1,2]) # Does NOT result with an outer-product per batch
Running any of these two graphs:
sess = tf.Session()
result_astrix = sess.run(c, feed_dict={a:a_np, b: b_np})
result_multiply = sess.run(d, feed_dict={a:a_np, b: b_np})
result_tensordot = sess.run(e, feed_dict={a:a_np, b: b_np})
print('a*b:')
print(result_astrix)
print('tf.multiply(a,b):')
print(result_multiply)
print('tf.tensordot(a,b, axes=[1,2]:')
print(result_tensordot)
As pointed out in the other answers, the outer product can be done using broadcasting:
a = tf.range(10)
b = tf.range(5)
outer = a[..., None] * b[None, ...]
tf.InteractiveSession().run(outer)
# array([[ 0, 0, 0, 0, 0],
# [ 0, 1, 2, 3, 4],
# [ 0, 2, 4, 6, 8],
# [ 0, 3, 6, 9, 12],
# [ 0, 4, 8, 12, 16],
# [ 0, 5, 10, 15, 20],
# [ 0, 6, 12, 18, 24],
# [ 0, 7, 14, 21, 28],
# [ 0, 8, 16, 24, 32],
# [ 0, 9, 18, 27, 36]], dtype=int32)
Explanation:
The a[..., None] inserts a new dimension of length 1 after the last axis.
Similarly, b[None, ...] inserts a new dimension of length 1 before the first axis.
The element-wide multiplication then broadcasts the tensors from shapes (10, 1) * (1, 5) to (10, 5) * (10, 5), computing the outer product.
Where you insert the additional dimensions determines for which dimensions the outer product is computed. For example, if both tensors have a batch size, you can skip that using : which gives a[:, ..., None] * b[:, None, ...]. This can be further abbreviated as a[..., None] * b[:, None]. To perform the outer product over the last dimension and thus supporting any number of batch dimensions, use a[..., None] * b[..., None, :].
I would have commented to MasDra, but SO wouldn't let me as a new registered user.
The general outer product of multiple vectors arranged in a list U of length order can be obtained via
tf.einsum(','.join(string.ascii_lowercase[0:order])+'->'+string.ascii_lowercase[0:order], *U)