How do I mask for 2-D MultiHeadAttention in Tensorflow? - tensorflow

Can anyone help me understand masking a 3D input (technically 4D) in MultiHeadAttention?
My original dataset consists of timeseries in the form of:
Inputs: (samples, horizon, features) ~> (8, 4, 2) ~> K, V, Q during inference
Targets: (samples, horizon, features) ~> (8, 4, 2) ~> Q during training
Labels: (sample, horizon, features) ~> (1, 4, 2)
Essentially I'm taking 8 samples of timeseries data and ultimately outputting 1 sample in the same format. Targets are horizon-shifted values of Inputs and fed into an encoder-only Transformer model (Q, K, V as shown above).
In order to best approximate the single output sample (which is identical to the last sample in Targets), I need to run full attention on the horizons of each sample and causal attention between samples. Once the data has been run through the encoder, it is sent to an EinsumDense layer which reduces the (8, 4, 2) encoder output into (1, 4, 2). In order for all this to work, I need to inject a 4th dimension on my data, so Inputs and Targets are formatted as (1, 8, 4, 2).
So getting to my actual question, how do I generate the masking for the encoder? After some digging around through errors I noticed that the shape of the tensor that MHA uses for masking the softmax is formatted (1, 1, 8, 4, 8, 4) which makes me believe it's (B, H, TS, TH, SS, SH) where:
B=batch
H=heads
TS=target samples
TH=target horizon
SS=source samples
SH=source horizon
I gather this notion from the docs only because of the attention_output description:
...where T is for target sequence shapes
Assuming this to be the case, is the following a reasonable mask, or is there a more appropriate method:
sample_mask = tf.linalg.band_part(tf.ones((samples, samples)), -1, 0)
horizon_mask = tf.ones((horizon, horizon))
encoder_mask = (
sample_mask[:, tf.newaxis, :, tf.newaxis]
* horizon_mask[tf.newaxis, :, tf.newaxis, :]
)

it is masking you can fancy it since data are contained in many fashions nothing wrong with it but I am trying to use the Tensorflow methods please see the result they are on the same dimensions. Tensorflow Masking layer
Sample: Simply identical masking values with target shapes you become observers for the solutions, proved with eyes fashions improved governance.
import tensorflow as tf
import matplotlib.pyplot as plt
start = 3
limit = 25
delta = 3
sample = tf.range(start, limit, delta)
sample = tf.cast( sample, dtype=tf.int64 )
sample = tf.constant( sample, shape=( 8, 1 ) )
horizon = tf.random.uniform(shape=[1, 4], minval=5, maxval=10, dtype=tf.int64)
features = tf.random.uniform(shape=[1, 1, 2], minval=-5, maxval=+5, dtype=tf.int64)
temp = tf.math.multiply(sample, horizon)
temp = tf.expand_dims(temp, axis=2)
input = tf.math.multiply( temp, features )
print( "input: " )
print( input )
n_samples = 8
n_horizon = 4
n_features = 2
sample_mask = tf.linalg.band_part(tf.ones((n_samples, n_samples)), -1, 0)
horizon_mask = tf.ones((n_horizon, n_horizon))
encoder_mask = (
sample_mask[:, tf.newaxis, :, tf.newaxis]
* horizon_mask[tf.newaxis, :, tf.newaxis, :]
)
print( encoder_mask )
masking_layer = tf.keras.layers.Masking(mask_value=50, input_shape=(n_horizon, n_features))
print( masking_layer(input) )
img_1 = tf.keras.preprocessing.image.array_to_img(
tf.constant( tf.constant( input[:,:,1], shape=(8, 4, 1) ), shape=(8, 4, 1) ),
data_format=None,
scale=True
)
img_2 = tf.keras.preprocessing.image.array_to_img(
tf.constant( masking_layer(input)[:,:,0], shape=(8, 4, 1) ),
data_format=None,
scale=True
)
plt.figure(figsize=(1, 2))
plt.title("🧸")
plt.subplot(1, 2, 1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(img_1)
plt.xlabel("Input (8, 4, 2), left")
plt.subplot(1, 2, 2)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(img_2)
plt.xlabel("Masks (8, 4, 2), left")
plt.show()
Output: Input tensor we created from table matched features.
[[ -960 0]
[-1080 0]
[ -960 0]
[ -960 0]]], shape=(8, 4, 2), dtype=int64)
Output: The question - masking methods.
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
...
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]], shape=(8, 4, 8, 4), dtype=float32)
Output: The masking_layer = tf.keras.layers.Masking(mask_value=50, input_shape=(n_horizon, n_features))
[[ -840 0]
[ -945 0]
[ -840 0]
[ -840 0]]
[[ -960 0]
[-1080 0]
[ -960 0]
[ -960 0]]], shape=(8, 4, 2), dtype=int64)

Related

What is the difference between batch, batch_size, timesteps & features in Tensorflow?

I am new to deep learning and I am utterly confused about the terminology.
In the Tensorflow documentation,
for [RNN layer] https://www.tensorflow.org/api_docs/python/tf/keras/layers/RNN#input_shape
N-D tensor with shape [batch_size, timesteps, ...]
for [LSTM layer]
https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
inputs: A 3D tensor with shape [batch, timesteps, feature].
I understand for the input_shape, we don't have to specify the batch/batch size.
But still I would like to know the difference between batch & batch size.
What is time-steps vs features?
Is the 1st Dimension always the batch? The 2nd-D = Time-steps, and 3rd-D = Features?
Example 1
data = array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
data = data.reshape((1, 5, 2))
print(data.shape) --> (1, 5, 2)
print(data)
[[[ 1 2]
[ 3 4]
[ 5 6]
[ 7 8]
[ 9 10]]]
model = Sequential()
model.add(LSTM(32, input_shape=(5, 2)))
Example 2
data1 = array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11])
n_features = 1
data1 = data1.reshape((len(data1), n_features))
print(data1)
# define generator
n_input = 2
generator = TimeseriesGenerator(data1, data1, length=n_input, stride=2, batch_size=10)
# number of batch
print('Batches: %d' % len(generator))
# OUT --> Batches: 1
# print each batch
for i in range(len(generator)):
x, y = generator[i]
print('%s => %s' % (x, y))
x, y = generator[0]
print(x.shape)
[[[ 1]
[ 2]]
[[ 3]
[ 4]]
[[ 5]
[ 6]]
[[ 7]
[ 8]]
[[ 9]
[10]]] => [[ 3]
[ 5]
[ 7]
[ 9]
[11]]
(5, 2, 1)
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_input, n_features)))
Difference between batch_size v. batch
In the documentation you quoted, batch means batch_size.
Meaning of timesteps and feature
Taking a glance at https://www.tensorflow.org/tutorials/structured_data/time_series (weather forecast example with real-world data!) will help you understand more about time-series data.
feature is what you want the model to make predictions from; in the above forecast example, it is an vector (array) of pressure, temperature, etc...
RNN/ LSTM are designed to handle time-series. This is why you need to feed timesteps, along with feature, to your model. timesteps represents when the data is recorded; again, in the example above, data is sampled every hour, so timesteps == 0 is the data taken at the first hour, timesteps == 1 the second hour, ...
Order of dimensions of the input/ output data
In TensorFlow, the first dimension of data often represents a batch.
What comes after the batch axis, depends on the problem field. In general, global features (like batch size) precedes element-specific features (like image size).
Examples:
time-series data are in (batch_size, timesteps, feature) format.
Image data are often represented in NHWC format: (batch_size, image_height, image_width, channels).
From https://www.tensorflow.org/guide/tensor#about_shapes :
While axes are often referred to by their indices, you should always
keep track of the meaning of each. Often axes are ordered from global
to local: The batch axis first, followed by spatial dimensions, and
features for each location last. This way feature vectors are
contiguous regions of memory.

Discrepancy between tensorflow's conv1d and pytorch's conv1d

I am trying to import some pytorch code to tensorflow, I came to know that torch.nn.functional.conv1d() is tf.nn.conv1d() but I am afraid there are still some discrepancies in tf's versions. Specifically, I cannot find the group parameter in tf.conv1d. For example: the following codes output two different results:
Pytorch:
inputs = torch.Tensor([[[1, 1, 1, 1],[2, 2, 2, 2],[3, 3, 3, 3]]]) #batch_sizex seq_length x embed_dim,
inputs = inputs.transpose(2,1) #batch_size x embed_dim x seq_length
batch_size, embed_dim, seq_length = inputs.size()
kernel_size = 3
in_channels = 2
out_channels = in_channels
weight = torch.ones(out_channels, 1, kernel_size)
inputs = inputs.contiguous().view(-1, in_channels, seq_length) #batch_size*embed_dim/in_channels x in_channels x seq_length
inputs = F.pad(inputs, (kernel_size-1,0), 'constant', 0)
output = F.conv1d(inputs, weight, padding=0, groups=in_channels)
output = output.contiguous().view(batch_size, embed_dim, seq_length).transpose(2,1)
Output:
tensor([[[1., 1., 1., 1.],
[3., 3., 3., 3.],
[6., 6., 6., 6.]]])
Tensorflow:
inputs = tf.constant([[[1, 1, 1, 1],[2, 2, 2, 2],[3, 3, 3, 3]]], dtype=tf.float32) #batch_sizex seq_length x embed_dim
inputs = tf.transpose(inputs, perm=[0,2,1])
batch_size, embed_dim, seq_length = inputs.get_shape()
print(batch_size, seq_length, embed_dim)
kernel_size = 3
in_channels = 2
out_channels = in_channels
weight = tf.ones([kernel_size, in_channels, out_channels])
inputs = tf.reshape(inputs, [(batch_size*embed_dim)//in_channels, in_channels, seq_length], name='inputs')
inputs = tf.transpose(inputs, perm=[0, 2, 1])
padding = [[0, 0], [(kernel_size - 1), 0], [0, 0]]
padded = tf.pad(inputs, padding)
res = tf.nn.conv1d(padded, weight, 1, 'VALID')
res = tf.transpose(res, perm=[0, 2, 1])
res = tf.reshape(res, [batch_size, embed_dim, seq_length])
res = tf.transpose(res, perm=[0, 2, 1])
print(res)
Output:
[[[ 2. 2. 2. 2.]
[ 6. 6. 6. 6.]
[12. 12. 12. 12.]]], shape=(1, 3, 4), dtype=float32)
Different results
There is no discrepancy between those versions, you are just setting up different things. To get exactly same results as in Tensorflow change the lines specifying weights to:
weight = torch.ones(out_channels, 2, kernel_size)
, because your input has two input channels, as you have correctly declared in TF:
weight = tf.ones([kernel_size, in_channels, out_channels])
Groups parameter
You have misunderstood what is groups parameter responsible for in pytorch. It restricts the number of channels each filter uses (in this case only one as 2 input_channels divided by 2 give us one).
See here for more intuitive explanation for 2D convolution.

tensorflow: shift zeros to the end

Given a tensor (with numbers >= 0) in tensorflow. I need to shift all zeros to the end of each line and remove columns that only include 0's.
E.g.
0 2 3 4
0 1 0 5
2 3 1 0
should be transformed to
2 3 4
1 5 0
2 3 1
Is there any nice way to do this in tensorflow? Btw, the order of the non-zero elements should be the same (no sorting).
Ragged tensor method
The best way
def rm_zeros(pred):
pred = tf.cast(pred, tf.float32)
# num_non_zero element in every row
num_non_zero = tf.count_nonzero(pred, -1) #[3 2 3]
# flat input and remove all zeros
flat_pred = tf.reshape(pred, [-1])
mask = tf.math.logical_not(tf.equal(flat_pred, tf.zeros_like(flat_pred)))
flat_pred_without_zero = tf.boolean_mask(flat_pred, mask) #[2. 3. 4. 1. 5. 2. 3. 1.]
# create a ragged tensor and change it to tensor, rows will be padded to max length
ragged_pred = tf.RaggedTensor.from_row_lengths(values=flat_pred_without_zero, row_lengths=num_non_zero)
paded_pred = ragged_pred.to_tensor(default_value=0.)
return paded_pred
a = tf.constant([[0, 2, 3, 4],[0, 1, 0, 5],[2, 3, 1, 0]])
print(rm_zeros(a))
output
tf.Tensor(
[[2. 3. 4.]
[1. 5. 0.]
[1. 2. 3.]], shape=(3, 3), dtype=float32)
Sorted method
If you don't mind the original data get sorted, the code below might be helpful. Although it's not the best solution.
The idea here is
1. change all zeros to infinity
2. sort the tensor
3. change all infinity back to zeros
4. slice the tensor to get minimal padding
def rm_zeros_sorted(input):
input = tf.cast(input, tf.float32)
# 1. change all zeros to infinity
zero_to_inf = tf.where(tf.equal(input, tf.zeros_like(input)), np.inf*tf.ones_like(input), input)
# 2. sort the tensor
input_sorted = tf.sort(zero_to_inf, axis=-1, direction='ASCENDING')
# 3. change all infinity back to zeros
inf_to_zero = tf.where(tf.math.is_inf(input_sorted), tf.zeros_like(input_sorted), input_sorted)
# 4. slice the tensor to get minimal padding
num_non_zero = tf.count_nonzero(inf_to_zero, -1)
max_non_zero = tf.reduce_max(num_non_zero)
remove_useless_zero = inf_to_zero[..., 0:max_non_zero]
return remove_useless_zero
a = tf.constant([[0, 2, 3, 4],[0, 1, 0, 5],[2, 3, 1, 0]])
print(rm_zeros_sorted(a))
output
tf.Tensor(
[[2. 3. 4.]
[1. 5. 0.]
[1. 2. 3.]], shape=(3, 3), dtype=float32)
The code below gets the trick done, although I'm sure that there are more elegant solutions possible and I'm curious to see those. The annoying part is that you have different amounts of zeros for each row.
a = tf.constant([[0, 2, 3, 4],[0, 1, 0, 5],[2, 3, 1, 0]])
boolean_mask = tf.logical_not(tf.equal(a, tf.zeros_like(a)))
# all the non-zero values in a flat tensor
non_zero_values = tf.gather_nd(a, tf.where(boolean_mask))
# number of non-zero values in each row
n_non_zero = tf.reduce_sum(tf.cast(boolean_mask, tf.int64), axis=-1)
# max number of non-zeros -> this will be the padding length
max_non_zero = tf.reduce_max(n_non_zero).numpy()
(Here it gets ugly)
# Split the tensor into flat tensors with the non-zero values of each row
rows = tf.split(non_zero_values, n_non_zero)
# Pad with zeros wherever necessary and recombine into a single tensor
tf.stack([tf.pad(r, paddings=[[0, max_non_zero - r.get_shape().as_list()[0]]]) for r in rows])
Produces the desired result:
<tf.Tensor: id=49, shape=(3, 3), dtype=int32, numpy=
array([[2, 3, 4],
[1, 5, 0],
[2, 3, 1]], dtype=int32)>
def shift_zeros(data, mask):
data_flat = tf.boolean_mask(data, mask)
nonzero_lens = tf.reduce_sum(tf.cast(mask, dtype=tf.int32), axis=-1)
nonzero_mask = tf.sequence_mask(nonzero_lens, maxlen=tf.shape(mask)[-1])
nonzero_data = tf.scatter_nd(tf.cast(tf.where(nonzero_mask), dtype=tf.int32), data_flat, shape=tf.shape(data))
return nonzero_data

K-means example(tf.expand_dims)

In Example code of Kmeans of Tensorflow,
When use the function 'tf.expand_dims'(Inserts a dimension of 1 into a tensor's shape.) in point_expanded, centroids_expanded
before calculate tf.reduce_sum.
why is these have different indexes(0, 1) in second parameter?
import numpy as np
import tensorflow as tf
points_n = 200
clusters_n = 3
iteration_n = 100
points = tf.constant(np.random.uniform(0, 10, (points_n, 2)))
centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0],[clusters_n, -1]))
points_expanded = tf.expand_dims(points, 0)
centroids_expanded = tf.expand_dims(centroids, 1)
distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2)
assignments = tf.argmin(distances, 0)
means = []
for c in range(clusters_n):
means.append(tf.reduce_mean(tf.gather(points,tf.reshape(tf.where(tf.equal(assignments, c)), [1, -1])), reduction_indices=[1]))
new_centroids = tf.concat(means,0)
update_centroids = tf.assign(centroids, new_centroids)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for step in range(iteration_n):
[_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments])
print("centroids" + "\n", centroid_values)
plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)
plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)
plt.show()
This is done to subtract each centroid from each point. First, make sure you understand the notion of broadcasting (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
that is linked from tf.subtract (https://www.tensorflow.org/api_docs/python/tf/subtract). Then, you just need to draw the shapes of points, expanded_points, centroids, and expanded_centroids and understand what values get "broadcast" where. Once you do that you will see that broadcasting allows you to compute exactly what you want - subtract each point from each centroid.
As a sanity check, since there are 200 points, 3 centroids, and each is 2D, we should have 200*3*2 differences. This is exactly what we get:
In [53]: points
Out[53]: <tf.Tensor 'Const:0' shape=(200, 2) dtype=float64>
In [54]: points_expanded
Out[54]: <tf.Tensor 'ExpandDims_4:0' shape=(1, 200, 2) dtype=float64>
In [55]: centroids
Out[55]: <tf.Variable 'Variable:0' shape=(3, 2) dtype=float64_ref>
In [56]: centroids_expanded
Out[56]: <tf.Tensor 'ExpandDims_5:0' shape=(3, 1, 2) dtype=float64>
In [57]: tf.subtract(points_expanded, centroids_expanded)
Out[57]: <tf.Tensor 'Sub_5:0' shape=(3, 200, 2) dtype=float64>
If you are having trouble drawing the shapes, you can think of broadcasting the expanded_points with dimension (1, 200, 2) to dimension (3, 200, 2) as copying the 200x2 matrix 3 times along the first dimension. The 3x2 matrix in centroids_expanded (of shape (3, 1, 2)) get copied 200 times along the second dimension.

How to get tensorflow to do a convolution on a 2 x 2 matrix with a 1 x 2 kernel?

I have the following matrix:
and the following kernel:
If I do a convolution with no padding and slide by 1 row, I should get the following answer:
Because:
Based the documentation of tf.nn.conv2d, I thought this code expresses what I just described above:
import tensorflow as tf
input_batch = tf.constant([
[
[[.0], [1.0]],
[[2.], [3.]]
]
])
kernel = tf.constant([
[
[[1.0, 2.0]]
]
])
conv2d = tf.nn.conv2d(input_batch, kernel, strides=[1, 1, 1, 1], padding='VALID')
sess = tf.Session()
print(sess.run(conv2d))
But it produces this output:
[[[[ 0. 0.]
[ 1. 2.]]
[[ 2. 4.]
[ 3. 6.]]]]
And I have no clue how that is computed. I've tried experimenting with different values for the strides padding parameter but still am not able to produce the result I expected.
You have not correctly read my explanation in the tutorial you linked. After a straight-forward modification of no-padding, strides=1 you suppose to get the following code.
import tensorflow as tf
k = tf.constant([
[1, 2],
], dtype=tf.float32, name='k')
i = tf.constant([
[0, 1],
[2, 3],
], dtype=tf.float32, name='i')
kernel = tf.reshape(k, [1, 2, 1, 1], name='kernel')
image = tf.reshape(i, [1, 2, 2, 1], name='image')
res = tf.squeeze(tf.nn.conv2d(image, kernel, [1, 1, 1, 1], "VALID"))
# VALID means no padding
with tf.Session() as sess:
print sess.run(res)
Which gives you the result you expected: [2., 8.]. Here I got a vector instead of the column because of squeeze operator.
One problem I see with your code (there might be other) is that your kernel is of the shape (1, 1, 1, 2), but it suppose to be (1, 2, 1, 1).