Keras video frame prediction with lower output dimension then input dimension - tensorflow

I want to train a Keras DNN for video frame prediction:
Input: 4 consecutive Frames of a Video
Output: Next frame, predicted from the network
So, basically, the dimensions are: input: (number_samples, 4, 60, 60), output: (number_samples, 1, 60, 60). I need some help in getting from the 4 frames in the input down to 1 frame in the output.
I've found an example here and would like to work with that.
Problem is, in that network, the output is not one frame, but the same number of frames as the input. (so my task is actually simpler, because I want to generate only one next frame, not 4). Now I don't know, which layers I could append at the end of the network or how I could change the network, so the output dimensions are as desired. (one frame instead of 4).
Appending a Conv2D Layer at the end did not work, because it does not match with the dimensions of the Conv3D.
Any Ideas on how to go about that problem and how my network architecture could look like? Any advice on my task in general and how I could build a good network for it is also appreciated.

This loop in the code example (for which you gave the URL) can be tailored to do what you desire.
for j in range(16):
new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
new = new_pos[::, -1, ::, ::, ::]
track = np.concatenate((track, new), axis=0)


How to convert this numpy one-liner into Tensorflow backend code?

I have multiple depthmaps which show a car from different angles. I need to calculate how well they match together in my loss function, so I have to reproject them into a different view. The depthmaps live in a cube that is relative to the length of the vehicle. The images have the shape (256,256). I already wrote the code to convert them to a pointcloud with backend functions (256*256,3). I can reproject this pointcloud to the side view with numpy like this:
reProj = np.zeros((256, 256), np.float32)
reProj[pointCloud[:, 1], pointCloud[:, 2]] = pointCloud[:, 0]
How can I convert this into keras backend code? I suspect there should be a gather somewhere in there, but I just cannot get it working.
Source depth image:
Thanks for your help!
Edit: Minimal working example with data:
You can do this by using tf.matmul() the first input will be your pointcloud, from the dimensions i am assuming you are storing for every pixel a 3d vector x,y,z. The second input will be the 3d rotation matrix coresponding to the projection you need, keep in mind this works for every angle you want to you just need to define the 3x3 matrix.
If i understand correctly your data you need to rotate over x 90 degrees so the matrix would be
1 0 0
0 0 -1
0 1 0
read more on rotation matrices here
just go to the tree dimension and see what you need
So i finally figured it out, I was just thinking about it wrong. It is not a gather operation, is it a scatter. This works perfect now!
indices = K.stack([p[:, 1], p[:, 2]], -1)
indices = K.reshape(indices, (256, 256, 2))
indices = K.clip(indices, 0, 256 - 1)
updates = K.reshape(p[:,0], (256,256))
reProj = tf.tensor_scatter_nd_max(tf.zeros((256, 256), tf.int32), indices, updates)

How to understand masked multi-head attention in transformer

I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not unserstand if the words after generating word have not been generated, how can them be seen?
I try to read the code of transformer (link: The code achieved mask is shown below. It uses the lower triangular matrix to mask, I can not understand why.
padding_num = -2 ** 32 + 1
diag_vals = tf.ones_like(inputs[0, :, :]) # (T_q, T_k)
tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(inputs)[0], 1, 1]) # (N, T_q, T_k)
paddings = tf.ones_like(masks) * padding_num
outputs = tf.where(tf.equal(masks, 0), paddings, inputs)
I had the very same question after reading the Transformer paper. I found no complete and detailed answer to the question in the Internet so I'll try to explain my understanding of Masked Multi-Head Attention.
The short answer is - we need masking to make the training parallel. And the parallelization is good as it allows the model to train faster.
Here's an example explaining the idea. Let's say we train to translate "I love you" to German. The encoder works in parallel mode - it can produce vector representation of the input sequence ("I love you") within a constant number of steps (i.e. the number of steps doesn't depend on the length of the input sequence).
Let's say the encoder produces the numbers 11, 12, 13 as the vector representations of the input sequence. In reality these vectors will be much longer but for simplicity we use the short ones. Also for simplicity we ignore the service tokens, like - beginning of the sequence, - end of the sequence and others.
During the training we know that the translation should be "Ich liebe dich" (we always know the expected output during the training). Let's say the expected vector representations of the "Ich liebe dich" words are 21, 22, 23.
If we make the decoder training in sequential mode, it'll look like the training of the Recurrent Neural Network. The following sequential steps will be performed:
Sequential operation #1. Input: 11, 12, 13.
Trying to predict 21.
The predicted output won't be exactly 21, let's say it'll be 21.1.
Sequential operation #2. Input: 11, 12, 13, and also 21.1 as the previous output.
Trying to predict 22.
The predicted output won't be exactly 22, let's say it'll be 22.3.
Sequential operation #3. Input 11, 12, 13, and also 22.3 as the previous output.
Trying to predict 23.
The predicted output won't be exactly 23, let's say it'll be 23.5.
This means we'll need to make 3 sequential operations (in general case - a sequential operation per each input). Also we'll have an accumulating error on each next iteration. Also we don't use attention as we only look to a single previous output.
As we actually know the expected outputs we can adjust the process and make it parallel. There's no need to wait for the previous step output.
Parallel operation #A. Inputs: 11, 12, 13.
Trying to predict 21.
Parallel operation #B. Inputs: 11, 12, 13, and also 21.
Trying to predict 22.
Parallel operation #C. Inputs: 11, 12, 13, and also 21, 22.
Trying to predict 23.
This algorithm can be executed in parallel and also it doesn't accumulate the error. And this algorithm uses attention (i.e. looks to all previous inputs) thus has more information about the context to consider while making the prediction.
And here is where we need the masking. The training algorithm knows the entire expected output (21, 22, 23). It hides (masks) a part of this known output sequence for each of the parallel operations.
When it executes #A - it hides (masks) the entire output.
When it executes #B - it hides 2nd and 3rd outputs.
When it executes #C - it hides 3rd output.
Masking itself is implemented as the following (from the original paper):
We implement this inside of scaled dot-product attention by masking
out (setting to −∞) all values in the input of the softmax which
correspond to illegal connections
Note: during the inference (not training) the decoder works in the sequential (not parallel) mode as it doesn't know the output sequence initially. But it's different from RNN approach as Transformer inference still uses self-attention and looks at all previous outputs (but not only the very previous one).
Note 2: I've seen in some materials that masking can be used differently for non-translation applications. For example, for language modeling the masking can be used to hide some words from the input sentence and the model will try to predict them during the training using other, non-masked words (i.e. learn to understand the context).
decoder is a self-regressor and can't see the future words
encoder in transformer is a self-regressor;
which means it will predict the next token according to the previous;
so input x can't see the future words;
we use masked multi-head attention to do this.

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code ( primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

Getting wrong parameter count for Google NASNet-A neural net

I’m trying to understand the NASNet-A architecture in detail, but can’t match the parameter counts in the paper.
For example, the paper says CIFAR-10 NASNet-A “6 # 768” model has 3.3M params, but by my calculations a single “sep 5x5” primitive in the final cell should alone have 2.9M params… which can’t be right!
Here’s how I derive this count…
The “6 # 768” notation means the “number of filters in the penultimate layer of the network” is 768, which I assume means the number of filters in each of the primitive operations in the cell is 768, and therefore the output depth of the concat operation (with 5 block inputs) is 5 * 768. Since shape is only changed by reduction cells, the input to the final cell (concat output from prior normal cell) will also be of depth 5 * 768.
So for a 5x5 separable convolution with 5 * 768 input channels and 768 output channels, the number of parameters is:
5x5x1 * (5 * 768) = 96,00 params for the 5x5 depthwise filters, plus
1x1x(5 * 768) x 768 = 2,949,128 params for the 1x1 pointwise filters
Where am I going wrong?!
The amount of output channels from each operation of cell's block is according to the defined num_conv_filters. In example for CIFAR NASNet-A is 32, and it doubles after each Reduction Cell.
Although they mention they have B=5 blocks and no residual connection it seems they have 6 concatenated chunks of filters, the last seem to come from the previous layer.
This is why for example you have 192 feature depth in the first cell:
You can take a look on the expected depths here:
So for example, for the last 5x5 separable convolution you can get:
5x5*768 + 768*128 = 117504 parametes
For more info about the separable convolution:

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here:
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.