enter image description here
The input of K.function is the first layer of the triplet model and the output is the fifth layer of the triplet model. Does it mean to cut the layer like this?
Does this operating principle work as a principle when the response of the sensorflow is executed?
Also, the model received as an input in this function was three inputs when learning, and I understood that it only puts one of the three, is it okay to use it like this?
Related
I am trying to make a model that is able to extract human speech from a recording. To do this I have loaded 1500 noisy files (some of these files are the exact same but with different speech to noise ratios (-1,1,3,5,7). I want my model to take in a wav file as a one dimensional array/tensor along the horizontal axis, and output a one dimensional array/tensor that I could then play.
currently this is how my data is set up.
this is how my model is setup
an error I am having is that I am not able to make a prediction and when I am i get an array/tensor with only one element, instead one with 220500. The reason behind 22050 is that it is the length of the background noise that was overlapped into clean speech so every file is this length.
I have been messing around with layers.Input because while I want my model to take in every row as one "object"/audio clip. I dont know if that is what's happening because the only "successful" prediction is an error
The model you built expect data in the format (batch_size, 1, 220500), as in the input layer you declared an input_shape of (1, 220500).
For the data you are using you should just use an input_shape of (220500,).
Another problem you might encounter, is that you are using a single unit in the last layer. This way the output of the model will be (batch_size, 1), but you need (batch_size, 220500) as an output.
For this last problem I suggest you to use a generative recurrent neural network.
I've been looking at tfjs examples and trying to learn about seq2seq models. During the process, I've stumbled upon the date-conversion-attention example.
It's a great example but what kind of attention mechanism is being used in the example? There is no info in Readme file. Can somebody point me to the paper that describes the attention that's being used here?
Link to attention part:
https://github.com/tensorflow/tfjs-examples/blob/908ee32750ba750a14d15caeb53115e2d3dda2b3/date-conversion-attention/model.js#L102-L119
I believe I found the answer.
The attention model used in the date-conversion-attention uses the dot product alignment score and it's described in Effective Approaches to Attention-based Neural Machine Translation. Link: https://arxiv.org/pdf/1508.04025.pdf
I have twisted my head around this sample for some hours now, and this what I have concluded so far:
The encoder looks at the full input, one character-embedding for each lstm-step. The decoder expects a time-shifted copy of the output as its input -starting with a special character. The output (target strings) are provided as-is to the decoder during training. During evaluation, one character is predicted at the time, passing the prediction back into the decoder for the next character.
The decoder does not see the input, but it receives the encoder's final step output as its initial state. This state initialisation tells the decoder how to produce it's outputs, something like an encoded description of the date-format to work on (I assume).
The LSTM's output, one for each step (=character of input or output), from the encoder and decoder are then dot product'ed and normalised with softmax. This dot-product is the attention matrix - basically a highlight of the activations from the encoder and the decoder. For the attention heatmap to light up for the given next character, the decoder must have output'ed something that "matches" the encoder's outputs. The attention matrix is not learned weights or biases, its just a product of the encoder and decoder's outputs.
Finally this attention matrix is dot product'ed with the full encoder input and concatenated with the decoder output - to allow the final dense layers to decode the attention mappings and "read" the right values from the encoder output.
In the prediction process, only the last character is read from the prediction. Possibly because the previous predictions might be unstable?
I read the excellent book: Deep Learning with JavaScript Neural networks in TensorFlow.js The book explains the examples one by one and adds lots of extra documentation. But I don't think they explain the general architecture very well, for this sample - only the details.
Here is my understanding of a basic Sequence to Sequence LSTMs. Suppose we are tackling a question-answer setting.
You have two set of LSTMs (green and blue below). Each set respectively sharing weights (i.e. each of the 4 green cells have the same weights and similarly with the blue cells). The first is a many to one LSTM, which summarises the question at the last hidden layer/ cell memory.
The second set (blue) is a Many to Many LSTM which has different weights to the first set of LSTMs. The input is simply the answer sentence while the output is the same sentence shifted by one.
The question is two fold:
1. Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
2. Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
(image taken from suriyadeepan.github.io)
Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
Both hidden state h and cell memory c are passed to the decoder.
TensorFlow
In seq2seq source code, you can find the following code in basic_rnn_seq2seq():
_, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)
If you use an LSTMCell, the returned enc_state from the encoder will be a tuple (c, h). As you can see, the tuple is passed directly to the decoder.
Keras
In Keras, the "state" defined for an LSTMCell is also a tuple (h, c) (note that the order is different from TF). In LSTMCell.call(), you can find:
h_tm1 = states[0]
c_tm1 = states[1]
To get the states returned from an LSTM layer, you can specify return_state=True. The returned value is a tuple (o, h, c). The tensor o is the output of this layer, which will be equal to h unless you specify return_sequences=True.
Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
###TensorFlow###
Just provide the initial state to an LSTMCell when calling it. For example, in the official RNN tutorial:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
...
output, state = lstm(current_batch_of_words, state)
There's also an initial_state argument for functions such as tf.nn.static_rnn. If you use the seq2seq module, provide the states to rnn_decoder as have been shown in the code for question 1.
###Keras###
Use the keyword argument initial_state in the LSTM function call.
out = LSTM(32)(input_tensor, initial_state=(h, c))
You can actually find this usage on the official documentation:
###Note on specifying the initial state of RNNs###
You can specify the initial state of RNN layers symbolically by
calling them with the keyword argument initial_state. The value of
initial_state should be a tensor or list of tensors representing the
initial state of the RNN layer.
EDIT:
There's now an example script in Keras (lstm_seq2seq.py) showing how to implement basic seq2seq in Keras. How to make prediction after training a seq2seq model is also covered in this script.
(Edit: this answer is incomplete and hasn't considered actual possibilities of state transfering. See the accepted answer).
From a Keras point of view, that picture has only two layers.
The green group is one LSTM layer.
The blue group is another LSTM layer.
There isn't any communication between green and blue other than passing the outputs. So, the answer for 1 is:
Only the thought vector (which is the actual output of the layer) is passed to the other layer.
Memory and state (not sure if these are two different entities) are totally contained inside a single layer and are not initially intended to be seen or shared with any other layer.
Each individual block in that image is totally invisible in keras. They are considered "time steps", something that only appears in the shape of the input data. It's rarely important to worry about them (unless for very advanced usages).
In keras, it's like this:
Easily, you have access only to the external arrows (including "thought vector").
But having access to each step (each individual green block in your picture) is not an exposed thing. So...
Passing the states from one layer to the other is also not expected in Keras. You will probably have to hack things. (See this: https://github.com/fchollet/keras/issues/2995)
But considering a thought vector big enough, you could say it will learn a way to carry what is important in itself.
The only notion you have from the steps is:
You have to input things shaped like (sentences, length, wordIdFeatures)
The steps will be performed considering that each slice in the length dimension is an input to each green block.
You may choose to have a single output (sentences, cells), for which you completely lose track of steps. Or...
Outputs like (sentences, length, cells), from which you know the output of each block through the length dimension.
One to many or many to many?
Now, the first layer is many to one (but nothing prevents it from being many to many too if you want).
But the second... that's complicated.
If the thought vector was made by a many to one. You will have to manage a way of creating a one to many. (That's not trivial in keras, but you could think of repeating the thought vector for the expected length, making it be the input to all steps. Or maybe fill an entire sequence with zeros or ones, keeping only the first element as the thought vector)
If the thought vector was made by a many to many, you can take advantage of this and keep an easy many to many, if you're willing to accept that the output has exactly the same number of steps as the input.
Keras doesn't have a ready solution for 1 to many cases. (From a single input predict a whole sequence).
I went through this tutorial . In the last block it says that dynamic_rnn function cannot apply to calculate attention. But what I don't understand is all we need is the hidden state of the decoder in order to find the attention which will be work out with encoder symbols.
Attention mechanism in the context of encoder-decoder means that decoder at each time step "attends" to the "useful" parts of the encoder. This is implemented as, for example, averaging encoder's outputs and feeding that value (called context) into a decoder at a given time step.
dynamic_rnn computes outputs of LSTM cells across all time steps and gives you the final value. So, there is no way to tell the model that the cell state at time step t should depend not only on the output of the previous cell and input, but also on additional information such as context. You can control computation at each time step of encoder or decoder LSTM using raw_rnn.
If I understand correctly, in this tutorial the author feeds ground truth input as input to the decoder at each time step. However, this is not the usual way it is done. Usually, you want to feed the output of decoder at time t as input to decoder at time t+1. In short, the input to the decoder at each time step is variable, whereas in dynamic_rnn it is predefined.
Refer to for more technical details: https://arxiv.org/abs/1409.0473
I want to train a convolutional neural network with TensorFlow to do multi-output multi-class classification.
For example: If we take the MNIST sample set and always combine two random images two a single one and then want to classify the resulting image. The result of the classification should be the two digits shown in the image.
So the output of the network could have the shape [-1, 2, 10] where the first dimension is the batch, the second represents the output (is it the first or the second digit) and the third is the "usual" classification of the shown digit.
I tried googling for this for a while now, but wasn't able find something useful. Also, I don't know if multi-output multi-class classification is the correct naming for this task. If not, what is the correct naming? Do you have any links/tutorials/documentations/papers explaining what I'd need to do to build the loss function/training operations?
What I tried was to split up the output of the network into the single outputs with tf.split and then use softmax_cross_entropy_with_logits on every single output. The result I averaged over all outputs but it doesn't seem to work. Is this even a reasonable way?
For nomenclature of classification problems, you can have a look at this link:
http://scikit-learn.org/stable/modules/multiclass.html
So your problem is called "Multilabel Classification". In normal TensorFlow multiclass classification (classic MNIST) you will have 10 output units and you will use softmax at the end for computing losses i.e. "tf.nn.softmax_cross_entropy_with_logits".
Ex: If your image has "2", then groundtruth will be [0,0,1,0,0,0,0,0,0,0]
But here, your network output will have 20 units and you will use sigmoid i.e. "tf.nn.sigmoid_cross_entropy_with_logits"
Ex: If your image has "2" & "4", then groundtruth will be [0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0], i.e. first ten bits to represent first digit class and second to represent second digit class.
First you have to provide two labels to an image comprised of two different images. Then change your objective loss function so it maximizes the outputs of the two given labels and train your model. I don't think you need to split the outputs.