I'm quite new to Tensorflow and machine learning. Sorry if I haven't asked the question accurately or not making sense somewhere. I have recently got to read about and try to understand the transformer model, after its reputation in NLP and thankfully TensorFlow website has in details code and explanation.
https://www.tensorflow.org/text/tutorials/transformer#training_and_checkpointing
I have no problem understanding the code: the attention layer, positional encoding, encoder, decoder, masking etc.
When training the model, the input is the sentence to be translated and the one in the target language. where the target language is shifted and masked.
My problem is when the trainned model is used for evaluation, the mission is to translate an unseen sentence to the target language, and so the input for target would be an empty token, how would this empty tensor react with the trained model within the attention layer. Its empty? and in the first place what would be the effect of neglecting it.
To be more precise, please look at the screenshot below:
tar_inp in inputted in transformer, and loss is computed between prediction and tar_real but when evaluating the model, what is the function of an empty tar_inp do in the layer. Thank you very much sorry if it's a dumb question and may you please provide some intuition for understanding.
Related
I would like to create custom hierarchical, many-to-one RNN architecture in Tensorflow. The idea is to process posts of each user on a discussion forum by inner RNN and hand over the last hidden state for each post into outer RNN that encodes order of posts. The idea is to end up with some latent representation that encodes both content and order that is used for prediction of a specified binary target. I would like to implement this architecture, but I cannot wrap my head around how to express the nesting in Tensorflow. Can you please give me some advice? Many thanks. Please see the attached diagram. (https://i.stack.imgur.com/lUqqX.png)
I tried reading Tensorflow documentation on nested inputs to RNNs and on composing layers from layers, but I think it does not describe what I want do... or it simply does not click for me.
In the guide for Quantization Aware Training, I noticed that RNN and LSTM were listed in the roadmap for "future support". Does anyone know if it is supported now?
Is using Post-Training Quantization also possible for quantizing RNN and LSTM? I don't see much information or discussion about it so I wonder if it is possible now or if it is still in development.
Thank you.
I am currently trying to implement a speech enhancement model in 8-bit integer based on DTLN (https://github.com/breizhn/DTLN). However, when I tried to infer the quantized model without any audio/ empty array, it adds a weird waveform on top of the result: A constant signal every 125 Hz. I have checked other places in the code and there is no problem, just boils down to the quantization process with RNN/LSTM.
Im using the MXNet implementation of the TFT model, and I want to get the feature importance for every timestep from the trained model. Unfortunately, there is no such implemented functionwhich would satisfy my demand. According to the original article for TFT, there is a way to get the feature importance by getting the weigths off of the variable selection network. Howewer, it's softmax function gives back an embedded, 3 dimensional matrix. Im stuck with this problem, due to the lack of documentation about TFT/MXNet.
Any help is highly appricated.
I want to ask you how we can effectively re-train a trained seq2seq model to remove/mitigate a specific observed error output. I'm going to give an example about Speech Synthesis, but any idea from different domains, such as Machine Translation and Speech Recognition, using seq2seq model will be appreciated.
I learned the basics of seq2seq with attention model, especially for Speech Synthesis such as Tacotron-2.
Using a distributed well-trained model showed me how naturally our computer could speak with the seq2seq (end-to-end) model (you can listen to some audio samples here). But still, the model fails to read some words properly, e.g., it fails to read "obey [əˈbā]" in multiple ways like [əˈbī] and [əˈbē].
The reason is obvious because the word "obey" appears too little, only three times out of 225,715 words, in our dataset (LJ Speech), and the model had no luck.
So, how can we re-train the model to overcome the error? Adding extra audio clips containing the "obey" pronunciation sounds impractical, but reusing the three audio clips has the danger of overfitting. And also, I suppose we use a well-trained model and "simply training more" is not an effective solution.
Now, this is one of the drawbacks of seq2seq model, which is not talked much. The model successfully simplified the pipelines of the traditional models, e.g., for Speech Synthesis, it replaced an acoustic model and a text analysis frontend etc by a single neural network. But we lost the controllability of our model at all. It's impossible to make the system read in a specific way.
Again, if you use a seq2seq model in any field and get an undesirable output, how do you fix that? Is there a data-scientific workaround to this problem, or maybe a cutting-edge Neural Network mechanism to gain more controllability in seq2seq model?
Thanks.
I found an answer to my own question in Section 3.2 of the paper (Deep Voice 3).
So, they trained both of phoneme-based model and character-based model, using phoneme inputs mainly except that character-based model is used if words cannot be converted to their phoneme representations.
In Tensorflow 1.0, the seq2seq API was largely changed, and is no longer compatible with previous seq2seq examples. In particular, I find attention decoders quite a bit more challenging to build: the old attention_decoder function has been removed, instead the new API expects the user to provide the dynamic_rnn_decoder a couple of different attention functions during training and predicting, which in turn rely on a prepare_attention function.
Has anybody got an example of how to build an attention decoder, providing only the inputs and the final encoder state?
This seq2seq with attention model can translate number pronunciations into digits, e.g. "one hundred and twenty seven" -> "127". See if it helps.
https://acv.aixon.co/attention_word_to_number.html