what are the inputs to a wavenet? - text-to-speech

I am trying to implement TTS. I have just read about wavenet, but, I am confused on local conditioning. The original paper here, explains to add a time series for local conditioning, this article explains that adding mel spectrogram features for local conditioning is fine. As we know that Wavenet is a generative model and takes raw audio inputs to generate high audio output when conditioned,
my question is that the said mel spectrogram features are of that raw audio passed as in the input or of some other audio.
Secondly, for implementing a TTS the audio input will be generated by some other TTS system whose output quality will be improved by wavenet, am I correct to think this way??
Please help, it is direly needed.
Thanks

Mel features are created by actual TTS module from the text (tacotron2 for example), than you run vocoder module (Wavenet) to create speech.
It is better to try existing implementation like Nvidia/tacotron2 + nvidia/waveglow. Waveglow is better than wavenet between, much faster. Wavenet is very slow.

Related

Can .tflite capture tf.hub.text_embedding_column() processes?

Just a general question here, no reproducible example but thought this might be the right place anyway since its very software specific.
I am building a model which I want to convert to .tflite. It relies on tf.hub.text_embedding_collumn() for feature generation. When I convert to .tflite will this be captured such that the resulting model will take raw text as input rather than a sparse vector representation?
Would be good to know just generally before I invest too much time in this approach. Thanks in advance!
Currently I don't imagine this would work, as we do not support enough string ops to implement that. One approach would be to do this handling through a custom op, but implementing this custom op would require domain knowledge and mitigate the ease-of-use advance of using tf hub in the first place.
There is some interest in defining a set of hub operators that are verified to work well with tflite, but this is not yet ready.

What Tensorflow API to use for Seq2Seq

This year Google produced 5 different packages for seq2seq:
seq2seq (claimed to be general purpose but
inactive)
nmt (active but supposed to be just
about NMT probably)
legacy_seq2seq
(clearly legacy)
contrib/seq2seq
(not complete probably)
tensor2tensor (similar purpose, also
active development)
Which package is actually worth to use for the implementation? It seems they are all different approaches but none of them stable enough.
I've had too a headache about some issue, which framework to choose? I want to implement OCR using Encoder-Decoder with attention. I've been trying to implement it using legacy_seq2seq (it was main library that time), but it was hard to understand all that process, for sure it should not be used any more.
https://github.com/google/seq2seq: for me it looks like trying to making a command line training script with not writing own code. If you want to learn Translation model, this should work but in other case it may not (like for my OCR), because there is not enough of documentation and too little number of users
https://github.com/tensorflow/tensor2tensor: this is very similar to above implementation but it is maintained and you can add more of own code for ex. reading own dataset. The basic usage is again Translation. But it also enable such task like Image Caption, which is nice. So if you want to try ready to use library and your problem is txt->txt or image->txt then you could try this. It should also work for OCR. I'm just not sure it there is enough documentation for each case (like using CNN at feature extractor)
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/seq2seq: apart from above, this is just pure library, which can be useful when you want to create a seq2seq by yourself using TF. It have a function to add Attention, Sequence Loss etc. In my case I chose that option as then I have much more freedom of choosing the each step of framework. I can choose CNN architecture, RNN cell type, Bi or Uni RNN, type of decoder etc. But then you will need to spend some time to get familiar with all the idea behind it.
https://github.com/tensorflow/nmt : another translation framework, based on tf.contrib.seq2seq library
From my perspective you have two option:
If you want to check the idea very fast and be sure that you are using very efficient code, use tensor2tensor library. It should help you to get early results or even very good final model.
If you want to make a research, not being sure how exactly the pipeline should look like or want to learn about idea of seq2seq, use library from tf.contrib.seq2seq.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

Parsing HEVC for Motion Information

I parsed the HEVC stream by simply identifying sart code (000001 or 00000001), and now I am looking for the motion information in the NAL payload. My goal is to calculate the percentage of the motion information in the stream. Any ideas?
Your best bet is to start with the HM reference software (get it here: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/trunk/) and add some debug info as the different kinds of data is read from the bitstream. This is likely much easier than writing bitstream decoder from scratch.
Check out the debug that is built into the software already, for example RExt__DECODER_DEBUG_BIT_STATISTICS or DEBUG_CABAC_BINS. This may do what you want already, if not it will be pretty close. I think information about bit usage can be best collected in source/Lib/TLibDecoder/TDecBinCoderCABAC.cpp during decode.
If you need to speed this up, you can of course skip the actual decode steps :)
At the decoder side, You can find the motion vector information as MVD, so you should using pixel decoding process to get the motion information. it need you to understand the process of the inter prediction at HEVC.
than you!

NAudio - Create software beat machine / sampler - General Strategy

Im new to NAudio but the goal of my project is to provide the user with the ability for the user to listen to an MP3 and then select a sample or a "chunk" of that song as a sample which could be saved to disk. These samples would be able to replayed at the same time (i.e. not merged but played at the same time).
Could someone please let me know what the overall strategy required to achieve this (....not necessarily the specifics...almost like pseduo code....).
For example would the samples / chunks of a song need to be saved as a WAV file. And these samples could be played together in the WAV format, etc.
I have seen a few small examples of a few implementations of some of the ideas Ive mentioned above but dont have a good sense of the big picture just yet.
Thanks in advance,
Andrew
The chunks wouldn't need to be saved as WAV files unless you were keeping them for future use. You can store the PCM audio (Mp3FileReader automatically converts to PCM) in a byte array and use RawSourceWaveStream to play them.
As for mixing them, I'd recommend using the MixingSampleProvider. This does mean you need to convert your RawSourceWaveStream to IEEE float, but you can use Pcm16BitToSampleProvider to do this. This will give the advantage that you can adjust volumes (and do other DSP) easily on the samples you are mixing. MixingSampleProvider also auto-removes completed inputs, so you can just add new inputs whenever you want to trigger a sound.