Poor quality when using multilingual vocoder for persian language with mozilla tts - text-to-speech

I used tacotron 2 for training persian dataset (around 33 hours). When I use Griffin-Lim, quality of sound is good but when I used other pretrained coqui vocoder such as fullband-melgan and wavegrad, sound quality is poor.
Here are my config file and two sound file that use two distinct vocoder. Below command is for generating soundfile with fullband melgan vocoder.
tts --text "سلام حال شما خوب است؟" --model_path "best_model.pth" --config_path "config.json" --vocoder_name "vocoder_models/universal/libri-tts/fullband-melgan" --out_path "example.wav"
config.zip
sound_file_by_full_band_melgan.zip
sound_file_by_grffin_lim.zip
I will appreciate in advance if you can help me.

Related

Convert a .npy file to wav following tacotron2 training

I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!

Text recognition with tensorfow

I'm new to tensorflow and played around with the hand written numbers MNIST set.
I'd like to do my own project that recognises text instead of numbers but can't find a good tutorial.
Is it the same principle as numbers but instead of 10 layers at the end I have to use 26? Or include upper and lowercase and special characters?
If so I'd have to first crop the words into each character, right? Or is there a way to recognise entire sentences?
I'd like to train three different fonts, so no handwriting, and don't care about upper or lower case.
Later I'd like to use the trained model on photographs. A printed article for example. Does the model work if I align the image, do I have to retrain for a little bit or train it from the start with the new data?
Where do I start? The Keras example is overwhelming.
You're looking for an OCR model, a simple CNN can't detect text from scanned images, you need to segment them first which can be completed based on the language script.
You can start with tesseract. There is a python wrapper named pytesseract.
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
For your own model, try CRNN models. https://github.com/qjadud1994/CRNN-Keras

Binomial And Multinomial Classification in ML

I got a project in which my task is to build network intrusion detection system to detect anomolies and attacks in the network.
There are two problems.
1. Binomial Classification: Activity is normal or attack
2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R
But before this I get some confusion in these terms Binomial/Multinomial Classification.
Help me to understand/ if possible please share a sort code... which gives me more help.
I tried to search these term on google/youtube but can't find proper definition with some code
I do only these thing with my code:-
clean/transform/outlier detect/missing value treatment
model_selection/accuracy test
so my next step is to make classification of Binomial/Multinomial Classification
Thanks for help...
First, do not hesitate to post on https://datascience.stackexchange.com/ for these kind of question that is more Data Science than coding issue.
Second, the answer is as simple as :
Binary (and not Binomial) Classification means only 2 targets to find.
=> In your case Normal vs Attack
Multilabel / Multiclass / Multinomial Classification means more than 2 targets to find.
=> Your case : Normal, DOS, PROBE, REL & E2R.
You can find example on https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Is it possible to train an xgboost model in Python and deploy it Run it in C/C++?

How much cross compatibility is there between the different language APIs?
For example, is it possible to train and save a model in Python and run it in C/C++ or any other language?
I would try this myself however my skills in non-Python languages are very limited.
You can dump the model into a text file as like this:
model.get_booster().dump_model('xgb_model.txt')
Then you should parse the text dump and reproduce the prediction function in C++.
I have implemented this in a little library that I call FastForest, if you want to save some time and want to make sure you use a fast implementation:
https://github.com/guitargeek/XGBoost-FastForest
The mission of the library is to be:
Easy: deploying your xgboost model should be as painless as it can be
Fast: thanks to efficient structure-of-array data structures for storing the trees, this library goes very easy on your CPU and memory (it is about 3 to 5 times faster than xgboost in prediction)
Safe: the FastForest objects are immutable, and therefore they are an excellent choice in multithreading environments
Portable: FastForest has no dependency other than the C++ standard library
Here is a little usage example, loading the model you have dumped before and assuming the model requires 5 features:
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
FastForest fastForest("xgb_model.txt", features);
std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
float output = fastForest(input.data());
When you create the FastForest you have to tell it in which order you intend to pass the features, because the text file does not store the order of the features.
Also note that the FastForest does not do the logistic transformation for you, so in order to reproduce predict_proba() you need to apply the logistic transformation:
float proba = 1./(1. + std::exp(-output));
The treelite package(research paper, documentation) enables compilation of tree-based models, including XGBoost, to optimized C code, making inference much faster than with native model libraries.
You could consider dumping your model in a text file using
model.get_booster().dump_model('xgb_model.txt', with_stats=True)
then, after some parsing, you can easily reproduce the .predict() function in C/C++. For the rest I am not aware of native porting of xgboost to C

how to add speech training data to tensorflow

I have labelled .wav files to train a Convolutional Neural Network. These are for Bengali phones, for which no standard Dataset is available. I want to input these .wav files to Tensorflow for training my CNN model. I want to create Grayscale Spectrograms from these .wav files, which will be input for my model. I need help in how to do so. If there is more than one alternative, what are their strength and weakness?
Also, they are of variable time lengths, like some are 70ms, some are 160ms. Is there a way to divide them in 20ms segments?
I have done something similar in my research. I used the Linux utility SOX to do the audio wave file manipulation and creating spectrograms.
On the audio file length, you can use the "trim" option within SOX to split the file into 20ms segments. Something along the lines of the following:
sox myaudio.wav trim 0 0.02 : newfile : restart
Using the "spectrogram" option of SOX, you can then create the spectrogram.
sox myaudio.wav -n spectrogram -m -x 256 -y 256 -o myspectrogram.png
The command will create a monochrome spectrogram of size 256x256 and store it in the file "myspectrogram.png".
In my research, I did not split the file into smaller chunks. I found that using the whole wave file of the word was sufficient to get good recognition. But, it depends on what your long term goal is.
You can also look at the ffmpeg ops in TensorFlow for loading audio files, though we don't yet have a built-in spectrogram:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/ffmpeg