I have labelled .wav files to train a Convolutional Neural Network. These are for Bengali phones, for which no standard Dataset is available. I want to input these .wav files to Tensorflow for training my CNN model. I want to create Grayscale Spectrograms from these .wav files, which will be input for my model. I need help in how to do so. If there is more than one alternative, what are their strength and weakness?
Also, they are of variable time lengths, like some are 70ms, some are 160ms. Is there a way to divide them in 20ms segments?
I have done something similar in my research. I used the Linux utility SOX to do the audio wave file manipulation and creating spectrograms.
On the audio file length, you can use the "trim" option within SOX to split the file into 20ms segments. Something along the lines of the following:
sox myaudio.wav trim 0 0.02 : newfile : restart
Using the "spectrogram" option of SOX, you can then create the spectrogram.
sox myaudio.wav -n spectrogram -m -x 256 -y 256 -o myspectrogram.png
The command will create a monochrome spectrogram of size 256x256 and store it in the file "myspectrogram.png".
In my research, I did not split the file into smaller chunks. I found that using the whole wave file of the word was sufficient to get good recognition. But, it depends on what your long term goal is.
You can also look at the ffmpeg ops in TensorFlow for loading audio files, though we don't yet have a built-in spectrogram:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/ffmpeg
Related
I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!
I'm using this command:
tfjs.converters.save_keras_model(model,'jsmodels')
but I get a model.json and 3 weights file
group1-shard1of3.bin
group1-shard2of3.bin
group1-shard3of3.bin
and I want to get only one .bin file, how can I do that?
I am not too sure if this is possible using save_keras_model but from the command line with tensorflowjs_converter I would do the following. Where you specify the --weigth_shard_size_bytes to be the size of the model you have. If your model is <= 30Mb then setting it to 30000000 bytes will result in a single file group1-shard1of1.bin.
tensorflowjs_converter --input_format keras --weight_shard_size_bytes 30000000 'model.h5' 'output_dir'
The reason is because your model is over the the default size of 4mb. Hence, it is broken into multiple chunks of at most 4mb so set weight_shard_size_bytes argument of the tfjs.converters.save_keras_model(model,name_of_folder) to a size larger than the size of your model, so you get just one group1-shard1of1.bin file.
tfjs.converters.save_keras_model(model,name_of_folder,weigth_shard_size_bytes=1024*1024*size > than model_size)
I'm new to tensorflow and played around with the hand written numbers MNIST set.
I'd like to do my own project that recognises text instead of numbers but can't find a good tutorial.
Is it the same principle as numbers but instead of 10 layers at the end I have to use 26? Or include upper and lowercase and special characters?
If so I'd have to first crop the words into each character, right? Or is there a way to recognise entire sentences?
I'd like to train three different fonts, so no handwriting, and don't care about upper or lower case.
Later I'd like to use the trained model on photographs. A printed article for example. Does the model work if I align the image, do I have to retrain for a little bit or train it from the start with the new data?
Where do I start? The Keras example is overwhelming.
You're looking for an OCR model, a simple CNN can't detect text from scanned images, you need to segment them first which can be completed based on the language script.
You can start with tesseract. There is a python wrapper named pytesseract.
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
For your own model, try CRNN models. https://github.com/qjadud1994/CRNN-Keras
I am trying to train a neural network to detect steganographic images using Tensorflow and Nvidia Digits. I loaded a data set which has two sub directories - Cover Images and Steg Images. I think the network has to process the cover/stegano image pairs together to learn which are the covers and which are steganographic images. Am I correct?
How does batch size work? If I give 1 does it take one image from both sub directories and process them? or do I have to input batch number as 2 for that?
How does shuffling data on each epoch work? does it shuffle both sub directories equally? as an example will 1.jpg be the third photo on both folders or will it be different on them both?
I think the network has to process the cover/stegano image pairs
together to learn which are the covers and which are steganographic
images. Am I correct?
I am not familiar with object detection (right?) in Nvidia Digits, so please check out their tutorials for more information.
You need to think about the kind of labeling the training data first. Usually in the examples I see only use one training folder and one validation folder (each: images and labels) - Digits divides your dataset, e.g. into 90 % training and 10 % validation images.
How does batch size work? If I give 1 does it take one image from both
sub directories and process them? or do I have to input batch number
as 2 for that?
With batch number you tell Digits how many images you use per iteration. It's used for dataset division (memory for calculations is limited; you can't fit the whole dataset into one iteration). In one epoch the whole dataset is processed.
As written above, one image at a time, as far as I know.
How does shuffling data on each epoch work? does it shuffle both sub
directories equally? as an example will 1.jpg be the third photo on
both folders or will it be different on them both?
The data should be shuffled automatically.
I am trying to do a Deep Learning project by using Tensorflow.
Each of my data sets contains 2 files( PNGimage file + TXTvectors file ), where are put in different folders as follow:
./data/image/ #Folders contains different size of images
./data/vector/ #Folders contains vectors of corresponding image
#For example: apple.png + apple.txt
The example content of vector shows as follow:
10.0,2.5,5,13
And since image size are different, the resize and some transformation apply on vectors are required. It is important to make sure that I can do these processing during Tensorflow is running. Is there any good way to manage this kind of datasets?
I referred to a lot of basic tutorial however most of them are not so many details about arrange customized data input and output. Please give me some advice!
I recommend you to take a look at TFRecords and queues. Basically the idea is the following: you resize all your images to the same format and store them together with your txt vectors in one TFRecord file. This is done separately before you run your model.
When you create your model you create a queue which reads data from the TFRecord file and feeds it to your model.