MS Cognitive custom voice-submitting sample data-returning "Only the RIFF(WAV) format is accepted. Check the format of your audio files." - text-to-speech

Just checking to make sure that this should be supported. The page here says that you should be able to use any PCM file that's at least 16kHz. I'm trying to segment a longer wav file into utterances using NAudio, and I can generate the files, but all of the training data that I submit is coming back with the processing error "Only the RIFF(WAV) format is accepted. Check the format of your audio files." The audio files are 16 bit PCM, mono, 44kHz wav files, and are all under 60s. Is there another constraint on the file format that I might be missing? The wav files do have a valid RIFF header (verified that the bytes exist).

I managed to figure this out by explicitly re-encoding audio that I received back from the SpeechRecognizer. Definitely not an efficient solution, but this was just a hack to test things out. Here's the code for reference (put this in Recognizer.Recognized):
string rawResult = ea.Result.ToString(); //can get access to raw value this way.
Regex r = new Regex(#".*Offset"":(\d*),.*");
UInt64 offset = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);
r = new Regex(#".*Duration"":(\d*),.*");
UInt64 duration = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);
//create segment files
File.AppendAllText($#"{path}\{fileName}\{fileName}.txt", $"{segmentNumber}\t{ea.Result.Text}\r\n");
//offset and duration are in 100ns units
WaveFileReader w = new WaveFileReader(v);
long totalDurationInMs = w.SampleCount / w.WaveFormat.SampleRate * 1000; //total length of the file
ulong offsetInMs = offset / 10000; //convert from 100ns intervals to ms
ulong durationInMs = duration / 10000;
long bytesPerMilliseconds = w.WaveFormat.AverageBytesPerSecond / 1000;
w.Position = bytesPerMilliseconds * (long)offsetInMs;
long bytesToRead = bytesPerMilliseconds * (long)durationInMs;
byte[] buffer = new byte[bytesToRead];
int bytesRead = w.Read(buffer, 0, (int)bytesToRead);
string wavFileName = $#"{path}\{fileName}\{segmentNumber}.wav";
string tempFileName = wavFileName + ".tmp";
WaveFileWriter wr = new WaveFileWriter(tempFileName, w.WaveFormat);
wr.Write(buffer, 0, bytesRead);
wr.Close();
//this is probably really inefficient, but it's also the simplest way to get things in the right format. It's a prototype-deal with it...
WaveFileReader r2 = new WaveFileReader(tempFileName);
//from other project
var desiredOutputFormat = new WaveFormat(16000, 16, 1);
using (var converter = new WaveFormatConversionStream(desiredOutputFormat, r2))
{
WaveFileWriter.CreateWaveFile(wavFileName, converter);
}
segmentNumber++;
This splits the input file to separate per-turn files, and appends the turn transcripts in the text file using the filenames.
The good news is that this produced a "valid" dataset, and I was able to create a voice from it. The bad news is that the voice font produced audio that was almost completely unintelligible, which I'm going to attribute to a combination of using machine-transcribed samples along with irregular turn breaks and possibly noisy audio. I may see if there's anything that can be done to improve the accuracy by hand editing a few files, but I at least wanted to post an answer here in case anyone else has the same problem.
Also, it appears that either 16 KHz and 44 KHz PCM will work with custom voice, so that's a plus if you have higher quality audio available.

Related

Convert CameraImage Stream to Bytes or File in flutter?

I am trying to use google ml kit to process an image so i extract the left eye open probability,
but it requires the input image to be either a file,bytes or file path, see below
final inputImage = InputImage.fromFile(file);
final inputImage = InputImage.fromBytes(bytes: bytes, inputImageData: inputImageData);
final inputImage = InputImage.fromFilePath(filePath);
it requires one of those above, i am trying to use a Camera image stream to achieve this,
_cameraService.cameraController.startImageStream((image) async {
// i am trying to convert the image received here to be converted into either a File, Bytes, File path
}
ML Kit team doesn't own the flutter_mlkit repo, so please file issue at https://github.com/azihsoyn/flutter_mlkit/issues

pop at the beginning of playback

When playing a memory stream containing wav encoded audio, the playback starts with a sharp pop/crackle:
ms = new MemoryStream(File.ReadAllBytes(audio_filename));
[...]
dispose_audio();
sound_output = new DirectSoundOut();
IWaveProvider provider = new RawSourceWaveStream(ms, new WaveFormat());
sound_output.Init(provider);
sound_output.Play();
That pop/crackle does not occur when playing the wav file directly:
dispose_audio();
NAudio.Wave.WaveStream pcm = new WaveChannel32(new NAudio.Wave.WaveFileReader(audio_filename));
audio_stream = new BlockAlignReductionStream(pcm);
sound_output = new DirectSoundOut();
sound_output.Init(audio_stream);
sound_output.Play();
Same file is playing, but when the wav data are stored in a memory stream first, there is a somewhat loud pop at the beginning of the playback.
I am very much a newbie with NAudio and audio in general, so it's probably something silly, but I can't seem to figure it out.
You are playing the WAV file header as though it were audio. Instead of RawSourceWaveStream, you still need to use WaveFileReader, just pass in your memory stream.

MFT NAudio Resampling on the fly

I want to resample an audio file using NAudio and MFT on-the-fly.
For example, I have the following audio file:
File name: MyAudioFile.mp3
Duration: 10 sec
When this file is being played, I only want to resample that particular position to WAV in the desired format.
So, if the length of "MyAudioFile.mp3" is 10 sec, and the "current play position" is 2.5 sec, I want to resample only that portion of data into WAV format at the sampling rate of 48 KHz.
When the audio progresses further, again, only the "current play position" must be resampled.
I tried the following code:
WaveStream reader = new MediaFoundationReaderRT([path of
"MyAudioFile.mp3"]);
MemoryStream outMemStream = new MemoryStream(); //Decode to memory
stream
using (reader)
using (var resampler = new MediaFoundationResampler(reader,
resampler.WaveFormat))
{
WaveFileWriter.CreateWaveFile(outMemStream, resampler);
rsws = new RawSourceWaveStream(outMemStream, resampler.WaveFormat);
}
WaveChannel32 waveformInputStream = new WaveChannel32(rsws);
The resampling happens properly; however it resamples the whole audio file, which takes time.
What I am looking at is just resampling the "current play position" of the audio, and discard any other position information.
Thanks! Appreciate if you can provide some sample.
To resample on the fly, just pass the reader directly into MediaFoundationResampler. You will now have an ISampleProvider so you won't be able to use WaveChannel32, but really that is an obsolete class now, and you should be able to do anything you need with other ISampleProvider classes from NAudio.

How to compress speech stream in silverlight

I have recorded audio using silverligh4 and trying to save it through service on the server.
The problem is recorded .WAV file has lakhs ofbytes of data as stream. But when this stream is passed to service its getting transmitted as 1526 bytes max only. I have set max properties in web.config. I think we need to encode the stream on the client and pass this encoded stream and decode it on the server. How to encode the audio stream on sileverlight
application and decode it on the server? Please advice me. Thanks for your time. Nspeex or CSpeex do not work for me. If any one has implemented the same please suggest how to do it?
The only way to compress WAV to any reasonable size (without trading quality) is to convert it to another format.
I don't know if this is an option for you but it would be very easy to use lame.exe to convert to MP3 before sending to the server. Of course you'd need to make sure that licensing allows you to distribute with your application.
Here is an open source program to convert MP3 to WAV:
http://www.codeproject.com/KB/audio-video/madlldlib.aspx
Something like this to convert to MP3, you might be able to convert MP3 to WAV with lame also using the --decompress option.
using System.Diagnostics;
public string WAV2MP3(string fileName, bool waitFlag) {
string newFileName = fullpathDir + fileName.Replace(".wav",".mp3");
string lameArgs = "-b 32 --resample 22.05 -m m \"" +
fullpathDir + fileName + "\" \"" +
newFileName + "\"";
ProcessStartInfo processInfo = new ProcessStartInfo();
Arguments = lameArgs;
WindowStyle = ProcessWindowStyle.Hidden;
WorkingDirectory = Application.StartupPath;
Process startedProcess = new Process.Start(processInfo);
if (waitFlag) {
startedProcess.WaitForExit();
}
return newFileName;
};
I'd probably just take the raw audio stream, sample it at a low rate, and send it out via a compressed stream. If you wanted to get fancy, you could farm the compression out to an MP3 encoder like LAME (in a separate thread/process!).

Mimic file IO in j2me midlet using RMS

I want to be able to record audio and save it to persistent storage in my j2me application. As I understand j2me does not expose the handset's file system, instead it wants the developer to use the RMS system. I understand the idea behind RMS but cannot seem to think of the best way to implement audio recording using it. I have a continuous stream of bits from the audio input which must be saved, 1) should I make a buffer and then periodically create a new record with the bytes in the buffer. 2) Should I put each sample in a new record? 3) should I save the entire recording file in a byte array and then only write it to the RMS on stop recording?
Is there a better way to achieve this other than RMS?
Consider this code below and edit it as necessary it should solve your problem by writing to the phone filesystem directly
getRoots();
FileConnection fc = null;
DataOutputStream dos = null;
fc = (FileConnection)Connector.open("file:///E:/");
if (!fc.exists())
{
fc.mkdir();
}
fc = (FileConnection) Connector.open("file:///E:/test.wav");
if (!fc.exists())
{
fc.create();
}
dos = fc.openDataOutputStream();
dos.write( recordedSoundArray);