What Wav Format should I use for recording audio and Recognizing Speech in it? - naudio

I am creating a windows service which will connect to audio input device and record audio using NAudio WaveIn. And this is flow :
1st level of Speech detection using VOSK Speech Recognition and Add recorded audio to Memory Stream.
if Speech is Recognized, save memory stream to wav file.
2nd level of Speech Recognition using Microsoft Cognitive Speech Service to read from wav file.
My question, what wave format should I use while saving as wav file, to improvise the speech recognition accuracy of Cognitive speech service.?
_waveIn.WaveFormat = new WaveFormat(8000, 16, 1);
or
_waveIn.WaveFormat = new WaveFormat(16000, 16, 1);
Any help would be much appreciated.

The higher sample rate will take twice as much disk space but will be slightly higher quality, so might give slightly better results. I recommend you try submitting the same audio at both sample rates though and see whether there is any difference.

Related

Is it possible to set Windows.Media.SpeechSynthesis stream format as in SAPI 5.3?

I'm using Windows.Media.SpeechSynthesis (C++/WinRT) to convert text to audio file. Previously I was using SAPI where was possible to set Audio Format when binding to a file via SPBindToFile(...) before speaking.
Is there any similar method in Windows.Media.SpeechSynthesis? Seems that there is only possible to get 16kHz, 16Bit, Mono wave stream, does it?
Does SpeechSynthesisStream already contain a real audio stream after speech synthesis, or does it hold some precalculated raw data, and does actual encoding happen when accessing its data (playback on a device or copying to another not-speech-specific stream)?
Thank you!
I think there should be possible to control the speech synthesis stream format somehow.
The WinRT synthesis engines output 16Khz 16-bit mono data. There isn't any resampling layer to change the format.

What characteristics should have a .wav file as result of TTS engine to be be listened with high quality?

I'm trying to generate high quality voice-over using Microsoft Speech API. What kind of values I should pass in to this constructor to guarantee high quality audio?
The .wav file will be used latter to feed FFmpeg, so audio will be re-encoded latter to a more compact form. My main goal is keep the voice as clear as I can, but I really don't know which values guarantee the best quality perceived by humans.
First of all, just to let you know I haven't used this Speech API, I'll give you an answer based on my Audio processing work.....
You can choose EncodingFormat.Pcm for Pulse Code Modulation
samplesPerSecond is sampling frequency. Because it is voice you can cover it with 16000hz for sure. If you are really perfectionist you can go with 22050 for example. Higher the value is, the audio file size will be larger. If file size isn't a problem you can even go with 32000 or 44100 but there won't be much noticable difference....
bitsPerSample - go with 16 if possible
1 or 2, mono or stereo ..... it won't affect the quality of the sound
averageBytesPerSecond ..... this would be samplesPerSecond*bytesPerSample (for example 22050*2)
blockAlign ..... this would be Bytes Per Sample*numberOfChanels (for example if you have 16bit PCM Mono audio, 16bits are 2 bytes, Mono is 1, so blockAlign is 2*1)
That last one, the byte array doesn't speaks much for itself, I'm not sure what it serves for, I believe the first 6 arguments are enough for audio to be generated.
I hope this was helpful
Cheers

Is there a way to stream audio from MIC and play that stream in Silverlight

So I want to stream the audio from a mic using NAudio and then pass that stream to WCF which a Siverlight app can consume to broadcast the live audio sound. I want the latency to be as low as possible.
Any suggestions or if some one has already done it please point the source. Thanks in advance
what you are asking is certainly possible, but will be a fair amount of work to do.
NAudio can handle to capturing microphone audio.
At the Silverlight end you can play custom audio formats (in this case PCM) using a custom media element streaming source. See this one: http://code.msdn.microsoft.com/wavmss
I suspect latency would not be very good. You can reduce it by keeping the buffer sizes small. Also bear in mind that WAV is not a very efficient format to be sending over the network.
To have low latency as possible, you should use the netTcpBinding and stream your audio in binary format. I would use MemoryStream for this and try to play with the buffersize to figure out what the best performance is. Also, try checking audio formats for best performance. This also depends of the audio quality you expect.

Convert audio (wav file) to text using SAPI?

My task is to convert an Audio file not from Direct Speech from Human into text.
e.g If I have "Hello there" store in wav file to it will transcribe it into text and show "Hello there" string on screen.
Any language code in preferred but priority is C#.
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WAV files have extra noise, accuracy will go down dramatically.
Dragon Naturally Speaking Professional has support for transcription, but it still requires training and a single voice. (I do believe that DNS has a custom audio model that works well for voice recorders.) I haven't used it myself, so I don't know how well it would work in your situation.
Now, if you are looking for specific keywords, other people have had success using "Audio Mining" - running the recognizer looking for a specific keyword on an audio stream

Waveform Audio buffer

i look at this article in msdn Recording and Playing Sound with the Waveform Audio Interface and download the P/Invoke Library Sample that record using wave in and wave out .
how do i get the data from buffer (Waveform Audio Interface) (wave in) while recording and play it using c# or vb (wave out) ,
thanks
You could try to use the NAudio library, it takes care of this for you.