recognito framework's recognition result is not good - voice-recognition

I'm testing recognito framework for speaker recognition.
I put short(maybe 5~7 sec) wav files.
but return result is always true.
Would recognito need to longer length wav to identification?

Related

What characteristics should have a .wav file as result of TTS engine to be be listened with high quality?

I'm trying to generate high quality voice-over using Microsoft Speech API. What kind of values I should pass in to this constructor to guarantee high quality audio?
The .wav file will be used latter to feed FFmpeg, so audio will be re-encoded latter to a more compact form. My main goal is keep the voice as clear as I can, but I really don't know which values guarantee the best quality perceived by humans.
First of all, just to let you know I haven't used this Speech API, I'll give you an answer based on my Audio processing work.....
You can choose EncodingFormat.Pcm for Pulse Code Modulation
samplesPerSecond is sampling frequency. Because it is voice you can cover it with 16000hz for sure. If you are really perfectionist you can go with 22050 for example. Higher the value is, the audio file size will be larger. If file size isn't a problem you can even go with 32000 or 44100 but there won't be much noticable difference....
bitsPerSample - go with 16 if possible
1 or 2, mono or stereo ..... it won't affect the quality of the sound
averageBytesPerSecond ..... this would be samplesPerSecond*bytesPerSample (for example 22050*2)
blockAlign ..... this would be Bytes Per Sample*numberOfChanels (for example if you have 16bit PCM Mono audio, 16bits are 2 bytes, Mono is 1, so blockAlign is 2*1)
That last one, the byte array doesn't speaks much for itself, I'm not sure what it serves for, I believe the first 6 arguments are enough for audio to be generated.
I hope this was helpful
Cheers

core audio: is zero equivalent to silence only for PCM audio?

I'm trying to create a basic algorithm that does packet loss concealment for core audio. I simply want to replace the missing data with silence.. in the book learning core audio, the author says that in lossless PCM, zeros mean silence. I was wondering if I'm playing VBR (ie compressed data), would putting zeros suffice for silence as well?
In my existing code.. when I plug zeros into the audio queue.. it suddenly jams (ie it no longer frees up consumed data in the audio queue callback..) and i'm wondering why
PCM is the raw encoded sample. All 0 (when using signed data for samples) is indeed silence. (In fact, all of any value is silence, but such a DC offset has the potential to damage your amplifier and/or speakers, if it isn't filtered out.)
When you compress with a lossy codec, you enter a digital format where it is not trivial to just add silence. Think of adding data to a ZIP file to add null bytes to the end of a file. It isn't as simple as just inserting them arbitrarily into the ZIP file.
If you want to add silence to a compressed file, you must do so using the appropriate codec. Then, you have to fit it into the bitstream, which is also not trivial. Usually the stream is broken up by frames, but you can't even split on those frames in some formats. MP3 and AAC use a bit reservoir where unused data in prior frames can be used to encode more complicated frames later on, making splitting the file very difficult.

Objective-C play sound

I know how to play mp3 files and whatnot in Xcode iOS. But how do I play a certain frequency, like if I just wanted to emit a C# note for 25 seconds; how might I do that? (The synth isn't as important to me as just the pitch of the note.)
You need to generate the PCM audio waveform that corresponds to the note you want to play and store that into a sample buffer in memory. Then you send that buffer to the audio hardware.
Here is a tutorial on generating waveforms of several types. The article goes into some details on the many aspects to a note you need to consider, including the frequency, volume, waveform shape, sampling rate, etc. The article comes with Flash source code, I think you should have no problem taking the concepts and adapting them to iOS.
If you also need a library that you can use to play the generated buffers on iOS, then I recommend the open source Finch.
I hope this helps!
You can synthesize waveforms of your desired frequency and feed them to the callbacks of either the Audio Queue or the RemoteIO Audio Unit API.
Here is a short tutorial on some of the code needed to create sine wave tones for iOS in C.

Convert audio (wav file) to text using SAPI?

My task is to convert an Audio file not from Direct Speech from Human into text.
e.g If I have "Hello there" store in wav file to it will transcribe it into text and show "Hello there" string on screen.
Any language code in preferred but priority is C#.
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WAV files have extra noise, accuracy will go down dramatically.
Dragon Naturally Speaking Professional has support for transcription, but it still requires training and a single voice. (I do believe that DNS has a custom audio model that works well for voice recorders.) I haven't used it myself, so I don't know how well it would work in your situation.
Now, if you are looking for specific keywords, other people have had success using "Audio Mining" - running the recognizer looking for a specific keyword on an audio stream

How to programmatically test for audio sync

I have a multimedia application that among other things converts video using FFMpeg. Video conversion being the pain that it is, I have in my test suits some tests that check our ability to convert various video formats, with emphasis on sample videos known not to work.
A common problem we've noticed from users is that some videos end up with their audio desynched after being processed, and I am looking for a way to check this in my tests.
Extracting the audio portion of the resulting videos is not a problem.
My best idea so far would be to check the offset of the first non-silence at both the beginning and end and compare each between the two videos, but I'm hoping someone smart has a better idea.
The application language/environment is Java, but since this is for testing, I'm free to use any toolset.
The basic problem is likely that the video and audio are different lengths. Extract the audio and test its length vs. the video length. If they are significantly different (more than maybe .05 sec, I'm not really sure what is detectable as "off"), then there's a problem.
To fix it, re-encode the audio to match the video length, and then put the audio and video back into a container format.