Convert audio (wav file) to text using SAPI? - sapi

My task is to convert an Audio file not from Direct Speech from Human into text.
e.g If I have "Hello there" store in wav file to it will transcribe it into text and show "Hello there" string on screen.
Any language code in preferred but priority is C#.

SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WAV files have extra noise, accuracy will go down dramatically.
Dragon Naturally Speaking Professional has support for transcription, but it still requires training and a single voice. (I do believe that DNS has a custom audio model that works well for voice recorders.) I haven't used it myself, so I don't know how well it would work in your situation.
Now, if you are looking for specific keywords, other people have had success using "Audio Mining" - running the recognizer looking for a specific keyword on an audio stream

Related

Voice Recording Has Lower Decibels Than "Silent" Recording

I have looked at a number of articles on detecting silence in a recording using NAudio. But, I ran into a snag. I am now looking into using the more complex silence detection methods such as Fourier Transform. In the meantime, perhaps someone can shed some light on the problem I have run into.
I wrote a program in C# using NAudio to detect silence in a WAV file. I have a sample dictation file that I used to test it and it works fine.
As a further test, I used Audacity to create a Wav file that has 1 minute of silence in it using the Noise Reduction feature. When I listen to it, I don't hear sound. When I run my program on it, the lowest decibel reading is higher than the lowest decibel reading in the dictation file. I'm wondering why because this concerns me that there might be dictation files that have silence in them that I cannot detect.

Speech Recognition API match grammar when there is no sound (Microsoft)

I have build a Speech Recognition tool using Microsoft SAPI and a Kinect.
Following code sample I load XML grammar and start a SpeechRecognitionEngine.
Sometimes when there is few or no sound the SpeechRecognitionEngine have a match with a very high confidence (0.85) on a simple sentence: "Sarah what time is it"
Why Engine trigger this strong match in silence ?!
Any Workaroud ?
Here is my main class on GitHub
I also write (in french) a blog post with dump (wav + xml)
I am not sure exactly which wave file you are talking about (I haven't spoken French since middle school). But I think this wave from your group qualifies: dump_2012_12.16_12.47.33.wav. It has a high confidence value .857 and does not appear to have any speech in the audio file. Looking at a spectrogram (see below) you can see the audio file does contain energy in the speech range.
Most speech recognition engines these days use a Hidden Markov Model (aka HMM) to match audio vector patterns to speech. The state of the art today is not always accurate at doing this. HMM's tend to be really sensitive to background noise.
This is why most speech type features in production today (like Siri) are push to talk. You need to push a button and you have 5 seconds to speak into the microphone. They do this so they can be sure there is some type of speech signal. For those systems that are open mic (Kinect is the only one I know of) they try and use a form of echo cancellation to suppress background audio. But even with the state of the art there is still bleed through.
The only relatively easy work arounds (again not 100%) that I know of involve editing your grammar to include a garbage rule and shortening the possible phrase list. The garbage rule will give the speech engine a "run home to momma" option when it does not know what to do.
http://www.w3.org/TR/speech-grammar/#S2.2.3
Although I don't think this is recommended usage I have seen some systems behave better when using the garbage rule to help filter out background noise. Of course they then have to ignore the garbage reco events.

Mac OS X equivalent for DirectShow, GraphEdit

New to Mac OS X, familiar with Windows. Windows has DirectShow, a good number of built-in filters, COM programming, and GraphEdit for very fast prototyping and snooping on the graphs you've constructed in code.
I'm now about to go to the Mac to work with cameras, webcams, microphones, color spaces, files, splitting, synchronization, rendering, file reading, file saving, and many of things I've come to take for granted with DirecShow when putting together applications for live performance. On the Mac side, so far I've found ... nothing! Either I don't know where to look or I'm having the toughest time tying the Mac's reputation for its ease of handling media with a coherent programmatic ability to get in there and start messin' with media manipulatin' building blocks.
I've seen some weak suggestions to use gstreamer or some library for QT but I can't bring myself to believe that this is the Apple way to go. And I've come across some QuickTime documentation but I'm not looking to do transitions, sprites, broadcasting, ...
Having a brain trained on DirectShow means I don't even know how Apple thinks about providing DirectShow-like functionality. That means I don't know the right keywords and don't even know where to look. Books? Bought a few. Now I might be able to write some code that can edit your sister's wedding video (if I can't make decent headway on this topic I may next be asking what that'd be worth to you), but for identifying what filters are available and how to string them together ... nothing. Suggestions?
Video handling is going through a huge transition on the Mac at the moment. QuickTime is very old, but also big and powerful, so it's been undergoing an incremental replacement process for the past 5 years or so.
That said, QTKit is the QuickTime subset (capture, playback, format conversion and basic video editing) which is supported going forward. The legacy QuickTime APIs are still there for the moment, and probably will remain at least until its major features are available elsewhere, but are 32-bit only. For some involved video stuff you may end up needing to use it in places.
At the moment, iOS is ahead of the Mac because it could start from scratch with AV Foundation. The future of the Mac media frameworks will probably either be AV Foundation directly (with QTKit being a lightweight shim over the top) or an extension of QTKit that looks very similar.
For audio there's Core Audio which is on Mac and iOS and isn't going away any time soon. It's quite powerful but somewhat obtuse in places. Luckily online support is very good; the mailing list is an essential resource.
For filters and frame-level processing you've got Core Video as someone else mentioned, as well as Core Image. For motion graphics there's Quartz Composer which includes a graphical editor and a plugin architecture to add your own patches. For programmatic procedural animation and easily mixing rendering modelsĀ (OpenGL, Quartz, video, etc.) there's Core Animation.
In addition to all of these, of course there's no reason you can't use open source libraries where the built-in stuff doesn't do what you want.
To address your comment below:
In QuickTime (and QTKit), individual data types like audio and video are represented as tracks. It may not be immediately clear that QuickTime can open audio as well as video file formats. A common way to combine audio and video would be:
Create a QTMovie with your video file.
Create a QTMovie with your audio file.
Take the QTTrack object representing the audio and add it to the QTMovie with the video in it.
Flatten the movie, so it doesn't simply contain a reference to the other movie but actually contains the audio data.
Write the movie to disk.
Here's an example from Blender. You'll see how the A/V muxing is done in the end_qt function. There's also some use of Core Audio in there (AudioConverter*). (There's some classic QuickTime export code in quicktime_export.c but it doesn't seem to do audio.)

How to programmatically test for audio sync

I have a multimedia application that among other things converts video using FFMpeg. Video conversion being the pain that it is, I have in my test suits some tests that check our ability to convert various video formats, with emphasis on sample videos known not to work.
A common problem we've noticed from users is that some videos end up with their audio desynched after being processed, and I am looking for a way to check this in my tests.
Extracting the audio portion of the resulting videos is not a problem.
My best idea so far would be to check the offset of the first non-silence at both the beginning and end and compare each between the two videos, but I'm hoping someone smart has a better idea.
The application language/environment is Java, but since this is for testing, I'm free to use any toolset.
The basic problem is likely that the video and audio are different lengths. Extract the audio and test its length vs. the video length. If they are significantly different (more than maybe .05 sec, I'm not really sure what is detectable as "off"), then there's a problem.
To fix it, re-encode the audio to match the video length, and then put the audio and video back into a container format.

Using Cocoa to detect when a running application plays audio

I'm looking into writing an app that runs as a background process and detects when an app (say, Safari) is playing audio. I can use NSWorkspace to get the process ID's of the currently running applications but I'm at a loss when it comes to detecting what those processes are doing. I assume that there is a way to listen in on a process and detect what public messages the objects are sending. I apologize for my ignorance on the subject.
Has anyone attempted anything like this or are aware of any resources that can help?
I don't think that your "answer" is an answer at all...
and there IS an answer (which is not "42")
your best bet for doing this would be to write a pass-through audio output device. Much like soundflower, actually. so your audio output device would then load the actual (physical) audio output device and pass the audio data along to it directly (after first having a look at the audio stream, of course!). then you only need to convince your users to configure your audio device as the default audio output device so that the majority of applications which play sound will use it automatically. and voila...
your audio processing function will probably just do a quick RMS on the buffer before passing it along to the actual output device. and when the audio power crosses a certain threshold (probably something like -54dB with apple audio hardware), then you know that some app is making sound.
|K<
SoundFlower is an open-source project that allows Mac OS X applications to pass audio to each other. It almost certainly does something similar to what you describe.
I've been informed on another thread that while this is possible, it is an extremely advanced technique and not recommended. It would involve using Application Enhancer (APE) and is considered a not 'nice' thing to do. Looks like that app idea is destined for the big recycling bin in the sky :)