I have looked at a number of articles on detecting silence in a recording using NAudio. But, I ran into a snag. I am now looking into using the more complex silence detection methods such as Fourier Transform. In the meantime, perhaps someone can shed some light on the problem I have run into.
I wrote a program in C# using NAudio to detect silence in a WAV file. I have a sample dictation file that I used to test it and it works fine.
As a further test, I used Audacity to create a Wav file that has 1 minute of silence in it using the Noise Reduction feature. When I listen to it, I don't hear sound. When I run my program on it, the lowest decibel reading is higher than the lowest decibel reading in the dictation file. I'm wondering why because this concerns me that there might be dictation files that have silence in them that I cannot detect.
Related
Currently i am trying to create a game and having a radio which you must tune. I was experimenting using Csound and Frequency modulation and while it is possible using oscoil opcode however It does not appear to allow you to modify a Sound File from the DiskIn opcode.
Is there another opcode that may allow modification to an audio diskin File?
Could you explain what you mean about "modifying a sound file from the diskin opcode"? I know you can modulate the playback rate at k-rate but I'm not sure that's what you're looking tod o.
I am currently using NAudio to do the audio output. Everything is fine except there is an annoying echo in the background. What can I eliminate such noise?
Thanks,
Adam
If you are not using a headset for a network chat program, then the received audio can get recorded by the microphone again, resulting in an annoying echo. There are some fairly complex echo suppression algorithms that programs like Skype use to detect and eliminate these echoes. Unfortunately NAudio does not include such an algorithm, so you'd need to find a third party one, or write your own.
I have build a Speech Recognition tool using Microsoft SAPI and a Kinect.
Following code sample I load XML grammar and start a SpeechRecognitionEngine.
Sometimes when there is few or no sound the SpeechRecognitionEngine have a match with a very high confidence (0.85) on a simple sentence: "Sarah what time is it"
Why Engine trigger this strong match in silence ?!
Any Workaroud ?
Here is my main class on GitHub
I also write (in french) a blog post with dump (wav + xml)
I am not sure exactly which wave file you are talking about (I haven't spoken French since middle school). But I think this wave from your group qualifies: dump_2012_12.16_12.47.33.wav. It has a high confidence value .857 and does not appear to have any speech in the audio file. Looking at a spectrogram (see below) you can see the audio file does contain energy in the speech range.
Most speech recognition engines these days use a Hidden Markov Model (aka HMM) to match audio vector patterns to speech. The state of the art today is not always accurate at doing this. HMM's tend to be really sensitive to background noise.
This is why most speech type features in production today (like Siri) are push to talk. You need to push a button and you have 5 seconds to speak into the microphone. They do this so they can be sure there is some type of speech signal. For those systems that are open mic (Kinect is the only one I know of) they try and use a form of echo cancellation to suppress background audio. But even with the state of the art there is still bleed through.
The only relatively easy work arounds (again not 100%) that I know of involve editing your grammar to include a garbage rule and shortening the possible phrase list. The garbage rule will give the speech engine a "run home to momma" option when it does not know what to do.
http://www.w3.org/TR/speech-grammar/#S2.2.3
Although I don't think this is recommended usage I have seen some systems behave better when using the garbage rule to help filter out background noise. Of course they then have to ignore the garbage reco events.
I know how to play mp3 files and whatnot in Xcode iOS. But how do I play a certain frequency, like if I just wanted to emit a C# note for 25 seconds; how might I do that? (The synth isn't as important to me as just the pitch of the note.)
You need to generate the PCM audio waveform that corresponds to the note you want to play and store that into a sample buffer in memory. Then you send that buffer to the audio hardware.
Here is a tutorial on generating waveforms of several types. The article goes into some details on the many aspects to a note you need to consider, including the frequency, volume, waveform shape, sampling rate, etc. The article comes with Flash source code, I think you should have no problem taking the concepts and adapting them to iOS.
If you also need a library that you can use to play the generated buffers on iOS, then I recommend the open source Finch.
I hope this helps!
You can synthesize waveforms of your desired frequency and feed them to the callbacks of either the Audio Queue or the RemoteIO Audio Unit API.
Here is a short tutorial on some of the code needed to create sine wave tones for iOS in C.
My task is to convert an Audio file not from Direct Speech from Human into text.
e.g If I have "Hello there" store in wav file to it will transcribe it into text and show "Hello there" string on screen.
Any language code in preferred but priority is C#.
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WAV files have extra noise, accuracy will go down dramatically.
Dragon Naturally Speaking Professional has support for transcription, but it still requires training and a single voice. (I do believe that DNS has a custom audio model that works well for voice recorders.) I haven't used it myself, so I don't know how well it would work in your situation.
Now, if you are looking for specific keywords, other people have had success using "Audio Mining" - running the recognizer looking for a specific keyword on an audio stream