Is this file qualified for Sphinx training - voice-recognition

My WAV has some small buzzing sound at its beginning and end. Is this file qualified for Sphinx training? If yes, do I have to include some special character in the transcription file?
Thank you and best regards.

First of all, you can simply cut your wav and remove the sound.
Secondly, if you have one wav, there is no point training anything, you need lots of data.
You can add a special sound "buzz" in your transcription, model, vocabulary, etc. and account for it. I would rather get rid of it if possible. Though, if your mic buzzes all the time, you may choose to leave it.
In general your training data should be recorded under the same conditions the system will be used.

Related

How to find keyword in audio file and deliver timestamps

I wonder how to find a certain keyword (potentially multiple times) in a (long, lets say 1-2 hours) audio file with their corresponding time stamps of start and end of it.
Let's say we do it with Tensorflow as described here. The problems I see are the following: In reality the keyword we use to train can be a bit longer or shorter (for example you can say "no" or "nooooo", ranging therefore from 0.1s to 3s maybe). In the link they use padding for that so the input for training and inference is always in the same shape.
So what is in reality the best way to deal with:
Different input lengths of the audio snippets to train? Padding and cutting might destroy important information or add nonsensical "emptiness".
How to find/run inference in the long audio file? Moving window with a resolution of 16kHz would be the obvious way but that can't be an efficient way.
How to then get the timestamps?
Thanks!

ImageDeserializer mean file with variable sized images?

When you have variable sized input images and use the ImageDeserializer to resize the images, how are you supposed to deal with a mean file? Computing the mean file is easy when the input images are all the same size. Wouldn't it be better if the ImageDeserializer would be capable of compute the means?
The order in which image pre-processing steps are executed by default are:
Take a crop that has the desired image size
Subtract the mean
Hence, your mean file hence only has to contain a mean for the desired input size.
For computing the mean yourself, you will have to repeat these steps, at least to some degree. If you're on .NET, then you may want to have a look at this post where .NET image pre-processing is discussed.
I agree that it would be helpful if there was some tool to compute the mean file. I can understand though why the image deserializer does not do it automatically: You need to transform your training and test data via the same mean file. If you subtract mean from training data automatically, you'll have no way of repeating the same operation on the test data. Plus there is randomization that could make it messy in some cases, etc.

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?
No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.
You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.
If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).

NAudio - Create software beat machine / sampler - General Strategy

Im new to NAudio but the goal of my project is to provide the user with the ability for the user to listen to an MP3 and then select a sample or a "chunk" of that song as a sample which could be saved to disk. These samples would be able to replayed at the same time (i.e. not merged but played at the same time).
Could someone please let me know what the overall strategy required to achieve this (....not necessarily the specifics...almost like pseduo code....).
For example would the samples / chunks of a song need to be saved as a WAV file. And these samples could be played together in the WAV format, etc.
I have seen a few small examples of a few implementations of some of the ideas Ive mentioned above but dont have a good sense of the big picture just yet.
Thanks in advance,
Andrew
The chunks wouldn't need to be saved as WAV files unless you were keeping them for future use. You can store the PCM audio (Mp3FileReader automatically converts to PCM) in a byte array and use RawSourceWaveStream to play them.
As for mixing them, I'd recommend using the MixingSampleProvider. This does mean you need to convert your RawSourceWaveStream to IEEE float, but you can use Pcm16BitToSampleProvider to do this. This will give the advantage that you can adjust volumes (and do other DSP) easily on the samples you are mixing. MixingSampleProvider also auto-removes completed inputs, so you can just add new inputs whenever you want to trigger a sound.

How to give best chance of success to an OCR software?

I am using Tesseract OCR (via pytesser) and PIL (Python Image Library) for automated test of an application.
I am checking that the displayed text is ok by making a screenshot and getting the text thanks to tesseract.
I had some issues in the beginning and it seems to work better since I have increased the size of the screenshot thanks to the bicubic interpolation of PIL.
Unfortunatelly, I still have some mistakes like confusion between '0' and 'O'. I can imagine that I will have other similar issues in the future.
I would like to know if there are some techniques to prepare an image in order to help the OCR. Any idea is welcomed.
Thanks in advance
Shameless plug and disclaimer: my company packages Tesseract for use in .NET
Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.
Train tesseract to recognize your font
Make image extra clean and with enough free space around characters
Profit :)
Here are few real world examples.
First image is original image (croped power meter numbers)
Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
Third image is completely cleaned image - 100% OCR recognized without any training!
Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.
For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?
Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.
For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.