I have been trying to use CMU's Pocket Sphinx to perform speech recognition on an Android tablet.
The tutorial on doing this can be found here. My problem is that recognition runs really slowly if I use a grammar of any significant size. Using a language model, I can achieve good accuracy and speed, so my temporary solution has been to generate a language model from my grammar and use that.
In my configuration, I set -bestpath = false. After that, I am at a loss as to how to speed things up.
Clarification: I understand that a large grammar will take a long time to initialize, but I don't think it should take a long time for recognition to run using it.
Is there anyone with experience using Pocket Sphinx and a grammar who can share their experience, configuration, etc.?
We used pocketsphinx on a 1Ghz Android mobile following tutorials available on the net (just do a google search). It was quite quick to startup, but it hung up after you stopped recording for about 10 secs even if you only recorded 2 words. This was using the default "hub4" prerecorded grammer.
Related
What am trying to do is, count the revving("vroom" sound) of a physical car, through my app. Am coding in ReactNative. And I don't plan to create something complex, like communicating with the Car's inbuilt computer or anything to do this.
But instead, I was planning to create the app to listen to the nearby sounds. So if the nearby sound is that of a revving, then the app will simply count it.
I have done other features in my app, but listening to the sound and detect if it's a "vroom" sound is what am stuck with.
Based on my research, I can see that I have to make use of the Fast Fourier Transform algorithm. But am confused at how I can implement it in my ReactNative app. Am still searching for a package that has an implementation.
I have seen some apps that can be used to tune the sounds of Violin, Guitar, etc. What am trying to do is similar to this, but pretty simple. Once I get a basic idea, I will be able to get going. In my case, my app will be listening to the high decibel sound.
Any inputs would be highly appreciated.
This is known as Acoustic Event Detection. Possibly you can use an Audio Classification approach. The best way to solve it is using supervised machine learning. For example a CNN on mel-spectrograms. Here is an introduction. You can do the same in JavaScript using Tensorflow.JS. The official documentation contains a tutorial.
One of the first steps is to collect a small dataset of examples of "vroom" sounds versus other loud non-vroom sounds.
I am trying to build an outdoor smoke detection from the neighbor chimneys.
I live in a neighborhood where a couple of houses are still using wood-burning fireplaces and cause lots of smoke and they do during the day time. when it is smoky outside, the kid's room sometime has windows open and smoke get in and very hard to get smoke out. The worst part is it is not illegal (yet) so I found little help apart from talking to them and react to it quickly, in vain.
I am thinking to have an outdoor camera looking at chimneys and detect smoke. Then a program sends a text message for alerting. Most time, the image is pretty still and not a lot of variations. It shouldn't be a too hard problem for classification I imagine? I have little experience with Tensorflow or machine learning but I am a good programmer. So given some direction and some existing model, I hope I can get this working...
I know this sounds desperate, nevertheless, for a good deed. Please help.
For fire and smoke classification, you can check the following tutorial: https://www.pyimagesearch.com/2019/11/18/fire-and-smoke-detection-with-keras-and-deep-learning/.
PyImageSearch is a very good website for image processing, you can find there many articles which can help you (even deployment of neural networks on RaspberryPi and so on).
A co-worker and I had an idea to create a little web game where a user enters a chunk of data about themselves and then the application would write for them to sound like them in certain structures. (Trying to leave the idea a little vague.) We are both new to ML and thought this could be a fun first dive.
We have a decent bit of background with PHP, JavaScript (FE and Node), Ruby a little bit of other languages, and have had interest in learning Python for ML. Curious if you can run a cost efficient ML library for text well with a web app, being most servers lack GPUs?
Perhaps you have to pay for one of the cloud based systems, but wanted to find the best entry point for this idea without racking up too much cost. (So far I have been reading about running Pytorch or TensorFlow, but it sounds like you lose a lot of efficiency running with CPUs.)
Thank you!
(My other thought is doing it via an iOS app and trying Apple's ML setup.)
It sounds like you are looking for something like Tensorflow JS
Yes, before jumping into training something with Deep Learning; (this might even be un-necessary for your purpose) try to build a nice and simple baseline for this.
Before Deep Learning (just a few yrs ago) people did similar tasks using n-gram feature based language models. https://web.stanford.edu/~jurafsky/slp3/3.pdf
Essentially you try to predict the next few words probabilistically given a small context(of n-words; typically n is small like 5 or 6)
This should be a lot of fun to work out and will certainly do quite well with a small amount of data. Also such a model will run blazingly fast; so you don't have to worry about GPUs and compute .
To improve on these results with Deep Learning, you'll need to collect a ton of data first; and it will be work to get it to be fast on a web based platform
Does needing just a single word voice recognition reduce the complexity of the task enough to be able to fully perform voice recognition processing offline, on an iOS or Android smartphone? (E.g., could a reasonably accurate counter for the number of times that a single, pre-programmed word was spoken while the microphone is active be developed to work offline on a standard iOS or Android smartphone?).
I've found plenty of tools and examples capturing voice and sending it to an online service (e.g., the Google cloud voice-to-text), but does the single-word focus reduce the complexity enough for the recognition to be doable offline today? If so, do you have any libraries to suggest or where would you start?
Cloud services are good for various reasons relating to your question:
It makes deployment of new versions of the algorithm (which happen much more frequently than most people realize) a lot easier
It allows the developer to collect your data and use it in future algorithm development (or whatever they please)
From a practical standpoint, most deployed models (at least the effective ones) can be quite large and take up quite a bit of space on a mobile device.
In addition to the above, I don't think that the singular word focus changes much, if anything. The model has to not just account for words, but also for the different ways those words can be said (volume, tone, accents, inflection, etc, etc).
So what you are asking can be done but there's also good reasons why it's on the cloud.
I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus