On-device single-word voice recognition - voice-recognition

Does needing just a single word voice recognition reduce the complexity of the task enough to be able to fully perform voice recognition processing offline, on an iOS or Android smartphone? (E.g., could a reasonably accurate counter for the number of times that a single, pre-programmed word was spoken while the microphone is active be developed to work offline on a standard iOS or Android smartphone?).
I've found plenty of tools and examples capturing voice and sending it to an online service (e.g., the Google cloud voice-to-text), but does the single-word focus reduce the complexity enough for the recognition to be doable offline today? If so, do you have any libraries to suggest or where would you start?

Cloud services are good for various reasons relating to your question:
It makes deployment of new versions of the algorithm (which happen much more frequently than most people realize) a lot easier
It allows the developer to collect your data and use it in future algorithm development (or whatever they please)
From a practical standpoint, most deployed models (at least the effective ones) can be quite large and take up quite a bit of space on a mobile device.
In addition to the above, I don't think that the singular word focus changes much, if anything. The model has to not just account for words, but also for the different ways those words can be said (volume, tone, accents, inflection, etc, etc).
So what you are asking can be done but there's also good reasons why it's on the cloud.

Related

Do game developers build custom game engines just for a single game or a game franchise?

I am aware that game engines like Unity, Unreal, Cry Engine provide almost all the tools necessary to build an AAA title game. Its also the best choice if the game has a tight release data or if your new to game development. But since they are generalist game engines (meaning that they are made to fit multiple genre of games. Correct me if in wrong) for some games (next-gen or games which require a lot of performance), they might leave some performance on the table, something which could be accomplished by developing a custom engine.
This brings me to my question,
Do game developers (indie game developers, large teams or even companies) still build game engines from scratch to tailor fit a game or a game franchise?
Thank You!
when we talk about big companies like Ubisoft or rockstar they built their own engines and didn't use Unity or unreal
Rockstar uses "Rockstar Advanced Game Engine"
and Ubisoft uses "AnvilNext"
but why?
there are millions of reasons they do such a thing, I'm gonna say just 2 from #scremyCat
the support
and the license
Support: Highest degree of support and understanding - as they built it all, they understand all of its internals and can offer
complete support. E.g. A game needs X feature, they'll easily know
if they can implement it or not. Another benefit of this is not
having to wait on external entities, if there's a game breaking bug
in the engine they can get right on it, while a third party engine
depending on the licensing agreement this might not be possible
(though they would typically license the source code anyway).
License: Licensing - as an indie developer accepting that you might have to pay a small percentage of your revenue for licensing
the engine might not be as much of an issue seeing as the amount you
need to breakeven is unlikely going to be very high and chances are
you're already making when your revenue is at the levels needed to
pay a %, and your total revenue from a game isn't likely going to be
huge anyway so the amount in licensing fees you need to pay may seem
very reasonable. Meanwhile a AAA game will have a much higher
break-even target and their expected revenue is most definitely in
the tens to hundreds of millions, which now means they're paying a
large amount in licensing fees. Now it should be said they usually
get much better licensing deals to begin with than the indie dev
gets, but still they're paying huge amounts.
As for timeframe, it can take years to fully develop an engine of their scale. Often why you'll see them using the same version of the engine for a good cycle of games whilst working on the next version of the engine. And as for what's involved, a LOT. They need to handle every platform they'll be targeting, the rendering, the physics, the AI, the audio, the input, the file system access, the asset management pipeline, the tools, etc.
How are they better than current popular engines? They aren't necessarily (to other developers), but to them with their own reasoning for doing it they are. The simplest answer for how can they be better is that when you're creating your own engine from scratch you can do whatever you want.
It should also be said that developing your own engine isn't just limited to large game companies, a number of smaller developers also do this. The more popular reasons for this are typically because they enjoy it, and have some functionality they want that isn't available in existing options. E.g. While you can create many games with Unity or Unreal, there's plenty of things which just aren't feasible or might take considerable work to even make possible anyway. This can be a reason for a smaller dev to make their own engine.
Yes, they absolutely do. Nintendo is a good example.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

Building GIS apps from scratch?

I am a very beginner in software and I am asking or a direction to proceed for research technologies to build my app. I am having just an idea for the app. I am trying to build something like zomato but different services. The idea of location based system is similar. I searched online and came to know about GIS systems. But while researching further, it seems I've to create a map all together. This feels redundant to build as we have api of google maps.
But can i use this api to build a system "ON" it????
Any tutorials or some direction in this direction would be helpful.
Also what is difference between GIS and gps based apps.
As you see, I am not very clear in the fundamentals of the GIS and GPS based apps
Thanks for the help
Regarding Android, you have almost all you need by combining the platform API and the comprehensive Google Maps Android API. Regarding the later, it's actually a matter of opting by convenience and possibly paying a licence fee to Google, versus developing your own solutions of aggregating free or cheaper services from elsewhere.
Most problems solved by apps are not the same problems solved by classical GIS software, since the former are more consumer-oriented (using public transportation, navigating a route, planning a trip, finding a nearby restaurant), and the later are more specialist-oriented, typically solving larger-scale and more technical issues (detecting regions with flood risk, monitoring deforestation, calculating volumes of terrain to be bulldozed, etc.)
You should not, IMO, be discouraged by the seemingly hard technical concepts of geography and map making. Your best bet is to have a clear vision of what actual problems you app should be solving, and study the geography topics gradually, as the need arises.
A bit of consideration on your question about GIS:
If it were created today, the GIS acronym would mean any software dealing with geographic data, be it a mobile app or a workstation software suite destined to specialized professional use.
But when it was created, the term meant almost exclusively the later sense, and so it has a lot of tradition and cultural legacy to it - which is of couse not always a good thing. Specifically (at least in my experience), it seems to me the jargon and concepts used by the classic GIS community are a bit impenetrable to the newcomer, specially if she comes from the software-development field instead of the geo-sciences field.
But geographic information availability has gone from scarcity to overwhelming abundance, and so have its enabling technologies: GPS satellites, mobile computing and mobile connectivity.

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com

Ideas for a distributed processing project?

I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree