API to break voice into phonemes / synthesize new speech given speech samples?

API to break voice into phonemes / synthesize new speech given speech samples? - api

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.

There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.

SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.

If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm

The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech

Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.

You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.

I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com

Related

Best way to create text to speech voice variant

I need a minimum of 3/4 different tts voice but unfortunatenly I have only one voice.
This because I have only one Italian neural voice (Diego) and the others are all standard voice and the quality is much worse.
The final objective is create a voice over for 3/4 persons minimum and I can't use the some exact voice.
For this reason, I like to create some variant started by the only one neural voice that I have, that gives the impression of a voice of other people all of this without seem unnatural.
Actually I have Adobe Audition, Audacity , Ircam Trax, ffmpeg and apart this I can use SSML with API (in this case microsoft Azure).
I don't known what are the effects and in what measure use it without damage the voices.
In short I ask what is the best way to do using the software that I have or other if I will get better results.
Thanks !

what language are you using? If you are using English, I am sure you can find more than 3-4 neural voices. There are en-US, en-GB, en-CA, en-AU neural voices and all sound natural.
You can also tune the pitch using SSML to make the voice sound different.
If you would like to create different voices, try customvoice.ai with your speech data (or your voice talents).
or, what are the particular 'variances' you are looking for?

On-device single-word voice recognition

Does needing just a single word voice recognition reduce the complexity of the task enough to be able to fully perform voice recognition processing offline, on an iOS or Android smartphone? (E.g., could a reasonably accurate counter for the number of times that a single, pre-programmed word was spoken while the microphone is active be developed to work offline on a standard iOS or Android smartphone?).
I've found plenty of tools and examples capturing voice and sending it to an online service (e.g., the Google cloud voice-to-text), but does the single-word focus reduce the complexity enough for the recognition to be doable offline today? If so, do you have any libraries to suggest or where would you start?

Cloud services are good for various reasons relating to your question:
It makes deployment of new versions of the algorithm (which happen much more frequently than most people realize) a lot easier
It allows the developer to collect your data and use it in future algorithm development (or whatever they please)
From a practical standpoint, most deployed models (at least the effective ones) can be quite large and take up quite a bit of space on a mobile device.
In addition to the above, I don't think that the singular word focus changes much, if anything. The model has to not just account for words, but also for the different ways those words can be said (volume, tone, accents, inflection, etc, etc).
So what you are asking can be done but there's also good reasons why it's on the cloud.

Robot odometry in labview

I am currently working on a (school-)project involving a robot having to navigate a corn field.
We need to make the complete software in NI Labview.
Because of the tasks the robot has to be able to perform the robot has to know it's position.
As sensors we have a 6-DOF IMU, some unrealiable wheel encoders and a 2D laser scanner (SICK TIM351).
Until now I am unable to figure out any algorithms or tutorials, and thus really stuck on this problem.
I am wondering if anyone ever attempted in making SLAM work in labview, and if so are there any examples or explanations to do this?
Or is there perhaps a toolkit for LabVIEW that contains this function/algorithm?
Kind regards,
Jesse Bax
3rd year mechatronic student

As Slavo mentioned, there's the LabVIEW Robotics module that contains algorithms like A* for pathfinding. But there's not very much there that can help you solve the SLAM problem, that I am aware of. The SLAM problem consist of the following parts: Landmark extraction, data association, state estimation and updating of state.
For landmark extraction, you have to pick one or multiple features that you want the robot to recognize. This can for example be a corner or a line(wall in 3D). You can for example use clustering, split and merge or the RANSAC algorithm. I believe your laser scanner extract and store the points in a list sorted by angle, this makes the Split and Merge algorithm very feasible. Although RANSAC is the most accurate of them, but also has a higher complexity. I recommend starting with some optimal data points for testing the line extraction. You can for example put your laser scanner in a small room with straight walls and perform one scan and save it to an array or a file. Make sure the contour is a bit more complex than just four walls. And remove noise either before or after measurement.
I haven't read up on good methods for data association, but you could for example just consider a landmark new if it is a certain distance away from any existing landmarks or update an old landmark if not.
State estimation and updating of state can be achieved with the complementary filter or the Extended Kalman Filter (EKF). EKF is the de facto for nonlinear state estimation [1] and tend to work very well in practice. The theory behind EKF is quite though, but it should be a tad easier to implement. I would recommend using the MathScript module if you are going to program EKF. The point of these two filters are to estimate the position of the robot from the wheel encoders and landmarks extracted from the laser scanner.
As the SLAM problem is a big task, I would recommend program it in multiple smaller SubVI's. So that you can properly test your parts without too much added complexity.
There's also a lot of good papers on SLAM.
http://www.cs.berkeley.edu/~pabbeel/cs287-fa09/readings/Durrant-Whyte_Bailey_SLAM-tutorial-I.pdf
http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-412j-cognitive-robotics-spring-2005/projects/1aslam_blas_repo.pdf
The book "Probabalistic Robotics".
https://wiki.csem.flinders.edu.au/pub/CSEMThesisProjects/ProjectSmit0949/Thesis.pdf

LabVIEW provides LabVIEW Robotics module. There are also plenty of templates for robotics module. Firstly you can check the Starter Kit 2.0 template Which will provide you simple working self driving robot project. You can base on such template and develop your own application from working model, not from scratch.

What methods to recognize sentence handwriting?

I mean posts per sentence, not per letter. Such a doctor's prescription handwriting which hard to read. Not just a normal handwriting.
In example :
I use a data mining or machine learning for doing a training from
paper handwrited.
User scanning a paper with hard to read writing.
The application doing an image processing.
And the output is some sentence from paper.
And what device to use? (Scanner or webcam)
I am newbie. If could i need some example in vb.net with emguCV/openCV and researches journals.
Any help would be appreciated.

Welcome to stack overflow! The answer to your question is twofold:
a. If you want to recognize handwriting that has already happened i.e. it is presented to you as an image you are in trouble. Computer Vision is still not good enough to provide you with reasonable accuracy.
b. If you have a chance to recognize handwriting “as it's happening” - you are in luck. Download, for example, a Gesture Search app from Android play store and you are in business.
The difference between the two scenarios is subtle but significant. In the second case you have an extra piece of information that makes handwriting recognition possible. This piece is timing of each stroke. In other words, instead of an image with handwriting you have a bunch of strokes that are all labeled with their time stamps. You can think about it as a sequence of lines and curves or as image segmentation - in any way this provides a big hint for the system. Additional help comes from the dictionary on your phone but this is typically used by any handwriting system.
Android of course has an open source library for stroke recognition (find more on your own). If you still want to go for recognizing images though, you have to first detect text (e.g. as a bounding box) and second use any of the existing engines to process detected regions. For text detection I can recommend MSER. But be careful trying to implement even text detection on your own - you are entering a world of pain here ;). Here is an article that can help.
As for learning how to recognize text from images on the Internet - this can be your plan B or C or Z when you master above mentioned stages. Don’t try to abuse learning methods and make them do hard work for you - you will hit a wall if you don’t understand what’s going on under the hood.

Small embedded synthesized speech libraries/suggestions

Are there any easy-to-use free or cheap speech synthesis libraries for PIC and/or ARM embedded systems where code size is more important than speech quality? Nowadays it seems that a 1 meg package is considered "compact", but a lot of microcontrollers are smaller than that. Back in the 1980's Apple hired a contractor to produce Macintalk, which offered reasonable-quality speech in a 26K package which ran on a 7.16MHz 68000, and a program called SAM could produce speech that wasn't quite as good, but still serviceable, with a 16K package that ran on a 1MHz 6502. The SpeakJet runs a speech-synthesis algorithm on some type of PIC.
I probably wouldn't particularly need to produce speech, but would want to be able to speak messages formed from a number of pre-set words. Obviously it would be possible to simply prerecord all the messages, but with a vocabulary of e.g. 100 words, I would think that storing 16K worth of code plus maybe 1K worth of phonetic strings would be more compact than storing audio for 100 words.
Alternatively, if I wanted to store audio for 100 words, what would be the best way of generating a set of words that would flow naturally together? On older-style speech synthesizers, any given word could be spoken three ways: neutral inflection, falling inflection (as if followed by a period), or rising inflection (followed by a question mark). Words with neutral inflection could be spliced together in any order and sound fine. The text-to-wave tools I've found, though, seem to like to add finer details of inflection which sound "off" if words are cut apart and resequenced. Are there any tools which are designed for producing waves that can be concatenated and spliced nicely? If I do use such a tool, what audio format would be best for storing the waves so as to allow efficient decoding on a small microcontroller?

Last time I did this I was able add hardware like:http://www.sparkfun.com/products/9578 . There may be patent liabilities in your environment, like I ran into, that force a commercial software stack or OTS chip.
Otherwise, I've used http://www.speech.cs.cmu.edu/flite/ for more lenient projects, and it worked well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas