Best way to create text to speech voice variant - text-to-speech

I need a minimum of 3/4 different tts voice but unfortunatenly I have only one voice.
This because I have only one Italian neural voice (Diego) and the others are all standard voice and the quality is much worse.
The final objective is create a voice over for 3/4 persons minimum and I can't use the some exact voice.
For this reason, I like to create some variant started by the only one neural voice that I have, that gives the impression of a voice of other people all of this without seem unnatural.
Actually I have Adobe Audition, Audacity , Ircam Trax, ffmpeg and apart this I can use SSML with API (in this case microsoft Azure).
I don't known what are the effects and in what measure use it without damage the voices.
In short I ask what is the best way to do using the software that I have or other if I will get better results.
Thanks !

what language are you using? If you are using English, I am sure you can find more than 3-4 neural voices. There are en-US, en-GB, en-CA, en-AU neural voices and all sound natural.
You can also tune the pitch using SSML to make the voice sound different.
If you would like to create different voices, try customvoice.ai with your speech data (or your voice talents).
or, what are the particular 'variances' you are looking for?

Related

ReactNative - Listen to specific sound input - Vroom of Car

What am trying to do is, count the revving("vroom" sound) of a physical car, through my app. Am coding in ReactNative. And I don't plan to create something complex, like communicating with the Car's inbuilt computer or anything to do this.
But instead, I was planning to create the app to listen to the nearby sounds. So if the nearby sound is that of a revving, then the app will simply count it.
I have done other features in my app, but listening to the sound and detect if it's a "vroom" sound is what am stuck with.
Based on my research, I can see that I have to make use of the Fast Fourier Transform algorithm. But am confused at how I can implement it in my ReactNative app. Am still searching for a package that has an implementation.
I have seen some apps that can be used to tune the sounds of Violin, Guitar, etc. What am trying to do is similar to this, but pretty simple. Once I get a basic idea, I will be able to get going. In my case, my app will be listening to the high decibel sound.
Any inputs would be highly appreciated.
This is known as Acoustic Event Detection. Possibly you can use an Audio Classification approach. The best way to solve it is using supervised machine learning. For example a CNN on mel-spectrograms. Here is an introduction. You can do the same in JavaScript using Tensorflow.JS. The official documentation contains a tutorial.
One of the first steps is to collect a small dataset of examples of "vroom" sounds versus other loud non-vroom sounds.

On-device single-word voice recognition

Does needing just a single word voice recognition reduce the complexity of the task enough to be able to fully perform voice recognition processing offline, on an iOS or Android smartphone? (E.g., could a reasonably accurate counter for the number of times that a single, pre-programmed word was spoken while the microphone is active be developed to work offline on a standard iOS or Android smartphone?).
I've found plenty of tools and examples capturing voice and sending it to an online service (e.g., the Google cloud voice-to-text), but does the single-word focus reduce the complexity enough for the recognition to be doable offline today? If so, do you have any libraries to suggest or where would you start?
Cloud services are good for various reasons relating to your question:
It makes deployment of new versions of the algorithm (which happen much more frequently than most people realize) a lot easier
It allows the developer to collect your data and use it in future algorithm development (or whatever they please)
From a practical standpoint, most deployed models (at least the effective ones) can be quite large and take up quite a bit of space on a mobile device.
In addition to the above, I don't think that the singular word focus changes much, if anything. The model has to not just account for words, but also for the different ways those words can be said (volume, tone, accents, inflection, etc, etc).
So what you are asking can be done but there's also good reasons why it's on the cloud.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

Shape (preferably human) recognition API for use with standard webcam

I am interested in getting into user interaction/shape detection with a simple usb webcam. I can use multiple webcams, but don't want to be restricted to using something like the kinect sensor. My detection cameras need to be set up on either side of a helmet (or if an individual one, on top). I have found some, but they don't really have the functionality I need and most are angled towards facial recognition. I need to be able to detect a basic human skeletal structure and determine if something is obstructing it. I would really rather be able to do it without using any sort of marker system on the target person. I would like for it to be able to target multiple structures. Obviously I am willing to do tweaking if necessary, but want to see how close I can get to what I need before I rebuild the wheel. I am trying to design an ai system that can determine how many people are in an area and where they are.
Doubt there will be anything like this since Microsoft spent a ton of money on the R&D for Kinect and it's probably all locked behind an NDA. I'm also guessing there's a lot of hardware within the Kinect that is not available in a standard webcam.
The closest thing that I could find to what you're looking for is the OpenKinect project, might be a good place to start your research.

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com