Audio analysis for voice, gender diarization/recognition - voice-recognition

Does anyone know a library, program, project, etc. that tries to determine how many speakers were active in an audio file, label each speaker, label its gender, etc.?
So far I found the following:
Identifying segments when a person is speaking?
Audio analysis to detect human voice, gender, age and emotion — any prior open-source work done?

The task of identifying how many people are there and assigning segments to speakers in an audio file is known as speaker diarization. Using this keyword for search you can find lots of research papers and some libraries in python. Most of the current research use deep learning models, typically RNN, to generate embeddings and then cluster them into different chunks, ideally which belong to different speakers. It is a difficult task, especially if your files are noisy. I didn't find any library/tool which was very accurate. Even IBM's API is not that accurate.
We have developed some Deep learning models on our own for this task which are exposed through API's. You can take a look at https://developers.deepaffects.com/ for more info. We also have gender and emotion recognition API's.
Disclosure - I work at deepaffects

Related

What is needed for a recommendation engine based on word/text input

I'm new to the Machine-Learning (AI) technology. I'm developing a messenger app for Android/IOs where I would like to recommend the users based on the texts/word/conversation a product from a relative small product portfolio.
Example 1:
In case the user of the messenger writes a sentence including the words "vine", "dinner", "date" the AI should recommend a bottle of vine to the user.
Example 2:
In case the user of the app writes that he has drunk a good coffee this morning, the AI should recommend a mug to the user.
Example 3:
In case the user writes something about a cute boy she met last day, the AI should recommend a "teddy bear" to the user.
I'm a Software Developer since almost 20 year with experience in the development of C/C++/Java based application (Android and IOs apps) as well as some experience in Google Cloud Platform. The ML/AI technology is completely new to me. Okay, I know the basics (input data is needed to train the ML/AI system etc.), but I wonder If there is already a framework which could help me to develop such a system which solves the above described uses-case.
I would appreciate it, if you could give me some hints where and how to start.
Thank you and regards
It is definitely possible to implement such an application, in case you want to do it in Google Cloud you will need some understanding of Tensorflow.
First of all, I recommend to you to do the Machine Learning Crash Course, for a good introduction to Machine Learning and to start to familiarize yourself with TensorFlow. Afterwards I recommend to take a look into Tensorflow tutorials which will give you a more practical introduction to Tensorflow, and include various examples on building/training/testing models.
Once you are famirialized with Tensorflow, you can jump into learning how to run jobs in the Machine Learning engine, you can start by following the quickstart. The documentation includes detailed guides on how to use the ml-engine, plus multiple samples and tutorials.
Since I believe that your application would fall into the Recommender System type, here you can see an example model, in Google Cloud ML Engine, on how to recommend items to users based on his previous searches. In your case, you would have to build a model in order to recommend items to users based on his previous words in the sentence.
The second option, in case you don't want to go through the hassle of building a new model from scratch, would be to use the Google Cloud Natural Language API, which you can understand as pre-trained models using Google (incredibly big) data. In your case, I believe that the Content Classifying API would help you achieve what your application intends to do, however, the outputs (which you can see here) are limited to what the model was trained to do, and might not be specific enough for your application, however it is an easy solution and you can still profit of this API in order to extract labels/information and send it as input to another model.
I hope that these links provide you with some foundations on what is possible to do with Tensorflow in the ML Engine, and are useful to you.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

Robot odometry in labview

I am currently working on a (school-)project involving a robot having to navigate a corn field.
We need to make the complete software in NI Labview.
Because of the tasks the robot has to be able to perform the robot has to know it's position.
As sensors we have a 6-DOF IMU, some unrealiable wheel encoders and a 2D laser scanner (SICK TIM351).
Until now I am unable to figure out any algorithms or tutorials, and thus really stuck on this problem.
I am wondering if anyone ever attempted in making SLAM work in labview, and if so are there any examples or explanations to do this?
Or is there perhaps a toolkit for LabVIEW that contains this function/algorithm?
Kind regards,
Jesse Bax
3rd year mechatronic student
As Slavo mentioned, there's the LabVIEW Robotics module that contains algorithms like A* for pathfinding. But there's not very much there that can help you solve the SLAM problem, that I am aware of. The SLAM problem consist of the following parts: Landmark extraction, data association, state estimation and updating of state.
For landmark extraction, you have to pick one or multiple features that you want the robot to recognize. This can for example be a corner or a line(wall in 3D). You can for example use clustering, split and merge or the RANSAC algorithm. I believe your laser scanner extract and store the points in a list sorted by angle, this makes the Split and Merge algorithm very feasible. Although RANSAC is the most accurate of them, but also has a higher complexity. I recommend starting with some optimal data points for testing the line extraction. You can for example put your laser scanner in a small room with straight walls and perform one scan and save it to an array or a file. Make sure the contour is a bit more complex than just four walls. And remove noise either before or after measurement.
I haven't read up on good methods for data association, but you could for example just consider a landmark new if it is a certain distance away from any existing landmarks or update an old landmark if not.
State estimation and updating of state can be achieved with the complementary filter or the Extended Kalman Filter (EKF). EKF is the de facto for nonlinear state estimation [1] and tend to work very well in practice. The theory behind EKF is quite though, but it should be a tad easier to implement. I would recommend using the MathScript module if you are going to program EKF. The point of these two filters are to estimate the position of the robot from the wheel encoders and landmarks extracted from the laser scanner.
As the SLAM problem is a big task, I would recommend program it in multiple smaller SubVI's. So that you can properly test your parts without too much added complexity.
There's also a lot of good papers on SLAM.
http://www.cs.berkeley.edu/~pabbeel/cs287-fa09/readings/Durrant-Whyte_Bailey_SLAM-tutorial-I.pdf
http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-412j-cognitive-robotics-spring-2005/projects/1aslam_blas_repo.pdf
The book "Probabalistic Robotics".
https://wiki.csem.flinders.edu.au/pub/CSEMThesisProjects/ProjectSmit0949/Thesis.pdf
LabVIEW provides LabVIEW Robotics module. There are also plenty of templates for robotics module. Firstly you can check the Starter Kit 2.0 template Which will provide you simple working self driving robot project. You can base on such template and develop your own application from working model, not from scratch.

Comparative Analysis of Object Recognition or Tracking Methods with Multiple Kinects

First of all, I have looked into the list of all branches of StackExchange, and this seems to be the one best suited for this question.
I am looking for a comparative analysis (can be both theoretical and implementation-oriented) between various popular methods of object recognition and tracking using a Microsoft Kinect 360. These methods do not have to include specialised Kinect API features like hand gesture recognition or skeleton tracking. Could you point me to some technical literature that tackles this subject? I have found quite a few papers that talk about doing object detection and training based on the PCL library. I have also found some papers on using just the RGB image of the Kinect to do classic object tracking. But to put things into perspective, I want to know which method gives good performance and is less challenging to implement when a set of Kinects (possibly with overlapping projection cones) is used to do object recognition and/or tracking.
In my opinion, building a unified point cloud to analyse and label the objects to recognise/classify them using multiple Kinects (assuming I have multiple BUSes) would have too high processing overhead. Would it be a viable alternative to do foreground extraction on the depth images separately and then somehow identify duplicates?

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com