Shape (preferably human) recognition API for use with standard webcam - api

I am interested in getting into user interaction/shape detection with a simple usb webcam. I can use multiple webcams, but don't want to be restricted to using something like the kinect sensor. My detection cameras need to be set up on either side of a helmet (or if an individual one, on top). I have found some, but they don't really have the functionality I need and most are angled towards facial recognition. I need to be able to detect a basic human skeletal structure and determine if something is obstructing it. I would really rather be able to do it without using any sort of marker system on the target person. I would like for it to be able to target multiple structures. Obviously I am willing to do tweaking if necessary, but want to see how close I can get to what I need before I rebuild the wheel. I am trying to design an ai system that can determine how many people are in an area and where they are.

Doubt there will be anything like this since Microsoft spent a ton of money on the R&D for Kinect and it's probably all locked behind an NDA. I'm also guessing there's a lot of hardware within the Kinect that is not available in a standard webcam.
The closest thing that I could find to what you're looking for is the OpenKinect project, might be a good place to start your research.

Related

Looking for a way to capture elevation and location data from a device to create a topographical map or model

I'm in the process of buying a 7.5 acre plot of land in a wooded, hilly area. I would estimate that the elevation varies about 50 feet from the bottom of the creek to the top of the hill. I would like to find a good method for measuring the topography of the land so I can create a 3D model. It would be tremendously useful to be able to try out different land development ideas and to simulate locations for future buildings.
My low-tech version of doing this would be to set up a laser level and go around taking elevation measurements in a 3' or so grid pattern. As I thought about that, I realized that smartphones and similar devices have quite a few sensors built in that might make this a lot easier.
I learned about software that will use a drone to capture data and images to automatically generate a topo map and 3D model. Drone Deploy is one such tool. I do have a DJI Phantom 4, but I don't know if it's feasible to fly such an intricate path among trees to scan the entire property. I wonder if there's another way to use this amazing modern hardware (phone or drone) to make my task easy.
I would appreciate hearing any thoughts and ideas about this!
The thing with dronedeploy is that you fly above the trees usually 30meters is ok. In a cross pattern.
Why do you want to fly between the trees? You have to explain that first.

Separate noise from skeleton with Kinect

Looking to do a proof of concept, and new to Kinect. I believe this is possible, but trying to gauge difficulty with links to tutorials etc explaining how this may work.
Looking to have the Kinect look at a walkway, and essentially detect people movement. This does not mean Skeletal movement, but essentially "foot traffic". I want to determine the noise of traffic, i.e. are there alot of people walking past, or a few. (Note this does not mean counting, just a rough indication. Can this be just pixel movement etc?)
Secondly, if a person then stops and faces the Kinect, pick them up as a user, and track rudimentary movements.
The second part I'm relatively comfortable with, the first I'm not.
Any help is appreciated is pointing me in the right direction. We are a Microsoft house, so any indication if Microsoft SDK, or OpenKinect is the best path would be great too.

Description-File for physical setup of Multi-Monitors

I need machine-readable descriptions for Multi-Monitor and VR Setups, like simple dual-screen computers, Powerwalls, and Caves. This description must include the sizes and placements of all outputs (displays or projections) in the physical space.
The far goal is to combine User-(Head)-tracking, device tracking for mobile devices, etc. with multi-display environments.
The simplest issue is to be aware of the gap between the screens of a multi-monitor setup because of the borders of the display cases.
The most complex setup would probably be caves with polygonal or curved projection surfaces.
My impression is that every VR-Software out there defines it's own setup-config-crackpot-text-file-format. Is there a common standard or common practice I am missing?
There are no common standards in VR (yet) especially the type you take interest in, but you might want to check out vrui.
The author of that project understands the need for middle software that would do what you want to do: http://doc-ok.org/?p=123. He also has a great article where he considers that for VR the standard camera model could be changed with great benefits, in a way similar to what you seem to ask for in your question: http://doc-ok.org/?p=27
Maybe, once VR gets some popularity and traction thanks to Oculus and all, a need for standaristaion will rise - there already is one for HMDs, check out the OSVR project. But I dont really see it very probable - CAVEs and Powerwall setups won't be so widespread due to costs involved and space required. Using HMDs will probably be a lot cheaper and more portable/handy.
EDIT: I also found this - http://www.middlevr.com/

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com

Does anyone have any idea how to create a 2D skeleton with the Kinect depthmap?

I'm currently using a Processing Kinect library which supplies a depth map. I was wondering how I could take that and use it to create a 2D skeleton, if possible. Not looking for any code here, just a general process I could use to achieve those results.
Also, given that we've seen this in several of the Kinect games so far, would it be difficult to have multiple skeletons running at once?
Disclaimer: the reason why you still didn't get an answer for this question is probably because that's a current research problem. So I can't give you a direct answer but will try to help with some information and useful resources for this topic.
There are mainly 2 different approaches to create a skeleton from a depth map. The first one is to use machine learning, the second is purely algorithmic.
For the machine learning one, you'd need many samples of people doing a predetermined move, and use those samples to train your favorite learning algorithm. That's the approach that was taken and implemented by Microsoft in the XBox (source), it works really well BUT you need millions of samples to make it reliable... quite a drawback.
The "algorithmic" approach (understand without using a training set) can be done in many different ways and is a research problem. It's often based on modeling the possible body postures and trying to match that with the depth image received. That's the approach that was chosen by PrimeSense (the guys behind the kinect depth camera technology) for their skeleton tracking tool NITE.
The OpenKinect community maintains a wiki where they list some interesting research material about this topic. You might also be interested in this thread on the OpenNI mailing list.
If you're looking for an implementation of a skeleton tracking tool, PrimeSense released NITE (closed source), the one they made: it's part of the OpenNI framework. That's what's used in most of the videos you might have seen that involve skeleton tracking. I think it's able to handle up to 2 skeletons at the same time, but that requires confirmation.
The best solution is to use FAAST (http://projects.ict.usc.edu/mxr/faast/) which requires OpenNI. I have struggled to get OpenNI to work on my computer. I have not seen an approach yet using Code Laboratories' CL NUI.
An algorithmic approach is http://code.google.com/p/skeletonization/ but you may have a problem because your depthmap only represents surfaces and no closed objects.