I already use HTK (Hidden Markov Model Tool Kit) for recognizing specific commands
used to control my Android application, but in this case I need to pass some voice data to a server and that may consume more time.
To prevent this latency, I am thinking about using pocketsphinx to recognize the voice data locally with the Android application so that I won't need to pass that audio to the server.
If this is a good idea, is it easy to learn pocketsphinx from scratch? Also, what are advantages and disadvantages of both techniques (server-based and local voice recognition), and which one is better?
CMUSphinx is definitely a great idea, it has a number of advantages over HTK:
Better license
Works offline on Android
Fast
Supports multiple languages out-of-box
Easier to use and learn
You definitely should try Pocketsphinx, for more information see
http://cmusphinx.sourceforge.net/2011/05/building-pocketsphinx-on-android/
Related
I am using the AR Parrot drone 2.0, and I want to somehow "trick" the PC into thinking that the video stream coming from it is a second webcam.
The reason is that I want to use some existing computer vision libraries, and it would much more straight forward to just specify which webcam is the source of the video stream, on which the algorithms will run.
I came across an iOS app that streams video from the AR drone, but it seems to stream it on iPhones/iPads and not on computer devices.
Are you looking for a full app or do you want to create one by yourself?
I'm working with an AR.Drone 2.0 for a research project and although it is an old device which makes it complicated to find resources, there are... and lots of them.
If you want to create your own app, you don't say which language / platform you want to use it. If you want to use Java, you can use YADrone. It has a good API and it also has examples, including a full application that can be used.
https://vsis-www.informatik.uni-hamburg.de/oldServer/teaching//projects/yadrone/
I want to embed a video stream into my web page, which is part of our own cloud based software. The video should be low-latency (like video conferencing), and it would be preferable, but not required, for it to include audio. I am comfortable serving streaming binary data from the server-side, and embedding it into the page using HTML5 video.
What I am not comfortable with is the ability to capture the video data to begin with. The client does not already have a solution in place, and is looking to us for assistance. The video would be routed through our server equipment, and not be an embedded peice that connects directly to the video source.
It is a known quantity for us to use a USB or built-in camera from the computer. What I would like more information is about stand-alone cameras.
Some models of cameras have their own API documentation (example). It would seem from what I am reading that a manufacturer would typically have their own API which they repeat on many or all of their models, and that each manufacturer would be different in their API. However, I have only done surface reading and hope to gain more knowledge from someone who has already researched this, or perhaps even had first hand experience.
Do stand-alone cameras generally include an API? (Wouldn't this is a common requirement, so that security software can use multiple lines of cameras?) Or if not an API, how is the data retrieved from the on-board webserver? Is it usually flash based? Perhaps there is a re-useable video stream I could capture from there? Or is the stream formatting usually diverse?
What would I run into when trying to get the server-side to capture that data?
How does latency on a stand-alone device compare with a USB camera solution?
Do you have tips on picking out a stand-alone camera that would be a good fit for streaming through a server?
I am experienced at using JavaScript (both HTML5 and Node.JS), Perl and Java.
Each camera manufacturer has their own take on this from the point of access points; generally you should be able to ask for a snapshot or a MJPEG stream, but it can vary. Take a look at this entry on CodeProject; it tackles two common methodologies. Here's another one targeted at Foscam specifically.
Get a good NAS, I suggest Synology, check out their long list of supported IP Web Cams. You can connect them with a hub or with a router or whatever you wish. It's not a "computer" as-in "tower", but it does many computer jobs, and it can stay on while your computer is off or away, and do thing like like video feeds, torrents, backups, etc.
I'm not an expert on all the features, so I don't know how to get it to broadcast without recording, but even if it does then at least it's separate. Synology is a popular brand and there are lot of authorized and un-authorized plugins for it. Check them out and see if one suits you.
Sorry if this is a repeat question, but I didn't see it anywhere.
I'm working on a Mac program that will take voice commands, and NSSpeechRecognizer isn't quite doing it for me.
I want something a little more dynamic so I can set alarms, make dates, give more natural commands, etc.
Every open source speech engine I've found is tailored toward iOS. Do openears/vocalkit etc. still work just as fine for Mac programs?
Speech recognition is exceptionally non-trivial. The engines that are free are free for a reason. If you expect dictation in any amount (like an alarm label), you're out of luck. There are reasons Siri requires an entire data center. The open source packages available won't get you much further than simple telephone auto-attendants.
Unless you have an extensive statistics background and free time, I'd recommend that you pursue licensing a commercial library or server implementation.
pocket sphinx from Carnegie Melon is about the only option
http://cmusphinx.sourceforge.net/
In a nutshell Fast Dormancy allows the RRC state machine to go to IDLE(CELL_PCH) from CELL_DCH without waiting for the timer to expire. Is there any OS (Android, Windows Phone, iOS etc) which exposes APIs using which we can invoke fast dormancy on 3G devices? Any pointers appreciated.
EDIT: Does any OS expose API's to
switch off 3G radio or switch radio
states(DCH,FACH,IDLE etc.)
I'm not sure if I understood your question correctly (I'm not familiar with the actual 3G-technology), but at least BlackBerry API (since 4.2.1) does have the following method:
Requests that the radios belonging to
the provided Wireless Access Families
be powered off.
http://www.blackberry.com/developers/docs/6.0.0api/net/rim/device/api/system/Radio.html#deactivateWAFs(int)
Constants used with the above:
http://www.blackberry.com/developers/docs/6.0.0api/net/rim/device/api/system/RadioInfo.html#WAF_3GPP
Not sure if this is what you actually meant.
It seems that Blackberry also expose fast dormancy since API 4.0.0
http://www.blackberry.com/developers/docs/5.0.0api/net/rim/device/api/io/IOProperties.html#CDMA_SET_FAST_DORMANCY_FLAG
and
http://www.blackberry.com/developers/docs/4.0.2api/net/rim/device/api/io/IOProperties.html
The OFono stack used by MeeGo seems to have Fast Dormancy settings (and radio toggling) in the radio settings api, but I can't really see at which level those would be available to users. The API doc is in their git repo:
http://meego.gitorious.org/meego-cellular/ofono/blobs/5639c653979e324e0b3a195ec3fab07fc2bd3a05/doc/radio-settings-api.txt
I've read NCFD has been blamed for spotty 3G performance on iOS devices in some cases, so I'm not sure programmatically playing with at an application level is such a good idea, especially since you'd be making assumptions about the entire platform's network stack requirements.
I'm interested in writing some homebrew code for the Microsoft Kinect console. I have a few applications which I think would translate well to the platform. I've been toying with the idea of giving it a shot using the OpenKinect drivers and libraries. Obviously this would be a lot of work, but I am wondering just how much. Does anyone have experience with OpenKinect? Do you get only the raw video/audio data from the device, or has anyone written higher level abstractions to make common tasks easier?
The OpenKinect library is basically a driver — at least for now — so don't expect much high functions from it. You will more or less get the raw data from both the depth and the video cameras.
This is basically an array received in a callback function each time a frame arrives.
You can give it a try by following the instructions provided on the OpenKinect website, it's really quick to install and try it, and you can play a bit with the glview application provided to get a feeling of what's possible.
I've set up a few demos using opencv, and got pretty cool results even though I didn't have much background in computer vision so I can only encourage you to try it yourself!
Alternately, if you're looking for more advanced functions, the OpenNI framework was just released this week and provides some impressive high level algorithms such as skeleton tracking and some gesture recognition. Part of the framework is proprietary algorithms from PrimeSense (like the powerful skeleton tracking module...). I haven't tried it yet and don't know how well it integrates with the kinect and the different OS, but since a bunch of guys from different groups (OpenKinect, Willow Garage...) are working hard on it that shouldn't be an issue within a week.
Elaborating further on what Jules Olleon wrote, i've worked with OpenNI (http://www.openni.org) and the algorithms above it (NITE), and I highly recommend using these frameworks. Both frameworks are well-documented, and come with numerous samples from which you can start out.
Basically, OpenNI abstracts the lower-level details of working with the sensor and its driver for you, and gives you a convenient way to get what you want from a "generator" (e.g. xn::DepthGenerator for getting the raw depth data). OpenNI is open-source and free to use in any application. OpenNI also handles the platform-abstraction for you. As of today, OpenNI is supported and works fine for Windows 32/64 and linux, and is in the process of being ported to OSX. Bindings are available for use in multiple programming languages (C, C++, .NET, Python, and a few others I believe).
NITE has additional interfaces built above OpenNI, which give you higher-level results (e.g. track a hand-point, skeletons, scene analysis etc). You'll want to check the subtleties of NITE's license regarding when/where you can use it, but it's still probably the easiest and fastest way to get analysis (e.g. skeleton) for now. NITE is closed-source, so PrimeSense need to supply a binary version for you to use. Currently windows and linux versions are available.
I haven't worked with with OpenKinect but I've been working with OpenNI and SensorKinect for a few months now for my research. If you are planning to work with raw data from Kinect, they work great in giving you depth and video (they don't support motor control). I've used it with C++ and OpenGL in both Windows 64bit and Ubuntu 32bit with almost no modifications to the code. It's very easy to learn if you know basic c++. Installing it might be a little headache.
For more advanced features such as skeleton detection, gesture recognition, etc., I highly recommend using the middlewares such as NITE with OpenNI or the ones provided in here: Middlewares developed around OpenNI rather than re-inventing the wheel. Nite is also very easy to use once you have OpenNI working; e.g. joint recognition is something around 10-20 extra lines of code.
Something that I would recommend to my younger self would be to learn and work with a basic game engine (e.g. Unity) rather than directly with OpenGL. It would give you a lot better and more enjoyable graphics, less hassle and would also enable you to easily integrate your program with other tools such as PhysX. I haven't tried any, but I know there are some plugins for using Kinect drivers in Unity.