I have a program that receives an audio (mono) stream of bits from TCP/IP. I am wondering whether the speech (speech-recognition) API in Mac OS X would be able to do a speech-to-text transform for me.
(I don't mind saving the audio into .wav first and read it as oppose to do the transform on the fly).
I have read the official docs online, it is a bit confusing. And I couldn't find any good example about this topic.
Also, should I do it in Cocoa/Carbon/Java or Objective-C?
Can someone please shed some light?
Thanks.
There's a number of examples that get copied under /Developer/Examples/Speech/Recognition when you install XCode.
Cocoa class for speech recognition is NSSpeechRecognizer.
I've not used it but as far as I know speech recognition requires you to build a grammar to help the engine choose from a number of choices rather then allowing you to pass free-form input. This is all explained in the examples referred above.
This comes a bit late perhaps, but I'll chime in anyway.
The speech recognition facilities in OS X (on both the Carbon and Cocoa side of things) are for speech command recognition, which means that they will recognize words (or phrases, commands) that have been loaded into the speech system language model. I've done some stuff with small dictionaries and it works pretty well, but if you want to recognize arbitrary speech things may turn hairier.
Something else to keep in mind is that the functionality that the speech APIs in OS X provide is not one to one. The Carbon stuff provides functionality that has not made it to NSSpeechRecognizer (the docs make some mention of this).
I don't know about Cocoa, but the Carbon Speech Recognition Manager does allow you to specify inputs other than a microphone so a sound stream would work just fine.
Here's a good O'Reilly article to get you started.
You can use either ApplicationServices's SpeechSynthesis (10.0+)
CFStringRef cfstr = CFStringCreateWithCString(NULL,"Hello World!", kCFStringEncodingMacRoman);
Str255 pstr;
CFStringGetPascalString(cfstr, pstr, 255, kCFStringEncodingMacRoman);
SpeakString(pstr);
or AppKit's NSSpeechSynthesizer (10.3+)
NSSpeechSynthesizer *synth = [[NSSpeechSynthesizer alloc] initWithVoice:#"com.apple.speech.synthesis.voice.Alex"];
[synth startSpeakingString:#"Hello world!"];
Related
I'm due to work on a small application that captures audio from the Mac's Audio Queue and needs to save it to disk in some reasonable audio format.
Does anyone have a some decent sample code (Cocoa / Objective-C) that they can share?
I specifically need to capture the audio that is being passed to the Built-in Output device in order to record it. Any insights? The answers so far have been helpful, but have not helped me understand how the data going to the output can be captured, agnostic of the input source.
Working with audio in Mac OS X involves interfacing with Core Audio. For a quick overview, take a look at the Core Audio Overview.
You will need to interface with the AUHAL to perform input and output; a technical note exists detailing the steps required to do so. This code seems to usually be written in C++, as that is the procedure taken in the SimplePlayThru demo.
This doesn't cover the actual steps required to capture that audio input. However, these links should provide you with enough sample code to begin interfacing with your input device. I'll post more links in this answer if I happen across them.
Take a look at /Developer/Example/CoreAudio/Services/AudioFileTools. Specifically, look at afrecord.cpp. Admittedly, this is not Cocoa per se; Cocoa itself doesn't seem to have any specific capabilities for recording. If you'll want to interface with the C++ file there, you'll likely need to write some Objective C++ like in SimplePlayThru.
There is a good example code at Ulli Kusterers Github Repository
Cocoadev also has an article about that topic. The source code at the bottom of the page uses QuickTimes Sequence Grabber API. I would go with Core Audio.
Is there a "talking head" library for Mac OS X / Cocoa / Objective-C? Specifically the ones that simplify translating spoken text into visemes / facial expressions? Microsoft has "Microsoft Agent" as part of their Text to Speech API, does the Mac has a worthy competitor for this feature?
There is nothing to generate the face, but you can use the NSSpeechSynthesizerDelegate protocol to receive -speechSynthesizer:willSpeakPhoneme: messages so you can sync your own artwork with the speech.
I just posted a quick and dirty demo complete with project files and test mouth images. It's intended to show how easy this is to do. The hard part is the art work. :-) The project and blog post should be enough to get anybody started.
I'm due to work on a small application that captures audio from the Mac's Audio Queue and needs to save it to disk in some reasonable audio format.
Does anyone have a some decent sample code (Cocoa / Objective-C) that they can share?
I specifically need to capture the audio that is being passed to the Built-in Output device in order to record it. Any insights? The answers so far have been helpful, but have not helped me understand how the data going to the output can be captured, agnostic of the input source.
Working with audio in Mac OS X involves interfacing with Core Audio. For a quick overview, take a look at the Core Audio Overview.
You will need to interface with the AUHAL to perform input and output; a technical note exists detailing the steps required to do so. This code seems to usually be written in C++, as that is the procedure taken in the SimplePlayThru demo.
This doesn't cover the actual steps required to capture that audio input. However, these links should provide you with enough sample code to begin interfacing with your input device. I'll post more links in this answer if I happen across them.
Take a look at /Developer/Example/CoreAudio/Services/AudioFileTools. Specifically, look at afrecord.cpp. Admittedly, this is not Cocoa per se; Cocoa itself doesn't seem to have any specific capabilities for recording. If you'll want to interface with the C++ file there, you'll likely need to write some Objective C++ like in SimplePlayThru.
There is a good example code at Ulli Kusterers Github Repository
Cocoadev also has an article about that topic. The source code at the bottom of the page uses QuickTimes Sequence Grabber API. I would go with Core Audio.
The SAPI engine can only render TTS from one application at a time (I have run a test with two instances of the Windows SDK TTSApplication sample to verify this). I am writing an application in which I need to detect whether the TTS engine is currently speaking (i.e. under control of a separate application, not mine).
Does anyone know please how can I programmatically (in C++) detect the SAPI TTS engine busy/ready state? I have tried using ISpVoice::GetStatus() but that only seems to work for any TTS activity in my own application.
Thanks.
This is the solution to know whether the speech synthesis system is speaking or not.
ISpVoice *pVoice;
hr = pVoice->GetStatus(& status, NULL);
if(status.dwRunningState == SPRS_IS_SPEAKING)
std::cout<< "The Speech Synthesis System is speaking."
else
std::cout<< "The Speech Synthesis System is not speaking."
For example in SAPI4, IVTxtAttributes::IsSpeaking retrieve such status (if engine is currently playing samples to some audio device).
Anyway IMO general SAPI engine is not limited to one application. I believe that this behaviour is 'your engine' specific.
From http://msdn.microsoft.com/en-us/library/ee431864%28v=vs.85%29.aspx
SPRUNSTATE lists the voice running states.
typedef enum SPRUNSTATE
{
SPRS_DONE,
SPRS_IS_SPEAKING
} SPRUNSTATE;
Elements:
SPRS_DONE
The voice has completed processing all queued streams.
SPRS_IS_SPEAKING
The voice instance currently has the audio claimed.
Is it possible to access the iSight camera on a macbook programmatically? By this I mean I would like to be able to just grab still frames from the iSight camera on command and then do something with them. If so, is it only accessible using objective c, or could other languages be used as well?
You should check out the QTKit Capture documentation.
On Leopard, you can get at all of it over the RubyCocoa bridge:
require 'osx/cocoa'
OSX.require_framework("/System/Library/Frameworks/QTKit.framework")
OSX::QTCaptureDevice.inputDevices.each do |device|
puts device.localizedDisplayName
end
I don't have a Mac here, but there is some Documentation up here:
http://developer.apple.com/documentation/Hardware/Conceptual/iSightProgGuide/01introduction/chapter_1_section_1.html
It looks like you have to go through the QuickTime API. There is supposed to be a Sample Project called "MungGrab" which could be worth a look according to this thread.
If you poke around Apple's mailing lists you can find some code to do it in Java as well. Here's a simple example suitable for capturing individual frames, and here's a more complicated one that's fast enough to display live video.
There's a command line utility called isightcapture that does more or less what you want to do. You could probably get the code from the developer (his e-mail address is in the readme you get when you download the utility).
One thing that hasn't been mentioned so far is the IKPictureTaker, which is part of Image Kit. This will come up with the standard OS provided panel to take pictures though, with all the possible filter functionality etc. included. I'm not sure if that's what you want.
I suppose you can use it from other languages as well, considering there are things like cocoa bridges but I have no experience with them.
Googling also came up with another question on stackoverflow that seems to address this issue.
Aside from ObjC, you can use the PyObjC or RubyCocoa bindings to access it also. If you're not picky about which language, I'd say use Ruby, as PyObjC is horribly badly documented (even the official Apple page on it refers to the old version, not the one that came with OS X Leopard)
Quartz Composer is probably the easiest way to access it, and .quartz files can be embed in applications pretty easily (and the data piped out to ObjC or such)
Also, I suppose there should be an example or two of this in the /Developer/Examples/
From a related question which specifically asked the solution to be pythonic, you should give a try to motmot's camiface library from Andrew Straw. It also works with firewire cameras, but it works also with the isight, which is what you are looking for.
From the tutorial:
import motmot.cam_iface.cam_iface_ctypes as cam_iface
import numpy as np
mode_num = 0
device_num = 0
num_buffers = 32
cam = cam_iface.Camera(device_num,num_buffers,mode_num)
cam.start_camera()
frame = np.asarray(cam.grab_next_frame_blocking())
print 'grabbed frame with shape %s'%(frame.shape,)