Low latency isochronous out on OSX and Windows 10 - usb

I'm trying to output isochronous data (generated programmatically) over High Speed USB 2 with very low latency. Ideally around 1-2 ms. On Windows I'm using WinUsb, and on OSX I'm using IOKit.
There are two approaches I have thought of. I'm wondering which is best.
1-frame transfers
WinUsb is quite restrictive in what it allows, and requires each isochronous transfer to be a whole number of frames (1 frame = 1 ms). Therefore to minimise latency use transfers of one frame each in a loop something like this:
for (;;)
{
// Submit a 1-frame transfer ASAP.
WinUsb_WriteIsochPipeAsap(..., &overlapped[i]);
// Wait for the transfer from 2 frames ago to complete, for timing purposes. This
// keeps the loop in sync with the USB frames.
WinUsb_GetOverlappedResult(..., &overlapped[i-2], block=true);
}
This works fairly well and gives a latency of 2 ms. On OSX I can do a similar thing, though it is quite a bit more complicated. This is the gist of the code - the full code is too long to post here:
uint64_t frame = ...->GetBusFrameNumber(...) + 1;
for (;;)
{
// Submit at the next available frame.
for (a few attempts)
{
kr = ...->LowLatencyWriteIsochPipeAsync(...
frame, // Start on this frame.
&transfer[i]); // Callback
if (kr == kIOReturnIsoTooOld)
frame++; // Try the next frame.
else if (kr == kIOReturnSuccess)
break;
else
abort();
}
// Above, I pass a callback with a reference to a condition_variable. When
// the transfer completes the condition_variable is triggered and wakes this up:
transfer[i-5].waitForResult();
// I have to wait for 5 frames ago on OSX, otherwise it skips frames.
}
Again this kind of works and gives a latency of around 3.5 ms. But it's not super-reliable.
Race the kernel
OSX's low latency isochronous functions allow you to submit long transfers (e.g. 64 frames), and then regularly (max once per millisecond) update the frame list which says where the kernel has got to in reading the write buffer.
I think the idea is that you somehow wake up every N milliseconds (or microseconds), read the frame list, work out where you need to write to and do that. I haven't written code for this yet but I'm not entirely sure how to proceed, and there are no examples I can find.
It doesn't seem to provide a callback when the frame list is updated so I suppose you have to use your own timer - CFRunLoopTimerCreate() and read the frame list from that callback?
Also I'm wondering if WinUsb allows a similar thing, because it also forces you to register a buffer so it can be simultaneously accessed by the kernel and user-space. I can't find any examples that explicitly say you can write to the buffer while the kernel is reading it though. Are you meant to use WinUsb_GetCurrentFrameNumber in a regular callback to work out where the kernel has got to in a transfer?
That would require getting a regular callback on Windows, which seems a bit tricky. The only way I've seen is to use multimedia timers which have a minimum period of 1 millisecond (unless you use the undocumented (NtSetTimerResolution?).
So my question is: Can I improve the "1-frame transfers" approach, or should I switch to a 1 kHz callback that tries to race the kernel. Example code very appreciated!

(Too long for a comment, so…)
I can only address the OS X side of things. This part of the question:
I think the idea is that you somehow wake up every N milliseconds (or
microseconds), read the frame list, work out where you need to write
to and do that. I haven't written code for this yet but I'm not
entirely sure how to proceed, and there are no examples I can find.
It doesn't seem to provide a callback when the frame list is updated
so I suppose you have to use your own timer - CFRunLoopTimerCreate()
and read the frame list from that callback?
Has me scratching my head over what you're trying to do. Where is your data coming from, where latency is critical but the data source does not already notify you when data is ready?
The idea is that your data is being streamed from some source, and as soon as any data becomes available, presumably when some completion for that data source gets called, you write all available data into the user/kernel shared data buffer at the appropriate location.
So maybe you could explain in a little more detail what you're trying to do and I might be able to help.

Related

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.
do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

WasapiLoopbackCapture to WaveOut

I'm using WasapiLoopbackCapture to capture sound coming from my speakers and then using onDataAvailable to send it to another device and I'm attempting to play the data sent using the WaveOut class and a BufferedWaveProvider and just adding a sample everytime data is sent from my client using the onDataAvailable. I'm having problems sending sound. The most functioning I've managed to get it is:
Not syncing the Wave format of the client and the server, just sending data and adding it to the sample. Problem is this is stutters very much even though I checked the buffer stored size and it has 51 seconds. I even have to increase the buffer size which eventually overflows anyway.
I tried syncing the Wave format and I just get clicks but have no problem with buffer size. I also tried making sure that at least a second was stored in the buffer but that had zero effect.
If anyone could point me in the right direction that would be great.
Uncompressed audio takes up a lot of space on a network. On my machine the WasapiLoopbackCapture object produces 32-bit (IeeeFloat) stereo samples at 44100 samples per second, for around 2.7Mbit/sec total raw bandwidth. Once you factor in TCP packet overheads and so on, that's quite a lot of data you're transferring.
The first thing I would suggest though is that you plug in some profiling code at each step in the process to get an idea of where your bottlenecks are happening. How fast is data arriving from the capture device? How big are your packets? How long does it take to service each call to your OnDataAvailable event handler? How much data are you sending per second across the network? How fast is the data arriving at the client? Figure out where the bottlenecks are and you get a much better idea of what the bottlenecks are.
Try building a simulated server that reads data from a wave file in various WaveFormats (channels, bits per sample and sample rate) and simulates sending that data across the network to the client. You might find that the problem goes away at lower bandwidth. And if bandwidth is the issue, compression might be the solution.
If you're using a single-threaded model, and servicing each OnDataAvailable event takes longer than the recording frequency (ie: number of expected calls to OnDataAvailable per second) then there's going to be a data loss issue. Multiple threads can help with this - one to get the data from the audio system, another to process and send the data. But you can end up in the same position: losing data because you're not dealing with it quickly enough. When that happens it's handy to know about it, because it indicates a problem in the program. Find out when and where it happens - overflow in input, processing or output buffers all have different potential reasons and need different attention.

execute functions with millisecond accuracy based on timer

I have an array of floats which represent time values based on when events were triggered over a period of time.
0: time stored: 1.68
1: time stored: 2.33
2: time stored: 2.47
3: time stored: 2.57
4: time stored: 2.68
5: time stored: 2.73
6: time stored: 2.83
7: time stored: 2.92
8: time stored: 2.98
9: time stored: 3.05
I would now like to start a timer and when the timer hits 1 second 687 milliseconds - the first position in the array - for an event to be triggered/method execution.
and when the timer hits 2 seconds and 337 milliseconds for a second method execution to be triggered right till the last element in the array at 3 seconds and 56 milliseconds for the last event to be triggered.
How can i mimick something like this? i need something with high accuracy
I guess what im essentially asking is how to create a metronome with high precision method calls to play the sound back on time?
…how to create a metronome with high precision method calls to play the sound back on time?
You would use the audio clock, which has all the accuracy you would typically want (the sample rate for audio playback -- e.g. 44.1kHz) - not an NSTimer.
Specifically, you can use a sampler (e.g. AudioUnit) and schedule MIDI events, or you can fill buffers with your (preloaded) click sounds' sample data in your audio streaming callback at the sample positions determined by the tempo.
To maintain 1ms or better, you will need to always base your timing off the audio clock. This is really very easy because your tempo shall dictate an interval of frames.
The tough part (for most people) is getting used to working in realtime contexts and using the audio frameworks, if you have not worked in that domain previously.
Look into dispatch_after(). You'd create a target time for it using something like dispatch_time(DISPATCH_TIME_NOW, 1.687000 * NSEC_PER_SEC).
Update: if you only want to play sounds at specific times, rather than do arbitrary work, then you should use an audio API that allows you to schedule playback at specific times. I'm most familiar with the Audio Queue API. You would create a queue and create 2 or 3 buffers. (2 if the audio is always the same. 3 if you dynamically load or compute it.) Then, you'd use AudioQueueEnqueueBufferWithParameters() to queue each buffer with a specific start time. The audio queue will then take care of playing as close as possible to that requested start time. I doubt you're going to beat the precision of that by manually coding an alternative. As the queue returns processed buffers to you, you refill it if necessary and queue it at the next time.
I'm sure that AVFoundation must have a similar facility for scheduling playback at specific time, but I'm not familiar with it.
To get high precision timing you'd have to jump down a programming level or two and utilise something like the Core Audio Unit framework, which offers sample-accurate timing (at 44100kHz, samples should occur around every 0.02ms).
The drawback to this approach is that to get such timing performance, Core Audio Unit programming eschews Objective-C for a C/C++ approach, which is (in my opinion) tougher to code than Objective-C. The way Core Audio Units work is also quite confusing on top of that, especially if you don't have a background in audio DSP.
Staying in Objective-C, you probably know that NSTimers are not an option here. Why not check out the AVFoundation framework? It can be used for precise media sequencing, and with a bit of creative sideways thinking, and the AVURLAssetPreferPreciseDurationAndTimingKey option of AVURLAsset, you might be able to achieve what you want without using Core Audio Units.
Just to fill out more about AVFoundation, you can place instances of AVAsset into an AVMutableComposition (via AVMutableCompositionTrack objects), and then use AVPlayerItem objects with an AVPlayer instance to control the result. The AVPlayerItem notification AVPlayerItemDidPlayToEndTimeNotification (docs) can be used to determine when individual assets finish, and the AVPlayer methods addBoundaryTimeObserverForTimes:queue:usingBlock: and addPeriodicTimeObserverForInterval:queue:usingBlock: can provide notifications at arbitrary times.
With iOS, if your app will be playing audio, you can also get this all to run on the background thread, meaning you can keep time whilst your app is in the background (though a warning, if it does not play audio, Apple might not accept your app using this background mode). Check out UIBackgroundModes docs for more info.

Why is playing audio through AV Foundation blocking the UI on a slow connection?

I'm using AV Foundation to play an MP3 file loaded over the network, with code that is almost identical to the playback example here: Putting it all Together: Playing a Video File Using AVPlayerLayer, except without attaching a layer for video playback. I was trying to make my app respond to the playback buffer becoming empty on a slow network connection. To do this, I planned to use key-value observing on the AVPlayerItem's playbackBufferEmpty property, but the documentation did not say whether that was possible. I thought it might be possible because the status property can be observed (and is the example above) even though the documentation doesn't say that.
So, in an attempt to create conditions where the buffer would empty, I added code on the server to sleep for two seconds after serving up each 8k chunk of the MP3 file. Much to my surprise, this caused my app's UI (updated using NSTimer) to freeze completely for long periods, despite the fact that it shows almost no CPU usage in the profiler. I tried loading the tracks on another queue with dispatch_async, but that didn't help at all.
Even without the sleep on the server, I've noticed that loading streams using AVPlayerItem keeps the UI from updating for the short time that the stream is being downloaded. I can't see why a slow file download should ever block the responsiveness of the UI. Any idea why this is happening or what I can do about it?
Okay, problem solved. It looks like passing AVURLAssetPreferPreciseDurationAndTimingKey in the options to URLAssetWithURL:options: causes the slowdown. This also only happens when the AVURLAsset's duration property or some other property relating to the stream's timing is accessed from the selector fired by the NSTimer. So if you can avoid polling for timing information, this problem may not affect you, but that's not an option for me. If precise timing is not requested, there's still a delay of around 0.75 seconds to 1 second, but that's all.
Looking back through it, the documentation does warn that precise timing might cause slower performance, but I never imagined 10+ second delays. Why the delay should scale with the loading time of the media is beyond me; it seems like it should only scale with the size of the file. Maybe iOS is doing some kind of heavy polling for new data and/or processing the same bytes over and over.
So now, without "precise timing and duration," the duration of the asset is permanently at 0.0, even when it's fully loaded. I can also answer my original goal of doing KVO on AVPlayerItem.isPlaybackBufferEmpty. It seems KVO would be useless anyway, since the property starts out being NO, changes to YES as soon as I start playback, and continues to be YES even as the media is playing for minutes at a time. The documentation says this about the property:
Indicates whether playback has consumed all buffered media and that playback will stall or end.
So I guess that's not accurate, and, at least in this particular case, the property is not very useful.

Unexpected behavior with AudioQueueServices callback while recording audio

I'm recording a continuous stream of data using AudioQueueServices. It is my understanding that the callback will only be called when the buffer fills with data. In practice, the first callback has a full buffer, the 2nd callback is 3/4 full, the 3rd callback is full, the 4th is 3/4 full, and so on. These buffers are 8000 packets (recording 8khz audio) - so I should be getting back 1s of audio to the callback each time. I've confirmed that my audio queue buffer size is correct (and is somewhat confirmed by the behavior). What am I doing wrong? Should I be doing something in the AudioQueueNewInput with a different RunLoop? I tried but this didn't seem to make a difference...
By the way, if I run in the debugger, each callback is full with 8000 samples - making me think this is a threading / timing thing.
Apparently, from discussing with others — and the lack of responses, this behavior is as designed (or broken but not likely to be fixed), even if improperly documented. The work around is to buffer your samples appropriately in the callback if necessary and not expect the buffer to be full. This isn't an issue at all if you are just writing the data to a file, but if you expect to operate on consistently sized blocks of audio data in the callback, you will have to assure this consistency yourself.