We're experimenting with a Freeswitch based multiparty video conferencing solution (Zoom like). The users are connecting via WebRTC (Verto clients) and the streams are all muxed and displayed on the canvas (mod_conference in mux mode). It works OK, but we notice high media latency for mixed output and this makes it very difficult to have a real-time dialogue. This is not load related, even with only 1 caller watching himself on the canvas (the mux conference output), it takes almost 1 second to see a local move being reflected on the screen (e.g. if I raise my hand I can see it happening on the screen after almost 1 second ). This is obviously the roundtrip delay, but after discarding the intrinsic network latency (measured to be about 100 ms roundtrip) there seem to be around 800-900 ms added latency. There's no TURN relaying involved. It seems this is being introduced along the buffering/ transcoding/ muxing pipeline. Any suggestions please what to try to reduce the latency? What sort of latency should we expect, what's your experience, has anyone deployed a Freeswitch video conferencing with acceptable latency for bidirectional, real time conversations? Ultimately I'm trying to understand if Freeswitch can be used for a multiparty real time video conversation or I should give up look for something else. Thanks!
Related
I am new to WebRTC technology.
I want to create a video chat / video conferencing with a transmitter and many followers (more 1000).
Example:
I read a lot of documentations :
https://medium.com/linagora-engineering/scalability-in-video-conferencing-part-1-276f52b4acac
https://webrtcglossary.com/sfu/
But I still don't know what is the best solution (in my case) between Selective Forwarding Unit (SFU) and Multiploint Control Unit (MCU).
Can you help me to understand?
I think the best way is MCU but I am not sure.
Second question:
Can you suggest some sources and links that can help me to set up such an architecture. Currently my project works perfectly in Peer To Peer (Mesh) but it is not the right solution. I have absolutely no idea how to set this up.
Thank you so much
It is possible to implement this using an SFU. The more peers are connected, the more you would need processing power to handle those new peers. This could be done by using more threads and/or forwarding requests to another machine.
With mediasoup it is possible have control over this. With this tool you have routers where peers can connect to to get the stream. A router works on a worker which has a limited amount of receiving peers (depending on cpu capacity). Now to allow more peers you can forward the stream to other routers which can expand the total capacity.
useful links:
https://mediasoup.org/documentation/v3/scalability/#one-to-many-broadcasting
https://mediasoup.org/documentation/v3/mediasoup/design/#architecture
https://mediasoup.discourse.group/t/scalability-in-mediasoup-example/793/2?u=dirvann
does anyone know how to stream html5 camera output to other users.
If that's possible should I use sockets, images and stream them to the users or other technology.
Is there any video tutorial where I can take a look about it.
Many thanks.
The two most common approaches now are most likely:
stream from the source to a server, and allow users connect to the server to stream to their devices, typically using some form of Adaptive Bit Rate streaming protocol (ABR - basically creates multiple bit rate versions of your content and chunks them, so the client can choose the next chunk from the best bit rate for the device and current network conditions).
Stream peer to peer, or via a conferencing hub, using WebRTC
In general, the latter is more focused towards real time, e.g. any delay should be below the threshold which would interfere with audio and video conferences, usually less than 200ms for audio for example. To achieve this it may have to sacrifice quality sometimes, especially video quality.
There are some good WebRTC samples available online (here at the time of writing): https://webrtc.github.io/samples/
We are currently using ExoPlayer for one of our applications, which is very similar to the HQ Trivia app, and we use HLS as the streaming protocol.
Due to the nature of the game, we are trying to keep all the viewers of this stream to have the same latency, basically to keep them in sync.
We noticed that with the current backend configuration the latency is somewhere between 6 and 10 seconds. Based on this fact, we assumed that it would be safe to “force” the player to play at a bigger delay (15 seconds, further off the live edge), this way achieving the same (constant) delay across all the devices.
We’re using EXT-X-PROGRAM-DATE-TIME tag to get the server time of the currently playing content and we also have a master clock with the current time (NTP). We’re constantly comparing the 2 clocks to check the current latency. We’re pausing the player until it reaches the desired delay, then we’re resuming the playback.
The problem with this solution is that the latency might get worse (accumulating delay) over time and we don’t have other choice than restarting the playback and redo the steps described above if the delay gets too big (steps over a specified threshold). Before restarting the player we’re also trying to slightly increase the playback speed until it reaches the specified delay.
The exoPlayer instance is setup with a DefaultLoadControl, DefaultRenderersFactory, DefaultTrackSelector and the media source uses a DefaultDataSourceFactory.
The server-side configuration is as follows:
cupertinoChunkDurationTarget: 2000 (default: 10000)
cupertinoMaxChunkCount: 31 (default: 10)
cupertinoPlaylistChunkCount: 15 (default: 3)
My first question would be if this is even achievable with a protocol like HLS? Why is the player drifting away accumulating more and more delay?
Is there a better setup for the exoPlayer instance considering our specific use case?
Is there a better way to achieve a constant playback delay across all the playing devices? How important are the parameters on the server side in trying to achieve such a behaviour?
I would really appreciate any kind of help because I have reached a dead-end. :)
Thanks!
The only sollution for this is provided by:
https://netinsight.net/product/sye/
Their sollution includes frame accurate sync with no drift and stateful ABR. This probably can’t be done with http based protocols hence their sollution is built upon UDP transport.
I have a simple UDP streaming protocol that takes RAW H264 video frames and sends them instantly from server side to the client side.
Using this protocol I can get near network RTT latency (no packet resending and I don't care about packet loss), so if I have 20 ms latency from server to the client I can make a video frame to be ready from encoder output to the client side (ready to be decoded) in... let's say 30 ms.
My question is:
Is WebRTC (over UDP) capable of going down to this kind of latencies?
Not taking into account encoding and decoding times, what is the
lowest latency possible I can get with WebRTC for the protocol layer?
I don't know if this kind of latencies will require my own protocol to be more deeply developed or I may go to something more generic like WebRTC for my video server development in order to instantly be supported by every web browser.
WebRTC can have the same low latency as regular SIP/RTP stacks.
WebRTC stack vendors does their best to reduce delay.
For recording and sending out there is no any delay. The stack will send the packets immediately once received from the recorder device and compressed with the selected codec. Some codec's (and some codec settings) might introduce some delay here to enable some features such as FEC.
Regarding the receiver side:
In optimal circumstances the stack should not delay the playback of the packets, so they can be display as soon as they arrive.
However in sub-optimal circumstances (with network delays or packet loss) the stack will introduce a jitter buffer. The lower is the network quality, the higher will be the jitter buffer length.
So, to achieve the lowest delay, you might have to do the followings:
choose a codec with the smallest processing time
remove FEC and disable any other settings which might cause additional delays
remove the jitter buffer (most WebRTC stacks doesn't have a setting for this so you might have to modify the code yourself, but it is an easy modification, because you just need to deactivate a part of the code)
WebRTC uses RTP as the underlying media transport which has only a small additional header at the beginning of the payload compared to plain UDP. This means it should be on par with what you achieve with plain UDP. RTP is heavily used in latency critical environments like real time audio and video (its the media transport in SIP, H.323, XMPP) and thus you can expect the latency to be sufficient for this purpose.
I successfully use WasapiLoopbackCapture() for recording audio played on system, but I'm looking for a way to record what the user would actually hear through the speakers.
I'll explain: If a certain application plays music, WASAPI Loopback shall intercept music samples, even if Windows main volume-control is set to 0, meaning: even if no sound is actually heard through audio-card's output-jack (speakers/headphone/etc).
I'd like to intercept the audio actually "reaching" the output-jack (after ALL mixers on the audio-path have "done their job").
Is this possible using NAudio (or other infrastructure)?
A code-sample or a link to a such could come in handy.
Thanks much.
No, this is not directly possible. The loopback capture provided by WASAPI is the stream of data being sent to the audio hardware. It is the hardware that controls the actual output sound, and this is where the volume level is applied to change the output signal strength. Apart from some hardware- and driver-specific options - or some interesting hardware solutions like loopback cables or external ADC - there is no direct method to get the true output data.
One option is to get the volume level from the mixer and apply it as a scaling factor on any data you receive from the loopback stream. This is not a perfect solution, but possibly the best you can do without specific hardware support.