mapping between upcall from a kernel vport and the userspace upcall threads - netlink

For the case that ovs uses kernel datapath, if there are 2 userspace upcall threads and 4 kernel vports, it seems (from source code 2.11.90), every vport will create only one netlink sock and binds it to every upcall thread's epoll_handler.
The question is, if there is an upcall request sent from a vport's netlink socket, which upcall thread in userspace will handler this request?
Thanks much.

I found the answer already. The ovs create an epoll-handler for all the netlink sockets and all the threads are listening to the epoll-handler. The first thread pending on the epoll-handler will get the cpu to run when an event occurs to the epoll-handler.

Related

UDP server and connected sockets

[edit]
Seems my question was asked nearly 10 years ago here...
Emulating accept() for UDP (timing-issue in setting up demultiplexed UDP sockets)
...with no clean and scalable solution. I think this could be solved handily by supporting listen() and accept() for UDP, just as connect() is now.
[/edit]
In a followup to this question...
Can you bind() and connect() both ends of a UDP connection
...is there any mechanism to simultaneously bind() and connect()?
The reason I ask is that a multi-threaded UDP server may wish to move a new "session" to its own descriptor for scalability purposes. The intent is to prevent the listener descriptor from becoming a bottleneck, similar to the rationale behind SO_REUSEPORT.
However, a bind() call with a new descriptor will take over the port from the listener descriptor until the connect() call is made. That provides a window of opportunity, albeit briefly, for ingress datagrams to get delivered to the new descriptor queue.
This window is also a problem for UDP servers wanting to employ DTLS. It's recoverable if the clients retry, but not having to would be preferable.
connect() on UDP does not provide connection demultiplexing.
connect() does two things:
Sets a default address for transmit functions that don't accept a destination address (send(), write(), etc)
Sets a filter on incoming datagrams.
It's important to note that the incoming filter simply discards datagrams that do not match. It does not forward them elsewhere. If there are multiple UDP sockets bound to the same address, some OSes will pick one (maybe random, maybe last created) for each datagram (demultiplexing is totally broken) and some will deliver all datagrams to all of them (demultiplexing succeeds but is incredibly inefficient). Both of these are "the wrong thing". Even an OS that lets you pick between the two behaviors via a socket option is still doing things differently from the way you wanted. The time between bind() and connect() is just the smallest piece of this puzzle of unwanted behavior.
To handle UDP with multiple peers, use a single socket in connectionless mode. To have multiple threads processing received packets in parallel, you can either
call recvfrom on multiple threads which process the data (this works because datagram sockets preserve message boundaries, you'd never do this with a stream socket such as TCP), or
call recvfrom on a single thread, which doesn't do any processing, just queues the message to the thread responsible for processing it.
Even if you had an OS that gave you an option for dispatching incoming UDP based on designated peer addresses (connection emulation), doing that dispatching inside the OS is still not going to be any more efficient than doing it in the server application, and a user-space dispatcher tuned for your traffic patterns is probably going to perform substantially better than a one-size-fits-all dispatcher provided by the OS.
For example, a DNS (DHCP) server is going to transact with a lot of different hosts, nearly all running on port 53 (67-68) at the remote end. So hashing based on the remote port would be useless, you need to hash on the host. Conversely, a cache server supporting a web application server cluster is going to transact with a handful of hosts, and a large number of different ports. Here hashing on remote port will be better.
Do the connection association yourself, don't use socket connection emulation.
The issue you described is the one I encountered some time ago doing TCP-like listen/accept mechanism for UDP.
In my case the solution (which turned out to be bad as I will describe later) was to create one UDP socket to receive any incoming datagrams and when one arrives making this particular socket connected to sender (via recvfrom() with MSG_PEEK and connect()) and returning it to new thread. Moreover, new not connected UDP socket was created for next incoming datagrams. This way the new thread (and dedicated socket) did recv() on the socket and was handling only this particular channel from now on, while the main one was waiting for new datagrams coming from other peers.
Everything had worked well until the incoming datagram rate was higher. The problem was that while the main socket was transitioning to connected state, it was buffering not one but a few more datagrams (coming from many peers) and thus thread created to handle the particular sender was reading in effect a few more datagrams not intended to it.
I could not find solution (e.g. creating new connected socket (instead connecting the main one) and pass the received datagram on main socket to its receive buffer for futher recv()). Eventually, I ended up with N threads, each one having one "listening" socket (with use of SO_REUSEPORT) with datagram scattering done on OS level.

NETLINK input function in kernel

When we invoke sendmsg API call from user process, input function is invoked and we have sent message to kernel. Ok, but when we call recvmsg API call, is input function invoked again? I saw this on example that I can not comment because I don't have reputations. Title of that post is: "How to use netlink socket to communicate with a kernel module?" So, could anyone see that example and tell me how to distinguish things between writing to kernel socket and reading from it.
Why would the input function be invoked again? sendmsg() sends and recvmsg() receives. The hello_nl_recv_msg() is only executed when the kernel module receives a message.
In that example, the userspace program sends message A to the kernel using the sendmsg() function.
Message A arrives to the kernel. The kernel calls hello_nl_recv_msg(). Message A is encapsulated in the argument, skb.
The kernel module chooses to send a response to the process whose process ID is the one that sent skb. It creates message B. The kernel module sends message B to userspace using the nlmsg_unicast() function.
Message B appears in userspace during the recvmsg() function. (Because the process ID of the userspace program is the same the kernel module wrote to.)
recvmsg() sleeps until a message to the kernel is received, so you don't have to worry whether the kernel has already answered or not before you call that function.

How to prevent an I/O Completion Port from blocking when completion packets are available?

I have a server application that uses Microsoft's I/O Completion Port (IOCP) mechanism to manage asynchronous network socket communication. In general, this IOCP approach has performed very well in my environment. However, I have encountered an edge case scenario for which I am seeking guidance:
For the purposes of testing, my server application is streaming data (lets say ~400 KB/sec) over a gigabit LAN to a single client. All is well...until I disconnect the client's Ethernet cable from the LAN. Disconnecting the cable in this manner prevents the server from immediately detecting that the client has disappeared (i.e. the client's TCP network stack does not send notification of the connection's termination to the server)
Meanwhile, the server continues to make WSASend calls to the client...and being that these calls are asynchronous, they appear to "succeed" (i.e. the data is buffered by the OS in the outbound queue for the socket).
While this is all happening, I have 16 threads blocked on GetQueuedCompletionStatus, waiting to retrieve completion packets from the port as they become available. Prior to disconnecting the client's cable, there was a constant stream of completion packets. Now, everything (as expected) seems to have come to a halt...for about 32 seconds. After 32 seconds, IOCP springs back into action returning FALSE with a non-null lpOverlapped value. GetLastError returns 121 (The semaphore timeout period has expired.) I can only assume that error 121 is an artifact of WSASend finally timing out after the TCP stack determined the client was gone?
I'm fine with the network stack taking 32 seconds to figure out my client is gone. The problem is that while the system is making this determination, my IOCP is paralyzed. For example, WSAAccept events that post to the same IOCP are not handled by any of the 16 threads blocked on GetQueuedCompletionStatus until the failed completion packet (indicating error 121) is received.
My initial plan to work around this involved using WSAWaitForMultipleEvents immediately after calling WSASend. If the socket event wasn't signaled within (e.g. 3 seconds), then I terminate the socket connection and move on (in hopes of preventing the extensive blocking effect on my IOCP). Unfortunately, WSAWaitForMultipleEvents never seems to encounter a timeout (so maybe asynchronous sockets are signaled by virtue of being asynchronous? Or copying data to the TCP queue qualifies for a signal?)
I'm still trying to sort this all out, but was hoping someone had some insights as to how to prevent the IOCP hang.
Other details: My server application is running on Win7 with 8 cores; IOCP is configured to use at most 8 concurrent threads; my thread pool has 16 threads. Plenty of RAM, processor and bandwidth.
Thanks in advance for your suggestions and advice.
It's usual for the WSASend() completions to stall in this situation. You won't get them until the TCP stack times out its resend attempts and completes all of the outstanding sends in error. This doesn't block any other operations. I expect you are either testing incorrectly or have a bug in your code.
Note that your 'fix' is flawed. You could see this 'delayed send completion' situation at any point during a normal connection if the sender is sending faster than the consumer can consume. See this article on TCP flow control and async writes. A better plan is to use a counter for the amount of oustanding writes (per connection) that you want to allow and stop sending if that counter gets reached and then resume when it drops below a 'low water mark' threshold value.
Note that if you've pulled out the network cable into the machine how do you expect any other operations to complete? Reads will just sit there and only fail once a write has failed and AcceptEx will simply sit there and wait for the condition to rectify itself.

boost::asio timeouts example - writing data is expensive

boost:: asio provides an example of how to use the library to implement asynchronous timeouts; client sends server periodic heartbeat messages to server, which echoes heartbeat back to client. failure to respond within N seconds causes disconnect. see boost_asio/example/timeouts/server.cpp
The pattern outlined in these examples would be a good starting point for part of a project i will be working on shortly, but for one wrinkle:
in addition to heartbeats, both client and server need to send messages to each other.
The timeouts example pushes heartbeat echo messages onto a queue, and a subsequent timeout causes an asynchronous handler for the timeout to actually write the data to the socket.
Introducing data for the socket to write cannot be done on the thread running io_service, because it is blocked on run(). run_once() doesn't help, you still block until there is a handler to run, and introduce the complexity of managing work for the io_service.
In asio, asynchronous handlers - writes to the socket being one of them - are called on the thread running io_service.
Therefore, to introduce messages randomly, data to be sent is pushed onto a queue from a thread other than the io_service thread, which implies protecting the queue and notification timer with a mutex. There are then two mutexes per message, one for pushing the data to the queue, and one for the handler which dequeues the data for write to socket.
This is actually a more general question than asio timeouts alone: is there a pattern, when the io_service thread is blocked on run(), by which data can be asynchronously written to the socket without taking two mutexes per message?
The following things could be of interest: boost::asio strands is a mechanism of synchronising handlers. You only need to do this though if you are calling io_service::run from multiple threads AFAIK.
Also useful is the io_service::post method, which allows you execute code from the thread that has invoked io_service::run.

Prefered method of notifying upper layers about received message

I'm writing a RS485 driver for an embedded C project.
The driver is listening for incoming messages and should notify the upper layer application when a complete message is received and ready to be read.
What is the preferred way to do this?
By using interrupts? Trigger a SW interrupt and read the message from within the ISR?
Let the application poll the driver periodically?
I generally do as little work as possible in the ISR to secure the received data or clean up the transmitted data. This will usually mean reading data out of the hardware buffers and into a circular buffer.
On receive, for a multi-threaded os, a receive interrupt empties the hardware, clears the interrupt and signals a thread to service the received data.
For a polling environment, a receive interrupt empties the harwdware, clears the interrupt, and sets a flag to notify the polling loop that it has something to process.
Since interrupts can occur any time the data structures shared between the ISR and the polling loop or processing thread must be protected using a mutual exclusion mechanism.
Often this will mean disabling interrupts briefly while you adjust a pointer or count.
If the received data is packetized you can hunt for packet boundaries in the ISR
and notify the handler only when a full packet has arrived.