How to prevent an I/O Completion Port from blocking when completion packets are available? - winsock2

I have a server application that uses Microsoft's I/O Completion Port (IOCP) mechanism to manage asynchronous network socket communication. In general, this IOCP approach has performed very well in my environment. However, I have encountered an edge case scenario for which I am seeking guidance:
For the purposes of testing, my server application is streaming data (lets say ~400 KB/sec) over a gigabit LAN to a single client. All is well...until I disconnect the client's Ethernet cable from the LAN. Disconnecting the cable in this manner prevents the server from immediately detecting that the client has disappeared (i.e. the client's TCP network stack does not send notification of the connection's termination to the server)
Meanwhile, the server continues to make WSASend calls to the client...and being that these calls are asynchronous, they appear to "succeed" (i.e. the data is buffered by the OS in the outbound queue for the socket).
While this is all happening, I have 16 threads blocked on GetQueuedCompletionStatus, waiting to retrieve completion packets from the port as they become available. Prior to disconnecting the client's cable, there was a constant stream of completion packets. Now, everything (as expected) seems to have come to a halt...for about 32 seconds. After 32 seconds, IOCP springs back into action returning FALSE with a non-null lpOverlapped value. GetLastError returns 121 (The semaphore timeout period has expired.) I can only assume that error 121 is an artifact of WSASend finally timing out after the TCP stack determined the client was gone?
I'm fine with the network stack taking 32 seconds to figure out my client is gone. The problem is that while the system is making this determination, my IOCP is paralyzed. For example, WSAAccept events that post to the same IOCP are not handled by any of the 16 threads blocked on GetQueuedCompletionStatus until the failed completion packet (indicating error 121) is received.
My initial plan to work around this involved using WSAWaitForMultipleEvents immediately after calling WSASend. If the socket event wasn't signaled within (e.g. 3 seconds), then I terminate the socket connection and move on (in hopes of preventing the extensive blocking effect on my IOCP). Unfortunately, WSAWaitForMultipleEvents never seems to encounter a timeout (so maybe asynchronous sockets are signaled by virtue of being asynchronous? Or copying data to the TCP queue qualifies for a signal?)
I'm still trying to sort this all out, but was hoping someone had some insights as to how to prevent the IOCP hang.
Other details: My server application is running on Win7 with 8 cores; IOCP is configured to use at most 8 concurrent threads; my thread pool has 16 threads. Plenty of RAM, processor and bandwidth.
Thanks in advance for your suggestions and advice.

It's usual for the WSASend() completions to stall in this situation. You won't get them until the TCP stack times out its resend attempts and completes all of the outstanding sends in error. This doesn't block any other operations. I expect you are either testing incorrectly or have a bug in your code.
Note that your 'fix' is flawed. You could see this 'delayed send completion' situation at any point during a normal connection if the sender is sending faster than the consumer can consume. See this article on TCP flow control and async writes. A better plan is to use a counter for the amount of oustanding writes (per connection) that you want to allow and stop sending if that counter gets reached and then resume when it drops below a 'low water mark' threshold value.
Note that if you've pulled out the network cable into the machine how do you expect any other operations to complete? Reads will just sit there and only fail once a write has failed and AcceptEx will simply sit there and wait for the condition to rectify itself.

Related

Limit total size of inflight iot message

I am using IoTHub device client SDK on an embedded device. The application will send telemetry message to iot hub periodically. The iot device connect to a wireless router and wireless connect to internet via WAN port.
When the wireless router lost internet connection, iot device will not get notified immediately about the disconnection. It takes about 60s to get notified, before that iot device will continue to send telemetry message with IoTHubDeviceClient_LL_SendEventAsync(), all those message get queued in SDK layer and eat memory. Since it's on embedded device with limited resource, memory get eaten up and cause program been killed by a lower memory killer app.
Is there way to specified total size of iot message can be queued in sdk layer? If exceed this quota, IoTHubDeviceClient_LL_SendEventAsync() will failed immediately.
Actually this is also needed for normal scenario too. When iot device send message, seems message been queued in low layer and get flushed out at certain time. I don't see any API that can control the flush. That create another problem, even when there is internet connection, from application level, there is no control of how many message been queued and how long it been queued, in turn, app has no control of how much memory been used by process. On my device, there is system monitor that will kill process use too much memory.
The question is what do you do even in that case if the message failure occurs in the case that the queue is full? Do you lose the information then because of lack of storage capacity? From the IoT perspective, I would recommend in this case to consider if your device is reliable IoT device to handle these edge cases as well. And also knowing the limits of the devices, and knowing how long it can be without the internet connection helps to mitigate these risks from your application, not SDK.
From the GitHub, default sendMessageAsync method throws timeout exception in case your message sending fails, unless you have some kind of retry policies implemented(according to the documentation C SDK does not allow custom retry policies
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-reliability-features-in-sdks).
According to the documentation in case of connection failure based on the retry policy(if you have set it), SDK will try to initiate connection this way or that way and queue the messages created in the meantime:
https://github.com/Azure/azure-iot-sdk-c/blob/master/doc/connection_and_messaging_reliability.md
So, an expectation here is that SDK does not take responsibility for the memory limits. This is up to the application to deal. Since your device has some limitations, I would recommend implementing your own queuing mechanism(maybe set no-retry as a policy and that way avoid queuing). That way you have under the control what will happen in the case that there is no internet connection and have under the control memory limitations. Maybe your business case accepts that you calculate an average value and instead of 50 you store 1 message over the time etc..
If this something you do not like, the documentation says also that you set the timeout for the queue - maybe not the memory limit but timeout yes, so maybe you can try to investigate this a bit deeper:
"There are two timeout controls in this system. An original one in the iothub_client_ll layer - which controls the "waiting to send" queue - and a modern one in the protocol transport layer - that applies to the "in progress" list. However, since IoTHubClient_LL_DoWork causes the Telemetry messages to be immediately* processed, sent and moved to the "in progress" list, the first timeout control is virtually non-applicable.
Both can be fine-tuned by users through IoTHubClient_LL_SetOption, and because of that removing the original control could cause a break for existing customers. For that reason it has been kept as is, but it will be re-designed when we move to the next major version of the product."

Suspend operation of lwIP Raw API

I am working on a project using a Zynq (Picozed devboard). The application is run bare-metal, uses lwIP TCP in RAW mode and basically behaves like this:
Receive a batch of data via Ethernet, which is stored in RAM.
Process the batch of data.
Send back the processed data via Ethernet.
The problem is, I need to measure the execution time of the processing part. However, running lwIP in RAW mode forces me to call tcp_fasttmr() and tcp_slowtmr() every 250/500 ms, which makes accurate measurement pretty hard. Whenever I'm not calling the tcp_tmr() functions for some time, I start repeatedly receiving error messages via UART ("unable to alloc pbuf in recv_handler"). It seems this is called from some ISR related to error handling, but I cannot really find the exact location.
My question is, how do I suspend the network functionality so I don't need to call tcp_tmr() periodically? I tried closing the connection and disabling the interface (netif_set_down()) and disabling the timer interrupt, but it still seems to have no effect on my problem.
I don't know anything about that devboard or the microcontroller on it but you should have an ethernetif.c (lwIP port) file which should contain the processing of an Ethernet receive interrupt or similar. This should be calling the lwIP function netif->input with a packet to process.
Disabling the interface won't stop this behaviour, it will just stop the higher level processing of the packet. If you are only timing how long the execution time is for debugging, you could try disabling the Ethernet receive interrupt and stop calling tcp_tmr until you have processed the packets.

understanding the concept of running a program in interrupt handler

Early Cisco routers running IOS operating system enhanced their packet processing speed by doing packet switching within the interrupt handler instead in "regular" operating system process. Doing packet processing in interrupt handler ensured that context switching within operating system does not affect the packet processing. As I understand, interrupt handler is a piece of software in operating system meant for handling the interrupts. How to understand the concept of packet switching done within the interrupt handler?
use of interrupts is preferred when an event requires some immediate attention by the operating system, or a program which installed an interrupt service routine. This as opposed to polling, where software checks periodically whether a condition exists, which indicates that the event has occurred.
interrupt service routines aren't commonly meant to do a lot of work themselves. They are rather written to reach their end as quickly as possible, so that normal execution can resume. "normal execution" meaning, the location and state previous processing was interrupted when the interrupt occurred. reason is that it must be avoided that the same interrupt occurs again while its handler is still executed, or it may be ignored, or lead to incorrect results, or even worse, to software failure (crashes). So what an interrupt service routine usually does is, reading any data associated with that event and storing it in a queue, signalling that the queue experienced mutation, and setting things such that another interrupt may occur, then resume by restoring pre-interrupt context. the queued data, associated with that interrupt, can now be processed asynchronously, without risking that interrupts pile up.
The following is the procedure for executing interrupt-level switching:
Look up the memory structure to determine the next-hop address and outgoing interface.
Do an Open Systems Interconnection (OSI) Layer 2 rewrite, also called MAC rewrite, which means changing the encapsulation of the packet to comply with the outgoing interface.
Put the packet into the tx ring or output queue of the outgoing interface.
Update the appropriate memory structures (reset timers in caches, update counters, and so forth).
The interrupt which is raised when a packet is received from the network interface is called the "RX interrupt". This interrupt is dismissed only when all the above steps are executed. If any of the first three steps above cannot be performed, the packet is sent to the next switching layer. If the next switching layer is process switching, the packet is put into the input queue of the incoming interface for process switching and the interrupt is dismissed. Since interrupts cannot be interrupted by interrupts of the same level and all interfaces raise interrupts of the same level, no other packet can be handled until the current RX interrupt is dismissed.
Different interrupt switching paths can be organized in a hierarchy, from the one providing the fastest lookup to the one providing the slowest lookup. The last resort used for handling packets is always process switching. Not all interfaces and packet types are supported in every interrupt switching path. Generally, only those that require examination and changes limited to the packet header can be interrupt-switched. If the packet payload needs to be examined before forwarding, interrupt switching is not possible. More specific constraints may exist for some interrupt switching paths. Also, if the Layer 2 connection over the outgoing interface must be reliable (that is, it includes support for retransmission), the packet cannot be handled at interrupt level.
The following are examples of packets that cannot be interrupt-switched:
Traffic directed to the router (routing protocol traffic, Simple Network Management Protocol (SNMP), Telnet, Trivial File Transfer Protocol (TFTP), ping, and so on). Management traffic can be sourced and directed to the router. They have specific task-related processes.
OSI Layer 2 connection-oriented encapsulations (for example, X.25). Some tasks are too complex to be coded in the interrupt-switching path because there are too many instructions to run, or timers and windows are required. Some examples are features such as encryption, Local Area Transport (LAT) translation, and Data-Link Switching Plus (DLSW+).
More here: http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-software-releases-121-mainline/12809-tuning.html

OpenSSL SSL_ERROR_WANT_WRITE never recovers during SSL_write()

I have two applications talking to each other over SSL. The client is running on a windows machine, the server is a linux based application. The client is sending a large amount of data to the server on startup. The data is sent in ~4000byte chunks over to the server that contains 30 entries. I have to send about 50000 entries over.
During that transmission the server sends a message to the client, the message size is ~4000bytes. After that happens, the SSL_write() on the client side begins to return error of SSL_ERROR_WANT_WRITE. The client sleeps for 10ms, and retries the SSL_write with the exact same parameters, however, the SSL_write fails infinitely. Subsequently it aborts. If it tries to send a new message, I get an error indicating I am not sending the same aborted message from earlier.
error:1409F07F:SSL routines:SSL3_WRITE_PENDING: bad write retryā€¯
The server eventually kills the connection since it has not heard from the client for 60s and re-establishes a new one. This is just an FYI, the real issue is how can I get SSL_write to resume.
If the server does not send a request during the receive the problem goes away. If I shrink the size of the request from 16K to 100 bytes the problem does not happen.
The SSL CTX MODE is set to SSL_MODE_AUTO_RETRY and SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER.
Does anyone have an idea what might cause a simultaneous transmission from both sides with large information can cause this failure. What can I do to prevent it if this is a limitation other than capping the size that goes out from the server to the client. My concern is that if the client is not sending anything the throttling I applied to avoid this issue is a waste.
On the client side I tried to perform an SSL_read to see if I need to read during a write even though I never receive an SSL_ERROR_PENDING_READ, but the buffer is not that big anyway. ~1000bytes in size.
Any insight on this would be appreciated.
SSL_ERROR_WANT_WRITE - This error is returned by OpenSSL (I am assuming you are using OpenSSL) only when socket send gives it an EWOULDBLOCK or EAGAIN error. The socket send will give a EWOUDLBLOCK error when the send side buffer is full, which in turn means that your Server is not reading the messages sent from Client.
So, essentially, the problem lies with your Server which is not reading the messages sent to it. You need to check your server and fix it, which will automatically fix your client problem.
Also, why have you set the option "SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER"? SSL always expects that the record which it is trying to send should be sent completely before the next record can be sent.
As it turns out that with both the client and server side app, the read and writes are processed in one thread. In a perfect storm as I described above, the client is busy writing (non blocking). The server then decides to do a write a large set of messages of its own in between processing its rx buffers. The server tx is a blocking call. The server gets stuck writing, starves the read, the buffers fill up and we have a deadlock scenario.
The default windows buffer is 8k bytes so it doesn't take much to fill it up.
The architecture should be such that there is a separate thread for the rx and tx processing on both sides. As a short cut/term fix, once can increase the rx buffers and rate limit the tx side to prevent the deadlock.

boost::asio timeouts example - writing data is expensive

boost:: asio provides an example of how to use the library to implement asynchronous timeouts; client sends server periodic heartbeat messages to server, which echoes heartbeat back to client. failure to respond within N seconds causes disconnect. see boost_asio/example/timeouts/server.cpp
The pattern outlined in these examples would be a good starting point for part of a project i will be working on shortly, but for one wrinkle:
in addition to heartbeats, both client and server need to send messages to each other.
The timeouts example pushes heartbeat echo messages onto a queue, and a subsequent timeout causes an asynchronous handler for the timeout to actually write the data to the socket.
Introducing data for the socket to write cannot be done on the thread running io_service, because it is blocked on run(). run_once() doesn't help, you still block until there is a handler to run, and introduce the complexity of managing work for the io_service.
In asio, asynchronous handlers - writes to the socket being one of them - are called on the thread running io_service.
Therefore, to introduce messages randomly, data to be sent is pushed onto a queue from a thread other than the io_service thread, which implies protecting the queue and notification timer with a mutex. There are then two mutexes per message, one for pushing the data to the queue, and one for the handler which dequeues the data for write to socket.
This is actually a more general question than asio timeouts alone: is there a pattern, when the io_service thread is blocked on run(), by which data can be asynchronously written to the socket without taking two mutexes per message?
The following things could be of interest: boost::asio strands is a mechanism of synchronising handlers. You only need to do this though if you are calling io_service::run from multiple threads AFAIK.
Also useful is the io_service::post method, which allows you execute code from the thread that has invoked io_service::run.