Limit total size of inflight iot message - azure-iot-hub

I am using IoTHub device client SDK on an embedded device. The application will send telemetry message to iot hub periodically. The iot device connect to a wireless router and wireless connect to internet via WAN port.
When the wireless router lost internet connection, iot device will not get notified immediately about the disconnection. It takes about 60s to get notified, before that iot device will continue to send telemetry message with IoTHubDeviceClient_LL_SendEventAsync(), all those message get queued in SDK layer and eat memory. Since it's on embedded device with limited resource, memory get eaten up and cause program been killed by a lower memory killer app.
Is there way to specified total size of iot message can be queued in sdk layer? If exceed this quota, IoTHubDeviceClient_LL_SendEventAsync() will failed immediately.
Actually this is also needed for normal scenario too. When iot device send message, seems message been queued in low layer and get flushed out at certain time. I don't see any API that can control the flush. That create another problem, even when there is internet connection, from application level, there is no control of how many message been queued and how long it been queued, in turn, app has no control of how much memory been used by process. On my device, there is system monitor that will kill process use too much memory.

The question is what do you do even in that case if the message failure occurs in the case that the queue is full? Do you lose the information then because of lack of storage capacity? From the IoT perspective, I would recommend in this case to consider if your device is reliable IoT device to handle these edge cases as well. And also knowing the limits of the devices, and knowing how long it can be without the internet connection helps to mitigate these risks from your application, not SDK.
From the GitHub, default sendMessageAsync method throws timeout exception in case your message sending fails, unless you have some kind of retry policies implemented(according to the documentation C SDK does not allow custom retry policies
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-reliability-features-in-sdks).
According to the documentation in case of connection failure based on the retry policy(if you have set it), SDK will try to initiate connection this way or that way and queue the messages created in the meantime:
https://github.com/Azure/azure-iot-sdk-c/blob/master/doc/connection_and_messaging_reliability.md
So, an expectation here is that SDK does not take responsibility for the memory limits. This is up to the application to deal. Since your device has some limitations, I would recommend implementing your own queuing mechanism(maybe set no-retry as a policy and that way avoid queuing). That way you have under the control what will happen in the case that there is no internet connection and have under the control memory limitations. Maybe your business case accepts that you calculate an average value and instead of 50 you store 1 message over the time etc..
If this something you do not like, the documentation says also that you set the timeout for the queue - maybe not the memory limit but timeout yes, so maybe you can try to investigate this a bit deeper:
"There are two timeout controls in this system. An original one in the iothub_client_ll layer - which controls the "waiting to send" queue - and a modern one in the protocol transport layer - that applies to the "in progress" list. However, since IoTHubClient_LL_DoWork causes the Telemetry messages to be immediately* processed, sent and moved to the "in progress" list, the first timeout control is virtually non-applicable.
Both can be fine-tuned by users through IoTHubClient_LL_SetOption, and because of that removing the original control could cause a break for existing customers. For that reason it has been kept as is, but it will be re-designed when we move to the next major version of the product."

Related

Does the operation of the CAN peripheral in STM32 wait for the execution of the ISR routine code?

I'm developing a stack layer on microcontroller STM32L433 that uses the CAN protocol; a fundamental part of the stack is the authentication of the devices.
During authentication it can occur that two (or more) devices start to send a CAN message (authentication message) with the same identifier and different payload (true random value). In this case every device should be able to detect if this message was sent first from another device.
I have studied this case and three situations can occur:
the devices start to send message at the same time; in this case only one device is able to sent the message because all others devices detect one error and then abort the transmission.
only one device is able to send the message and occupy the bus before all others devices load the transmission MAILBOX of the CAN peripheral, or before the CAN peripheral of the others devices set the message that is going to be sent in the SCHEDULED state.
In this case, the devices that have not been able to send the message will receive the reception interrupt; within the ISR routine of reception I'm able to abort the transmission.
only one device are able to send the message and occupy the bus and all others CAN peripherals of others devices have message in SCHEDULED state and are waiting that bus become idle.
In this case the devices that have not been able to send the message will receive the reception interrupt. Also in this situation I thought to stop the transmission within the ISR routine of reception (like situation 2) ), but I'm not sure that this is guaranteed for all messages because if the CAN peripheral sets the message that is going to be sent in the TRANSMIT state before the code inside ISR is executed, the operation of abort will have no effect.
My question is (related to the situation 3): Is the message in the transmission MAILBOX in the SCHEDULED state set in the TRANSMISSION state after that the code in the receiving ISR routine is executed or is this thing not guaranteed?
To answer on your third case first, no it is not guaranteed that your message is not on the bus, while receiving. Because interrupts might have some latency too, and within this time, the mailbox might be able to go ahead with transmission.
Your "authentication" also sounds a bit troublesome, since nobody from outside could also actually decide which ECU was actually the one that won the arbitration and actually sent that specific message.
We have ECUs in vehicles which decide at runtime, according to certain methods, where they are mounted by pin and some CAN reception, but only in listen mode. TX is actually disabled in the stack. After that, detection has completed, we switch configurations and restart the communications stack and further initialize the software going up.
But these "setups" are usually defined beforehand, e.g. due to master/slave (vehicle/private bus communication), or maybe some connector pins connected to GND / OPEN / UBAT, or maybe some bus message which tells on which bus it is on.
That seems to be more reliable than your method.

Large RabbitMQ message in Slow network

I am using RabbitMQ with Spring AMQP
large message (>100MB, 102400KB)
small bandwidth (<512Kbps)
low heartbeat interval (10 seconds)
single broker
It will take >= 200*8 seconds to consume the message, which is more than my heartbeat interval. From https://stackoverflow.com/a/42363685/418439
If the message transfer time between nodes (60seconds?) > heartbeat time between nodes, it will cause the cluster to disconnect and the loose the message
Will I also face the disconnection issue even I am using single broker?
Does the heartbeat and consumer using the same thread, where if
consumer is consuming, it is not possible to perform heartbeat?
If so, what can I do to consume the message, without increase heartbeat interval or reduce my message size?
Update:
I have received another answer and comments after I posted my own answer. Thanks for the feedback. Just to clarify, I do not use AMQP for file transfer. Actually the data is in JSON message, some are simple and small but some contain complex information, include some free hand drawing. Besides saving the data at Data Center, we also save a copy of message at branch level via AMQP, for case connectivity to Data Center is not available.
So, the real questions here are a bit more fundamental, and those are: (1) is it appropriate to perform a large file transfer via AMQP, and (2) what purpose does the heartbeat serve?
Heartbeats
First off, let's address the heartbeat question. As the RabbitMQ documentation clearly states, the purpose of the heartbeat is "to ensure that the application layer promptly finds out about disrupted connections."
The reason for this is simple. In an ordinary AMQP usage, there may be several seconds, even minutes between the arrival of successive messages. Without data being exchanged across a TCP session, many firewalls and other networking equipment automatically close ports to lower exposure to the enterprise network. Heartbeats further help mitigate a fundamental weakness in TCP, which is the difficulty of detecting a dropped connection. Networks experience failure, and TCP is not always able to detect that on its own.
So, the bottom line here is that, while you're transferring a large message, the connection is active and the heartbeat function serves no useful purpose, and can cause you trouble. It's best to turn it off in such cases.
AMQP For Moving Large Files?
The second issue, and I believe more important question, is how should large files be dealt with. To answer this, let's first consider what a message queue does: sending messages -- small bits of data which communicate something to another computer system. The operative word here is small. Messages typically contain one of three things: 1. commands (go do something), 2. events (something happened), 3. requests (give me some data), and 4. responses (here is your data). A full discussion on these is beyond the scope, but suffice it to say that each of these can generally be composed of a small message less than 100kB.
Indeed, the AMQP protocol, which underlies RabbitMQ, is a fairly chatty protocol. It requires large messages be divided into multiple segments of no more than 131kB. This can add a significant amount of overhead to a large file transfer, especially when compared to other file transfer mechanisms (FTP, for instance). Secondly, the message has to be fully processed by the broker before it is made available in a queue, and it ties up valuable resources on the broker while this is being done. For one, the whole message must fit into RAM on the broker due to its architecture. This solution may work for one client and one broker, but it will break quickly when scaling out is attempted.
Finally, compression is often desirable when transferring files - HTTP supports gzip compression automatcially. AMQP does not. It is quite common in message-oriented applications to send a message containing a resource locator (e.g. URL) pointing to the larger data file, which is then accessed via appropriate means.
The moral of the story
As the adage goes: "to the man with a hammer, everything looks like a nail." AMQP is not a hammer- it's a precision scalpel. It has a very specific purpose, and narrow applicability within that purpose. Using it for something other than its intended purpose will lead to stability and reliability problems in whatever it is you are designing, and overall dissatisfaction with your end product.
Will I also face the disconnection issue even I am using single
broker?
Yes
Does the heartbeat and consumer use the same thread, where
if consumer is consuming, it is not possible to perform heartbeat?
Can't confirm the thread, but from what I observe when Java RabbitMQ consumer consumes a message, it won't perform heartbeat acknowledgement. If the time to consume longer than 3 x heartbeat timeout timer (due to large message and/or low bandwidth), MQ server will close AMQP connection.
If so, what can I do to consume the message, without increase
heartbeat interval or reduce my message size?
I resolved my issue by increasing heartbeat size. No further code change is required.

MQTT long polling over GPRS/HSDPA

I'd like to use MQTT to send control information to a device, but I'm concerned that leaving the MQTT client and cell data connection up (basically in long-polling mode) is somehow bad. Either from a data charges, network usage, battery life, or some other aspect?
Another approach might be to send an SMS to the device when it has a message to pick up - but that seems to defeat the purpose of MQTT and also introduces a long delay while dialing and setting up the GPRS connection.
Is there any reason I should be concerned on this approach?
I think this approach is quite valid - think of it this way: Your App's long polling transfers a very small volume of data, as long as it just polls, so
the data usage should be miniscule
the battery is impacted only for the data sent in addition to the keepalive, which is at least an order of magnitude higher than the long polling
as a reference: ActiveSync, which runs all the time, is nothing else than a fancy form of long polling
You may want to look at MQTT-SN, which is designed to run over UDP, and therefore does not need an active connection. Real Small Message Broker is an implementation of a MQTT-SN broker, and will bridge to Mosquitto.
The other approach is to use the retain flag on messages, that way a control app can send the message and the device will get it as soon as it reconnects, regards less of if the app is still online. In all cases, the user experience on the app side should differentiate between the request being sent and it being honored, or refused, so you will need tri-state controls (on, off, pending).

How to prevent an I/O Completion Port from blocking when completion packets are available?

I have a server application that uses Microsoft's I/O Completion Port (IOCP) mechanism to manage asynchronous network socket communication. In general, this IOCP approach has performed very well in my environment. However, I have encountered an edge case scenario for which I am seeking guidance:
For the purposes of testing, my server application is streaming data (lets say ~400 KB/sec) over a gigabit LAN to a single client. All is well...until I disconnect the client's Ethernet cable from the LAN. Disconnecting the cable in this manner prevents the server from immediately detecting that the client has disappeared (i.e. the client's TCP network stack does not send notification of the connection's termination to the server)
Meanwhile, the server continues to make WSASend calls to the client...and being that these calls are asynchronous, they appear to "succeed" (i.e. the data is buffered by the OS in the outbound queue for the socket).
While this is all happening, I have 16 threads blocked on GetQueuedCompletionStatus, waiting to retrieve completion packets from the port as they become available. Prior to disconnecting the client's cable, there was a constant stream of completion packets. Now, everything (as expected) seems to have come to a halt...for about 32 seconds. After 32 seconds, IOCP springs back into action returning FALSE with a non-null lpOverlapped value. GetLastError returns 121 (The semaphore timeout period has expired.) I can only assume that error 121 is an artifact of WSASend finally timing out after the TCP stack determined the client was gone?
I'm fine with the network stack taking 32 seconds to figure out my client is gone. The problem is that while the system is making this determination, my IOCP is paralyzed. For example, WSAAccept events that post to the same IOCP are not handled by any of the 16 threads blocked on GetQueuedCompletionStatus until the failed completion packet (indicating error 121) is received.
My initial plan to work around this involved using WSAWaitForMultipleEvents immediately after calling WSASend. If the socket event wasn't signaled within (e.g. 3 seconds), then I terminate the socket connection and move on (in hopes of preventing the extensive blocking effect on my IOCP). Unfortunately, WSAWaitForMultipleEvents never seems to encounter a timeout (so maybe asynchronous sockets are signaled by virtue of being asynchronous? Or copying data to the TCP queue qualifies for a signal?)
I'm still trying to sort this all out, but was hoping someone had some insights as to how to prevent the IOCP hang.
Other details: My server application is running on Win7 with 8 cores; IOCP is configured to use at most 8 concurrent threads; my thread pool has 16 threads. Plenty of RAM, processor and bandwidth.
Thanks in advance for your suggestions and advice.
It's usual for the WSASend() completions to stall in this situation. You won't get them until the TCP stack times out its resend attempts and completes all of the outstanding sends in error. This doesn't block any other operations. I expect you are either testing incorrectly or have a bug in your code.
Note that your 'fix' is flawed. You could see this 'delayed send completion' situation at any point during a normal connection if the sender is sending faster than the consumer can consume. See this article on TCP flow control and async writes. A better plan is to use a counter for the amount of oustanding writes (per connection) that you want to allow and stop sending if that counter gets reached and then resume when it drops below a 'low water mark' threshold value.
Note that if you've pulled out the network cable into the machine how do you expect any other operations to complete? Reads will just sit there and only fail once a write has failed and AcceptEx will simply sit there and wait for the condition to rectify itself.

USB - MTP/PTP without Interrupt Endpoint

Since we plan to use MTP (Media Transfer Protocol) for your next device, we evaluate the use of MTP as replacement for the current (unstable) USB drivers in the current released device.
The limitation on this device is, that its processor (Strong Arm) supports only up to 3 EndPoints:
"Serial port 0 is a universal serial bus device controller (UDC) that supports three endpoints and can operate half-duplex at a baud rate of 12 Mbps (slave only, not a host or hub controller)."
But according to the specification, MTP needs at least 4 endpoints (from the PTP spec):
"The device shall contain at least four endpoints: default, Data-In, Data-Out, and an Interrupt endpoint."
Now the question: Can we just skip the interrupt endpoint on the device? I know that it violates the specification - but what happens if we do?
From our current evaluation software I can see the following scenarios:
The 'space available' is not updated - the user will see that there is 100Mb of free memory, but placing a 1Mb file gives the error "Not Enough Memory"
Non-host driven actions are not visible on the host (so when on the device files are deleted, created or moved, the connected host does not know about it)
If we can live with it, is it advisable to implement it this way?
UPDATE: Damn... when I tested it last time, I ve just removed the code for interrupt-EP data transmission. Now I also removed the endpoint definition (I do not create the endpoint anymore) and from this point the MTP connection couldn't be established any more :(
It seems that the windows driver (wpd) requires the interrupt endpoint - even if it's not used. Bad luck...
Has anyone an idea, whether and how to get MTP working with 3 endpoints?
Finally I got an answer from Microsoft:
The 3-endpoints setup is not supported.
The interrupt endpoint is required so that the driver can receive MTP events from the device. These events are a notification mechanism that the driver relies on to relay events to applications (e.g. when an object is created, updated, or removed).
If your device does nothing with the endpoint (i.e. send no events), applications such as Explorer will not behave correctly whenever objects on your device are changed.
So we buried our plans... :(