What is CAN Active, CAN passive and Sleep state in CAN Network manager? - embedded

I am trying to understand the CAN network management in vehicle. During my research, I got to know that CAN network management(CANNM) will make some Mode state to decide the CAN transmission. Those modes are CAN active, CAN passive and CAN sleep states. I want to know that is the exact use of CANNM and why these modes are required ?

I highly recommended to read Autosar Network Management Spec
Where you will got the idea behind that.
Maybe in your term it differences but likely same
CAN Active : If at least one NM node in a NM cluster needs
communication, the NM protocol ensures that all required
NM nodes remain awake.
Can Sleep : If there is no communication need in a NM cluster, the
NM protocol ensures that all NM nodes synchronously
enter sleep mode
Can Passive : NM node configured as Passive node is not able to
initiate a start-up of a NM cluster, however is able to be
woken up if any other node initiates a start-up. This
eliminates unnecessary communication and reduces bus
and buffer overhead. Allowing shutdown to be controlled
by a subset of the cluster’s nodes enables the possibility
that only fault tolerant nodes control shutdown.
Long said short what all this for a Startup/Wakeup and Sleep/Shutdown for ECUs in network

Related

How to Wakeup from Sleep during CAN network state transition?

I am trying to understand the CAN network management of AUTOSAR. I am trying to sleep the ECU if there are no CAN message received during IGN cycle. I am blocking the CAN transmission and reception during this stage. Now suppose AUTOSAR NM message is received, i want to make the ECU to wakeup and need to make CAN to be full active. I have gone through the basic Autosar Network management understanding.
As per my understanding
If communication on the bus is needed i.e. requested, NM messages are sent out. If no communication is needed i.e. released, sending of NM messages is stopped.
When the Autosar NM state is "Ready Sleep state" or "Repeat Message State", I am waking up the CAN. I would like to know, this is a good approach.
Reference: AUTOSAR_SWS_CANNetworkManagement.pdf
You need to read detail the state diagram in section 7.20 for the state and possible transition.
The wake-up and sleep in network management means the communication state. And the specification defines the state to synchronize between ECUs in vehicle. For examples:
During bus-sleep state is state that you disable the CAN controller and CAN transceiver.
During wake-up, you will initiate the communication state again for full communication.
Note: Beware about the state of sending/receiving the NM message because it is sync signal for all ECUs.

Limit total size of inflight iot message

I am using IoTHub device client SDK on an embedded device. The application will send telemetry message to iot hub periodically. The iot device connect to a wireless router and wireless connect to internet via WAN port.
When the wireless router lost internet connection, iot device will not get notified immediately about the disconnection. It takes about 60s to get notified, before that iot device will continue to send telemetry message with IoTHubDeviceClient_LL_SendEventAsync(), all those message get queued in SDK layer and eat memory. Since it's on embedded device with limited resource, memory get eaten up and cause program been killed by a lower memory killer app.
Is there way to specified total size of iot message can be queued in sdk layer? If exceed this quota, IoTHubDeviceClient_LL_SendEventAsync() will failed immediately.
Actually this is also needed for normal scenario too. When iot device send message, seems message been queued in low layer and get flushed out at certain time. I don't see any API that can control the flush. That create another problem, even when there is internet connection, from application level, there is no control of how many message been queued and how long it been queued, in turn, app has no control of how much memory been used by process. On my device, there is system monitor that will kill process use too much memory.
The question is what do you do even in that case if the message failure occurs in the case that the queue is full? Do you lose the information then because of lack of storage capacity? From the IoT perspective, I would recommend in this case to consider if your device is reliable IoT device to handle these edge cases as well. And also knowing the limits of the devices, and knowing how long it can be without the internet connection helps to mitigate these risks from your application, not SDK.
From the GitHub, default sendMessageAsync method throws timeout exception in case your message sending fails, unless you have some kind of retry policies implemented(according to the documentation C SDK does not allow custom retry policies
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-reliability-features-in-sdks).
According to the documentation in case of connection failure based on the retry policy(if you have set it), SDK will try to initiate connection this way or that way and queue the messages created in the meantime:
https://github.com/Azure/azure-iot-sdk-c/blob/master/doc/connection_and_messaging_reliability.md
So, an expectation here is that SDK does not take responsibility for the memory limits. This is up to the application to deal. Since your device has some limitations, I would recommend implementing your own queuing mechanism(maybe set no-retry as a policy and that way avoid queuing). That way you have under the control what will happen in the case that there is no internet connection and have under the control memory limitations. Maybe your business case accepts that you calculate an average value and instead of 50 you store 1 message over the time etc..
If this something you do not like, the documentation says also that you set the timeout for the queue - maybe not the memory limit but timeout yes, so maybe you can try to investigate this a bit deeper:
"There are two timeout controls in this system. An original one in the iothub_client_ll layer - which controls the "waiting to send" queue - and a modern one in the protocol transport layer - that applies to the "in progress" list. However, since IoTHubClient_LL_DoWork causes the Telemetry messages to be immediately* processed, sent and moved to the "in progress" list, the first timeout control is virtually non-applicable.
Both can be fine-tuned by users through IoTHubClient_LL_SetOption, and because of that removing the original control could cause a break for existing customers. For that reason it has been kept as is, but it will be re-designed when we move to the next major version of the product."

Rabbitmq: how worker can "ignore" a message and let an other worker treating it

Here's my current architecture
I have a bunch of IoT devices, that connects through raw duplex persistent TCP to 1 instances of my "worker" that is connected to a RabbitMQ Queue
My publisher publishes some messages that look like that
{
"iot_device_name" : "A",
"command" : "reboot"
}
The worker is then able to map the iot_device_name to the TCP socket.
All is working nice, but if we want to add HA and to scale out a bit, it would be better to have 4 instances of the worker. Load balancing the TCP question is not a problem (with HaProxy or Nginx).
Now the problem is on how to split the load on the Queue part, as the list of IoT devices handled by a worker is dynamic (i.e a device could disconnect and reconnect to an other worker)
So is there a way for a worker to say: "Hmmm no I can't treat this message because I don't know this device, give me an other" so that an other worker can then take it and handle it ?
Other information that may be of help:
the workers are all in the same network, that is also the same than the publisher
the numbers of workers is not dynamic and even if we extrapolate the number of devices for the next years, 8 workers would takes us VERY FAR, as it simply route message/transcode messages, so their cpu load is ridiculous.
So if I understand your architecture correctly, you have commands sent to your publisher on one side, which are pushed into rabbitmq.
On the consumer side, you have multiple workers, to which the messages are dispatched, and each worker has a bunch of devices connected to it.
If indeed this is your architecture, I'd propose the following for your rabbitmq configuration:
use a direct exchange
each worker has it's own queue (exclusive), and manages the bindings between the exchange and its queue dynamically:
each time a device connects to a worker, that worker adds a binding between its queue and the exchange with as routing key the identifier for the device
each time a worker detects that a device is not connected to it anymore, he removes the related binding from the rabbitmq configuration
related to the detection of disconnected devices, I'd expect it common that it's upon receiving a command to push to the device that a worker realize the device isn't connected to it anymore, in such cases in addition to adapting the bindings, the worker would republish the message to the same exchange with the same routing key, so that it can have another shot at being consumed by the proper worker
I'd also consider configuring a TTL on the queues, no point in consuming a message that's too old
The publisher will of course also need to alter the message, including the intended device identification as routing key
I hope the proposal here makes sense, there are a few other cases to be considered: alternate exchange to make sure we don't lose requests if there is a (short) period when the device hasn't reconnected to a worker and we get a command for it anyway, adding a property to a message republished to ensure we do not add an infinite loop in the system, ... but what is indicated above should be a reasonable starting point to achieve your goal

CAN error counters and interrupts

I'm using the bxCAN peripheral of an STMF3 uC in an environment where
1.) it is essential that the node is detached from the network once the REC/TEC has reached the warning level (waiting for the bus-off condition is not an option)
2.) the baud rate of the host network is unknown
3.) the connection might be sporadic as the node is connected by the user
Due to 1.) the STM32 HAL CAN driver is used in IT mode and whenever the called with the EWG flag set, the error callback shuts down the transceiver and deinitializes the bxCAN. In case the REC is over the limit, it is easily recovered by configuring the bxCAN in silent mode, assuming there is traffic on the CAN. However, if the TEC is over the limit, the bxCAN won't be able to transmit an other frame as the error interrupt will be instantly triggered once enabled -> there we are in a deadlock.
I tried decrementing the TEC by transmitting frames in silent loopback mode but successful transmissions do not affect the TEC in this mode it seems.
I suppose the question is not specific to this peripheral but valid for other CAN implementations.
Any suggestions are welcome.
I have implemented a work-around that seems to work fine, with the following requirements:
1.) whenever the CAN error ISR is triggered, it disconnects the node from the bus (the transceiver is powered off)
2.) not all interrupt sources are enabled, only the ones that are of higher severity than the last error state (e.g. in PASSIVE state the WARNING and PASSIVE interrupts are disabled and the BUSOFF interrupt is enabled)
3.) the last error state and thus the interrupt sources are updated whenever a.) an error ISR is triggered or b.) polling the CAN peripheral with a high frequency shows change in the error state
4.) whenever attempting a connection to the bus the REC must heal in listen-only mode first. For this, traffic is required on the bus.
With these requirements implemented the node is able to fail silently but recover to normal operation.

understanding the concept of running a program in interrupt handler

Early Cisco routers running IOS operating system enhanced their packet processing speed by doing packet switching within the interrupt handler instead in "regular" operating system process. Doing packet processing in interrupt handler ensured that context switching within operating system does not affect the packet processing. As I understand, interrupt handler is a piece of software in operating system meant for handling the interrupts. How to understand the concept of packet switching done within the interrupt handler?
use of interrupts is preferred when an event requires some immediate attention by the operating system, or a program which installed an interrupt service routine. This as opposed to polling, where software checks periodically whether a condition exists, which indicates that the event has occurred.
interrupt service routines aren't commonly meant to do a lot of work themselves. They are rather written to reach their end as quickly as possible, so that normal execution can resume. "normal execution" meaning, the location and state previous processing was interrupted when the interrupt occurred. reason is that it must be avoided that the same interrupt occurs again while its handler is still executed, or it may be ignored, or lead to incorrect results, or even worse, to software failure (crashes). So what an interrupt service routine usually does is, reading any data associated with that event and storing it in a queue, signalling that the queue experienced mutation, and setting things such that another interrupt may occur, then resume by restoring pre-interrupt context. the queued data, associated with that interrupt, can now be processed asynchronously, without risking that interrupts pile up.
The following is the procedure for executing interrupt-level switching:
Look up the memory structure to determine the next-hop address and outgoing interface.
Do an Open Systems Interconnection (OSI) Layer 2 rewrite, also called MAC rewrite, which means changing the encapsulation of the packet to comply with the outgoing interface.
Put the packet into the tx ring or output queue of the outgoing interface.
Update the appropriate memory structures (reset timers in caches, update counters, and so forth).
The interrupt which is raised when a packet is received from the network interface is called the "RX interrupt". This interrupt is dismissed only when all the above steps are executed. If any of the first three steps above cannot be performed, the packet is sent to the next switching layer. If the next switching layer is process switching, the packet is put into the input queue of the incoming interface for process switching and the interrupt is dismissed. Since interrupts cannot be interrupted by interrupts of the same level and all interfaces raise interrupts of the same level, no other packet can be handled until the current RX interrupt is dismissed.
Different interrupt switching paths can be organized in a hierarchy, from the one providing the fastest lookup to the one providing the slowest lookup. The last resort used for handling packets is always process switching. Not all interfaces and packet types are supported in every interrupt switching path. Generally, only those that require examination and changes limited to the packet header can be interrupt-switched. If the packet payload needs to be examined before forwarding, interrupt switching is not possible. More specific constraints may exist for some interrupt switching paths. Also, if the Layer 2 connection over the outgoing interface must be reliable (that is, it includes support for retransmission), the packet cannot be handled at interrupt level.
The following are examples of packets that cannot be interrupt-switched:
Traffic directed to the router (routing protocol traffic, Simple Network Management Protocol (SNMP), Telnet, Trivial File Transfer Protocol (TFTP), ping, and so on). Management traffic can be sourced and directed to the router. They have specific task-related processes.
OSI Layer 2 connection-oriented encapsulations (for example, X.25). Some tasks are too complex to be coded in the interrupt-switching path because there are too many instructions to run, or timers and windows are required. Some examples are features such as encryption, Local Area Transport (LAT) translation, and Data-Link Switching Plus (DLSW+).
More here: http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-software-releases-121-mainline/12809-tuning.html