Does "error frames" on CAN bus delay/ impair the communication? - error-handling

The quote below is from a document by Texas Instruments.
The error frame is a special message that violates the formatting
rules of a CAN message. It is transmitted when a node detects an error
in a message, and causes all other nodes in the network to send an
error frame as well. The original transmitter then automatically
retransmits the message. An elaborate system of error counters in the
CAN controller ensures that a node cannot tie up a bus by repeatedly
transmitting error frames.
Also, this wikipedia page provides more information on error frames.
As mentioned in several answers (link1, link2), CAN bus is half-duplex, that is, the nodes cannot transmit and receive data at the same time.
In general, a modern car contains more than 50 ECUs (nodes) on a CAN network. In case of an error, " if " the nodes would send error frames one after another, the CAN BUS would be occupied for a quite long time.
So, what do I miss here? Do the nodes send their error frames at the same time/ simultaneously and the hardware solves that issue? What happens if a node transmitted a different or corrupted error frame?

The other nodes will not send error-frames one after the other in most cases. If there is an error on the bus, then it is very very likely that all nodes will perceive the error. They will all then send their error-frames at (close to) the same time. As they all expect to see "their" error-frame, it does not matter who gets there first.
In the (unusual case) of an error only being noted by one node (perhaps some transient within the ECU) it will transmit an error-frame. The other ECUs will react to this error-frame (which is "simply" a violation of the stuffing rules) with their own error-frames. But again, they will all see it at the same time and so the case described above applies. They will all transmit their "own" error-frame at about the same time.
As noted by #Lundin in the question comments, error-frames are very unusual, so the impact on the bus-loading is not of major concern.
I do not understand this part of your question:
What happens if a node transmitted a different or corrupted error frame?
A node "cannot" transmit a different error-frame - it would not be an error-frame. An error-frame being corrupted is very unlikely as it is a string of dominant bits, which are driven hard, and usually by several to many ECUs at a time. If it were to happen, I think (but would have to check the spec) the ECUs would notice this as another error and transmit another error-frame.

A node that repeatedly sends active error frames first goes into the "Warning" state and then later into the "Bus Off" state. This prevents a broken node from becoming a "bubbling idiot".
See Bosch CAN specification page 63

Related

Clear WebRTC Data Channel queue

I have been trying to use WebRTC Data Channel for a game, however, I am unable to consistently send live player data without hitting the queue size limit (8KB) after 50-70 secs of playing.
Sine the data is required to be real-time, I have no use for data that comes out of order. I have initialized the data channel with the following attributes:
negotiated: true,
id: id,
ordered: true,
maxRetransmits: 0,
maxPacketLifetime: 66
The MDN Docs said that the buffer cannot be altered in any way.
Is there anyway I can consistently send data without exceeding the buffer space? I don't mind purging the buffer space as it only contains data that has been clogged up over time.
NOTE: The data is transmitting until the buffer size exceeds the 8KB space.
EDIT: I forgot to add that this issue is only occurring when the two sides are on different networks. When both are within the same LAN, there is no buffering (since higher bandwidth, I presume). I tried to add multiple Data Channels (8 in parallel). However, this only increased the time before the failure occurred again. All 8 buffers were full. I also tried creating a new channel each time the buffer was close to being full and switched to the new DC while closing the previous one that was full, but I found out the hard way (reading Note in MDN Docs) that the buffer space is not released immediately, rather tries to transmit all data in the buffer taking away precious bandwidth.
Thanks in advance.
The maxRetransmits value is ignored if the maxPacketLifetime value is set; thus, you've configured your channel to resend packets for up to 66ms. For your application, it is probably better to use a pure unreliable channel by setting maxPacketLifetime to 0.
As Sean said, there is no way to flush the queue. What you can do is to drop packets before sending them if the channel is congested:
if(dc.bufferedAmount > 0)
return;
dc.send(data);
Finally, you should realise that buffering may happen in the network as well as at the sender: any router can buffer packets when it is congested, and many routers have very large buffers (this is called BufferBloat). The WebRTC stack should prevent you from buffering too much data in the network, but if WebRTC's behaviour is not aggressive enough for your needs, you will need to add explicit feedback from the sender to the receiver in order to avoid having too many packets in flight.
I don't believe you can flush the outbound buffer, you will probably need to watch the bufferedAmount and adjust what you are sending if it grows.
Maybe handle the retransmissions yourselves and discard old data if needed? WebRTC doesn't surface the SACKs from SCTP. So I think you will need to implement something yourself.
It's an interesting problem. Would love to hear the WebRTC W3C WorkGroup takes on it if exposing more info would make things easier for you.

What is the difference between an error active node and an error passive node in CAN?

I understand the concept of TEC and REC counters in CAN. Will the error active node send active error frames upon detection of error?
Once the TEC count is above 127 then the error active node will become error passive. Does this mean it will start transmitting passive error frames?
Also, when other nodes detect that a node is transmitting active error frames, do they automatically transmit passive error frames? Can these nodes be referred to as error passive nodes?
This is my confusion which needs clarity.
Yes, it will stop sending out so-called active error frames with dominant bit sequences, and switch to recessive. Other nodes will not respond but increase their REC counter. Once the active error frame is sent, bus arbitration re-starts as usual, with the highest priority frame winning.
Quoting an article from CAN-CiA:
Fault confinement
The CAN data link layers detect all communication errors with a very high probability. A node detecting an error condition sends an Error Flag and discards the currently transmitted frame. All nodes receiving an Error Flag discard the message, too. In case of local failures, all other nodes recognize the Error Frame sent by the node(s) that detected it and sent by themselves a second time, which results in an eventually overlapping Error Frame. The active Error Frame is made of six dominant bits and an 8-bit recessive delimiter followed by the IMF. This local error globalization method guarantees network-wide data consistency, an important feature in distributed control systems.
If all errors are detected with a very high probability, permanent errors may lead to an unacceptable delay in transmitting messages. In the worst-case, all communication is aborted by means of Error Frames. In order to avoid this, the CAN protocol introduces two error counters: one for received messages (REC) and one for transmitted messages (TEC). They are increased and decreased according to the rules as specified in ISO 11898-1, the standard of the CAN data link layer protocols.
If one of the counters reaches 127, the node transits to error passive state. In this state, the node transmits passive Error Flags made of six recessive bits. This flag is overwritten by dominant bits of a transmitting node. This means that an error passive node can’t inform the other nodes about an incorrectly received frame. This is a critical situation from the viewpoint of the system. If a transmitting node permanently produces Error Flags, this would also delay and in the worst-case (high-prior message) block the other communication. Therefore, the node is forced into bus-off state, if the TEC reaches 256. In bus-off state, the node transmits only recessive bit-level. To transit to the error active state requires two conditions: a reset and the occurrence of 128 by 11 bit-times. This means that the remaining nodes are able to transmit 128 data frames before the node in bus-off recovers and integrates itself again as an error active node into the network.

CAN bus arbitration backoff time

I am aware of the way CAN bus does its arbitration. In a nutshell the CAN node ID having more '0' 's in its indentifier wins the rite to transmit on the bus and the rest of contending nodes back off.
But i dont find any details of how long the backed out node waits before re-trying to win the bus back. I consulted a few sources but still cant find the answer. Any experimental evidence for this ?
Bosch CAN
Introduction to the Controller Area Network
It is free to try again after the winning frame has been transmitted and no dominant bit has been found in the "intermission field" at the end of the CAN frame. You'll probably find a formal definition of this if you search the spec for "intermission field", see for example 3.1.5 of the old (obsolete) Bosch spec you linked.
The important part here is to realize that every CAN controller listens to every single frame, even if it isn't interested in it. This is how you achieve collision avoidance, rather than collision detection.
As mentioned in the Bosch CAN specification document all the CAN nodes can start to send pending frames when Bus Idle condition occurs (no dominant bit found on the bus). During the intermission period in the Interframe spacing no node can transmit (Overload frames can be transmitted but not Data or Remote frames). CAN nodes must wait for 3 recessive bits during this period. All nodes can start transmitting right after this intermission period.
If multiple nodes start at once after intermission period then the lowest identifier frame will win the arbitration. If the remote and data frames (both have same identifier) from different nodes start then the data frame will win the arbitration.
I agree with the answers above but i was looking for more mathematical analysis of the CAN bus timings. I found this excellent lecture notes : Time analysis of CAN messages
. Chapter 3

Can a CAN message have two reliable recipients?

In my situation multiple modules report their state over a CAN bus to a central processor, which replies and drives them. There's also a supervising processor, which listens in on the CAN bus and analyzes incoming messages from the modules for critically dangerous situations (two different modules reporting activating outputs which are absolutely forbidden from being activated simultaneously).
This all works okay as long as the CAN bus is noise-free.
CAN bus guarantees the recipient to receive a message; the message will be resent if no recipient confirms receiving it. The problem begins if there's more than one recipient and all of them absolutely must receive the message.
If the line is clean, both receive it, confirm it, and everything is okay.
If the message is badly damaged, neither will receive it, and it will be resent. That's okay.
But if the noise on the line is "just on the brink", one of them will receive it, and confirm, and the other will fail to receive it (noise on its end of the bus just minimally worse), and since the sender got the confirmation, the message won't be resent.
Is there a reliable way to assure two different recipients of a message both receive it? ...other than sending two messages with two addresses, specifically? (it's essential that the supervising CPU hears the same messages as the main CPU, not just similar)
There is no way at the CAN layer to detect receipt by more than one module. You would need to add messages to your communication protocol to confirm receipt if this is absolutely critical. As mentioned, you could have each module receive the same message and send a unique reply.
Some general thoughts:
1) Are the important messages broadcast periodically? If so, the recipient could test that the periodicity of the message is correct and fail safely if the period is violated.
2) CAN is a very robust network. In my many years, I have not seen noise affecting a single node like you described other than when the node was at the end of a exceedingly (and irrationally) long wire. You are correct to worry about this scenario and design your message format and system to be robust to all CAN failures. Generally, when safety or reliability was paramount, we would have more than one CAN bus communicating the information along with a number of crosscheck messages to verify that not only the path was intact but the device on the other end was operating intelligently. Our general assumption was that if crosscheck messages were making the trip, then our operational messages were making the trip successfully as well.
Obviously not.
It fails even in the simple case, that one receiver is shutdown.
There is no possibility for the master to detect this (for this single packet).
You need an advanced CAN, with more acknowledge slots, for each recipients one slot.
But you could request that each reciepient has to confirm the message with a unique response message.
So your master can detect by a timeout that not all reciepent received the message.

USB CDC device stalling

I'm writing a simple virtual serial port device to report an older serial port. By this point I'm able to enumerate the device and send/receive characters.
After a varying number of bulk-out transmissions from the host to the device the endpoint appears to give up and stop transferring data. On the PC side I receive a write error, and judging from a USBlyzer trace the music stops on a stall (USBD_STATUS_STALL_PID). However my code never intentionally issues a STALL condition on that endpoint and the status flag for having generated one never gets set though.
Given the short amount of time elapsed (<300 µs) between issuing the request and the STALL it would appear to be an invalid response of some sort, and not a time-out. On the device side the output endpoint is ready to go, with data in the buffer and proper DATA0/1 synchronization, but nothing further ever happens.
Note that the device appears to work fine even for long periods of time until I start sending "large" quantities of data. As near as I can tell the device enumeration/configuration also appears to complete successfully. Oh, and the bulk-in endpoint continues to work just fine after this.
For the record I'm using the standard Windows usbser.sys driver and an XMega128A4U µP. I'm also seeing the same behaviour across multiple Windows Vista and 7 machines.
Any ideas what I'm doing wrong or what further tests I might run to narrow things down?
USBlyzer log,
USB CDC stack,
test project
For the record this eventually turned out to be an oscillator problem. (Apparently the FLL's reference is always 1,024 Hz even when the 1,000 Hz USB frames are chosen. The slight clock error meant that a packet occasionally got rejected if it happened to contain one too many 1-bits in a row.)
I guess the moral of the story is to check the basics before assuming you've got a problem with the higher-level protocol. Also in retrospect a hardware USB analyzer would have been a worthwhile investment, the software alternatives mostly seems to spit out a generic error code or nothing at all when something goes awry.
Stalling the out-endpoint may happen on an overflow of the output buffer on the host side. Are you sure that the device does fetch the data it receives via out-endpoint - and if so does it fetch the data at least as fast as data is sent to the device?
Note that the device appears to work fine even for long periods of
time until I start sending "large" quantities of data.
This seems to be a hint for an overflow of the output-buffer.