CANOpen network load higher than expected - embedded

I am working on a project with a master computer connected via a CANOpen network to 4 slaves.
At each time step, the computer receives a measurement message from each slave, and sends them a control message. In total, 4 messages are received and 4 messages are sent at each time sample.
The message sent is a PDO with 6 data bytes (8 bytes including COB-ID)
The message received is a PDO with 8 data bytes (10 bytes including COB-ID)
My CAN network is configured at 1Mbit/s, and I run my program at 1000 Hz (1 ms sampling time). As the total load resulting from the messages described is 576 bits/cycle, the total load expected in the network is 576kbit/s, or 57%.
What I see, however, is that:
The controlling computer measures a load of ~86% (with minima of 68% and peaks of 100%).
A USB CAN bus analyser I connect to the network registers a traffic
of messages (count-wise) that is around half of what I nominally
expect (i.e., 4 sent, 4 received each cycle, for 50 seconds should result in 50k messages, while I only see 18-25k). Moreover, I receive
1-2 error messages per cycle from the slave devices that the
network is overloaded. Before it is pointed out, even counting the
size of these messages as part of traffic wouldn't get close to
explain the anomaly in load.
What I'd like to know is whether my way of calculating the CANOpen network load is correct. For instance, are there any protocol-specific handshakes, CRCs, or any sort of extra bytes sent to make the network simply work? It's nothing I could see in the wiki page of CANOpen, but I do know there are such appendices to messages in the original CAN bus standard.

In a CAN message, there is more than the data to be transmitted.
There is also the arbitration ID (11- or 29bits, depending on whether you use CAN 2.0A or 2.0B), there is a 15 bit CRC, an 7 bit EOF marker, the control field and also some other reserved bits.
Depending on the data, there may also be stuff bits.
Using CAN2.0B and assuming 48 bits (6 bytes) of data, you will get a message size of roughly 132 bits and roughly 151 bits for your 64 bits messages.
Summing this up, you will get roughly 1132 bits per cycle which is too much for a 1Mbit/s bus and 1000 Hz.
Hope that helps.

Related

Reading data over the SPI bus using DMA witth STM32F479

I am using the STM32F479 microcontroller along with a AFE440 Analog Front End. When data is ready to be read on the AFE I get a trigger via the ADC_RDY pin on the Microcontroller. At this point I need to read 4 different registers on the AFE all with 3 bytes of data and store them in a buffer. (3 * 4 = 12 bytes total). Then I want my processor to sleep until I get another event on the ADC_RDY pin at which point I read another 12 bytes. I want to store the 12 bytes read each time in a FIFO buffer of size 120 bytes.
I would like to read and store the bytes into the Buffer all using DMA. My processor will be a sleep during this transaction. It will wake up once the FIFO buffer is full with 120 bytes and process the data.
How would I go about setting this up with ST ?

STM32 USB Custom HID only 1 byte per transaction

I know that maximum speed of USB HID device is 64 kbps, but on oscilloscope I get transactions every 1 ms, which contain only ONE byte. My HID report descriptor listed below. What i must change to achieve 64Kbps? Currently my bInterval = 0x01 (1 ms polling for interrupt endpoint), but actual speed is 65 bytes/s, because it add reportID byte to my 64-byte data. I think, USB should not divide single 64+1 packet to 65 singlebyte packets. For experiment I use reportID=1 (from STM32 to PC). From PC side I use hidapi.dll to interact.
__ALIGN_BEGIN static uint8_t CUSTOM_HID_ReportDesc_FS[USBD_CUSTOM_HID_REPORT_DESC_SIZE] __ALIGN_END =
{
/* USER CODE BEGIN 0 */
USAGE_PAGE(USAGE_PAGE_UNDEFINED)
USAGE(USAGE_UNDEFINED)
COLLECTION(APPLICATION)
REPORT_ID(1)
USAGE(1)
LOGICAL_MIN(0)
LOGICAL_MAX(255)
REPORT_SIZE(8)
REPORT_COUNT(64)
INPUT(DATA | VARIABLE | ABSOLUTE)
REPORT_ID(2)
USAGE(2)
LOGICAL_MIN(0)
LOGICAL_MAX(255)
REPORT_SIZE(8)
REPORT_COUNT(64)
OUTPUT(DATA | VARIABLE | ABSOLUTE)
REPORT_ID(3)
USAGE(3)
LOGICAL_MIN(0)
LOGICAL_MAX(255)
REPORT_SIZE(8)
REPORT_COUNT(64)
OUTPUT(DATA | VARIABLE | ABSOLUTE)
REPORT_ID(4)
USAGE(4)
LOGICAL_MIN(0)
LOGICAL_MAX(255)
REPORT_SIZE(8)
REPORT_COUNT(64)
OUTPUT(DATA | VARIABLE | ABSOLUTE)
/* USER CODE END 0 */
0xC0 /* END_COLLECTION */
};
HID uses interrupt IN/OUT to convey reports. In USB, Interrupt transfers are polled by host every 1 ms. Every time endpoint is polled, it may yield a 64-byte report (for Low/Full speed). That's probably where you get the 64kB/s figure from. Actually, limit is 1k report / second. Also note these limits are different for High-speed and Super-speed devices.
Report descriptor is one thing. What you actually send as interrupt-IN is something else. It should match, but this is not enforced by anything. You should probably look into the code that builds the interrupt IN transfer payload.
Side note: all you seem interested in is to send arbitrary chunks of data, then HID is probably not the relevant profile. Using bulk endpoints looks more appropriate (and you'll not be limited by interrupt endpoint polling rate).

How to send/receive variable length protocol messages independently on the transmission layer

I'm writing a very specific application protocol to enable communication between 2 nodes. Node 1 is an embedded platform (a microcontroller), while node 2 is a common computer.
Such protocol defines messages of variable length. This means that sometimes node 1 sends a message of 100 bytes to node 2, while another time it sends a message of 452 bytes.
Such protocol shall be independent on how the messages are transmitted. For instance, the same message can be sent over USB, Bluetooth, etc.
Let's assume that a protocol message is defined as:
| Length (4 bytes) | ...Payload (variable length)... |
I'm struggling about how the receiver can recognise how long is the incoming message. So far, I have thought about 2 approaches.
1st approach
The sender sends the length first (4 bytes, always fixed size), and the message afterwards.
For instance, the sender does something like this:
// assuming that the parameters of send() are: data, length of data
send(msg_length, 4)
send(msg, msg_length - 4)
While the receiver side does:
msg_length = receive(4)
msg = receive(msg_length)
This may be ok with some "physical protocols" (e.g. UART), but with more complex ones (e.g. USB) transmitting the length with a separate packet may introduce some overhead. The reason being that an additional USB packet (with control data, ACK packets as well) is required to be transmitted for only 4 bytes.
However, with this approach the receiver side is pretty simple.
2nd approach
The alternative would be that the receiver keeps receiving data into a buffer, and at some point tries to find a valid message. Valid means: finding the length of the message first, and then its payload.
Most likely this approach requires adding some "start message" byte(s) at the beginning of the message, such that the receiver can use them to identify where a message is starting.

How does TensorFlow sync tensors which share a buffer between different step-ids

I have a problem in my contrib implementation for distributed TensorFlow. Not to bother you with non-relevant details, the solution applies a certain message protocol in order to utilize RDMA writes directly from/to source/destination tensors, to save memory copies on the CPU.
Let's say I have 2 sides, A and B, and A wants to receive a tensor from B.
The protocol is as follows:
A sends a REQUEST message to B.
B lookups the tensor locally (BaseRendezvousMgr::RecvLocalAsync) and sends a META-DATA response to A.
A uses the meta data to allocate the destination tensor, and sends a ACK to B containing the destination address.
B receives the ACK, and performs a remote DMA write to the destination address.
Between the REQUEST and the ACK, B keeps the local tensor alive (and Ref() > 0), by saving it in a local map (the REQUEST copies the tensor to the local map, the ACK pops it from the map).
To validate my solution, I added a checksum calculation at each step. Occasionally, I see that the checksum changes between the REQUEST and the ACK. This happens when I run PS with two workers:
Line 1 is REQUEST to worker 0.
Line 2 is ACK to worker 0.
Line 3 is REQUEST to worker 1.
Line 4 is ACK to worker 1.
The last value on each line is the checksum. The errors happens about 50% of the times. I always see it on line 4.
I also saw that the problematic tensor has a shared buffer for all step-ids (this is a given. I can't control it). So it is very likely that some other thread changed the tensor's content between lines 3 and 4, which is something I want to prevent.
So the question is how? What prevented the content from changing between lines 1 and 2, and 2 and 3? To emphasize, the time elapsed between lines 3 and 4 is less than 0.04 seconds, while the time elapsed between 2 and 3 is almost 2.5 seconds.
Thanks for your help. Code will be posted if required.
Are you using tf.Variable for the shared buffer? If so using tfe.Variable (to enable reasonable read-write semantics) or tf.get_variable(..., use_resource=True) to construct it will make any synchronization issues go away.
Otherwise this is hard to understand without knowing more about the generating graph.

Software memory testing for bus failures

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.
Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.
There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.
Ideas welcome...
The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.
Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.
When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.
Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.
unfortunately as you probably know each new failure case takes its own research project and development of a new test.
First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete
In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."
Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.
If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.
I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.
The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)
There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)
07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
2F,34,38,3D,3E,43,49,4A,4F,52,58
The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];
I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.
single bit errors detectable
double bit errors deducible (Explainable by a combination of frequent single bit errors)
3 or more bit errors (generally inconclusive)
Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)
The testing group is now using this to identify which specific connections are failing.