FEC and fixing dropped/inserted bits

FEC and fixing dropped/inserted bits - error-handling

I understand that most FEC algorithms, such as Reed-Solomon encoding, were designed to specifically fix bit flips in data streams. Also, if you know the position of where an erasure or insertion has occurred, RS can fix these streams, too. My question is - how do people practically fix very noisy data streams where bit/byte drops may occur? Are there specific algorithms that can be used, like a modified RS code?
We have a packetized data stream over a very long (thousands of feet) multi-drop rs-485 network. We send a request to a node, it responds with a multi-kilobyte response. Data can be randomly dropped or inserted due to a physical capacitive effects we're seeing on the line due to impedance mismatches, the long cable, and tristate effects of the node transceivers. We were supposed to be placing strong pull-up/pull-down resistors along the entire cable length, this was an oversight. Rs-485 networks can be extremely complicated at long lengths. We're wondering if we can somehow fix this effect in software using some error correction algorithm rather than having to respin hardware (which will be extremely expensive and effect scheduling).

Related

How do you design a configuration file system that copes with abrupt shutdown in embedded environment?

I'm designing a software that manages configuration file at application layer in embedded Linux.
Generally, it maintains two copies of the configuration file, one in RAM and one in flash memory. As soon as end-users update setting(s) by UI, the software saves it to the file in RAM, and then copy-paste it to the file in flash memory.
This scheme makes sure best stability in that the software reflects reality at the next second. However, the scheme compromises longevity to flash memory by accessing it every time.
As to longevity issue, I've thought about it by having a dedicated program doing this housekeeping, and adds this program to crontab then let it run like every 30 mins.
(Note: flash memory wears off only during erase cycles; the program only does housekeeping if the both files are not the same.)
But if the file in RAM is waiting for the program to do housekeeping and system shuts down unexpectedly, the file will lose.
So I'm thinking is there a way to have both longevity and not losing file at the same time? Or am I missing something?

There are many different reasons why flash can get corrupted: data retention over time, erase/write failures which are primarily caused by erase/write cycle wear, clock inaccuracies, read disturb in case of NAND flash, and even less likely errors sources such as cosmic rays or EMI. But also as in your case, algorithmic layer problems such as a flash erase/write getting interrupted by power loss or reset caused by EMI.
Similarly, there are many ways to deal with these various problems.
CRC16 or CRC32 depending on flash size is the classic way to deal with pretty much all possible flash errors, particularly with data retention since it most often manifests itself as single-bit errors, which CRC is great at discovering. It should ideally be designed so that the checksum is placed at the end of each erase-size segment. Or in case erase-size is very small (emulated eeprom/data flash etc), maybe a single CRC32 at the end of all data. Modern MCUs often have a CRC hardware peripheral which might be helpful.
Optionally you can let the CRC algorithm repair single bit errors, though this practice is often banned in high integrity systems.
ECC is used on NAND flash or in high integrity systems. Traditionally done through software (which is quite cumbersome), but lately also available through built-in hardware support, particularly on the "safety/chassis" kind of automotive microcontrollers. If you wish to use ECC then I highly recommend picking a part with such built-in support, then it can be used to replace manual CRC (which is somewhat painful to deal with real-time wise).
These parts with hardware ECC may also support a feature with an area where you can write down variables to have the hardware handle writing them to flash in the background, kind of similar to DMA.
Using the flash segment as FIFO. When storing reasonably small amounts of data in memory with large erase sizes, you can save flash erase/write cycles by only erasing the whole segment once, after which it will likely be set to "all ones" 0xFFFF... When writing, you look for the last available chunk of memory which is "all ones" and write there, even though the same data was previously written just before it. And when reading, you fetch the last written chunk before "all ones". Only when the whole erase size is used up do you perform an erase and start over from the beginning - data needs to be stored in RAM during this.
I strongly recommend picking a part with decent data flash though, meaning small erase sizes - so that you don't need to resort to hacks like this.
Mirror segments where all memory is stored as duplicates in two separate segments is mandatory practice for high integrity systems, though this can also be used to prevent corruption during power loss/unexpected resets and of course flash corruption in general. The idea is to always have at least one segment of intact data at all times, and optionally repair a corrupt one by overwriting it with the correct one at start-up. Also meaning that one segment must be verified to be correct and complete before writing to the next.
Keep the product cool. This is a hardware solution obviously, but data retention in particular is heavily affected by ambient temperature. The manufacturer normally guarantees some 15-20 years or so up to 85°C, but that might mean 100 years if you keep it at <25°C. As in, whenever possible, avoid mounting your MCU PCB near exhausts, oil coolers, hydraulics, heating elements etc etc.
Mirror segments in combination with CRC and/or ECC is likely the solution you are looking for here. Again, I strongly recommend to pick a MCU with dedicated data flash, meaning small erase segments and often far more erase/write cycles, ideally >100k.

Could a tight loop destroy cells of a microcontroller's flash?

It is well-known that Flash memory has limited write endurance, less so that reads could also have an upper limit such as mentioned in this Flash endurance test's Conclusion (3rd point).
On a microcontroller the code is typically stored in Flash, and is executed by fetching code words directly from the Flash cells.(at least this is most commonly so on 8 bit micros, some 32 bit micros might have some small buffer).
Depending on the particular code, it might happen that a location is accessed extremely frequently, such as if on the main execution path there is some busy loop, such as a wait for an interrupt (for example from a timer, synchronizing execution to a fixed interval).This could generate 100K or even more (read) accesses per second on average to a single Flash cell (depending on clock and the particular code).
Could such code actually destroy the cells of the Flash underneath it?(Is there any necessity to be concerned about this particular problem when designing code for microcontrollers? Such as part of a system which is meant to operate for years? Of course the Flash could be periodically verified by CRC, but that doesn't prevent the system failing if it happens, only that the failure will more likely happen in a controlled manner)

Only erasing/writing will affect the memory cells, not reading. You don't need to consider the number of reads when designing the program.
Programmed flash memory does age though, meaning that the value of the cells might not be reliable after a certain amount of years. This is known as data retention and depends mainly on temperature. MCU manufacturers typically specify a worse case in years, assuming that the part is kept in maximum specified ambient temperature.
This is something to consider for products that are expected to live long (> 10 years), particularly in environments where high temperatures can be expected. CRC and/or ECC is a good counter-measure against data retention, although if you do find that a cell has been corrupted, it typically just means that the application should shut down to a non-recoverable safe state.

I know of two techniques to approach this issue:
1) One technique is to set aside a const 32-bit integer variable in the system code. Then calculate a CRC32 checksum of the compiled binary image, and inserting the checksum into the reserved variable using an ELF-editor.
A module in the system software will then calculate a CRC32 over the flash area occupied by the application and compare to the "stored" value.
If you are using GCC, the linker can define a symbol to tell you where the segment stops. This method can detect errors but cannot correct them.
2) Another technique is to use a microcontroller that supports Flash ECC. TI sells Cortex-R4 MCUs which support Flash ECC (Hercules series).

I doubt that this is a practical concern. The article you cited vaguely asserts that this can happen but with no supporting evidence or quantification of the effect. There is a vague, unsupported and unquantified reference in the introduction:
[...] flash degrades over time from erasing/writing (or even just reading, although that decay is slower) [...]
Then again in the conclusion:
We did not check flash decay for reads, but reading also causes long term decay. It would be interesting to see if we can read a spot enough times to cause failure.
The author may be referring to read-disturbance in NAND flash, but microcontrollers do not use NAND flash for code storage/execution since it is not random-access. Read disturb is not a permanent effect, erasing and re-writing the affected block restores endurance. NAND controllers maintain read counts for blocks and automatically copy and erase blocks as necessary. They also employ ECC to detect and correct errors, and identify "write-worn" areas.
There is the potential for long-term "bit-rot" but I doubt that it is caused specifically by reading rather just ageing.
Most RTOS systems spend the majority of their processing time in a do-nothing idle loop, and run happily 24/7 365 days a year. Some processors support a wait-for-interrupt instruction that halts the CPU in the idle loop, but by no means all, and it is not uncommon not to use such an instruction. Processors with flash accelerators or caches may also prevent continuous rapid fetch from a single location, but again that is by no means all microcontrollers.

What happens at the receive part when we write with SPI?

When SPI master writes to the slave, something is shifting into the receive buffer right?
If yes, then it is normal "RXDATAAVAILABLE" flag to be set? It is nonsense! We send data, and when data is sent, we get notified that there is data received.
If all of my statements are correct, then how do we know what the correct data is into the RXFIFO?
Suppose we send two bytes frame. The first one is the address and the second one is dummy in order to read the value in that address (of the slave). Then suppose we have two levels Rx FIFO. In that FIFO instead the value read from the slave, we have two bytes, the first is who knows what, and the second the value read from the slave.
So the question is: how do we manage to receive only what is necessary, without getting junk data during the write part of the frame?

SPI works like simple 8 bit shift registers. You shift out bytes on MOSI at each flank of a clock and at the same time you shift in new data from MISO. Thus you send and receive at the same time. Hence the names MOSI = Master Out Slave In, and MISO = Master In Slave Out.
SPI peripherals on microcontrollers are more intricate than that though, and have separate data registers that are different from the actual hardware shift register, so that we can write data without worrying about the pending transmission. Some may even have multiple data buffers. But on the fundamental level, SPI always work with 8 bits.
When the microcontroller acting as SPI master writes something, there is usually two flags, one that says that the data buffer is made available, and another that tells that transmission is done.
When you are done sending, you are also done receiving. You'll get some sort of flag set. This is assuming that all devices are implementing SPI as intended, which is often not the case.
Note that some devices implement a system where you first send x bytes of data, and after that receive x bytes of data. This seems to be the scenario you describe. Sending and receiving is not done at the same time for that device, but instead in sequence. Meaning that during the first transmission, you'll clock in garbage, and then in order to receive data, you must clock out garbage. This is no fault of SPI, but how the manufacturer of the specific device has specified things.
Note that SPI is very poorly standardized and therefore all manner of weird crap exists on the market. The manner of sending/receiving data may vary, the clock polarity (flanks) may vary, where the device clocks the data may vary. Some devices might need delays between data bytes. Some devices might need some obscure handling of the Slave Select pin in order to work. It's all one big mess and the lack of international standardization is to blame.

An SPI master engine's received data available flag will be set as a simple result of the occurrence of a word's worth of clock cycles generated by the master itself. It tells you nothing about the operation or even existence of a peripheral on the bus.
When this flag is set, it is entirely up to you and your software to know if the contents of the received data register will have meaning or not.
If you have properly selected and interacted with an existent, operational peripheral in a read or transfer operation where it is documented to give a result, they will have meaning
If you have performed a purely write operation to a peripheral for which no reply data is documented at the word position in question, it will be meaningless, effectively no different than reading some random legal memory location. Note that in most cases, a write operation is simply a transfer where the received data is to be ignored - at implementation level there is usually no other difference.
If you have failed to address any existent peripheral it will be similarly be meaningless.
As with any other memory or read operation, it is up to you to know if the contents of the register in a particular situation are meaningful or not.
Since you know that the first byte contains "who knows what" while the second has meaning, write your software to ignore the first and use the second.
(As an aside, many, though by no means all, SPI peripherals are documented to shift out whatever constitutes their primary status register during the address phase, since that makes for a quick way to poll it)

Simple robust error correction for transmission of ascii over serial (RS485)

I have a very low speed data connection over serial (RS485):
9600 baud
actual data transmission rate is about 25% of that.
The serial line is going through an area of extremely high EMR. Peak fluctuations can reach 3000 KV.
I am not in the position (yet) to force a change in the physical medium, but could easily offer to put in a simple robust forward error correction scheme. The scheme needs to be easy to implement on a PIC18 series micro.
Ideas?

This site claims to implement Reed-Solomon on the PIC18. I've never used it myself, but perhaps it could be a helpful reference?

Search for CRC algorithm used in MODBUS ASCII protocol.

I develop with PIC18 devices and currently use the MCC18 and PICC18 compilers. I noticed a few weeks ago that the peripheral headers for PICC18 incorrectly map the Busy2USART() macro to the TRMT bit instead of the TRMT2 bit. This caused me major headaches for short time before I discovered the problem. Example, a simple transmission:
putc2USART(*p_value++);
while Busy2USART();
putc2USART(*p_value);
When the Busy2USART() macro was incorrectly mapped to the TRMT bit, I was never waiting for bytes to leave the shift register because I was monitoring the wrong bit. Before I realized the inaccurate header file, the only way I was able to successfully transmit a byte over 485 was to wait 1 ms between bytes. My baud rate was 91912 and the delays between bytes killed my throughput.
I also suggest implementing a means of collision detection and checksums. Checksums are cheap, even on a PIC18. If you are able to listen to your own transmissions, do so, it will allow you to be aware of collisions that may result from duplicate addresses on the same loop and incorrect timings.

Protocols used to talk between an embedded CPU and a PC

I am building a small device with its own CPU (AVR Mega8) that is supposed to connect to a PC. Assuming that the physical connection and passing of bytes has been accomplished, what would be the best protocol to use on top of those bytes? The computer needs to be able to set certain voltages on the device, and read back certain other voltages.
At the moment, I am thinking a completely host-driven synchronous protocol: computer send requests, the embedded CPU answers. Any other ideas?

Modbus might be what you are looking for. It was designed for exactly the type of problem you have. There is lots of code/tools out there and adherence to a standard could mean easy reuse later. It also support human readable ASCII so it is still easy to understand/test.
See FreeModBus for windows and embedded source.

There's a lot to be said for client-server architecture and synchronous protocols. Simplicity and robustness, to start. If speed isn't an issue, you might consider a compact, human-readable protocol to help with debugging. I'm thinking along the lines of modem AT commands: a "wakeup" sequence followed by a set/get command, followed by a terminator.
Host --> [V02?] // Request voltage #2
AVR --> [V02=2.34] // Reply with voltage #2
Host --> [V06=3.12] // Set voltage #6
AVR --> [V06=3.15] // Reply with voltage #6
Each side might time out if it doesn't see the closing bracket, and they'd re-synchronize on the next open bracket, which cannot appear within the message itself.
Depending on speed and reliability requirements, you might encode the commands into one or two bytes and add a checksum.
It's always a good idea to reply with the actual voltage, rather than simply echoing the command, as it saves a subsequent read operation.
Also helpful to define error messages, in case you need to debug.

My vote is for the human readable.
But if you go binary, try to put a header byte at the beginning to mark the beginning of a packet. I've always had bad luck with serial protocols getting out of sync. The header byte allows the embedded system to re-sync with the PC. Also, add a checksum at the end.

I've done stuff like this with a simple binary format
struct PacketHdr
{
char syncByte1;
char syncByte2;
char packetType;
char bytesToFollow; //-or- totalPacketSize
};
struct VoltageSet
{
struct PacketHdr;
int16 channelId;
int16 voltageLevel;
uint16 crc;
};
struct VoltageResponse
{
struct PacketHdr;
int16 data[N]; //Num channels are fixed
uint16 crc;
}
The sync bytes are less critical in a synchronous protocol than in an asynchronous one, but they still help, especially when the embedded system is first powering up, and you don't know if the first byte it gets is the middle of a message or not.
The type should be an enum that tells how to intepret the packet. Size could be inferred from type, but if you send it explicitly, then the reciever can handle unknown types without choking. You can use 'total packet size', or 'bytes to follow'; the latter can make the reciever code a little cleaner.
The CRC at the end adds more assurance that you have valid data. Sometimes I've seen the CRC in the header, which makes declaring structures easier, but putting it at the end lets you avoid an extra pass over the data when sending the message.
The sender and reciever should both have timeouts starting after the first byte of a packet is recieved, in case a byte is dropped. The PC side also needs a timeout to handle the case when the embedded system is not connected and there is no response at all.
If you are sure that both platforms use IEEE-754 floats (PC's do) and have the same endianness, then you can use floats as the data type. Otherwise it's safer to use integers, either raw A/D bits, or a preset scale (i.e. 1 bit = .001V gives a +/-32.267 V range)

Adam Liss makes a lot of great points. Simplicity and robustness should be the focus. Human readable ASCII transfers help a LOT while debugging. Great suggestions.
They may be overkill for your needs, but HDLC and/or PPP add in the concept of a data link layer, and all the benefits (and costs) that come with a data link layer. Link management, framing, checksums, sequence numbers, re-transmissions, etc... all help ensure robust communications, but add complexity, processing and code size, and may not be necessary for your particular application.

USB bus will answer all your requirements. It might be very simple usb device with only control pipe to send request to your device or you can add an interrupt pipe that will allow you to notify host about changes in your device.
There is a number of simple usb controllers that can be used, for example Cypress or Microchip.
Protocol on top of the transfer is really about your requirements. From your description it seems that simple synchronous protocol is definitely enough. What make you wander and look for additional approach? Share your doubts and we will try to help :).

If I wasn't expecting to need to do efficient binary transfers, I'd go for the terminal-style interface already suggested.
If I do want to do a binary packet format, I tend to use something loosely based on the PPP byte-asnc HDLC format, which is extremely simple and easy to send receive, basically:
Packets start and end with 0x7e
You escape a char by prefixing it with 0x7d and toggling bit 5 (i.e. xor with 0x20)
So 0x7e becomes 0x7d 0x5e
and 0x7d becomes 0x7d 0x5d
Every time you see an 0x7e then if you've got any data stored, you can process it.
I usually do host-driven synchronous stuff unless I have a very good reason to do otherwise. It's a technique which extends from simple point-point RS232 to multidrop RS422/485 without hassle - often a bonus.

As you may have already determined from all the responses not directly directing you to a protocol, that a roll your own approach to be your best choice.
So, this got me thinking and well, here are a few of my thoughts --
Given that this chip has 6 ADC channels, most likely you are using Rs-232 serial comm (a guess from your question), and of course the limited code space, defining a simple command structure will help, as Adam points out -- You may wish to keep the input processing to a minimum at the chip, so binary sounds attractive but the trade off is in ease of development AND servicing (you may have to trouble shoot a dead input 6 months from now) -- hyperterminal is a powerful debug tool -- so, that got me thinking of how to implement a simple command structure with good reliability.
A few general considerations --
keep commands the same size -- makes decoding easier.
Framing the commands and optional check sum, as Adam points out can be easily wrapped around your commands. (with small commands, a simple XOR/ADD checksum is quick and painless)
I would recommend a start up announcement to the host with the firmware version at reset - e.g., "HELLO; Firmware Version 1.00z" -- would tell the host that the target just started and what's running.
If you are primarily monitoring, you may wish to consider a "free run" mode where the target would simply cycle through the analog and digital readings -- of course, this doesn't have to be continuous, it can be spaced at 1, 5, 10 seconds, or just on command. Your micro is always listening so sending an updated value is an independent task.
Terminating each output line with a CR (or other character) makes synchronization at the host straight forward.
for example your micro could simply output the strings;
V0=3.20
V1=3.21
V2= ...
D1=0
D2=1
D3=...
and then start over --
Also, commands could be really simple --
? - Read all values -- there's not that many of them, so get them all.
X=12.34 - To set a value, the first byte is the port, then the voltage and I would recommend keeping the "=" and the "." as framing to ensure a valid packet if you forgo the checksum.
Another possibility, if your outputs are within a set range, you could prescale them. For example, if the output doesn't have to be exact, you could send something like
5=0
6=9
2=5
which would set port 5 off, port 6 to full on, and port 2 to half value -- With this approach, ascii and binary data are just about on the same footing in regards to computing/decoding resources at the micro. Or for more precision, make the output 2 bytes, e.g., 2=54 -- OR, add an xref table and the values don't even have to be linear where the data byte is an index into a look-up table ...
As I like to say; simple is usually better, unless it's not.
Hope this helps a bit.
Had another thought while re-reading; adding a "*" command could request the data wrapped with html tags and now your host app could simply redirect the output from your micro to a browser and wala, browser ready --
:)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas