When reading this it said
Remember that all hardware, all firmware, and all software have faults
and introduce errors. Don’t trust anyone or anything. Have test
systems that bit flips and corrupts and ensure the production system
can operate through these faults – at scale, rare events are amazingly
common.
I don't understand. Is it possible to run any type of software if you flip bits? pointers will be incorrect and if you read for an address on a switch case (or even function call) you can start executing in no mans land. Heck, the stack could be overwritten. If your stack is gone how could you possibly recover. You can't, you have to terminate and restart again. Terminate+restart is hardly recover.
How do you 'test' a production system but forcing a part to terminate? Is this actually saying if the system is multiple CPU/systems unplug a system and see if it non unplugged system corrupts or crashes? Otherwise i don't understand what this is suppose to mean.
I think the main point in this is of increasing system robustness - see Jeff Atwood's Coding Horror post about the Netflix Chaos Monkey that randomly takes servers down. If you know the server is going down at some point (and most will) then you start planning for that circumstance. You add redundancy where needed, you harden the code to deal with situations like this, and you do it now instead of pushing it off which is easy to do when you're not directly feeling the pain.
A tangent off of this is that not every bit change will take down the system. It may just do things like corrupt data or inter-process communications. In that case, each component in your system needs to do its own error checking and not trust that it's always going to get good data from the other components.
By having a test system that causes these kinds of faults, you have a chance to fix them as appropriate before they become a huge problem with your customers.
Related
I'm having an issue with a web application I am responsible for maintaining.
The system experiences regular bugs, and our support vendors are always asking us to see if we can "replicate the error in UAT". This is obviously a reasonable request. A lot of the time, for various reasons (some of which are clear, some of which are not), these errors are not present in UAT. This lack of bug reproducability in a testing environment is adding huge amounts of friction to the bug resolution process.
There are 3 key pieces of our system architecture where these bugs are flaring (the CMS, the API layer, and the database). I am proposing we set up a system job that perpetually clones these 3 parts of the system in to a sandboxed test environment. This cloning would happen periodically (eg, once every 24 hours), and automatically.
Is there a technical term for this sort of environment? Is this an established method of helping diagnose system issues? Is there somewhere I can read up on the industry best practices for establishing something like this? Thanks.
The technical term for this kind of process is replication it is often done for some systems like databases, but normally not for testing purpose, but in order to increase available, so the replication is used as a failover spare.
An exact copy of a production system, with all the data is not you'll find often, due to the high demand on resources. Also at some points to two systems have to differ. Most systems (I know of) have tons of interfaces you just can't copy a complete system systems.
Also: you only need the copy of the production system when you actually debugging an issue. And if you are in the middle of that you probably don't want everything to go away and get replaced by a new copy.
So instead I would recommend to setup scripts that allows to obtain a copy of the relevant parts on demand.
Also you might want to consider how you might be able to modify your system to make it easier to setup a copy.
For example, when you have all the setup automated (with chef/docker or similar) you should be able to setup the same system again anywhere you want, so you now you just have to get the production data over.
Which is an interesting point. Production data often contains secret information (because it is vital to the business, or because it is personal data). You don't want this kind of stuff hang around in a test system everybody can access.
Not that I can find any by googling, but ... does anyone know of any open source code/development frameworks/test software/etc for the Multidrop Bus commonly used in vending machines?
In my opinion there isn't a free framework for the MDB, as this bus is only used by profit oriented companies and nobody would make his own code open source (me too).
But the MDB protocol itself isn't very complex, it's the error handling for the several devices that is a bit complicated, as it should be 100% safe.
And today it can be tricky to implement the 9bit serial layer, as this isn't standard, even many MCUs didn't support it any more.
Edit: How I would implement it today
Regard all specification, especially the timings/timeout (ex. NAK-Timeout of 5ms).
I would use state machines to collect the configuration data, setting the normal mode of operation, set settings and all other things.
In the first step(not later) plan to build at any state an error handling, what should happen if the communication get lost, or you got an unexpected answer?
I would also implement logging much as possible, as sometimes there will money get lost and you have to explain why.
this maybe will be off topic, but I am preparing for an exam in real time. And I have been browsing the book and Internet for an answer for a problem.
Basically I wonder if by adding additional test code if it may change the real time behavior for an embedded system, and or also if it will introduce new errors.
Anyone who might know the answer for this, or refer me to some reading material for it?
Your question is too general.. So I guess the default answer would be it depends.. But considering the possibilities as an exercise of logic and thought, yes it surely can!
There are many schemes available to guarantee the 'real-timeness' of an embedded system. For example, one can have a pre-emptive timer based ISR to service the real-time task.. In such a case, your test code could possibly not affect the 'real-timeness'.. But if the testing takes too long, and the context switches are not pre-emptive, you could get into trouble..
But again it depends on what you're testing and how you're testing. Your test code can possible mess with the timers, interrupts or the memory of system. The possibilities to mess up stuff if you're not careful are endless..
Having an OS underneath will prevent some errors, but again depending on how it works, you may be saved from bad 'test code'..
Yes, when you add code (test, diagnostic, statistic) it may change the real time behavior. It depends on the design, the implementation and the CPU power if it will actually change the behavior. You also have more lines of code and the probability for errors may increase. But I wouldn't say, "it will introduce errors", since it can introduce errors.
Yes it can. See How can adding data to a segment in flash memory screw up a program's timing? for an example of how even adding non-executable code can adjust timing enough to screw up a system.
Yea, changing your code base could totally change its timing. Consider if you dumped some debug output to a serial port, it takes time to call that function, format the data, and if the function is synchronous, then for it to wait for data to go out. This kinda stuff definitely changes system timing behavior.
as opposed to writing your own library.
We're working on a project here that will be a self-dividing server pool, if one section grows too heavy, the manager would divide it and put it on another machine as a separate process. It would also alert all connected clients this affects to connect to the new server.
I am curious about using ZeroMQ for inter-server and inter-process communication. My partner would prefer to roll his own. I'm looking to the community to answer this question.
I'm a fairly novice programmer myself and just learned about messaging queues. As i've googled and read, it seems everyone is using messaging queues for all sorts of things, but why? What makes them better than writing your own library? Why are they so common and why are there so many?
what makes them better than writing your own library?
When rolling out the first version of your app, probably nothing: your needs are well defined and you will develop a messaging system that will fit your needs: small feature list, small source code etc.
Those tools are very useful after the first release, when you actually have to extend your application and add more features to it.
Let me give you a few use cases:
your app will have to talk to a big endian machine (sparc/powerpc) from a little endian machine (x86, intel/amd). Your messaging system had some endian ordering assumption: go and fix it
you designed your app so it is not a binary protocol/messaging system and now it is very slow because you spend most of your time parsing it (the number of messages increased and parsing became a bottleneck): adapt it so it can transport binary/fixed encoding
at the beginning you had 3 machine inside a lan, no noticeable delays everything gets to every machine. your client/boss/pointy-haired-devil-boss shows up and tell you that you will install the app on WAN you do not manage - and then you start having connection failures, bad latency etc. you need to store message and retry sending them later on: go back to the code and plug this stuff in (and enjoy)
messages sent need to have replies, but not all of them: you send some parameters in and expect a spreadsheet as a result instead of just sending and acknowledges, go back to code and plug this stuff in (and enjoy.)
some messages are critical and there reception/sending needs proper backup/persistence/. Why you ask ? auditing purposes
And many other use cases that I forgot ...
You can implement it yourself, but do not spend much time doing so: you will probably replace it later on anyway.
That's very much like asking: why use a database when you can write your own?
The answer is that using a tool that has been around for a while and is well understood in lots of different use cases, pays off more and more over time and as your requirements evolve. This is especially true if more than one developer is involved in a project. Do you want to become support staff for a queueing system if you change to a new project? Using a tool prevents that from happening. It becomes someone else's problem.
Case in point: persistence. Writing a tool to store one message on disk is easy. Writing a persistor that scales and performs well and stably, in many different use cases, and is manageable, and cheap to support, is hard. If you want to see someone complaining about how hard it is then look at this: http://www.lshift.net/blog/2009/12/07/rabbitmq-at-the-skills-matter-functional-programming-exchange
Anyway, I hope this helps. By all means write your own tool. Many many people have done so. Whatever solves your problem, is good.
I'm considering using ZeroMQ myself - hence I stumbled across this question.
Let's assume for the moment that you have the ability to implement a message queuing system that meets all of your requirements. Why would you adopt ZeroMQ (or other third party library) over the roll-your-own approach? Simple - cost.
Let's assume for a moment that ZeroMQ already meets all of your requirements. All that needs to be done is integrating it into your build, read some doco and then start using it. That's got to be far less effort than rolling your own. Plus, the maintenance burden has been shifted to another company. Since ZeroMQ is free, it's like you've just grown your development team to include (part of) the ZeroMQ team.
If you ran a Software Development business, then I think that you would balance the cost/risk of using third party libraries against rolling your own, and in this case, using ZeroMQ would win hands down.
Perhaps you (or rather, your partner) suffer, as so many developers do, from the "Not Invented Here" syndrome? If so, adjust your attitude and reassess the use of ZeroMQ. Personally, I much prefer the benefits of Proudly Found Elsewhere attitude. I'm hoping I can proud of finding ZeroMQ... time will tell.
EDIT: I came across this video from the ZeroMQ developers that talks about why you should use ZeroMQ.
what makes them better than writing your own library?
Message queuing systems are transactional, which is conceptually easy to use as a client, but hard to get right as an implementor, especially considering persistent queues. You might think you can get away with writing a quick messaging library, but without transactions and persistence, you'd not have the full benefits of a messaging system.
Persistence in this context means that the messaging middleware keeps unhandled messages in permanent storage (on disk) in case the server goes down; after a restart, the messages can be handled and no retransmit is necessary (the sender does not even know there was a problem). Transactional means that you can read messages from different queues and write messages to different queues in a transactional manner, meaning that either all reads and writes succeed or (if one or more fail) none succeeds. This is not really much different from the transactionality known from interfacing with databases and has the same benefits (it simplifies error handling; without transactions, you would have to assure that each individual read/write succeeds, and if one or more fail, you have to roll back those changes that did succeed).
Before writing your own library, read the 0MQ Guide here: http://zguide.zeromq.org/page:all
Chances are that you will either decide to install RabbitMQ, or else you will make your library on top of ZeroMQ since they have already done all the hard parts.
If you have a little time give it a try and roll out your own implemntation! The learnings of this excercise will convince you about the wisdom of using an already tested library.
When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.
If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?