hardware acceleration and context switching overhead - cryptography

If the host software willing to take advantage of the H/W is based in user-space, how is it possible to mitigate overhead of context switching, data copying etc.? For example, OpenSSL is in the user space, it should be able to speak to a driver via I/O control or similar, send/receive data buffers, all these would result in extensive use of syscalls. What would be the benefit of hardware then?

Related

is it recommended to use SPI flash to run code instead internal flash due to memory limitation of internal flash?

We used the LPC546xx family microcontroller in our project, currently, at the initial stage, we are finalizing the software and hardware requirements. The basic firmware size (which contains RTOS, 3rd party stack, library, etc...) currently is 480 KB. Now once full application developed than the size will exceed the internal flash size (512KB) and plus we needed storage which can hold firmware update image separately.
So we planned to use SPI flash (S25LP064A-JBLE, http://www.issi.com/WW/pdf/IS25LP032-064-128.pdf, serial flash memory) of 4MB\8MB to boot and run firmware.
is it recommended to run code from SPI flash? how can I map external flash memory directly to CPU memory space? Can anyone give an example that contains this memory mapping(linker script etc..) or demo application in which LPC546xx uses SPI FLASH?
Generally speaking it's not recommended, or differently put: the closer to the CPU the better. Both S25LP064A and LPC546xx however support XIP, so it is viable.
This is not a trivial issue as many aspects are affecting. I.e. issue is best avoided and should really have been ironed out in the planning stage. Embedded Systems are more about compromising than anything and making the right/better choices takes skill end experience.
Same question with replies on the NXP forum: link
512K of NVRAM is huge. There are almost certainly room for optimisations even if 3'rd party libraries are used.
On a related note this discussion concerning XIP should give valuable insight: link.
I would strongly encourage use of file-systems if not done already, for which external storage is much better suited. The further from the computational unit, the more relevant. That's not XIP and the penalty is copy-to-RAM either way you do it. I.e. performance will be slower. But in my experience, the need for speed has often-times not been thoroughly considered and at least partially greatly overestimated.
Regarding your mentioning of RTOS and FW-upgrade:
Unless it's a poor RTOS there's file-system awareness built in. Especially for FW upgrading (Note: you'll need room for 3 images, factory reset included), unless already supported by the SoC-vendor by some other means (OTA), it will make life much easier and less risky. If there's no FS-awareness, it can be added.
FW upgrade requires a lot of extra storage. More if simpler. Simpler is however also safer which especially for FW upgrades matters hugely. In the simplest case (binary flat image), you'll need at least twice the amount of memory you're already consuming.
All-in-all: I think the direction you're going is viable and depending on the actual situation perhaps your only choice.

How does Redis achieve the high throughput and performance?

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.
There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with

Challenges in using flat memory model

The flat memory model(linear memory model) provides maximum execution speed, occupies minimum CPU real estate and has direct access to memory without any segmentation / paging. It seems that flat memory model is ideal for small realtime application or single threaded realtime application.
However, is it possible to use real-time application that is multi-threaded/multi-tasking along with requirement of high resource allocation/protection in flat memory model ?
Thanks
I don't think the memory model has much to do here, except for the (RT)OS itself which you use to get multi-threading / multi-tasking done.
Paging or segmentation, if provided, is useful for the OS primarily for implementing memory protection features. It is only possible this way that the OS may protect itself and running user mode tasks against improperly written code in others which would accidentally write in memory out of their intended domain. (You can't get memory protection without some kind of paging or segmentation since you can't guard every single memory access)
In 32 bit AVR processors there is even a distinction between Memory management unit (MMU) and Memory protection unit (MPU). The first is the more complex unit supporting those kinds of paging features like modern PC processors (for example even making it possible to realize virtual memory), while the latter is a simpler subset only giving you tools for realizing memory protection (for example by the OS, to protect itself and tasks against each other), while it does not have any remapping capability (by a given address you always access the same cell of memory) like the MMU does. (Why the distinction? Because some cheaper AVR32's, where that's sufficient, only have an MPU)
So on a simple flat memory model what important thing you won't get are the protection features. If you can get by without those, it should go just fine.

Embedded System and Serial Flash wear issues

I am using a serial NOR flash (SPI Based) for my embedded application and also I have to implement a file system over it. That makes my NOR flash more prone to frequent erase and write cycles where having a wear level Algorithm comes into picture. I want to ask few questions regarding the same:
First, is it possible to implement a Wear Level Algorithm for Nor flash, if yes then why most of the time I find the solutions for NAND Flash and not NOR Flash?
Second, are serial SPI based low cost NAND flashes available, if yes then kindly share the part number for the same.
Third, how difficult is it to implement our own Wear Level algorithm?
Fourth, I have also read/heard that industrial grade NOR Flashes have higher erase/write cycles (in millions!!), is this understanding correct? If yes then kindly let me know the details of such SPI NOR Flash, which may also lead to avoiding implementation of wear level algorithm, if not completely then since I'm planning to implement my own wear level algorithm, it might give me a little room and ease in certain areas to implement the wear level algorithm.
The constraint to all these point is cost, I would want to have low cost solution to these issues.
Thanks in Advance
Regards
Aditya Mittal
(mittal.aditya12#gmail.com)
Implementing a wear-levelling algorithm is is not trivial, but not impossible either:
Your wear-levelling driver needs to know when disk blocks are no longer used by the filing system (this is known as TRIM support on modern SSDs). In practice, this means you need to modify your block driver API and filing systems above it, or have the wear-levelling driver aware of the filing system's free-space map. This second option is easy for FAT, but probably patented.
You need to reserve at least an erase-unit + a few allocation units to allow erase unit recycling. Reserving more blocks will increase performance
You'll want a background thread to perform asynchronous erase-unit recycling
You'll need to test, test an test again. When I last built one of these, we built a simulation of both flash and ran the real filing system on top it, and tortured the system for weeks.
There are lots and lots of patents covering aspects of wear-leveling. By the same token, there are two at least two wear-levelling layers in the Linux Kernel.
Given all of this, licensing a third-party library is probably cost-effective,
Atmel/Adesto etc. make those little serial flash chips by the billion. They also have loads of online docs. I suspect that the serial flash beetles don't implement wear-levelling because of cost - the devices they are typically used in are very cheap and tend to have a limited lifetime anyway. Bulk, 4-line NAND flash that is expected to see heavier and lengthy use, (eg. SD cards), have complex, (relatively), built-in controllers that can implement wear-levelling in a transparent manner.
I no longer use one-pin interface serial flash, partly due to the wear issue. An SD-card is cheap enough for me to use and, even if one does break, an on-site technician, (or even the customer), can easily swap it out.
Implementing a wear-levelling algo. is too expensive, both in terms of development time, (especially testing if the device has to support a file system that must not corrupt on power fail etc), and CPU/RAM for me to bother.
If your product is so cost-sensitive that you have to use serial NOR flash, I suggest that you ignore the issue.

Virtual Processors and Logical Partitions

I basically wanted to know what exactly a virtual processor is. At IBM's site they define it as:
"A virtual processor is a representation of a physical processor core to the operating system of a logical partition that uses shared processors. "
I understand that if there are x processors, each of which can simultaneously perform two operations, then the system can perform 2x operations simultaneously. But where does virtual processor fit into this. And i tried looking up the difference between a logical partition and other partitions such as primary but wasn't really sure.
I'd like to draw an analogy between virtual memory and virtual processors.
Start with expectations:
A user program is written against a set of expectation about what the memory looks like (an a nice flat, large, continuous memory model is the best...)
An OS system is written against a set of expectation of how the hardware performs (what CPU protection modes operation are available, how interrupts arrive and are blocked and handled, how to talk to IO devices, etc...)
Realize that expectation can be met directly by the hardware, or by an abstraction layer
Virtual memory is a set of (specialized, not found in simple chips) hardware tools and OS services that fake a user program into thinking that it has that nice, flat, large, continuous memory space, even while the OS is busily dividing the real memory into little piece, and storing some of them on disk, bringing other back, and otherwise making a real hash of it. But your code doesn't care. Everything just works.
A virtual processor system is a set of (specialized, not found in consumer CPUs) hardware tools and hypervisor services that allow your OS to believe it has direct access to one or more processors with the expected protection modes, interrupts, etc. even though the hypervisor is busily swapping whole OS contexts onto and off of one or more real processors, starting and stopping access to IO busses, and so on and so forth. But the OS doesn't care. Everything just works.
The hardware support to do this is has only recently started to be available in "desktop" CPUs, but Big Iron has had it for ages. It is useful for a couple of reasons
Protection. In a properly protected OS, it is tough for one processes or user to spy on another. But since they can be resident in the same context, it may still be possible. Virtualizing OSs divides them by another, even thinner channel and makes it that much harder for data to leak, and malicious things to be done.
Robustness. If you can swap OS contexts in and out you migrate them from one machine to anther and checkpoint and restart. Which allows for computers that detect failures on their own processors and recover gracefully.
These are the things (aside from millions of LOC of heavily debugged, mission critical code) that have kept people paying for Big Iron.