Win CE 6.0 client using WCF Services - Reduce Bandwidth

Win CE 6.0 client using WCF Services - Reduce Bandwidth - wcf

We have a Win CE 6.0 device that is required to consume services that will be provided using WCF. We are attempting to reduce bandwidth usage as much as possible and with a simple test we have found that using UDP instead of HTTP saved significant data usage.
I understand there are limitations regarding WCF on .NET Compact Framework 3.5 devices and was curious what people thought would be the appropriate way forward. Would it make sense to develop a custom UDP binding, and would that work for both sides?
Any feedback would be appreciated. Thanks.

While http does have some overhead, if this is becoming a significant part of your data usage, then I would suspect that your API is too "chatty", and maybe fewer messages (each carrying more payload) should be considered.
The next point would be; how can we reduce the bandwidth for a given amount of payload? Compression is an option, but can be a problem on some platforms. Another is to use a serialization format that is inherently dense and efficient to process (in terms of CPU cycles, since you are using low-power devices). For that purpose, something like "protocol buffers" would be ideal.
protobuf-net is a CF-compatible implementation of protocol buffers for .NET; the CF build doesn't have all the nice WCF features (because CF doesn't support them), but it can work very effectively.
Additionally, if you do go http, then MTOM should be considered, as this reduces the encoding overhead of binary data (i.e. what protobuf-net would use).
Moving to UDP can be an option, but I would try something like http + protobuf-net + MTOM first (combined with a less "chatty" API), and see how it stacks up.
I should also note that the current (downloadable) version of protobuf-net has some "kinks" with CF; it works, but it isn't as fast etc as it could be (due to limitations in meta-programming on CF). The "v2" product (not yet released) addresses all these points, allowing fully static (and fast) execution on CF. And best of all, it is free.

Related

How does Redis achieve the high throughput and performance?

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.

There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with

IServerXMLHTTPRequest vs WinHTTP Performance

I am trying to compare IServerXMLHTTPRequest and WinHTTP in regards to performance.
I would like to know:
What is the maximum limit of the data/file that can be sent?
What is the transfer rate if the file to be sent is the maximum limit?

For those who might be needing an information about this.
IServerXMLHTTPRequest is a thin layer above WinHTTP. Being a layer over WinHTTP means SXH will carry additional overhead. SXH doesn't provide any additional functionality over WinHTTP, other than the ability to directly support XML Document objects. source
And thus by using the WinHTTP object directly you achieve higher performance, scalability, and reduced memory consumption. source
If you are dealing with very large payloads (either posting/receiving multi-megabyte requests/responses), then use the WinHTTP Win32 API. The SXH component does not handle large data payloads efficiently--it will store all the data in a single memory buffer. The WinHTTP Win32 API allows the application to send/receive data using separate, smaller memory buffers. source

Compact decompression library for embedded use

We're currently creating a device for a customer that will get a block of data (like, say, 5-10KB) from a PC application. This is a bit simplified, so assume that the data must be passed and uncompressed a lot, not just once a year. The communication channel is really, really slow, so we'd like to compress the data beforehand, pass to the device and let it uncompress the data to its internal flash. The device itself, however, runs on a micro controller that is not really fast and does not have a lot of memory. It has enough flash memory to store the result, and can uncompress the data block as it is received, but it may not have enough RAM to store the entire compressed or uncompressed (or even both!) data blocks. And of course, it doesn't have an operating system or other luxury.
This means we need a sufficiently fast uncompression algorithm that does not use a lot of memory. The compression can be slow and ugly, since we're doing it on the PC side. C or .NET code preferred though for compression, to make things easier. The uncompression code should be in C, since it's unlikely that someone has an ASM optimized version for our controller.
We found LZO, which would be almost perfect for us, but it has a so-called "free" license (GPL) by default, which makes it totally unusable for our customer. The author says that commercial licenses are available on request, but unfortunately he's currently unreachable (for non-technical reasons, like the news on his site say).
I found a few other libraries, including the puff.c from zlib, and we're still investigating, but I thought I'd ask for your experience:
Which compression algorithm and/or library do you recommend for embedded purposes, given that the decompression device has really limited resources and source code and a commercial license are required?

You might want to check out one of these which are not GPL and are fairly compact implementations:
fastlz - MIT license, fairly simple code
lzjb - sun CDDL, used in zfs for compression, simple and very short
liblzf - BSD-style license, small, fast
lzfx - BSD-style, based on liblzf, small, fast
Those algorithms are all based on the original algorithm of Lempel–Ziv–Welch (They have all LZ in common)
https://en.wikipedia.org/wiki/Lempel–Ziv–Welch

I have used LZSS. I used code from Haruhiko Okumura as base. It uses the last portion of uncompressed data(2K) as dictionary. This code can be modified to not require a temporary ring buffer if you have no memory. The licensing is not clear from his site but some versions was released with a "Use, distribute, and modify this program freely" line included and the code is used by commercial vendors.
Here is an implementation based on the same code that forms part of the Allegro game library. Allegro licensing is giftware or zlib.
Another option could be the lzfx lib that implement LZF. I have not used it yet but it seems nice. Also uses previous results so it has low memory requirements and is released under a BSD Licence.

One alternative could be the LZ77 coder/decoder in the Basic Compression Library.
Since it uses the unpacked data history for its dictionary, it uses no extra RAM except for the compressed and uncompressed data buffers. It should be ideal for your use case (zlib license, portable C). The entire decoder is just 70 lines of code (including comments), and really fast.
EDIT: Yet another alternative is the liblzg library, which is a refined version of the aforementioned LZ77 coder/decoder. It compresses better, is generally faster, and requires no memory for decompression. It is very, very free (zlib license).

I would recommend ZLIB.
From the wiki:
The library provides facilities for control of processor and memory use
There are also facilities for conserving memory. These are probably only useful in restricted memory environments such as some embedded systems.
zlib is also used in many embedded devices because the code is portable, liberally-licensed
and has a relatively small memory footprint.

A lot depends on the nature of the data. If it is simple enough, you may not need anything very fancy. For example if the downloaded data was a simple image (for example something like a line graph), a simple run length encoding could cut the data down by a factor of ten and you would need trivial amounts of code and RAM to decode it.
Of course if the data is more complex, then this won't be of much use. But I'd start by exploring the data being sent and see if there are specific aspects which would allow you to compress it more effectively than using a general purpose algorithm.

You might want to check out Jørgen Ibsen's aPlib - a couple of excerpts from the product page:
The compression ratios achieved by aPLib combined with the speed and tiny footprint of the depackers (as low as 169 bytes!) makes it the ideal choice for many products.
aPLib is free to use even for commercial use, please check the included license for details.
The compression library is closed-source (yes, I know this could be a problem), but has precompiled libraries for a variety of compilers and operating systems, including both 32- and 64-bit editions. There's C and x86 assembly source code for the decompressor.
EDIT:
Jørgen also has a free (zlib license) BrifLZ library you could check out if not having compressor source is a big problem.

I've seen people use 7zip on an embedded system with memory in the tens of megabytes.

there is a specific custom version of zlib for Micro-controller based on ARM Cortex-M (M0, M0+, M1, M3, M4)
https://github.com/kuym/Zlib-ARM-Cortex-M

How to estimate size of Windows CE run-time image

I am developing an application, and need to estimate how much resources (RAM and ROM) it will need to run on a device. I have been looking online, but couldn't find any good tip on how to do this.
The system in question is an industrial system. The application itself will need to have a .NET Compact framework, and following components besides Windows CE Core: SYSGEN_HTTPD (Web Server), SYSGEN_ATL (Active Template Libraries), SYSGEN_SOAPTK_SERVER (SOAP Server), SYSGEN_MSXML_HTTP (XML/HTTP), SYSGEN_CPP_EH_AND_RTTI (Exception Handling and Runtime Type Information).
Tx

There really is not way to estimate this, becasue application behavior and code can have wildly different requirements. Something that does image manipulation is going to likely require more RAM than a simple HMI, but even two graphical apps that do the same thing could be vastly different based on how image algorithms and buffer sizes are set up.
The only way to greally get an idea is to actually run the application and see what the footprint looks like. I would guess that you're going to want a BOM that includes at least 64MB or RAM and 32MB of flash at a bare minimum. Depending on the app, I'd probably ask for 128MB of RAM. Flash would greatly depend on what the app needs to do.

Since you are specifying core OS components and since I assume you can estimate your own application's resources, I assume you ask for an estimation of the OS as a whole.
The simplest way to have an approximation is to build an emulator image (CE6 has an arm one) and it should give you a sense. The difference with the final image will be with the size of the drivers for the actual platform you will use.

Spread vs MPI vs zeromq?

In one of the answers to Broadcast like UDP with the Reliability of TCP, a user mentions the Spread messaging API. I've also run across one called ØMQ. I also have some familiarity with MPI.
So, my main question is: why would I choose one over the other? More specifically, why would I choose to use Spread or ØMQ when there are mature implementations of MPI to be had?

MPI was deisgned tightly-coupled compute clusters with fast, reliable networks. Spread and ØMQ are designed for large distributed systems. If you're designing a parallel scientific application, go with MPI, but if you are designing a persistent distributed system that needs to be resilient to faults and network instability, use one of the others.
MPI has very limited facilities for fault tolerance; the default error handling behavior in most implementations is a system-wide fail. Also, the semantics of MPI require that all messages sent eventually be consumed. This makes a lot of sense for simulations on a cluster, but not for a distributed application.

I have not used any of these libraries, but I may be able to give some hints.
MPI is a communication protocol while Spread and ØMQ are actual implementation.
MPI comes from "parallel" programming while Spread comes from "distributed" programming.
So, it really depends on whether you are trying to build a parallel system or distributed system. They are related to each other, but the implied connotations/goals are different. Parallel programming deals with increasing computational power by using multiple computers simultaneously. Distributed programming deals with reliable (consistent, fault-tolerant and highly available) group of computers.
The concept of "reliability" is slightly different from that of TCP. TCP's reliability is "give this packet to the end program no matter what." The distributed programming's reliability is "even if some machines die, the system as a whole continues to work in consistent manner." To really guarantee that all participants got the message, one would need something like 2 phase commit or one of faster alternatives.

You're addressing very different APIs here, with different notions about the kind of services provided and infrastructure for each of them. I don't know enough about MPI and Spread to answer for them, but I can help a little more with ZeroMQ.
ZeroMQ is a simple messaging communication library. It does nothing else than send a message to different peers (including local ones) based on a restricted set of common messaging patterns (PUSH/PULL, REQUEST/REPLY, PUB/SUB, etc.). It handles client connection, retrieval, and basic congestion strictly based on those patterns and you have to do the rest yourself.
Although appearing very restricted, this simple behavior is mostly what you would need for the communication layer of your application. It lets you scale very quickly from a simple prototype, all in memory, to more complex distributed applications in various environments, using simple proxies and gateways between nodes. However, don't expect it to do node deployment, network discovery, or server monitoring; You will have to do it yourself.
Briefly, use zeromq if you have an application that you want to scale from the simple multithread process to a distributed and variable environment, or that you want to experiment and prototype quickly and that no solutions seems to fit with your model. Expect however to have to put some effort on the deployment and monitoring of your network if you want to scale to a very large cluster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas