Why doesn't the library Orange support a binary serialization backend in addition to its XML? Is it because D currently cannot access/reflect on its binary representation or is it just not prioritized yet? If possible what D language features and/or Phobos modules should I use to realize a binary serialization backend for Orange?
For D2 I guess it should be straightforward considering we have std.binmanip right?
You can check out msgpack-d library which provides binary serialization in MessagePack format. From http://msgpack.org/:
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves.
Related
Say I have a memory buffer with a vector of type std::decimal::decimal128 (IEEE754R) elements, can I wrap and expose that as a NumPy array, and do fast operations on those decimal vectors, like for example compute variance or auto-correlation over the vector? How would I do that best?
Numpy does not support such a data type yet (at least on mainstream architectures). Only float16, float32, float64 and the non standard native extended double (generally with 80 bits) are supported. Put it shortly, only floating-point types natively supported by the target architecture. If the target machine support 128 bit double-precision numbers, then you could try the numpy.longdouble type but I do not expect this to be the case. In practice, x86 processors does not support that yet as well as ARM. IBM processors like POWER9 supports that natively but I am not sure they (fully) support the IEEE-754R standard. For more information please read this. Note that you could theoretically wrap binary data in Numpy types but you will not be able to do anything (really) useful with it. The Numpy code can theoretically be extended with new types but please note that Numpy is written in C and not C++ so adding the std::decimal::decimal128 in the source code will not be easy.
Note that if you really want to wrap such a type in Numpy array without having to change/rebuild the Numpy code, could wrap your type in a pure-Python class. However, be aware that the performance will be very bad since using pure-Python object prevent all the optimization done in Numpy (eg. SIMD vectorization, use of fast native code, specific algorithm optimized for a given type, etc.).
The following warning is mentioned in the Feed Exports section of Scrapy docs.
From the docs for JsonItemExporter:
JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or splitting the output in multiple chunks.
Does this mean that the JsonItemExporter is not suitable for incremental (aka stream data) or does it also imply a size limit for json?
If this means that this exporter is not suitable also for large files, does anyone have a clue about the upper limit for json items / file size (for e.g. 10MB or 50MB)?
JsonItemExporter does not have a size limit. The only Limitation remains to be no support for streamable objects.
What are the limitations(minuses) have binary serialization than XML, CSV, etc?
And can you explain?
Unix and the Web are (historically) favoring textual formats (XML, JSON, YAML, ...) and protocols (HTTP, SMTP, ...), because they are easier to debug (and to understand) since you can use textual tools and editors on them. Many library functions (e.g. fscanf and fprintf ...) are favoring textual formats.
And several tools are probably biased for textual files (whatever that means exactly). For instance, a textual file can probably be more efficiently managed under a version control system like git or svn etc... (and the diff and patch utilities are expecting textual data, with newlines to separate lines).
A possible disadvantage of textual format is that it may take more CPU time to encode/decode, and more disk space. (However disk space is cheap, textual data is compressible, and the bottleneck is the real I/O).
If you want compatibility of your binary data across various processors or compilers (or systems), you should explicitly take care of it, using "neutral" data formats like XDR or ASN1 and serialization libraries (e.g. s11n).
What are good compression-oriented application programming interfaces (APIs)?
Do people still use the
1991 "data compression interface" draft standard, and the
1991 "Stream transformation algorithm interface" draft standard.
(Both draft standards by Ross Williams)?
Are there any alternatives to those draft standards?
(I'm particularly looking for C APIs, but links to compression-oriented APIs in C++ and other languages would also be appreciated).
I'm experimenting with some data compression algorithms.
Typically the compressed file I'm producing is composed of a series of blocks,
with a block header indicating which compression algorithm needs to be used to decompress the remaining data in that block -- Huffman, LZW, LZP, "stored uncompressed", etc.
The block header also indicates which filter(s) need to be used to convert the intermediate stream or buffer of data from the decompressor into a lossless copy of the original plaintext -- Burrows–Wheeler transform, delta encoding, XML end-tag restoration, "copy unchanged", etc.
Rather than use a huge switch statement that selects based on the "compression type", which calls the selected decompression algorithm or filter algorithm, each procedure with its own special number and order of parameters,
it simplifies my code if every algorithm has exactly the same API -- the same number and order of parameters, etc.
Rather than waiting for the decompressor to run through the entire input stream before handing its output to the first filter,
It would be nice if the API supported decompressed output data coming out the final filter "relatively quickly" (low-latency) after relatively little compressed data has been fed into the initial decompressor.
It would be nice if the API could be used in systems that have only one thread or process.
Currently I'm kludging together my own internal API,
re-using existing compression algorithm implementations by
writing short wrapper functions to convert between my internal API and the special number and order of parameters used by each implementation.
Is there an already-existing API that I could use rather than designing my own from scratch?
Where can I find such an API?
I fear such an "API" does not exist.
Especially, requirement such as "starting stage-2 while stage-1 is ongoing and unfinished" is completely implementation dependant; and cannot be added later by an API layer.
Btw, Maciej Adamczyk just tried the same as you.
He made an open source benchmark comparing multiple compression algorithms over a block-compression scenario. The code can be consulted here :
http://encode.ru/threads/1371-Filesystem-benchmark?p=26630&viewfull=1#post26630
He has been obliged to "encapsulate" all these different compressor interfaces in order to cope with the difference.
Now for the good thing : most compressors tend to have relatively similar C interface when it comes to compressing a block of data.
AS an example, they can be as simple as this one :
http://code.google.com/p/lz4/source/browse/trunk/lz4.h
So, in the end, the adaptation layer is not so heavy.
I'm sending object IDs back and forth from client to server through the GWT RPC mechanism. The ids are coming out of the datastore as Longs (8 bytes). I think all of my ids will only need 4 bytes, but something random could happen that gives me a 5-byte (or whatever) value.
Is GWT going to be smart about packing these values in some variable-length encoding that will save space on average? Can I specify that it do so somewhere? Or should I write my own code to copy the Longs to ints and watch out for those exceptional situations?
Thanks~
As stated in the GWT documentation.
long: JavaScript has no 64-bit integral type, so long needs special consideration. Prior to GWT 1.5, the long type was was simply mapped to the integral range of a 64-bit JavaScript floating-point value, giving long variables an actual range less than the full 64 bits. As of GWT 1.5, long primitives are emulated as a pair of 32-bit integers, and work reliably over the entire 64-bit range. Overflow is emulated to match the expected behavior. There are a couple of caveats. Heavy use of long operations will have a performance impact due to the underlying emulation. Additionally, long primitives cannot be used in JSNI code because they are not a native JavaScript numeric type.
If your ids can fit in an Integer, you could be better off with that. Otherwise, if you're using a DTO, make the ids a double, which actually exists in Javascript.
GWT uses gzip compression for responses with a payload of 256 bytes or greater. That should work well if you have a lot of zero bytes in your response.
From RemoteServiceServlet.shouldCompressResponse:
Determines whether the response to a
given servlet request should or should
not be GZIP compressed. This method is
only called in cases where the
requester accepts GZIP encoding.
This implementation currently returns
true if the response string's
estimated byte length is longer than
256 bytes. Subclasses can override
this logic.
So, the server first checks if the requester (the browser, usually) accepts GZIP encoding. Internally, java.util.zip.GZIPOutputStream is used - see RPCServerUtils. On the client side, it's the browser's job to decompress the gzipped payload - since this is done in native code, it should be fairly quick.