What are the limitations(minuses) have binary serialization than XML, CSV, etc? - serialization

What are the limitations(minuses) have binary serialization than XML, CSV, etc?
And can you explain?

Unix and the Web are (historically) favoring textual formats (XML, JSON, YAML, ...) and protocols (HTTP, SMTP, ...), because they are easier to debug (and to understand) since you can use textual tools and editors on them. Many library functions (e.g. fscanf and fprintf ...) are favoring textual formats.
And several tools are probably biased for textual files (whatever that means exactly). For instance, a textual file can probably be more efficiently managed under a version control system like git or svn etc... (and the  diff and patch utilities are expecting textual data, with newlines to separate lines).
A possible disadvantage of textual format is that it may take more CPU time to encode/decode, and more disk space. (However disk space is cheap, textual data is compressible, and the bottleneck is the real I/O).
If you want compatibility of your binary data across various processors or compilers (or systems), you should explicitly take care of it, using "neutral" data formats like XDR or ASN1 and serialization libraries (e.g. s11n).

Related

What makes RecordIO attractive

I have been reading about RecordIO here and there and checking different implementations on github here, and there.
I'm simply trying to wrap my head around the pros of such a file format.
The pros I see are the following:
Block compression. It will be faster if you need to read only a few records because less to decompress.
Because of the somehow indexed structure you could lookup a specific record in acceptable time (assuming keys are sorted). This can be useful to quickly locate a record in an adhoc fashion.
I can also imagine that with such a file format you can have finer sharding strategies. Instead of sharding per file you can shard per block.
But I fail to see how such a file format is faster for reading over some plain protobuf with compression.
Essentially I fail to see a big pro in this format.

Binary Serialization Backend for Orange

Why doesn't the library Orange support a binary serialization backend in addition to its XML? Is it because D currently cannot access/reflect on its binary representation or is it just not prioritized yet? If possible what D language features and/or Phobos modules should I use to realize a binary serialization backend for Orange?
For D2 I guess it should be straightforward considering we have std.binmanip right?
You can check out msgpack-d library which provides binary serialization in MessagePack format. From http://msgpack.org/:
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

What are good compression-oriented application programming interfaces (APIs)?

What are good compression-oriented application programming interfaces (APIs)?
Do people still use the
1991 "data compression interface" draft standard, and the
1991 "Stream transformation algorithm interface" draft standard.
(Both draft standards by Ross Williams)?
Are there any alternatives to those draft standards?
(I'm particularly looking for C APIs, but links to compression-oriented APIs in C++ and other languages would also be appreciated).
I'm experimenting with some data compression algorithms.
Typically the compressed file I'm producing is composed of a series of blocks,
with a block header indicating which compression algorithm needs to be used to decompress the remaining data in that block -- Huffman, LZW, LZP, "stored uncompressed", etc.
The block header also indicates which filter(s) need to be used to convert the intermediate stream or buffer of data from the decompressor into a lossless copy of the original plaintext -- Burrows–Wheeler transform, delta encoding, XML end-tag restoration, "copy unchanged", etc.
Rather than use a huge switch statement that selects based on the "compression type", which calls the selected decompression algorithm or filter algorithm, each procedure with its own special number and order of parameters,
it simplifies my code if every algorithm has exactly the same API -- the same number and order of parameters, etc.
Rather than waiting for the decompressor to run through the entire input stream before handing its output to the first filter,
It would be nice if the API supported decompressed output data coming out the final filter "relatively quickly" (low-latency) after relatively little compressed data has been fed into the initial decompressor.
It would be nice if the API could be used in systems that have only one thread or process.
Currently I'm kludging together my own internal API,
re-using existing compression algorithm implementations by
writing short wrapper functions to convert between my internal API and the special number and order of parameters used by each implementation.
Is there an already-existing API that I could use rather than designing my own from scratch?
Where can I find such an API?
I fear such an "API" does not exist.
Especially, requirement such as "starting stage-2 while stage-1 is ongoing and unfinished" is completely implementation dependant; and cannot be added later by an API layer.
Btw, Maciej Adamczyk just tried the same as you.
He made an open source benchmark comparing multiple compression algorithms over a block-compression scenario. The code can be consulted here :
http://encode.ru/threads/1371-Filesystem-benchmark?p=26630&viewfull=1#post26630
He has been obliged to "encapsulate" all these different compressor interfaces in order to cope with the difference.
Now for the good thing : most compressors tend to have relatively similar C interface when it comes to compressing a block of data.
AS an example, they can be as simple as this one :
http://code.google.com/p/lz4/source/browse/trunk/lz4.h
So, in the end, the adaptation layer is not so heavy.

Are there different JPEG2000 file formats?

I've seen JPEG2000 files with both .J2K and .JP2 extensions, and codecs which read one won't always read the other. Can someone explain why there are multiple extensions for what I thought was a single format?
Because JPEG 2000 is both a codec and a file format. The standard is in many parts, with Part 1 giving (mostly) codec information (i.e. how to compress and decompress image data), with a container file format annex (JP2). Part 2 gives many extensions, and a more comprehensive container format (JPX).
JP2 is the "container" format for JPEG 2000 codestreams, and is modelled on the Quicktime format. J2K, I've not seen (we used J2C during standardisation), but I presume it is raw compressed data, without a wrapper. The point of the containers is that a "good" image comes with additional metadata - colour space information, tagging, etc. The JP2 format base allows many "boxes" of information in one file (including many images, if you like). It also allows the same container format to be used for many other parts of the standard (such as JP3D, and JPIP). Really, you shouldn't see many unwrapped, raw data streams - it is, in my opinion, bad practice.