What is a suitable compression algorithm for a large set of UTF-16 strings(multilingual)? - utf-16

I need to compress a large number of short multilingual strings(<1000bytes each). I have tried implementing LZW with a separate dictionary for each language. Is there a better solution for this? The strings are stored in a set so the ordering doesn't matter.

Base64 library to create encoded binary represantation + zlib compression might help.

Related

What is the size limit for JsonItemExporter in Scrapy?

The following warning is mentioned in the Feed Exports section of Scrapy docs.
From the docs for JsonItemExporter:
JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or splitting the output in multiple chunks.
Does this mean that the JsonItemExporter is not suitable for incremental (aka stream data) or does it also imply a size limit for json?
If this means that this exporter is not suitable also for large files, does anyone have a clue about the upper limit for json items / file size (for e.g. 10MB or 50MB)?
JsonItemExporter does not have a size limit. The only Limitation remains to be no support for streamable objects.

How to convert scanned document images to a PDF document with high compression?

I need to convert scanned document images to a PDF document with high compression. Compression ratio is very important. Can someone recommend any solution on C# for this task?
Best regards, Alexander
There is a free program called PDFBeads that can do it. It requires Ruby, ImageMagick and optionally jbig2enc.
The PDF format itself will probably add next to no overhead in your case. I mean your images will account for most of the output file size.
So, you should compress your images with highest possible compression. For black-and-white images you might get smallest output using FAX4 or JBIG2 compression schemes (both supported in PDF files).
For other images (grayscale, color) either use smallest possible size, lowest resolution and quality, or convert images to black-and-white and use FAX4/JBIG2 compression scheme.
Please note, that most probably you will lose some detail of any image while converting to black-and-white.
If you are looking for a library that can help you with recompression then have a look at Docotic.Pdf library (Disclaimer: I am one of developers of the library).
The Optimize images sample code shows how to recompress images before adding them to PDF. The sample shows how to recompress with JPEG, but for FAX4 the code will be almost the same.

Binary Serialization Backend for Orange

Why doesn't the library Orange support a binary serialization backend in addition to its XML? Is it because D currently cannot access/reflect on its binary representation or is it just not prioritized yet? If possible what D language features and/or Phobos modules should I use to realize a binary serialization backend for Orange?
For D2 I guess it should be straightforward considering we have std.binmanip right?
You can check out msgpack-d library which provides binary serialization in MessagePack format. From http://msgpack.org/:
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves.

Image Compression libraries for Objective-C

OK, this is what I need :
Lossy and/or Lossless compression (all options are going to be considered, although Lossless compression will be favoured)
PNG and JPG files support
Cocoa-friendly code and easy integration
I've used OptiPNG in the past but I'm currently looking for an alternative.
Any suggestions?
For lossy PNG compression use libimagequant (it will convert RGB/RGBA data to palette+alpha, which is 3 times smaller) combined with lodePNG (which supports palette+alpha format, unlike Cocoa's built-in methods).

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.