Compression for TFR records - tensorflow

How does one enable compression for TensorFlow Records (TFRs) for use with the tf.data API. The documentation, oddly, completely ignores the topic (here). Whereas, the tf.data API clearly describes how to open a TFR file with compression (here)
I have found some code from 2018 but this seems to rely on the older "compat.v1" libraries:
options = tf.compat.v1.io.TFRecordOptions(tf.compat.v1.io.TFRecordCompressionType.GZIP)
with tf.io.TFRecordWriter(filepath, TFRecordCompressionType.GZIP) as tf_file:
tf_file.write(example.SerializeToString(), options=options)
When I use this I get badly formed files with this. For example, without compression my files are 500kB and with compression they are 20B! I would very impressed if this were actually true, however this suggests a problem.

Related

What is the concept of CNTKTextFormatDeserializer and why use?

I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.

supporting lz4 compression with json format

it's more a feature request than a question.
LZ4 is one of fastest known LZ77 compressors today and is especially efficient when decompressing data. Its compression ratio is not like of gzip but is very good on text data. It's so good that it can actually save user time when writing a file to a persistent disk compared to a raw write in IO scarce environments like Google cloud.
Also it can save lots (lots!) of CPU cycles to Google and its clients if you guys support lz4 for uploading the data in addition to gzip.
This question is a feature request which would belong in the Cloud Platform Public Issue Tracker [1]. As a stackoverflow question, though, it's somewhat out of place. I encourage you to make a feature request which explains your use-case in the public issue tracker, while asking specific technical questions on stackoverflow. If you have any questions, you should refer to the Help Center [2] [3] here on stackoverflow or check out the kind of issues and feature requests which get reported in the App Engine public issue tracker [4], for example.

What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

I have read several times that turning on compression in HDF5 can lead to better read/write performance.
I wonder what ideal settings can be to achieve good read/write performance at:
data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)
I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.
I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.
There are a couple of possible compression filters that you could use.
Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.
Regarding performance:
It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000). See here, here and here for some infos.
However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.
Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.
Regarding your original question:
I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:
BloscLZ: internal default compressor, heavily based on FastLZ.
LZ4: a compact, very popular and fast compressor.
LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
Snappy: a popular compressor used in many places.
Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.
These have different strengths and the best thing is to try and benchmark them with your data and see which works best.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

Looking for a lossless compression api similar to smushit

Anyone know of an lossless image compression api/service similar to smushit from yahoo?
From their own FAQ:
WHAT TOOLS DOES SMUSH.IT USE TO SMUSH IMAGES?
We have found many good tools for reducing image size. Often times these tools are specific to particular image formats and work much better in certain circumstances than others. To "smush" really means to try many different image reduction algorithms and figure out which one gives the best result.
These are the algorithms currently in use:
ImageMagick: to identify the image type and to convert GIF files to PNG files.
pngcrush: to strip unneeded chunks from PNGs. We are also experimenting with other PNG reduction tools such as pngout, optipng, pngrewrite. Hopefully these tools will provide improved optimization of PNG files.
jpegtran: to strip all metadata from JPEGs (currently disabled) and try progressive JPEGs.
gifsicle: to optimize GIF animations by stripping repeating pixels in different frames.
More information about the smushing process is available at the Optimize Images section of Best Practices for High Performance Web pages.
It mentions several good tools. By the way, the very same FAQ mentions that Yahoo will make Smush.It a public API sooner or later so that you can run at it your own. Until then you can just upload images separately for Smush.It here.
Try Kraken Image Optimizer: https://kraken.io/signup
The developer's plan is free - but only returns dummy results. You must subscribe to one of the paid plans to use the API, however, the Web Interface is free and unlimited for images of up to 1MB.
Find out more in the Kraken documentation.
See this:
http://github.com/thebeansgroup/smush.py
It's a Python implementation of smushit that can be run off-line to optimise your images without uploading them to Yahoo's service.
As I know the best image compression for me is : Tinypng
They have also API : https://tinypng.com/developers
Once you retrieve your key, you can immediately start shrinking
images. Official client libraries are available for Ruby, PHP,
Node.js, Python and Java. You can also use the WordPress plugin, the
Magento 1 extension or improved Magento 2 extension to compress your
JPEG and PNG images.
And First 500 images per month is for free
Tip : Via using their API, you have no limit about file-size (not max 5MB each as their online tool)