TFRecord larger than the original data - tensorflow

Actually, I am dealing with many pictures which are from different videos, so I use tf.SequenceExample() to save them as different sequences and their labels attached into TFRcord.
But after running my code to generate TFRecord, it generates a TFRecord which is 29GB larger than my original pictures 3GB.
Is that normal to create TFRecord larger than the original data?

You may be storing the decoded images instead of the jpeg encoded ones. TFRecord has no concept of image formats so you can use any encoding you want. To keep the size the same, convert the original image file contents to a BytesList and store that without calling decode_image or using any image libraries or anything that understands image formats.
Another possibility is you might be storing the image as an Int64List full of bytes which would be 8x the size. Instead, store it as a BytesList containing a single Bytes.

Check the the type of data you load. I guess you load images as pixel-data. Every pixel is unit8 (8 bit) and likely to be converted to float (32 bit). Hence you have to expect that it gets 4 times the original size (3 GB -> 12 GB).
Also, the original format might have (better) compression than tfrecords. (I'm not sure if tfrecords can use compression)

Related

Converting numpy array (image) to pdf base64

I have an image, represented as a numpy array.
I want to avoid writing it out as a pdf, and then reading the file back to get the base64 representation of the file, is there an easier way to do this without writing a file?
My goal is to have the base64 representation of the output pdf file (without outputting one)
If I understand correctly, the base64 encoding is different for jpgs and pdfs, is this correct?.
Using PIL Image.fromarray function, one could convert all the images to PIL images.
Then again using PIL, save() could be used to save the images together as a PDF and write them to a buff:
buff = io.BytesIO()
pil_images[0].save(buff, "PDF", resolution=100.0, save_all=True, append_images=pil_images[1:])
buff.getvalue() returns the bytes (which is good enough for me, but it is also still possible to get the base64 representation)

Is there an optimal number of elements for a tfrecords file?

This is follow up to these SO questions
What is the need to do sharding of TFRecords files?
optimal size of a tfrecord file
and this passage from this tutorial
For this small dataset we will just create one TFRecords file for the
training-set and another for the test-set. But if your dataset is very
large then you can split it into several TFRecords files called
shards. This will also improve the random shuffling, because the
Dataset API only shuffles from a smaller buffer of e.g. 1024 elements
loaded into RAM. So if you have e.g. 100 TFRecords files, then the
randomization will be much better than for a single TFRecords file.
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/18_TFRecords_Dataset_API.ipynb
So there is an optimal file size, but I am wondering, if there's an optimal number of elements? Since it's the elements itself that's being distributed to the GPUs cores?
Are you trying to optimize:
1 initial data randomization?
2 data randomization across training batches and/or epochs?
3 training/validation throughput (ie, gpu utilization)?
Initial data randomization should be handled when data are initially saved into sharded files. This can be challenging, assuming you can't read the data into memory. One approach is to read all the unique data ids into memory, shuffle those, do your train/validate/test split, and then write your actual data to file shards in that randomized order. Now your data are initially shuffled/split/sharded.
Initial data randomization will make it easier to maintain randomization during training. However, I'd still say it is 'best practice' to re-shuffle file names and re-shuffle a data memory buffer as part of the train/validate data streams. Typically, you'll set up an input stream using multiple threads/processes. The first step is to randomize the file input streams by re-shuffling the filenames. This can be done like:
train_files = tf.data.Dataset.list_files('{}/d*.tfr'.format(train_dir),
shuffle=True)
Now, if your initial data write was already randomized, you 'could' read the entire data from one file, before going to the next, but that would still impact re-randomization throughout the training process, so typically you interleave file reads, reading a certain number of records from each file. This also improves throughput, assuming you are using multiple file read processes (which you should do, to maximize gpu throughput).
blocksize = 1000 # samples read from one file before switching files
train_data = train_files.interleave(interleaveFiles,
block_length=blocksize,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
Here, we're reading 1000 samples from each file, before going on to the next. Again, to re-shuffle the training data each epoch (which may or may not be critical), we re-shuffle the data in memory, setting a memory buffer based on what's available on the machine and how large our data items are (note - before formatting the data for gpu).
buffersize = 1000000 # samples read before shuffling in memory
train_data = train_data.shuffle(buffersize,
reshuffle_each_iteration=True)
train_data = train_data.repeat()
The repeat() call is just to allow the data set to 'wrap around' during training. This may or may not be important, depending on how you set up your training process.
To optimize throughput, you can do 2 things:
1 alter the order of operations in the data input stream. Typically, if you put your randomization operations early, they can operate on 'low weight' entities, like file names, rather than on tensors.
2 use pre-fetching to let your cpu processes stream data during gpu calculations
train_data = train_data.map(mapData,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_data = train_data.padded_batch(batchsize)
train_data = train_data.prefetch(10)
So, mapping and batching happens last (this is usually preferred for maximizing gpu throughput, but it can depend on other factors, like data size (pre and post-tensorizing), and how computationally expensive your map function is).
Finally, you can tune the prefetch size to maximize gpu throughput, constrained by system memory and memory speed.
So, how does this all impact the 'optimal' number of data items in each sharded file?
Obviously, if your data/file size is > your blocksize, blocksize becomes irrelevant, and you might as well read each file completely. Typically, if you are going to use this paradigm, you wand blocksize << data/file. I use 10x; so if my blocksize is 1000, I have ~10,000 data items in the file. This may not be optimal, but so far I can maintain >90% gpu usage using this approach on my specific hardware. If you want to tune for your hardware, you could start somewhere at ~10x and adjust, based on whatever you are specifically trying to optimize.
If you have very large numbers of files, you may run into problems maintaining good file read streams, but on a modern system you should be able to get to 100,000 files or more and still be fine. Moving large numbers of files around can be difficult, but usually easier than having very small numbers of very big files, so there are some (broad) constraints on file sizes that can impact how many data items/file you end up with. Generally speaking, I'd say having on the order of 100s of files would be ideal for a large dataset. That way you can easily stream files across a network efficiently (again, that will depend on your network). If the data set is small, you'll have 10s to 50s of files, which is fine for streaming, depending on file size (I typically try to hit 100-300MB/file, which works well for moving things around a LAN or WAN).
So, I think file-size and number-of-files places much stronger constraints on your process than number of data items/file, so long as you have an appropriate number of data items/file, given your file read blocksize. Again, you could hyper-shard your files (1 data item/file?), and read entire files into memory, without using file blocking. That might work, and it would certainly be lightweight to shuffle file names, rather than data items. But you might also end up with millions of files!
To really optimize, you'll need to set up an end-to-end training system on a particular machine, and then tweak it to see what works best for your particular data, network, and hardware. So long as your data are effectively randomized and your data files are easy to store/use/share, you just want to optimize gpu throughput. I would be surprised if reordering the data input stream and pre-fetching doesn't get you there.

Animated GIF larger than source images

I'm using imagemagick to create an animated GIF out of ~60 JPG 640x427px photos. The combined size of the JPGs is about 4MB.
However, the output GIF is ~12MB. Is there a reason why the GIF is considerably bigger? Can I conceivably achieve a GIF size of ~4MB?
The command I'm using is:
convert -channel RGB # no improvement in size
-delay 2x10 \
-size 640 \
-loop 0 \
-dispose Background # no improvement in size
-layers Optimize # about 2MB improvement
portrait/*.jpg portrait.gif
Using gifsicle didn't seem to improve either.
JPG is lossy compression.
GIF is lossless compression.
A better comparison would be to convert all the source images to GIF first, then combine them..
First google hit for GIF compression is http://ezgif.com/optimize which claims lossy GIF compresion, might work for you but I offer no warranty as I haven't tried it.
JPEG achieves it's compression through a (lossy) transform, where an 16x16 / 8x8 block of pixels is transformed to frequency representation and then quantized. Instead of selecting e.g. 256 levels (i.e. 8 bits) of red/green/blue per component, JPEG can ignore some frequency components, or use just 1 or 2 bits to represent them.
GIF on the other hand works by identifying repeated patterns from a paletted image (upto 256 entries), which occur exactly in the previously encoded/decoded stream. Both because of the JPEG compression, and the source of the images typically encoded by JPEG (natural full color), the probability of (long) exact matches is quite low.
60 RGB images with the size 640x427 is about 16 million pixels. To represent that much in 4 MB, requires a compression of 2 bits per pixel. To achieve this with GIF would require a very lossy algorithm, that would select (vector) quantization of true color pixels not to the closest pixel in the target GIF palette, but based also on the fact how good dictionary of code words this particular selection will make. The dictionary builds slowly and to achieve 2 bits/pixel, the average length of the decoded code word would have to map to 5.5 matching pixels in the close neighborhood.
By contrast, imagemagick has been able to compress the 16 million pixels (each selected from a palette of 256 elements) to 75% already!

How to get exact size of image in bytes?

I have calculated image size in bytes by converting image into NSData and its data length got wrong value.
NSData *data = UIImageJPEGRepresentation(image,0.5);
NSLog(#"image size in bytes %lu",(unsigned long)data.length);
Actually, the UIImage.length function here is not returning the wrong value, its just the result of the lossy conversion/reversion from a UIImage to NSData.
Setting the compressionQuality to the lowest compression possible of 1.0 in UIImageJpegRepresentation will not return the original image. Although the image metadata is stripped in this process, the function can and usually will yield an object larger than the original. Note that this increase in filesize does NOT increase the quality of the image from the compressed original either. Jpegs are highly compressed to begin with, which is why they are used so often, and the function is uncompressing it and then recompressing it. Its kind of like getting botox after age has stretched your body out, it might look similar to the original, but the insides are just not as as good as they used to be.
You could use a lower compressionQuality conditionally on larger files, close to 1.0, as the quality will drop off quickly. Other than that, depending on the final purpose of your images, the only other option would be to resize the image or adjust its resolution, perhaps in addition to adjusting the compression ratio. This change will exponentially curtail data usage. Web and mobile usage typically don't need the resolution as something like images meant for digital print.
You can write some code that adjusts each image and NSData representation only as much as needed to fit its individual data constraint.

How to compress images (png, jpg and so on) using objective C

i want to shrink png or jpg on OSX. i only want to shrinkg without affecting the image quality.
like tinypng.org
is there any recommended library? i just know imagemagick. is there a way to do that natively? or another library to shrink/compress images without affecting the image quality?
my aim is to shrink the file size, for example:
logo.png >> 476 k before shrink
logo.png >> 50k after shrink
Edit: to be clear, i want to compress the size of the file, not the image resolution.
TinyPNG.org works by using image quantisation - the similar colours in the image are converted into a HSV or RGB model and then merged depending on the distance.
How does it work?
...
When you upload a PNG (Portable Network Graphics) file, similar colours in your image are combined. This technique is called “quantisation”
...
src: http://tinypng.org
An answer here outlines a method of doing so: https://stackoverflow.com/a/492230/556479.
There are also some answers on this question with refer to how you can do so on Mac OS using objective-c: How do I reduce a bitmap to a known set of RGB colours
See Wikipedia for a more in depth guide: http://en.wikipedia.org/wiki/Color_quantization
Did you have a problem using ImageMagick? It has a rich set of quantize functions such as
bool MagickQuantizeImage( MagickWand mgck_wnd,
float number_colors,
int colorspace_type,
float treedepth,
bool dither,
bool measure_error )
Here is a very thorough guide to quantization using imageMagick
My suggestion is to use http://pngnq.sourceforge.net, it will give better results than ImageMagick and for the single example given in http://tinypng.org, it also produces a very similar output. It is a tiny C implementation of the method present in the paper "Kohonen Neural Networks for Optimal Colour Quantization". That alone is much better since you are no longer relying on closed unknown implementations.
Original (57 KB), tinypng.org (16 KB), pngnq (17 KB):
Using ImageMagick, the best quantization to 256 colors I can get uses the LAB colorspace and dithering by Floyd-Steinberg:
convert input.png -quantize LAB -dither FloydSteinberg -colors 256 output.png
This produces a 16 KB png, but it contains much more visual artifacts: