Estimate size of .tar.gz file before compressing - gzip

We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).
There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.
The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)

Yes, there is a way to estimate tar size before running the command.
tar -czf - /directory/to/archive/ | wc -c
Meaning:
This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the amount of KB in the archive. Technically, it runs the tool but doesn't save it.
Source: The Ultimate Tar Command Tutorial with 10 Practical Examples

It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.
For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.
I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.
You can flush the compression at each step to see how big the result is.

Related

How can I optimize my pdf repository after splitting it by page?

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.
Here is the sample pdf from repository
https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf
Any help regarding pdf splitting is welcomed
Thanks!
I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:
$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.
So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:
cpdf -remove-metadata in.pdf -o small.pdf
on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.

Is there a way to read tar.gz top lines without uncompression?

I have 1000+ *.tar.gz files with size 4G+ each. But the only thing that I needed is the top 5 lines of each file. I am wondering whether there is a fast way to read these lines without uncompressing process (it takes 3-5 minutes to uncompress a single file).
My platform is Linux.
No, there isn't any faster way.
The issue is that .tar file is stream of concatenated original files (with some meta information). gzip then adds compression of full archive. Therefore even to just get the list of the files the archive has to be uncompressed first.

what is the maximum extent of compressing a pdf file?

Whenever I try to compress a PDF file to a lower possible size, by either using ghostscript or pdftk or pdfopt, I end up having a file near to half the size of original. But lately, I am getting files of size in 1000 MB range, which are compressing to say a few hundreds. Can we further reduce them?
The pdf is made from jpg images which are of higher resolutions, cant we reduce the size of those images and further bring in some more reduction in size?
As far I know, without degrading jpeg streams and loosing quality, you can try the special feature offered by
Multivalent
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355/Multivalent20060102.jar.html
java -cp path/to.../multivalent.jar tool.pdf.Compress -compact file.pdf
resulting output will be compressed in a special way. the resulting file needs Multivalent browser to be read again
it is unpredictable how much space you can save (many times you cannot save any further space)

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).

Create discrepancy between size on disk and actual size in NTFS

I keep finding files which show a size of 10kb but a size on disk on 10gb. Trying to figure out how this is done, anyone have any ideas?
You can make sparse files on NTFS, as well as on any real filesystem. :-)
Seek to (10 GB - 10 kB), write 10 kB of data. There, you have a so-called 10 GB file, which in reality is only 10 kB big. :-)
You can create streams in NTFS files. It's like a separate file, but with the same filename. See here: Alternate Data Streams
I'm not sure about your case (or it might be a mistake in your question) but when you create a NTFS sparse file it will show different sizes for these fields.
When I create a 10MB sparse file and fill it with 1MB of data windows explorer will show:
Size: 10MB
Size on disk: 1MB
But in your case its the opposite. (or a mistake.)