Is there a way to read tar.gz top lines without uncompression? - gzip

I have 1000+ *.tar.gz files with size 4G+ each. But the only thing that I needed is the top 5 lines of each file. I am wondering whether there is a fast way to read these lines without uncompressing process (it takes 3-5 minutes to uncompress a single file).
My platform is Linux.

No, there isn't any faster way.
The issue is that .tar file is stream of concatenated original files (with some meta information). gzip then adds compression of full archive. Therefore even to just get the list of the files the archive has to be uncompressed first.

Related

How can I optimize my pdf repository after splitting it by page?

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.
Here is the sample pdf from repository
https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf
Any help regarding pdf splitting is welcomed
Thanks!
I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:
$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.
So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:
cpdf -remove-metadata in.pdf -o small.pdf
on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.

.gz (gzip) file analysis

According to RFC 1952 ("GZIP File Format Specification"), a gzip file consists of a series of "members" (compressed data sets).
Is it possible to analyze a gzip file without decompressing it, for example count up the number of members and index their locations within the file, or go into the middle of the file and find and decompress just one of the members.
No. To find when a member ends, you have to decompress it. You don't have to write out the decompressed result — just process the input to find where the members start.
Once you know where the members start, then yes, you can start decompression from any one of those locations.
Note that the vast majority of gzip files have just one member.

gzip partial modification and re-compression

I am unfamiliar with compression algorithms. Is it possible with zlib or some other library to decompress, modify and recompress only the beginning of a gzip stream and then concatenate it with the compressed remainder of the stream? This would be done in a case where, for example, I need to modify the first bytes of user data (not headers) of a 10GB gzip file so as to avoid decompressing and recompressing the entire file.
No. Compression will generally make use of the preceding data in compressing the subsequent data. So you can't change the preceding data without recompressing the remaining data.
An exception would be if there were breakpoints put in the compressed data originally that reset the history at each breakpoint. In zlib this is accomplished with Z_FULL_FLUSH during compression.

Estimate size of .tar.gz file before compressing

We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).
There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.
The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)
Yes, there is a way to estimate tar size before running the command.
tar -czf - /directory/to/archive/ | wc -c
Meaning:
This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the amount of KB in the archive. Technically, it runs the tool but doesn't save it.
Source: The Ultimate Tar Command Tutorial with 10 Practical Examples
It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.
For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.
I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.
You can flush the compression at each step to see how big the result is.

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).