Is there any way I can check how many flush points in a gzip file? - gzip

I have a huge .gz file created from zlib and I want to examine how many flushes points are in it (see http://www.zlib.net/manual.html for gz flush operation). Is there any way I can do it?

You can use infgen to disassemble the stream, and search for empty stored blocks (a line containing only "stored" immediately followed by a line containing only "end").

Related

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

How to compare and find the differences between two XML files in cocoa?

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this

How to rename a file inside a tar.bz2 archive?

I want to know if there is better way to rename a file from inside a .tar.bz2 file, without unpacking it and repacking the entire archive.
bzip2 performs stream compression on the entire stream produced by tar. It has no notion of files, and as such the only way to find a file in a tar.bzip2 archive is to decompress the bzip2 until the point where the file appears. Removing the file and creating a new tar.bz2 archive would require creating a new tar file.
You may be able to reuse the beginning of the original tar.bz2 archive if you write a special-purpose cache to avoid recompression while decompressing the archive, but you will surely have to recompress the rest of the archive.
If your problem is disk space, you may try to perform the entire decompression and compression online via pipes, i.e.
bzcat original.tar.bz2 | command_to_rename_inside_tar | bzip2 > result.tar.bz2

How do operating systems compute file size?

If I understand correctly, most programming language which provide a library function to retrieve the size of a file use a system call. But then, what does the system do under the hood? Does it depend on the file system? Is the size information stored in some kind of a file header?
Yes, this is filesystem dependent, but many filesystems do it in roughly the same way: for each file, there is a block on the hard drive that stores metadata about the file, including its size.
For many of the filesystems used in Linux/UNIX, for example, this block is called an inode. Note that the inode is not actually part of the file, so it's not really a header; it exists in a region of the disc that is reserved for storing metadata, not file data.
On NTFS, the filesystem used by Windows, file size data is stored in the master file table. This is roughly equivalent to the inode table on a Linux filesystem.
It's stored in a file's metadata, which you can retrieve with stat on POSIX systems. The metadata also includes, for example, when the file was last modified or accessed.
They're stored in a structure called an Inode: http://en.wikipedia.org/wiki/Inode
This contains all of your file's metadata; when you modify the contents of a file (or really do anything with it), your Inode gets updated.

How can I recover files from a corrupted .tar.gz archive?

I have a large number of files in a .tar.gz archive. Checking the file type with the command
file SMS.tar.gz
gives the response
gzip compressed data - deflate method , max compression
When I try to extract the archive with gunzip, after a delay I receive the message
gunzip: SMS.tar.gz: unexpected end of file
Is there any way to recover even part of the archive?
Recovery is possible but it depends on what caused the corruption.
If the file is just truncated, getting some partial result out is not too hard; just run
gunzip < SMS.tar.gz > SMS.tar.partial
which will give some output despite the error at the end.
If the compressed file has large missing blocks, it's basically hopeless after the bad block.
If the compressed file is systematically corrupted in small ways (e.g. transferring the binary file in ASCII mode, which smashes carriage returns and newlines throughout the file), it is possible to recover but requires quite a bit of custom programming, it's really only worth it if you have absolutely no other recourse (no backups) and the data is worth a lot of effort. (I have done it successfully.) I mentioned this scenario in a previous question.
The answers for .zip files differ somewhat, since zip archives have multiple separately-compressed members, so there's more hope (though most commercial tools are rather bogus, they eliminate warnings by patching CRCs, not by recovering good data). But your question was about a .tar.gz file, which is an archive with one big member.
Are you sure that it is a gzip file? I would first run 'file SMS.tar.gz' to validate that.
Then I would read the The gzip Recovery Toolkit page.
Here is one possible scenario that we encountered. We had a tar.gz file that would not decompress, trying to unzip gave the error:
gzip -d A.tar.gz
gzip: A.tar.gz: invalid compressed data--format violated
I figured out that the file may been originally uploaded over a non binary ftp connection (we don't know for sure).
The solution was relatively simple using the unix dos2unix utility
dos2unix A.tar.gz
dos2unix: converting file A.tar.gz to UNIX format ...
tar -xvf A.tar
file1.txt
file2.txt
....etc.
It worked!
This is one slim possibility, and maybe worth a try - it may help somebody out there.