How can I recover files from a corrupted .tar.gz archive? - gzip

I have a large number of files in a .tar.gz archive. Checking the file type with the command
file SMS.tar.gz
gives the response
gzip compressed data - deflate method , max compression
When I try to extract the archive with gunzip, after a delay I receive the message
gunzip: SMS.tar.gz: unexpected end of file
Is there any way to recover even part of the archive?

Recovery is possible but it depends on what caused the corruption.
If the file is just truncated, getting some partial result out is not too hard; just run
gunzip < SMS.tar.gz > SMS.tar.partial
which will give some output despite the error at the end.
If the compressed file has large missing blocks, it's basically hopeless after the bad block.
If the compressed file is systematically corrupted in small ways (e.g. transferring the binary file in ASCII mode, which smashes carriage returns and newlines throughout the file), it is possible to recover but requires quite a bit of custom programming, it's really only worth it if you have absolutely no other recourse (no backups) and the data is worth a lot of effort. (I have done it successfully.) I mentioned this scenario in a previous question.
The answers for .zip files differ somewhat, since zip archives have multiple separately-compressed members, so there's more hope (though most commercial tools are rather bogus, they eliminate warnings by patching CRCs, not by recovering good data). But your question was about a .tar.gz file, which is an archive with one big member.

Are you sure that it is a gzip file? I would first run 'file SMS.tar.gz' to validate that.
Then I would read the The gzip Recovery Toolkit page.

Here is one possible scenario that we encountered. We had a tar.gz file that would not decompress, trying to unzip gave the error:
gzip -d A.tar.gz
gzip: A.tar.gz: invalid compressed data--format violated
I figured out that the file may been originally uploaded over a non binary ftp connection (we don't know for sure).
The solution was relatively simple using the unix dos2unix utility
dos2unix A.tar.gz
dos2unix: converting file A.tar.gz to UNIX format ...
tar -xvf A.tar
file1.txt
file2.txt
....etc.
It worked!
This is one slim possibility, and maybe worth a try - it may help somebody out there.

Related

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).

How to compare and find the differences between two XML files in cocoa?

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this

How to rename a file inside a tar.bz2 archive?

I want to know if there is better way to rename a file from inside a .tar.bz2 file, without unpacking it and repacking the entire archive.
bzip2 performs stream compression on the entire stream produced by tar. It has no notion of files, and as such the only way to find a file in a tar.bzip2 archive is to decompress the bzip2 until the point where the file appears. Removing the file and creating a new tar.bz2 archive would require creating a new tar file.
You may be able to reuse the beginning of the original tar.bz2 archive if you write a special-purpose cache to avoid recompression while decompressing the archive, but you will surely have to recompress the rest of the archive.
If your problem is disk space, you may try to perform the entire decompression and compression online via pipes, i.e.
bzcat original.tar.bz2 | command_to_rename_inside_tar | bzip2 > result.tar.bz2

Out of memory error when merging large numbers of PDFs using Zend_PDF

We're using the Zend_PDF module in SugarCRM to merge pdf invoices that our system generates. I have been able to successfully merge a number of PDFs (around 10 to 30 in my tests), but we're getting memory errors when we try to merge larger numbers of pdf files. The error looks something like this:
[30-Jan-2012 14:10:20] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted at /usr/local/src/php-5.3.8/Zend/zend_operators.c:1265 (tried to allocate 68134 bytes) in /srv/www/htdocs/sugar6_mf/Zend/Pdf/Element/Object/Stream.php on line 442
The above error was generated when we tried to merge 457 pdf files - that's files, not pages. We're going to need to merge 5,000 and more at a time eventually.
Can anyone offer any help/advice on how to address this?
If needed, ask, and I'll post the code on how the merged pdf is being generated.
Thanks.
I should preface this answer by saying that I know nothing about SugarCRM - my response is based solely on my knowledge of Zend_Pdf.
If my understanding is correct, you have a PHP script (hopefully not running inside Apache considering the length of time it will take to process 5,000 files) that is taking multiple PDF files as input using the Zend_Pdf::load() method and then iterating through the pages of each PDF object and adding them to one target instance of Zend_Pdf, which you are then writing to a file using the save() method.
Using this approach, even if you unset() each of the source PDF objects after you've added the pages to the target PDF object, you'll still need enough memory to store the entire output file. If you blew through 250MB with only 457 files, then I'm guessing your input PDF files are probably about 500KB, so your output file is going to be absolutely huge, so you are still going to end up running out of memory.
My advice would be to ditch this method entirely and use pdftk instead, which you could invoke using the exec() function. I'm sure there's a limit to the size of the arguments you can provide to exec(), so it will probably be a multi-step process with several intermediate files, but ultimately I think this will be a faster, more robust solution.
And just to re-iterate an earlier point, I would not run this process within Apache. I would set up a cron job that runs at the appropriate intervals and drops the output file into a secure area on your web/file server.