How to rename a file inside a tar.bz2 archive? - bzip2

I want to know if there is better way to rename a file from inside a .tar.bz2 file, without unpacking it and repacking the entire archive.

bzip2 performs stream compression on the entire stream produced by tar. It has no notion of files, and as such the only way to find a file in a tar.bzip2 archive is to decompress the bzip2 until the point where the file appears. Removing the file and creating a new tar.bz2 archive would require creating a new tar file.
You may be able to reuse the beginning of the original tar.bz2 archive if you write a special-purpose cache to avoid recompression while decompressing the archive, but you will surely have to recompress the rest of the archive.
If your problem is disk space, you may try to perform the entire decompression and compression online via pipes, i.e.
bzcat original.tar.bz2 | command_to_rename_inside_tar | bzip2 > result.tar.bz2

Related

Preserve directory structure when unpacking attachments from PDF with pdftk?

I am trying to pack and unpack attachments including a subdirectory hierarchy to a PDF with pdftk ... attach_files and pdftk ... unpack_files. However, while attach_files is capable of representing the subdirectory information by including the / separator in file names, unpack_files puts all files into one flat directory, silently overwriting files if the same name occurs multiple times. Is it possible to get preservation of the hierarchy when unpacking?
As workarounds I have used:
Packing the attachments into a zip file and attaching the zip file. However, this way the attachment hierarchy is no longer easily accessible.
Applying a bijective transformation on the path names, that maps the hierarchy to a flat structure and back. However, this way unpacking is possible only with a script doing the transformation.
Being directly able to preserve the hierarchy information already stored in the PDF would be preferable.
Unfortunately not with the current version of pdftk, it is hardcoded to drop path information both when attaching and unpacking files. In fact, I would be surprised if any hierarchy information got stored in the PDF using pdftk.
That being said, it would not be too hard to write a patch to change this behaviour, I suggest opening an issue with a feature request.

Is there any way I can check how many flush points in a gzip file?

I have a huge .gz file created from zlib and I want to examine how many flushes points are in it (see http://www.zlib.net/manual.html for gz flush operation). Is there any way I can do it?
You can use infgen to disassemble the stream, and search for empty stored blocks (a line containing only "stored" immediately followed by a line containing only "end").

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).

How can I recover files from a corrupted .tar.gz archive?

I have a large number of files in a .tar.gz archive. Checking the file type with the command
file SMS.tar.gz
gives the response
gzip compressed data - deflate method , max compression
When I try to extract the archive with gunzip, after a delay I receive the message
gunzip: SMS.tar.gz: unexpected end of file
Is there any way to recover even part of the archive?
Recovery is possible but it depends on what caused the corruption.
If the file is just truncated, getting some partial result out is not too hard; just run
gunzip < SMS.tar.gz > SMS.tar.partial
which will give some output despite the error at the end.
If the compressed file has large missing blocks, it's basically hopeless after the bad block.
If the compressed file is systematically corrupted in small ways (e.g. transferring the binary file in ASCII mode, which smashes carriage returns and newlines throughout the file), it is possible to recover but requires quite a bit of custom programming, it's really only worth it if you have absolutely no other recourse (no backups) and the data is worth a lot of effort. (I have done it successfully.) I mentioned this scenario in a previous question.
The answers for .zip files differ somewhat, since zip archives have multiple separately-compressed members, so there's more hope (though most commercial tools are rather bogus, they eliminate warnings by patching CRCs, not by recovering good data). But your question was about a .tar.gz file, which is an archive with one big member.
Are you sure that it is a gzip file? I would first run 'file SMS.tar.gz' to validate that.
Then I would read the The gzip Recovery Toolkit page.
Here is one possible scenario that we encountered. We had a tar.gz file that would not decompress, trying to unzip gave the error:
gzip -d A.tar.gz
gzip: A.tar.gz: invalid compressed data--format violated
I figured out that the file may been originally uploaded over a non binary ftp connection (we don't know for sure).
The solution was relatively simple using the unix dos2unix utility
dos2unix A.tar.gz
dos2unix: converting file A.tar.gz to UNIX format ...
tar -xvf A.tar
file1.txt
file2.txt
....etc.
It worked!
This is one slim possibility, and maybe worth a try - it may help somebody out there.