I have a huge (500GB) gzipped tar file, and I want to extract all the files in it. The tar file is gzipped, but the files in it are not. The problem is that if I extract them all like this
tar xzf huge.tgz
then I run out of space.
Is there a way to simultaneously extract and gzip the files? I could write a script to do
tar tzf huge.tgz
and then extract each file and gzip it, one after the other. But I was hoping there might be a more efficient solution.
You would have to write a program that uses, for example, libarchive and zlib to extract entries and run them through gzip compression.
Related
New to zipping files here. I used the following command to gzip a bunch of large files within a single directory:
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/
When this was done, I noticed that it left the old directory tree where it was, and that the tar.gz was much larger. I'm not sure what the original size of the directory was as I didn't check it beforehand, but I think it was much larger than stated here...
-rw-r----- 1 xxx xxxx 21218045403 May 8 21:39 archive-RAW-MAFs.tar.gz
drwxr-s--- 34 xxx xxxx 4096 May 8 20:21 RAW_MAFS
I can also traverse through the original RAW_MAFs directory and open files. Ideally, I would like only the zipped file, because I don't need to touch this data again for a while and want to save as much as I can.
I'll take the second question first.
The original file are still there because you haven't told tar to delete them. Add the --remove-files option to the command line to get tar to do what you want
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/ --remove-files
Regarding the size of the RAW_MAFS directory tree. If it hasn't been deleted yet can you not check their sizes?
If the original files in RAW_MAFS are already compressed, then compressing again when you put them in your tar file will increase the size. Can you provide more details on what you are storing in the tar file?
If you are storing compressed files in the tar, try running without the z option.
You do not want to download the tar file.
Just download select few files inside the tar.
Does s3 provide any API to do this or is Apache Commons Compress my best bet?
While you'll be able to stream the file from S3 you'll still basically download it. The Apache Commons Compress library will help hide some of this and is a good solution. The other option would be to store the individual files in the tar file so that they can be randomly accessed.
Is there a way to get the uncompressed file size using gsutils? I've looked into du and ls -l. Both return the compressed size. I would like to avoid having to download the files to see their size.
gsutils provides only some basic commands like copy, list directory files etc. I would like to suggest you to write a python scripts which let you know the original size of the zipped files.
With gunzip its simply zip -r archive.zip my_dir/.
Am failing to find an equivalent command for bunzip. Some if found are zipping individual files inside of a directory, but i want one .bzip2 archive.
gunzip is not zip. zip is an archiver which handles files and directories. gzip/gunzip only compresses a single file or stream of data.
bzip2 is just like gzip, and only compresses a single file or stream of data. For both gzip and bzip2, it is traditional to use tar as the archiving program, and compressing the output. In fact that is such a common idiom that tar has options to invoke gzip or bzip2 for you. Do a man tar.
I have a program that only consumes uncomprssed files. I have a couple of .gz files and my goal is to feed the concatenation of them to that program. If I had a tar.gz file I could mount the tar.gz archive with the archivemont command.
I know I can concatenate the gz files:
cat a.gz b.gz > c.gz
But there is no way, that I am aware of, to mount a gz file. I don't have enough disk space to uncompress all of the files and the tar command do not accept stdin as the input so I cannot do this:
zcat *.gz | tar - | gzip > file.tar.gz
It is not clear what operations you need to perform on the tar.gz archive. But from what I can discern, tar.gz is not the format for this application. The entire archive stream is compressed by gzip, so you can't pull out or change a file without having to re-compress everything after it. The tar.gz stream can be specially prepared to keep the compression of each file independent, but then you might as well use the .zip format, which is better suited for random access and manipulation of individual files in the archive.
To address one of your comments, tar can in fact accept stdin as input. See pipe tar extract into tar create for some examples, where both GNU tar and BSD tar (with different syntax) can take in a tar file from stdin, delete entries, and write a new tar file to stdout.