Check the total content size of a tar gz file - gzip

How can I extract the size of the total uncompressed file data in a .tar.gz file from command line?

This works for any file size:
zcat archive.tar.gz | wc -c
For files smaller than 4Gb you could also use the -l option with gzip:
$ gzip -l compressed.tar.gz
compressed uncompressed ratio uncompressed_name
132 10240 99.1% compressed.tar

This will sum the total content size of the extracted files:
$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc
The output is given in bytes.
Explanation: tar tzvf lists the files in the archive in verbose format like ls -l. sed and cut isolate the file size field. The second sed puts a + in front of every size except the first and paste concatenates them, giving a sum expression that is then evaluated by bc.
Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.

The command gzip -l archive.tar.gz doesn't work correctly with file sizes greater than 2Gb. I would recommend zcat archive.tar.gz | wc --bytes instead for really large files.

I know this is an old answer; but I wrote a tool just for this two years ago. It’s called gzsize and it gives you the uncompressed size of a gzip'ed file without actually decompressing the whole file on disk:
$ gzsize <your file>

Use the following command:
tar -xzf archive.tar.gz --to-stdout|wc -c

I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.
first, which is most faster?
[oracle#base tmp]$ time zcat oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.761s
user 0m43.203s
sys 0m5.185s
[oracle#base tmp]$ time gzip -dc oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.335s
user 0m42.781s
sys 0m5.153s
[oracle#base tmp]$ time tar -tvf oracle.20180303.030001.dmp.tar.gz
-rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
-rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp
real 0m46.669s
user 0m44.347s
sys 0m4.981s
definitely, tar -xvf is the most faster, but
¿how to cancel executions after get header?
my solution is this:
[oracle#base tmp]$ time echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
6667023572
real 0m1.005s
user 0m0.013s
sys 0m0.066s

A tar file is uncompressed until/unless it is filtered through another program, such as gzip, bzip2, lzip, compress, lzma, etc. The file size of the tar file is the same as the extracted files, with probably less than 1kb of header info added in to make it a valid tarball.

Related

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

Piping output to local machine in a loop

I am trying to read some files located on a server and write only a certain number of columns from those files onto my local machine. I am tried to do this in a for loop to avoid inputting my password for each file. Below is what I was able to cobble till now.
The following code works but writes all the output to a single file which is not manageable due to its large size.
ssh user#xx.xxx.xxx.xx 'for loc in /hel/insur/*/201701*; do zcat $loc | grep -v NUMBER | awk -F',' -v OFS="," '\''{print $1,$2,$3,$4,$5}'\'' | gzip; done' > /cygdrive/c/Users/user1/Desktop/test/singlefile.csv.gz
So, I tried to write each file individually as shown below but it gives me an error saying that it cannot find the location(possibly because I am sshed ito the remote server).
ssh user#xx.xxx.xxx.xx 'for loc in /hel/insur/*/201701*; do zcat $loc | grep -v NUMBER | awk -F',' -v OFS="," '\''{print $1,$2,$3,$4,$5}'\'' | gzip > /cygdrive/c/Users/user1/Desktop/test/`echo $loc | cut -c84-112` ; done'
Any ideas on how to solve this?

How to unzip many .gz files to one same file?

I have a folder which contains a lot of .gz files, each of them contains los as text.
It's too troublesome to unzip and look through them one by one. So I'm wondering is there any command to unzip content of mutiple .gz files to a same file?
Thanks
This is the command you want probably:
cat *.gz | gzip -dc - | grep los
First the cat *.gz sends all the zipped files to stdout.
Using gzip the -d switch decompresses and -c sends output to stdout. The "-" allows input to come from stdin rather than through files.
Then this output can be piped to whatever program you want.
If you want to know the specifically each file that has matches you can do this too:
for f in *.gz
do echo $f: ;
gzip -dc $f | grep los ;
done

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

Utilizing multi core for tar+gzip/bzip compression/decompression

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).
I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?
You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.
For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
Common approach
There is option for tar program:
-I, --use-compress-program PROG
filter through PROG (must accept -d)
You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive
Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz
Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zip
For p7zip for compression you need a small shell script like the following:
#!/bin/sh
case $1 in
-d) 7za -txz -si -so e;;
*) 7za -txz -si -so a .;;
esac 2>/dev/null
Save it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z
xz
Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").
This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this
option has no effect for now.
However this will not work for decompression of files that haven't also
been compressed with threading enabled. From man for version 5.2.2:
Threaded decompression hasn't been implemented yet. It will only work
on files that contain multiple blocks with size information in
block headers. All files compressed in multi-threaded mode meet this
condition, but files compressed in single-threaded mode don't even if
--block-size=size is used.
Recompiling with replacement
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip
After recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz"
-j, --bzip2 filter the archive through lbzip2
--lzip filter the archive through plzip
-z, --gzip, --gunzip, --ungzip filter the archive through pigz
You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:
tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/
If you want to have more flexibility with filenames and compression options, you can use:
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s#/my/path/##g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz
Step 1: find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.
-exec will execute the next command using the results of find: tar
Step 2: tar
tar -P --transform='s#/my/path/##g' -cf - {} +
--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.
-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.
-cf - tells tar to use the tarball name we'll specify later
{} + uses everyfiles that find found previously
Step 3: pigz
pigz -9 -p 4
Use as many parameters as you want.
In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression.
If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
Step 4: archive name
> myarchive.tar.gz
Finally.
A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.
Here is an example for tar with modern zstd compressor, as finding out good examples on this one was difficult:
apt poem to install zstd and pv utilities for Ubuntu
Compress multiple files and folders (zstd command alone can only do single files)
Display progress using pv - shows the total bytes compressed and compression speed GB/sec real-time
Use all physical cores with -T0
Set compression level higher than the default with -8
Display the resulting wall clock and CPU time used after the operation is finished using time
apt install zstd pv
DATA_DIR=/path/to/my/folder/to/compress
TARGET=/path/to/my/arcive.tar.zst
time (cd $DATA_DIR && tar -cf - * | pv | zstd -T0 -8 -o $TARGET)