How to unzip many .gz files to one same file? - gzip

I have a folder which contains a lot of .gz files, each of them contains los as text.
It's too troublesome to unzip and look through them one by one. So I'm wondering is there any command to unzip content of mutiple .gz files to a same file?
Thanks

This is the command you want probably:
cat *.gz | gzip -dc - | grep los
First the cat *.gz sends all the zipped files to stdout.
Using gzip the -d switch decompresses and -c sends output to stdout. The "-" allows input to come from stdin rather than through files.
Then this output can be piped to whatever program you want.
If you want to know the specifically each file that has matches you can do this too:
for f in *.gz
do echo $f: ;
gzip -dc $f | grep los ;
done

Related

How to get around 'Argument list too long' error when concatenating multiple gzip files?

I am trying to concatenate around ~21,000 gzip files that are all located in a local directory so I can unzip 1 large gzip file then convert the unzipped file into a csv. Unfortunately this is over the maximum number of arguments that can be accepted by the cat command. I have tried using cat *gz > final.gz as well as ls *.gz | xargs cat but both have given me the error 'Argument list too long'. How might I work around this error to concatenate all gzipped files?
You can try something like:
find . -name \*.gz -print0 | xargs -0 cat > final.gz
If there are .gz files in subdirectories of the current directory, and you only want the ones in the current directory, then add -maxdepth 1 to the find options.
If you want to impose a particular order on the files, you can pipe through an appropriate sort between the find and the xargs.

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

Check the total content size of a tar gz file

How can I extract the size of the total uncompressed file data in a .tar.gz file from command line?
This works for any file size:
zcat archive.tar.gz | wc -c
For files smaller than 4Gb you could also use the -l option with gzip:
$ gzip -l compressed.tar.gz
compressed uncompressed ratio uncompressed_name
132 10240 99.1% compressed.tar
This will sum the total content size of the extracted files:
$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc
The output is given in bytes.
Explanation: tar tzvf lists the files in the archive in verbose format like ls -l. sed and cut isolate the file size field. The second sed puts a + in front of every size except the first and paste concatenates them, giving a sum expression that is then evaluated by bc.
Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.
The command gzip -l archive.tar.gz doesn't work correctly with file sizes greater than 2Gb. I would recommend zcat archive.tar.gz | wc --bytes instead for really large files.
I know this is an old answer; but I wrote a tool just for this two years ago. It’s called gzsize and it gives you the uncompressed size of a gzip'ed file without actually decompressing the whole file on disk:
$ gzsize <your file>
Use the following command:
tar -xzf archive.tar.gz --to-stdout|wc -c
I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.
first, which is most faster?
[oracle#base tmp]$ time zcat oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.761s
user 0m43.203s
sys 0m5.185s
[oracle#base tmp]$ time gzip -dc oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.335s
user 0m42.781s
sys 0m5.153s
[oracle#base tmp]$ time tar -tvf oracle.20180303.030001.dmp.tar.gz
-rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
-rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp
real 0m46.669s
user 0m44.347s
sys 0m4.981s
definitely, tar -xvf is the most faster, but
¿how to cancel executions after get header?
my solution is this:
[oracle#base tmp]$ time echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
6667023572
real 0m1.005s
user 0m0.013s
sys 0m0.066s
A tar file is uncompressed until/unless it is filtered through another program, such as gzip, bzip2, lzip, compress, lzma, etc. The file size of the tar file is the same as the extracted files, with probably less than 1kb of header info added in to make it a valid tarball.