AWK to process compressed files and printing original (compressed) file names - awk

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?

Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.

this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

Related

How to ls the first file (path included) of every subfolder recursively?

Suppose these are the files:
folder1/11.txt
folder1/12.txt
folder1/levela/11a1.txt
folder1/levela/11a2.txt
folder1/levela/levelb/11b1.txt
folder1/levela/levelb/11b2.txt
folder2/21.txt
folder2/22.txt
folder2/levela/21a1.txt
folder2/levela/21a2.txt
folder2/levela/levelb/21b1.txt
folder2/levela/levelb/21b2.txt
folder3/a/b/c/d/e/deepfile1.txt
folder3/a/b/c/d/e/deepfile2.txt
Is there a way (for example using ls, find or grep or any gnuwin32 commands) to show the 1st file from every subfolder please?
Desired output:
folder1/11.txt
folder1/levela/11a1.txt
folder1/levela/levelb/11b1.txt
folder2/21.txt
folder2/levela/21a1.txt
folder2/levela/levelb/21b1.txt
folder3/a/b/c/d/e/deepfile1.txt
Thank you.
Suggesting this solution:
find -type f -printf "%p %h\n"|sort --key 2.1,1.1|uniq --skip-fields=1|awk '{print $1}'
Explanation:
find -type -printf "%p %n\n"
This find command search for all regular files under current directory.
And print for each file. Files' relative path, (space), and files' relative folder.
Suggesting to run this command on your directory.
sort --key 2.1,1.1
Sort the files list lexicography, from 2nd field than 1st field
Result in all files are sorted per their specific directory
Suggesting to try this:
find -type f -printf "%p %h\n"|sort --key 2.1,1.1
uniq --skip-fields=1
From the sorted files list.
Remove those lines having duplicate directory (field #2)
awk '{print $1}'
Print only first field, the relative files path.
A bash script:
script.sh
#!/bin/bash
declare -A filesArr # declare assiciate array for files in directories
for currFile in $(find "$1" -type f); do # main loop scan all files undre $1
currDir=$(dirname "$currFile") # get the curret file's directory
if [[ -z ${filesArr["$currDir"]} ]]; then # if current directory is not stored in filesArr
filesArr[$currDir]="$currFile" # store the directory with curren file
fi
if [[ ${filesArr["$currDir"]} > "$currFile" ]]; then # if current file < stored file in array
filesArr[$currDir]="$currFile" # set the stored file to be current file
fi
done
for currFile in ${filesArr[#]}; do # loop over array to output each directory
echo "$currFile"
done
Running script.sh on /tmp folder
chmod a+x script.sh
./script.sh /tmp
BTW: answer below with sort and uniq is much faster.

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

SSH - Loop through lines from txt file and delete files

I have a .txt file and on each line is a different file location e.g.
file1.zip
file2.zip
file3.zip
How can I open that file, loop through each line and rm -f filename on each one?
Also, will deleting it throw an error if the file doesn't exist (has already been deleted) and if so how can I avoid this?
EDIT: The file names may have spaces in them, so this needs to be catered for as well.
You can use a for loop with cat to iterate through the lines:
IFS=$'\n'; \
for file in `cat list.txt`; do \
if [ -f $file ]; then \
rm -f "$file"; \
fi; \
done
The if [ -f $file ] will check if the file exists and is a regular file (not a directory). If the check fails, it will skip it.
The IFS=$'\n' at the top will set the delimiter to be newlines-only; This will allow you to process files with whitespace.
xargs -n1 echo < test.txt
Replace 'echo' with rm -f or any other command. You can also use cat test.txt |
'man xargs' for more info.

s3cmd: searching for files based on extension and delete from bucket

I have a S3 bucket with thousands of folders and many txt files inside those folders.
I would like to list all txt files inside the bucket so I can check if they're removable. Then remove them if they are.
Any idea how to do this with s3cmd?
This is fairly simple, but depends on how sophisticated you want the check to be. Suppose you wanted to remove every text file whose filename includes 'foo':
s3cmd --recursive ls s3://mybucket |
awk '{ print $4 }' | grep "*.txt" | grep "foo" | xargs s3cmd del
If you want a more sophisticated check than grep can handle, just redirect the first three commands to a file, then either manually edit the file or use or awk or perl or whatever your favorite tool is, then cat the output into s3cmd (depending on the check, you could do it all with piping, too):
s3cmd --recursive ls s3://mybucket | awk '{ print $4 }' | grep "*.txt" > /tmp/textfiles
magic-command-to-check-filenames /tmp/textfiles
cat /tmp/textfiles | xargs s3cmd del

Check the total content size of a tar gz file

How can I extract the size of the total uncompressed file data in a .tar.gz file from command line?
This works for any file size:
zcat archive.tar.gz | wc -c
For files smaller than 4Gb you could also use the -l option with gzip:
$ gzip -l compressed.tar.gz
compressed uncompressed ratio uncompressed_name
132 10240 99.1% compressed.tar
This will sum the total content size of the extracted files:
$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc
The output is given in bytes.
Explanation: tar tzvf lists the files in the archive in verbose format like ls -l. sed and cut isolate the file size field. The second sed puts a + in front of every size except the first and paste concatenates them, giving a sum expression that is then evaluated by bc.
Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.
The command gzip -l archive.tar.gz doesn't work correctly with file sizes greater than 2Gb. I would recommend zcat archive.tar.gz | wc --bytes instead for really large files.
I know this is an old answer; but I wrote a tool just for this two years ago. It’s called gzsize and it gives you the uncompressed size of a gzip'ed file without actually decompressing the whole file on disk:
$ gzsize <your file>
Use the following command:
tar -xzf archive.tar.gz --to-stdout|wc -c
I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.
first, which is most faster?
[oracle#base tmp]$ time zcat oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.761s
user 0m43.203s
sys 0m5.185s
[oracle#base tmp]$ time gzip -dc oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.335s
user 0m42.781s
sys 0m5.153s
[oracle#base tmp]$ time tar -tvf oracle.20180303.030001.dmp.tar.gz
-rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
-rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp
real 0m46.669s
user 0m44.347s
sys 0m4.981s
definitely, tar -xvf is the most faster, but
¿how to cancel executions after get header?
my solution is this:
[oracle#base tmp]$ time echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
6667023572
real 0m1.005s
user 0m0.013s
sys 0m0.066s
A tar file is uncompressed until/unless it is filtered through another program, such as gzip, bzip2, lzip, compress, lzma, etc. The file size of the tar file is the same as the extracted files, with probably less than 1kb of header info added in to make it a valid tarball.