Trying to modify awk code - awk

awk 'BEGIN{OFS=","} FNR == 1
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
{
cmd="gunzip -cd " FILENAME
cmd; close(cmd)
}
END {print fn,fnr,nl}
' /tmp/appscraps/* > /tmp/test.txt
the above scans all files in a given directory. prints the file name, number of lines in each file and number of lines found containing 'ERROR'.
im now trying to make it so that the script executes a command if any of the file it reads in isn't a regular file. i.e., if the file is a gzip file, then run a particular command.
above is my attempt to include the gunzip command in there and to do it on my own. unfortunately, it isn't working. also, i cannot "gunzip" all the files in the directory beforehand. this is because not all files in the directory will be "gzip" type. some will be regular files.
so i need the script to treat any .gz file it finds a different way so it can read it, count and print the number of lines that's in it, and the number of lines it found matching the pattern supplied (just as it would if the file had been a regular file).
any help?

This part of your script makes no sense:
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
Let me restructure it a bit and comment it so it's clearer what it does:
{ # for every line of every input file, do the following:
# If this is the 2nd or subsequent line, print the values of these variables:
if (NR > 1) {
print fn,fnr,nl
}
fn = FILENAME # set fn to FILENAME. Since this will occur for the first line of
# every file, this is that value fn will have when printed above,
# so why not just get rid of fn and print FILENAME?
fnr = 1 # set fnr to 1. This is immediately over-written below by
# setting it to FNR so this is pointless.
nl = 0
}
{ # for every line of every input file, also do the following
# (note the unnecessary "}" then "{" above):
fnr = FNR # set fnr to FNR. Since this will occur for the first line of
# every file, this is that value fnr will have when printed above,
# so why not just get rid of fnr and print FNR-1?
}
/ERROR/ && FILENAME ~ /\.gz$/ {
nl++ # increment the value of nl. Since nl is always set to zero above,
# this will only ever set it to 1, so why not just set it to 1?
# I suspect the real intent is to NOT set it to zero above.
}
You also have the code above testing for a file name that ends in ".gz" but then you're running gunzip on every file in the very next block.
Beyond that, just call gunzip from shell as everyone else also suggested. awk is a tool for parsing text, it's not an environment from which to call other tools - that's what a shell is for.
For example, assuming your comment (prints the file name, number of lines in each file and number of lines found containing 'ERROR) accurately describes what you want your awk script to do and assuming it makes sense to test for the word "ERROR" directly in a ".gz" file using awk:
for file in /tmp/appscraps/*.gz
do
awk -v OFS=',' '/ERROR/{nl++} END{print FILENAME, NR+0, nl+0}' "$file"
gunzip -cd "$file"
done > /tmp/test.txt
Much clearer and simpler, isn't it?
If it doesn't make sense to test for the word ERROR directly in a ".gz" file, then you can do this instead:
for file in /tmp/appscraps/*.gz
do
zcat "$file" | awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
gunzip -cd "$file"
done > /tmp/test.txt
To handle gz and non-gz files as you've now described in your comment below:
for file in /tmp/appscraps/*
do
case $file in
*.gz ) cmd="zcat" ;;
* ) cmd="cat" ;;
esac
"$cmd" "$file" |
awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
done > /tmp/test.txt
I left out the gunzip since you don't need it as far as I can tell from your stated requirements. If I'm wrong, explain what you need it for.

I think it could be simpler than that.
With shell expansion you already have the file name (hence you can print it).
So you can do a loop over all the files, and for each do the following:
print the file name
zgrep -c ERROR $file (this outputs the number of lines containing 'ERROR')
zcat $file|wc -l (this will output the line numbers)
zgrep and zcat work on both plain text files and gzipped ones.
Assuming you don't have any spaces in the paths/filenames:
for f in /tmp/appscraps/*
do
n_lines=$(zcat "$f"|wc -l)
n_errors=$(zgrep -c ERROR "$f")
echo "$f $n_lines $n_errors"
done
This is untested but it should work.

You can use execute the following command for each file :
gunzip -t FILENAME; echo $?
It will pass print the exit code 0(for gzip files) or 1(corrupt/other file). Now you can compare the output using IF to execute the required processing.

Related

How can I send the output of an AWK script to a file?

Within an AWK script, I'm needing to send the output of the script to a file while also printing it to the terminal. Is there a nice and tidy way I can do this without having a copy of every print redirect to the file?
I'm not particularly good at making SSCCE examples but here's my attempt at demonstrating my problem;
BEGIN{
print "This is an awk script"
# I don't want to have to do this for every print
print "This is an awk script" > thisiswhack.out
}
{
# data manip. stuff here
# ...
print "%s %s %s" blah, blah blah
# I don't want to have to do this for every print again
print "%s %s %s" blah blah blah >> thisiswhack.out
}
END{
print "Yay we're done!"
# Seriously, there has to be a better way to do this within the script
print "Yay we're done!" >> thisiswhack.out
}
Surely there must be a way to send the entire output of the script to an output file within the script itself, right?
The command to duplicate streams is tee, and we can use it inside awk:
awk '
BEGIN {tee = "tee out.txt"}
{print | tee}' in.txt
This invokes tee with the file argument out.txt, and opens a stream to this command.
The stream (and therefore tee) remains open until awk exits, or close(tee) is called.
Every time print | tee is used, the data is printed to that stream. tee then appends this data both to the file out.txt, and stdout.
The | command feature is POSIX awk. Also the tee variable isn't compulsory (you can use the string).
Of course, we can use tee outside awk too: awk ... | tee out.txt.
GNU AWK's Redirection allows sending output to command, rather than file, therefore I suggest following exploit of said feature:
awk 'BEGIN{command="tee output.txt"}{print tolower($0) | command}' input.txt
Note: I use tolower($0) for demonstration purposes. I redirect print into tee command, which does output to mentioned file and standard output, thus you should get lowercase version of input.txt written to output.txt and standard output.
If you are not confined to single awk usage then you might alternatively use tee outside, like so
awk '{print tolower($0)}' input.txt | tee output.txt
awk '
function prtf(str) {
printf "%s", str > "thisiswhack.out"
printf "%s", str
fflush()
}
function prt(str) {
prtf( str ORS )
}
{
# to print adding a newline at the end:
prt( "foo" )
# to print as-is without adding a newline:
prtf( sprintf("%s, %s, %d", $2, "bar", 17) )
}
' file
In the above we are not spawning a subshell to call any other command so it's efficient, and we're using fflush() after every print to ensure both output streams (stdout and the extra file) don't get out of sync with respect to each other (e.g. stdout displays less text than the file or vice-versa if the command is killed).
The above always overwrites the contents of "thisiswhack.out" with whatever the script outputs. If you want to append instead then change > to >>. If you want the option of doing both, introduce a variable (which I've named prtappend below) to control it which you can set on the command line, e.g. change:
printf "%s", str > "thisiswhack.out"
to:
printf "%s", str >> "thisiswhack.out"
and add:
BEGIN {
if ( !prtappend ) {
printf "" > "thisiswhack.out"
}
}
then if you do awk -v prtappend=1 '...' it'll append to thisiswhack.out instead of overwriting it.
Of course, the better approach if you're on a Unix system is to have your awk script called from a shell script with it's output piped to tee, e.g.:
#!/usr/bin/env bash
awk '
{
print "foo"
printf"%s, %s, %d", $2, "bar", 17
}
' "${#:--}" |
tee 'thisiswhack.out'
Note that this is one more example of why you should not call awk from a shebang.

An `awk` method to print after the last `/` on every line of a text file

I'm in need of an awk method to print after the last / on every line of a text file. I've got a txt file which includes file directories.
~/2500andMore.txt file
/Volumes/MCU/_ 0 _ Mt Cabins Ut/Raw/Log Cabin on The Stream/DJI_0003.JPG
/Volumes/MCU/_ 0 _ Mt Cabins Ut/Raw/DJI_0022.jpg
/Volumes/MCU/_ 0 _ Mt Cabins Ut/PTMAD/Insta/RAW/IMG_1049.jpg
The idea is to copy files that are all on one directory back to the example given. I'm planning to do something like this to copy them back:
awk '{print <after the last '/' of the line>}' ~/2500andMore.txt > ~/filename.txt
cd file-directory
while IFS= read -r file && IFS= read -r all <&3; do cp $file $all; done <~/filename.txt 3<~/2500andMore.txt
Judging by your example, you don't need to put the filenames into a separate file, just change your while loop:
while IFS= read -r all; do
file=${all##*/} # Or file=$(basename "$all")
cp "$file" "$all"
done <~/2500andMore.txt
But if you really want the filenames in filename.txt, then use:
awk -F / '{ print $NF }' ~/2500andMore.txt > ~/filename.txt
Explanation: "/" is used as a separator and $NF represents the last field

Count field separators on each line of input file and if missing/exceeding, output filename to error file

I have to validate the input file, Input.txt, for proper number of field separators on each row and if even one row including the header is missing or exceeding the correct number of field separators then print the name of the file to errorfiles.txt and exit.
I have another file to use as reference for the correct number of field separators, valid.txt, then compare the number of field separators on each row of the input file with the number of field separators in the valid.txt file.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
This is not working.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
It is not fully clear what your requirement is, to print the FILENAME on just a single input file provided, perhaps you wanted to loop over a list of files on a directory running this command?
Anyway, to use the content of the file in the context of awk, just use its -v switch and use input re-direction on the file
awk -F '|' -v count="$(<fscount)" -v fname="errorfiles.txt" '(NF-1) != (count+0) { print FILENAME > fname; close(fname); exit}' Input.txt
Notice the use of close(filename) here, which is generally required when you are manipulating files inside awk constructs. The close() call just closes the file descriptor associated with opening the file pointed by filename explicitly, instead of letting the OS do it.
GNU awk solution:
awk -F '|' 'ARGIND==1{aimNF=NF; nextfile} ARGIND==2{if (NF!=aimNF) {print FILENAME > "errorfiles.txt"; exit}}' valid.txt Input.txt
You can do it with just one command,
-- use awk to read two files, store NF number of 1st file, and compare it in the second file.
For other awk you can replace ARGIND==1 with FILENAME==ARGV[1], and so on.
Or if you are sure first file won't be empty, use NR==FNR and NR>FNR instead.

AWK print combination of string+bash variable+string

I am trying to rename contigs in a fasta file with the isolate ID and numbering contigs from 1 to n using awk.
Fastafile:
>NODE_1_length_172477_cov_46.1343
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
The isolate ID is a variable, as I am tying to do this for multiple files. I have come as far as getting it to print isolateIDnumber, but I need >isolateID_number
for file in /dir/*.fasta
do
name=$(basename "$file" .fasta)
awk '/^>/{print "'"$name"'" ++i; next}{print}' $file > rename.fasta
done;
This gives me:
15AR07771
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
Desired output:
>15AR0777_1
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
Question is, where do I put the string so that it will print >15AR0777_1 instead of 15AR07771
I tried a few variations of the following, but none worked
awk '/^>/{print ">'"$name"'" "_" ++i; next}{print}' $file > rename.fasta
awk '/^>/{print ">'"$name"'" _++i; next}{print}' $file > rename.fasta
Thanks!
Use awk -v awk_var="$bash_bar" to transport shell variables into awk scripts. man awk:
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the program begins. Such variable values are available to the
BEGIN rule of an AWK program.
ie:
for file in dir/*.fasta
do
name=$(basename "$file" .fasta)
awk -v name="$name" '/^>/{print ">" name "_" ++i; next}{print}' $file > rename.fasta
done
Here is an all awk version for it:
awk '
FNR==1 { # new file, close old and make name for new
close(f) # close the old output file
n=FILENAME # get filename of the new file
gsub(/^.*\/|\.fasta$/,"",n) # remove path and .fasta
f="rename_" n ".fasta" # new output file
}
/^>/ {
$0=">" n "_" ++i # >name_number
}
{
print > f # print to output file
}' dir/*.fasta # process .fasta files in dir
If there is a file dir/15AR07771.fasta the script will produce a file ./rename_15AR07771.fasta of it. (Your version writes all output files to rename.fasta and does not even append, you may want to fix that.)

Create a new text file with the name and the content of others

I have several .seq files containing text.
I want to get a single text file containing :
name_of_the_seq_file1
contents of file 1
name_of_the_seq_file2
contents of file 1
name_of_the_seq_file3
contents of file 3
...
All the files are on the same directory.
It´s possible with awk or similar?? thanks !!!
If there can be empty files then you need:
with GNU awk:
awk 'BEGINFILE{print FILENAME}1' *.seq
with other awks (untested):
awk '
FNR==1 {
for (++ARGIND;ARGV[ARGIND]!=FILENAME;ARGIND++) {
print ARGV[ARGIND]
}
print FILENAME
}
{ print }
END {
for (++ARGIND;ARGIND in ARGV;ARGIND++) {
print ARGV[ARGIND]
}
}' *.seq
You can use the following command:
awk 'FNR==1{print FILENAME}1' *.seq
FNR is the record number (which is the line number by default) of the current input file. Each time awk starts to handle another file FNR==1, in this case the current filename get's printed trough {print FILENAME}.
The trailing 1 is an awk idiom. It always evaluates to true, which makes awk print all lines of input.
Note:
The above solution works only as long as you have no empty files in that folder. Check Ed Morton's great answer which points this out.
perl -lpe 'print $ARGV if $. == 1; close(ARGV) if eof' *.seq
$. is the line number
$ARGV is the name of the current file
close(ARGV) if eof resets the line number at the end of each file