awk: catch `exit' in the END block - awk

I'm using awk for formatting an input file in an output file. I have several patterns to fill variables (like "some pattern" in the example). These variables are printed in the required format in the END block. The output has to be done there because the order of appearance in the input file is not guaranteed, but the order in the output file must be always the same.
BEGIN {
FS = "=|,"
}
/some pattern/ {
if ($1 == 8) {
var = $1
} else {
# Incorrect field value
exit 1
}
}
END {
# Output the variables
print var
}
So my problem is the exit statement in the pattern. If there is some error and this command is invoked, there should be no output at all or at the most an error message. But as the gawk manual (here) says, if the exit command is invoked in a pattern block the END block will be executed at least. Is there any way to catch the exit like:
if (!exit_invoked) {
print var
}
or some other way to avoid printing the output in the END block?
Stefan
edit: Used the solution from shellter.

you'll have to handle it explicitly, by setting exit_invoked before exit line, i.e.
BEGIN {
FS = "=|,"
}
/some pattern/ {
if ($1 == 8) {
var = $1
} else {
# Incorrect field value
exit_invoked=1
exit 1
}
}
END {
if (! exit_invoked ) {
# Output the variables
print var
}
}
I hope this helps.

END {
# If here from a main block exit error, it is unlikely to be at EOF
if (getline) exit
# If the input can still be read, exit with the previously set status rather than run the rest of the END block.
......

Being a fan of short syntax and trying to avoid futile {}s or adding them later to pre-existing programs, instead of:
...
else {
exit_invoked=1
exit 1
}
...
END {
if (! exit_invoked ) {
print var
}
}
I use:
else
exit (e=1) # the point
...
END {
if(!e)
print v
}

Related

GAWK does not terminate after ENDFILE block with single file

I have a gawk script below that reads a protein FASTA file and only prints out the records that don't have an X in their sequence and are within a certain range length. I wanted to modify the file in place so I had the script write to a temporary file and then rename it to the original file. The BEGINFILE and ENDFILE constructs in gawk seemed convenient for this. However, for some reason, gawk does not exit after executing the code in the ENDFILE even if it is given a single file argument. It seems to jump back to another line of code and then just hang. Does anyone know what could cause this to happen? The weird part is that this doesn't happen for every FASTA file, only a few and I can't tell what is different between the ones that trigger the bug and the ones that don't
#! /bin/gawk -f
function trim(s) {
gsub(/^[ \t]+|[ \t]+$/, "", s)
return s
}
function printFasta(header, seq, outfile, seq_line_max_chars) {
print ">" header > outfile
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars) > outfile
start += seq_line_max_chars
}
}
BEGIN {
min_prot_len = 400
}
BEGINFILE {
tmp_out = FILENAME ".tmp"
}
/^>.+/ {
headerStartIdx = index($0, ">") + 1
header = trim(substr($0, headerStartIdx))
getline
sequence = ""
while ($0 !~ /(^>.+)|(^[[:space:]]*$)/) {
x_matched = match($0, "X")
if (x_matched != 0) {
next
}
gsub("*", "")
sequence = sequence $0
getline
}
if (length(sequence) >= min_prot_len) {
printFasta(header, sequence, tmp_out)
}
}
ENDFILE {
print "move called"
# system(("mv " tmp_out " " FILENAME))
}
I called the script with
$ ./filter_proteins.awk test.faa
When I run this, move called is printed and then it hangs. I tried stepping through with the debugger and I see that it reaches the ENDFILE block having processed all the lines in the input file, but when I type the next command, it jumps to the getline statement on line 44. After several iterations of next and print $0 it seems that the program is stuck reading the last line of the input file till the end of time. Perhaps this is a bug?
I am using GAWK 5.1.0
Edit
A minimal input file.
https://github.com/CuriousTim/pastebin/blob/main/mb.34.faa
When I run the script with only a few sequences, it works, but when I use the whole file, it hangs. I wasn't sure how to make a minimal example without providing the whole file.
We'll know for sure after you provide sample input to test with but my money's on the call to getline within the loop reaching the end of the file and so triggering the ENDFILE condition to be true but you're still in the loop.
Look:
$ cat file
foo
bar
$ cat tst.awk
{
while (1) {
print "about to execute getline"
getline
print "just executed getline:", $0
if (++c == 5) {
exit
}
}
}
ENDFILE {
print "*** in ENDFILE ***"
}
$ awk -f tst.awk file
about to execute getline
just executed getline: bar
about to execute getline
*** in ENDFILE ***
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
about to execute getline
just executed getline: bar
Calling getline in a loop is not how you want to write an awk script and what you have in your code is the wrong syntax to use when calling getline at any time - see http://awk.freeshell.org/AllAboutGetline.
I modified my script to remove the getline in case anyone finds it useful.
#! /bin/gawk -f
function printFasta(header, seq, seq_line_max_chars) {
print ">" header
seq_line_max_chars = 80
start = 1
end = length(seq)
while (start <= end) {
print substr(seq, start, seq_line_max_chars)
start += seq_line_max_chars
}
}
BEGIN {
FS = ">"
min_prot_len = 400
max_prot_len = 700
}
NF > 1 {
if (sequence &&
length(sequence) >= min_prot_len &&
length(sequence) <= max_prot_len) {
printFasta(header, sequence)
}
}
{
if (!header) {
next
} else {
x_in_seq = match($0, "X")
if (!x_in_seq) {
gsub("*", "")
sequence = sequence $0
} else {
header = ""
}
}
}
END {
if (header) {
printFasta(header, sequence)
}
}

How to update a variable from inside an AWK function

I run this script from a loop inside another script and I want to:
a) print errors into a file keeping track of number of line, name of file and error.
b)I want to print into another file the unique names of files in which an error has been found, provided that a single file could have more than one error and I don't like repetitions.
I know I can sort | unique the file in the end from the calling script but... Is there another technique?
Something like:
if(tempVar != FILENAME)
{
print FILENAME >> uniqueFiles;
}
tempVar= FILENAME;
here's my script
awk '
function errorHandler(error1)
{
print FILENAME >> uniqueFiles;
print FILENAME";"NR";"error >> errorListing;
uniqueFiles = FILENAME;
}
BEGIN {
uniqueFiles="files.txt";
errorListing="errorList.txt";
error1="Error code 1"
}
{
if(NR>1)
{
if(length($1) != 10)
{
errorHandler(error1);
}
}
}
END{}' $1

awk 1 unexpected character '.' suddenly appeared

the script was working. I added some comments and renamed it then submitted it. today my instructor told me it doesnt work and give me the error of awk 1 unexpected character '.'
the script is supposed to read a name in command line and return the student information for the name back.
right now I checked it and surprisingly it gives me the error.
I should run it by the command like this:
scriptName -v name="aname" -f filename
what is this problem and which part of my code make it?
#!/usr/bin/awk
BEGIN{
tmp=name;
nameIsValid;
if (name && tolower(name) eq ~/^[a-z]+$/ )
{
inputName=tolower(name)
nameIsValid++;
}
else
{
print "you have not entered the student name"
printf "Enter the student's name: "
getline inputName < "-"
tmp=inputName;
if (tolower(inputName) eq ~/^[a-z]+$/)
{
tmpName=inputName
nameIsValid++
}
else
{
print "Enter a valid name!"
exit
}
}
inputName=tolower(inputName)
FS=":"
}
{
if($1=="Student Number")
{
split ($0,header,FS)
}
if ($1 ~/^[0-9]+$/ && length($1)==8)
{
split($2,names," ")
if (tolower(names[1]) == inputName || tolower(names[2])==inputName )
{
counter++
for (i=1;i<=NF;i++)
{
printf"%s:%s ",header[i], $i
}
printf "\n"
}
}
}
END{
if (counter == 0 && nameIsValid)
{
printf "There is no record for the %-10s\n" , tmp
}
}
Here are the steps to fix the script:
Get rid of all those spurious NULL statements (trailing semi-colons at the end of lines).
Get rid of the unset variable eq (it is NOT an equality operator!) from all of your comparions.
Cleanup the indenting.
Get rid of that first non-functional nameIsValid; statement.
Change printf "\n" to the simpler print "".
Get rid of the useless ,FS arg to split().
Change name && tolower(name) ~ /^[a-z]+$/ to just the second part of that condition since if that matches then of course name is populated.
Get rid of all of those tolower()s and use character classes instead of explicit a-z ranges.
Get rid of the tmp variable.
Simplify your BEGIN logic.
Get rid of the unnecessary nameIsValid variable completely.
Make the awk body a bit more awk-like
And here's the result (untested since no sample input/output posted):
BEGIN {
if (name !~ /^[[:alpha:]]+$/ ) {
print "you have not entered the student name"
printf "Enter the student's name: "
getline name < "-"
}
if (name ~ /^[[:alpha:]]+$/) {
inputName=tolower(name)
FS=":"
}
else {
print "Enter a valid name!"
exit
}
}
$1=="Student Number" { split ($0,header) }
$1 ~ /^[[:digit:]]+$/ && length($1)==8 {
split(tolower($2),names," ")
if (names[1]==inputName || names[2]==inputName ) {
counter++
for (i=1;i<=NF;i++) {
printf "%s:%s ",header[i], $i
}
print ""
}
}
}
END {
if (counter == 0 && inputName) {
printf "There is no record for the %-10s\n" , name
}
}
I changed the shebang line to:
#!/usr/bin/awk -f
and then in command line didnt use -f. It is working now
Run the script in the following way:
awk -f script_name.awk input_file.txt
This seems to suppress the warnings and errors.
In my case, the problem was resetting the IFS variable to be IFS="," as suggested in this answer for splitting string into an array. So I resetted the IFS variable and got my code to work.
IFS=', '
read -r -a array <<< "$string"
IFS=' ' # reset IFS

how to create an empty array

UPDATE
The original description below has many errors; gawk lint does not complain about uninitialized arrays used as RHS of in. For example, the following example gives no errors or warnings. I am not deleting the question because the answer I am about to accept gives good suggestion of using split with an empty string to create an empty array.
BEGIN{
LINT = "fatal";
// print x; // LINT gives error if this is uncommented
thread = 0;
if (thread in threads_start) {
print "if";
} else {
print "not if";
}
}
Original Question
A lot of my awk scripts have a construct as follows:
if (thread in threads_start) { // LINT warning here
printf("%s started at %d\n", threads[thread_start]));
} else {
printf("%s started at unknown\n");
}
With gawk --lint which results in
warning: reference to uninitialized variable `thread_start'
So I initialize in the BEGIN block as follows. But this looks kludge-y. Is there a more elegant way to create a zero-element array?
BEGIN { LINT = 1; thread_start[0] = 0; delete thread_start[0]; }
I think you might have made a few typo's in your code.
if (thread in threads_start) { // LINT warning here (you think)
Here you look for the index thread in array threads_start.
printf("%s started at %d\n", threads[thread_start])); // Actual LINT warning
But here you print the index thread_start in array threads! Also notice the different s's thread/threads and threads_start/thread_start. Gawk is actually warning you correctly about the usage of thread_start (without s) on the second line.
There also is an error in your printf format.
When you change these the lint warning disappears:
if (thread in threads_start) {
printf("%s started at %d\n", thread, threads_start[thread]));
} else {
printf("%s started at unknown\n");
}
But perhaps I've misunderstood what your code is supposed to do. In that case, could you post a minimal self-contained code sample that produces the spurious lint warning?
Summary
The idiomatic method of creating an empty array in Awk is to use split().
Details
To simplify your example above to focus on your question rather than your typos, the fatal error can be triggered with:
BEGIN{
LINT = "fatal";
if (thread in threads_start) {
print "if";
} else {
print "not if";
}
}
which produces the following error:
gawk: cmd. line:3: fatal: reference to uninitialized variable `thread'
Giving thread a value before using it to search in threads_start passes linting:
BEGIN{
LINT = "fatal";
thread = 0;
if (thread in threads_start) {
print "if";
} else {
print "not if";
}
}
produces:
not if
To create a linting error with an uninitialised array, we need to attempt to access an non-existent entry:
BEGIN{
LINT = "fatal";
thread = 0;
if (threads_start[thread]) {
print "if";
} else {
print "not if";
}
}
produces:
gawk: cmd. line:4: fatal: reference to uninitialized element `threads_start["0"]'
So, you don't really need to create an empty array in Awk, but if you want to do so, and answer your question, use split():
BEGIN{
LINT = "fatal";
thread = 0;
split("", threads_start);
if (thread in threads_start) {
print "if";
} else {
print "not if";
}
}
produces:
not if

Are there any AWK syntax checkers?

Are there any AWK syntax checkers? I'm interested in both minimal checkers that only flag syntax errors and more extensive checkers along the lines of lint.
It should be a static checker only, not dependent on running the script.
If you prefix your Awk script with BEGIN { exit(0) } END { exit(0) }, you're guaranteed that none of your of code will run. Exiting during BEGIN and END prevents other begin and exit blocks from running. If Awk returns 0, your script was fine; otherwise there was a syntax error.
If you put the code snippet in a separate argument, you'll get good line numbers in the error messages. This invocation...
gawk --source 'BEGIN { exit(0) } END { exit(0) }' --file syntax-test.awk
Gives error messages like this:
gawk: syntax-test.awk:3: x = f(
gawk: syntax-test.awk:3: ^ unexpected newline or end of string
GNU Awk's --lint can spot things like global variables and undefined functions:
gawk: syntax-test.awk:5: warning: function `g': parameter `x' shadows global variable
gawk: warning: function `f' called but never defined
And GNU Awk's --posix option can spot some compatibility problems:
gawk: syntax-test.awk:2: error: `delete array' is a gawk extension
Update: BEGIN and END
Although the END { exit(0) } block seems redundant, compare the subtle differences between these three invocations:
$ echo | awk '
BEGIN { print("at begin") }
/.*/ { print("found match") }
END { print("at end") }'
at begin
found match
at end
$ echo | awk '
BEGIN { exit(0) }
BEGIN { print("at begin") }
/.*/ { print("found match") }
END { print("at end") }'
at end
$ echo | awk '
BEGIN { exit(0) } END { exit(0) }
BEGIN { print("at begin") }
/.*/ { print("found match") }
END { print("at end") }'
In Awk, exiting during BEGIN will cancel all other begin blocks, and will prevent matching against any input. Exiting during END is the only way to prevent all other event blocks from running; that's why the third invocation above shows that no print statements were executed. The GNU Awk User's Guide has a section on the exit statement.
GNU awk appears to have a --lint option.
For a minimal syntax checker, which stops at the first error, try awk -f prog < /dev/null.