awk: how to "nextfile" in ENDFILE? - awk

refer to "The GNU Awk User's Guide"
The next statement (see section The next Statement) is not allowed inside either a BEGINFILE or an ENDFILE rule. The nextfile statement is allowed only inside a BEGINFILE rule, not inside an ENDFILE rule.
I have tested "nextfile" in BEGINFILEļ¼š
Even if nextfile is written in BEGINFILE, the corresponding file in ENDFILE will be executed
I wonder how can I skip the "file" that "nextfile" in BEGINFILE

If your question is "can we use nextfile inside ENDFILE section" Then short answer is no, we can't use it.
Here is an example:
Use of nextfile in BEGINFILE section: Let's say we have 2 Input_files named file1 and file2.
Let us run nextfile inside BEGINFILE rule.
awk 'BEGINFILE{print FILENAME;nextfile} 1' file1 file2
file1
file2
We could see its printing Input_file names in it and NOT printing the contents of files, means its working fine.
Use of nextfile inside ENDFILE block: Now lets run nextfile inside ENDFILE section and see what happens.
awk '1;ENDFILE{nextfile;print "bla bla"}' file1 file2
awk: cmd. line:1: error: `nextfile' used in ENDFILE action
It gives an error as you see above. Why this is happening, because ENDFILE OR END blocks always get executed after all the records of Input_file(s) are done with executing, so in case you want to jump to next Input_file then you could either use it in BEGINFILE OR BEGIN section OR use it in main block of your code in case you want to match specific condition and then jump to next Input_file.
How to skip ENDFILE section for a specific Input_file: If you have an ENDFILE statement then on each Input_file's record reading, then to skip a specific Input_file we could use condition like this:
awk '1;ENDFILE{if(FILENAME!="file2"){print FILENAME}}' file1 file2
Output will be as follows.
bla bla bla file1 contents here.....
file1
bla bnla bla file2 contents here....
Basically its print contents of Input_file1 and Input_file2 but because a condition is mentioned in ENDFILE block its NOT printing file2 name in output, hence prevented a statements from execution in ENDFILE block here.
Ideal use of nextfile in BEGINFILE block: If we use nextfile inside BEGINFILE block then it will for sure NOT going to print any of the records of that specific Input_file, why because BEGINFILE block's statements gets executed before reading records from Input_file when we mention nextfile it will NOT start reading records process and will simply jump to next Input_file.
So when to use nextfile in BEGINFILE block?: IMHO one of the BEST use case will be when we want to process multiple Input_file(s) and want to deal with headers for each Input_file(eg--> print only header OR a message for specific file and DO NOT process records of that Input_file). OR 2nd case I could think of if any processing you need to do before executing the records of that Input_file.

If you read the GNU awk manual, then it states this about nextfile:
In gawk, execution of nextfile causes additional things to happen: any ENDFILE rules are executed if gawk is not currently in an END or BEGINFILE rule, ARGIND is incremented, and any BEGINFILE rules are executed.
With gawk, nextfile is useful inside a BEGINFILE rule to skip over a file that would otherwise cause gawk to exit with a fatal error. In this case, ENDFILE rules are not executed.
source: GNU awk manual: nextfile statement
So, what does this mean?
ENDFILE is executed if nextfile is in a normal pattern-action pair (pattern { action })
$ awk '{nextfile}1;ENDFILE{print "processing ENDFILE"}' /dev/random
processing ENDFILE
ENDFILE is executed if nextfile is in a BEGINFILE-block (BEGINFILE { action })
$ awk 'BEGINFILE{print "processing BEGINFILE with error", ERRNO; nextfile}1
ENDFILE{print "processing ENDFILE"}' /dev/random
processing BEGINFILE with error
processing ENDFILE
ENDFILE is not executed if nextfile is in a BEGINFILE-block and an error occurred (BEGINFILE { action })
$ touch foo; chmod -r foo
$ awk 'BEGINFILE{print "processing BEGINFILE with error", ERRNO; nextfile}1
ENDFILE{print "processing ENDFILE"}' foo
processing BEGINFILE with error Permission denied
How can we skip ENDFILE using nextfile?
The answer is to fake an error. By assigning anything to the Gnu variable ERRNO, you will skip the ENDFILE block
$ awk 'BEGINFILE{ERRNO=1; print "processing BEGINFILE with error", ERRNO; nextfile}1
ENDFILE{print "processing ENDFILE"}' /dev/random
processing BEGINFILE with error 1

Related

How can I send the output of an AWK script to a file?

Within an AWK script, I'm needing to send the output of the script to a file while also printing it to the terminal. Is there a nice and tidy way I can do this without having a copy of every print redirect to the file?
I'm not particularly good at making SSCCE examples but here's my attempt at demonstrating my problem;
BEGIN{
print "This is an awk script"
# I don't want to have to do this for every print
print "This is an awk script" > thisiswhack.out
}
{
# data manip. stuff here
# ...
print "%s %s %s" blah, blah blah
# I don't want to have to do this for every print again
print "%s %s %s" blah blah blah >> thisiswhack.out
}
END{
print "Yay we're done!"
# Seriously, there has to be a better way to do this within the script
print "Yay we're done!" >> thisiswhack.out
}
Surely there must be a way to send the entire output of the script to an output file within the script itself, right?
The command to duplicate streams is tee, and we can use it inside awk:
awk '
BEGIN {tee = "tee out.txt"}
{print | tee}' in.txt
This invokes tee with the file argument out.txt, and opens a stream to this command.
The stream (and therefore tee) remains open until awk exits, or close(tee) is called.
Every time print | tee is used, the data is printed to that stream. tee then appends this data both to the file out.txt, and stdout.
The | command feature is POSIX awk. Also the tee variable isn't compulsory (you can use the string).
Of course, we can use tee outside awk too: awk ... | tee out.txt.
GNU AWK's Redirection allows sending output to command, rather than file, therefore I suggest following exploit of said feature:
awk 'BEGIN{command="tee output.txt"}{print tolower($0) | command}' input.txt
Note: I use tolower($0) for demonstration purposes. I redirect print into tee command, which does output to mentioned file and standard output, thus you should get lowercase version of input.txt written to output.txt and standard output.
If you are not confined to single awk usage then you might alternatively use tee outside, like so
awk '{print tolower($0)}' input.txt | tee output.txt
awk '
function prtf(str) {
printf "%s", str > "thisiswhack.out"
printf "%s", str
fflush()
}
function prt(str) {
prtf( str ORS )
}
{
# to print adding a newline at the end:
prt( "foo" )
# to print as-is without adding a newline:
prtf( sprintf("%s, %s, %d", $2, "bar", 17) )
}
' file
In the above we are not spawning a subshell to call any other command so it's efficient, and we're using fflush() after every print to ensure both output streams (stdout and the extra file) don't get out of sync with respect to each other (e.g. stdout displays less text than the file or vice-versa if the command is killed).
The above always overwrites the contents of "thisiswhack.out" with whatever the script outputs. If you want to append instead then change > to >>. If you want the option of doing both, introduce a variable (which I've named prtappend below) to control it which you can set on the command line, e.g. change:
printf "%s", str > "thisiswhack.out"
to:
printf "%s", str >> "thisiswhack.out"
and add:
BEGIN {
if ( !prtappend ) {
printf "" > "thisiswhack.out"
}
}
then if you do awk -v prtappend=1 '...' it'll append to thisiswhack.out instead of overwriting it.
Of course, the better approach if you're on a Unix system is to have your awk script called from a shell script with it's output piped to tee, e.g.:
#!/usr/bin/env bash
awk '
{
print "foo"
printf"%s, %s, %d", $2, "bar", 17
}
' "${#:--}" |
tee 'thisiswhack.out'
Note that this is one more example of why you should not call awk from a shebang.

AWK script ignore first line

I am iterating through a csv file with awk using the command gawk -f script.awk example.csv.
script.awk is a file containing my commands:
BEGIN{FS=","}
pattern {command}
pattern {command}
END{print output}
If I wanted to skip the first line of the csv file, where would I put the NR>1 command in script.awk?
I suggest:
BEGIN{FS=","}
NR==1 {next}
pattern {command}
pattern {command}
END{print output}
From man awk:
next: Stop processing the current input record. Read the next input record and start processing over with the first pattern in the AWK program. Upon reaching the end of the input data, execute any END rule(s).

How to replace \n with a comma for certain lines on CLI

I have the following text file.
# This is a test 1
"watch"
"autoconf"
# This is another line 2
"coreutils"
"binutils"
"screen"
# This is another line 3
"bash"
"emacs"
"nano"
"bison"
# This is another line 4
"libressl"
"python"
"rsync"
"unzip"
"vim"
I want to change this to the following:
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim
Remove the leading white spaces, remove quotes, replace a new line with a comma.
So far I got this:
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'> out.txt
# This is a test 1
watch
autoconf
# This is another line 2
coreutils
binutils
screen
# This is another line 3
bash
emacs
nano
bison
...
I'm not sure how to replace a new line with a comma. I tried the following.
# no change
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'| sed 's/\n/,/g'> out.txt
# changed all new lines
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'| sed -z 's/\n/,/g'> out.txt
$ cat out.txt
# This is a test 1,watch,autoconf,,# This is another line 2,coreutils,binutils,screen,,# This is another line 3,bash,emacs,nano,bison,,# This is another line 4,libressl,python,rsync,unzip,vim
How can I achieve this?
This might work for you (GNU sed):
sed -E 's/^\s*//;/^"(\S*)"/{s//\1/;H;$!d};x;s/.//;s/\n/, /gp;$d;z;x' file
Strip off white space at the front of all lines.
Strip out double quotes and append those words to the hold space.
Otherwise, switch to the hold space, delete the first introduced newline, replace all other newlines by , , print the result and then switch back to the pattern space and print that.
Here's an awk version. Notice that we set the record separator RS to the empty string. This tells awk to treat each block separated by an empty line as a single record. Then by setting the field separator with -F to a newline, each line in the block becomes a single field in that record.
Then it's just a matter of brute-forcing our way through the fields of each record, using sub or gsub to remove leading spaces and quotation marks, and using printf to avoid a newline when we don't want one and printing a comma instead.
$ awk -v RS="" -F'\n' '{
sub(/^[[:space:]]*/, "", $1);
print $1;
sep="";
for (i=2; i<=NF; ++i) {
gsub(/[[:space:]]*"/, "", $i);
printf "%s%s", sep, $i;
sep=", "
}
print "\n"
}' file
Output:
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim
A one-liner using GNU sed:
sed -Ez 's/\n[[:blank:]]*"?/\n/g; s/"\n([^\n])/, \1/g; s/"//g' file
or, using multiline techniques with standard sed:
sed '
s/^[[:blank:]]*//
/^".*"$/{
s/.//
s/.$//
:a
$b
N
s/\n[[:blank:]]*"\(.*\)"$/, \1/
ta
}' file
With your shown samples only, could you please try following. Written and tested in GNU awk.
awk '
BEGIN{
OFS=", "
}
NF{
gsub(/"|^ +| +$/,"")
}
/^#/ || !NF{
if(value){
print first ORS value
}
first=$0
value=""
if(!NF){ print }
next
}
{
value=(value?value OFS:"")$0
}
END{
if(value){
print first ORS value
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
OFS=", " ##Setting OFS as comma space here.
}
NF{ ##Checking condition if line is NOT empty do following.
gsub(/"|^ +| +$/,"") ##Globally substituting " OR starting/ending spaces with NULL here.
}
/^#/ || !NF{ ##Checking condition if line starts from # OR line is NULL then do following.
if(value){ ##Checking condition if value is NOT NULL then do following.
print first ORS value ##Printing first ORS value values here.
}
first=$0 ##Setting first to current line here.
value="" ##Nullifying value here.
if(!NF){ print } ##Checking condition if line is empty then simply print it.
next ##next will skip all further statements from here.
}
{
value=(value?value OFS:"")$0 ##Creating value here and keep on adding current line value to it.
}
END{ ##Starting END block of this program from here.
if(value){ ##Checking condition if value is NOT NULL then do following.
print first ORS value ##Printing first ORS value values here.
}
}
' Input_file ##Mentioning Input_file name here.
Using any POSIX awk in any shell on every Unix box:
$ awk -v RS= -v ORS='\n\n' -F'[[:blank:]]*\n[[:blank:]]*' -v OFS=', ' '{
gsub(/^[[:blank:]]*|"/,"")
printf "%s\n", $1
for (i=2;i<=NF;i++) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}' file
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.

awk difference between commands from file and from commandline

The following script
#! /bin/bash
B=5
#FILE INPUT
cat <<EOF > awk.in
BEGIN{b=$B;printf("B is %s\n", b)}
EOF
awk -f awk.in sometextfile.txt
#COMMANDLINE INPUT
awk 'BEGIN{b=$B;printf("B is %s\n", b)}' sometextfile.txt
produces the output
B is 5
B is
The commands I am issuing to awk are exactly the same, so why is the variable B interpreted correctly in the first case but not in the latter?
Thanks!
In the line
awk 'BEGIN{b=$B;printf("B is %s\n", b)}' sometextfile.txt
The string literal 'BEGIN{b=$B;printf("B is %s\n", b)}' is singly-quoted, therefore $B is not expanded and treated as awk code. In awk code, B is uninitialized, so $B becomes $0, which is in the BEGIN block empty.
In contrast, shell variables in here documents (as in your first example) are expanded, so awk.in ends up containing the value that $B had in the shell script. This, by the way, would have made writing awk code very painful as soon as you'd tried to use a field variable (named $1, $2, and so forth) or the full line (named $0) because you'd have to manually resolve the ambiguity between the awk fields and shell variables of the same name.
Use
awk -v b="$B" 'BEGIN{ printf("B is %s\n", b) }' sometextfile.txt
to make a shell variable known to awk code. Do not try to substitute it directly into awk code; it isn't necessary, you will hate writing awk code that way, and it leads to code injection problems, especially when B comes from an untrusted source.