AWK script, linefeed under Windows causing different function - awk

I have a simple AWK script which I try to execute under Windows. Gnu AWK 3.1.6.
The awk script is run with awk -f script.awk f1 f2 under Windows 10.
After spending almost half a day debugging, I came to find that the following two scenarios produce different results:
FNR==NR{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
versus
FNR==NR
{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
The difference of course being the linefeed at line 1.
It puzzles me because I don't recall seeing anywhere awk should be critical about linefeeds. Other linefeeds in the script are unimportant.
In example one, desired result is achieved. Example 2 prints f1, which is not desred.
So I made it work, but would like to know why

From the docs (https://www.gnu.org/software/gawk/manual/html_node/Statements_002fLines.html)
awk is a line-oriented language. Each rule’s action has to begin on
the same line as the pattern. To have the pattern and action on
separate lines, you must use backslash continuation; there is no other
option.
Note that the action only has to begin on the same line as the pattern. After that as we're all aware it can be spread over multiple lines, though not willy-nilly. From the same page in the docs:
However, gawk ignores newlines after any of the following symbols and
keywords:
, { ? : || && do else
In Example 2, since there is no action beginning on the same line as the FNR == NR pattern, the default action of printing the line is performed when that statement is true (which it is for all and only f1). Similarly in that example, the action block is not paired with any preceding pattern on its same line, so it is executed for every record (though there's no visible result for that).

Related

combine two awk commands NF and BEGIN (.csv processing)

How to combine two awk commands:
awk NF file_name.csv > file_name_after_NF.csv the output is used in next step:
awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' file_name_after_NF.csv > file_name_postprocess.csv
Assuming the intermediate file file_name_after_NF.csv was written soley to feed the 'no blank lines' version of the .csv file into the 'remove repeat lines' command, the two procedures can be combined by making NF the condition pattern for the main awk code block:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}}' file_name.csv > file_name_postprocess.csv
In the above procedure, the main awk block is only applied where there are one or more fields in the record.
If you need a file copy of file_name_after_NF.csv, this can be created by adding a file-write block within the main block of the previous awk procedure:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' file_name.csv > file_name_postprocess.csv
{print $0 > file_name_after_NF.csv} does the file writing from within awk. As this block is within the main block processed according to the NF condition pattern, only lines with fields are written to file_name_after_NF.csv
More generally, if a little more cumbersomely, awk procedures can be combined by piping their output into successive awk procedures. In your case this could be achieved using:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' > file_name_postprocess.csv
or, if you need the intermediate file, again include an awk file print block:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' > file_name_postprocess.csv
Edit dealing with cases where the header line is after one or more blank lines
#glennjackman raised a point relating to an example where blank lines existed before the first/header row, since the NR==1 condition in the first two examples above would no longer reset f to contain the header. The last two examples, where awk procedures are joined by pipes, would still work but the first two would not.
To fix this, a further variable can be added to the BEGIN awk block that is updated as soon as the first non-empty line is seen in the main block. This allows for that line to be identified so that prior empty lines do not matter:
awk 'BEGIN{f="";headerFlag="none"} NF{if($0!=f){print $0}if(headerFlag=="none"){f=$0;headerFlag="yes"}}' file_name.csv > file_name_postprocess.csv
The conditional NR==1 in the original script is changed here to check whether the the headerFlag set in BEGIN has been changed. If not, f is set to track repeats of the line and headerFlag is changed so the block will only run on the first encounter of a non-empty record.
The same change can be used in the second solution above.
I wasn't planning to try to answer this until after you posted sample input and expected output but since you already have answers, here's my guess at what you might be asking how to write:
awk 'BEGIN{f=""} !NF{next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv
but without sample input/output it's untested.
I'd recommend you start using white space in your code to improve readability btw.
Also - testing NF for a CSV without setting FS to , is probably the wrong thing to do but idk if you're trying to skip lines of all blanks or lines of all commas or something else so idk what the right thing is to do but maybe it's this:
awk 'BEGIN{FS=","; f=""} $0~("^"FS"*$"){next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv

Multiple awk pattern matching in one line

Let's say I want to match foo and bar in a file. The following works :
/foo/{commands}/bar/{commands}
Note: here there is no separator between /foo/{commands} and /bar/{commands}.
The following is also okey:
/foo/{commands1}{commands2} where {commands2} is executed for every line and I've left out the pattern.
But what if I want to leave out the commands? What's awk's syntax rule here? The following doesn't work:
/foo//bar/
Of course, I could write it as /foo/{print}/bar/{print}, but I was wondering what's the delimiter for separating segments and why you sometimes need it and sometimes you don't.
awk works on method of regexp then action in this if we are mentioning /..../ and not mentioning any action so when condition is TRUE then by default print of line will happen. In case you want to put this into 2 different statements try like:
awk '/foo/;/bar/' Input_file
Above will mean like:
Since they are segregated with ; these will be treated as 2 different conditions.
When /foo/ is true for any line then NO action is mentioned so print of that line will happen.
When /bar/ is true for any line same thing for this also, condition is true and no action mentioned so print of line will happen.
But point to be noted that in case any line has both strings in it so that line will be printed 2 times, which I believe you may NOT want it so you could do like following:
OR within single condition itself try something like:
awk '/foo|bar/' Input_file
Or in case you need to check if strings present in same line then try Like:
awk '/foo/ && /bar/' Input_file
To match foo and bar in a file - just combine patterns:
awk '/foo/ && /bar/ ....'

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.

awk statement within sed

I have multiple occurrences of the pattern:
)0.[0-9][0-9][0-9]:
where [0-9] is any digit, in various text context but the pattern is unique as this regex. And I need to turn the decimal fraction into integer (percent values from 0 to 99).
A small example substring would be
=1:0.00055)0.944:0.02762)0.760:0 to turn into
=1:0.00055)94:0.02762)76:0
What I’m doing is :
cat file | sed -e "s/)\([0-9].[0-9][0-9][0-9]\):/)`echo "\1"|awk '{ r=int(100*$0); if((r>=0)&&(r<=100)){ print r; } else { print "error"; exit(-1); } }'`:/g"
but the output is )0:
where is the fault?...
Since you asked 'where is the fault' and not 'how to solve the problem':
Your backquoted pipeline echo ...|awk ... is executed FIRST, producing a single 0 which is then made part of the s/// command passed to sed and thus substituted everywhere the pattern matches. PS: using the newer (post-Reagan) and more flexible notation for command substitution $( ... ) instead of backquotes is preferred in all shells except csh family, and especially on Stack where backquotes are special to markdown and troublesome to show in text.
If you want to actually solve the problem, which you didn't describe clearly or completely, some pointers toward a better direction:
Standard sed can't execute a command to generate replacement text; GNU sed can with flag e but you need to make the whole patternspace the command and fiddle anything else into holdspace, which is tedious. perl can evaluate an expression in the replacement for s, including arithmetic; awk (even gawk) can't do so directly, but you can get the same effect by doing the match and the replace/rebuild as separate steps, depending on the unspecified and unclear details of exactly what you want to do; if you want to keep the rest of the line unchanged, something like:
awk 'match($0,/)0[.][0-9][0-9][0-9]:/){ print substr($0,1,RSTART) (substr($0,RSTART+1,RLENGTH-2)*100) substr($0,RSTART+RLENGTH-1) }'
But you don't actually need arithmetic here if you're satisified with truncating. Just discard the leading 0. and the last digit and keep the two digits in between:
sed 's/)0[.]\([0-9][0-9]\)[0-9]:/)0.\1:/g'
Note . in regexp unless escaped or in a charclass (as I did) matches any character not just period, which may or may not be a problem since you didn't give the rest of your input.
And PS: negative numbers for process exit status don't work (except IIRC Plan 9). Use small (usually < 128) positive status values for errors; most common is to just use 1.
Check this perl one-liner command :
perl -pe 's/\)(\d+\.\d+):/sprintf ")%d:", $1 * 100/ge' file
Before :
=1:0.00055)0.944:0.02762)0.760:0
After :
=1:0.00055)94:0.02762)76:0
If you need to replace for real in editing mode, add -i switch :
perl -i -pe '...'

How do awk match and ~ operators work together?

I'm having trouble understanding this awk code:
$0 ~ ENVIRON["search"] {
match($0, /id=[0-9]+/);
if (RSTART) {
print substr($0, RSTART+3, RLENGTH-3)
}
}
How do the ~ and match() operators interact with each other?
How does the match() have any effect, if its output isn't printed or echo'd? What does it actually return or do? How can I use it in my own code?
This is related to Why are $0, ~, &c. used in a way that violates usual bash syntax docs inside an argument to awk?, but that question was centered around understanding the distinction between bash and awk syntaxes, whereas this one is centered around understanding the awk portions of the script.
Taking your questions one at a time:
How do the ~ and match() operators interact with each other?
They don't. At least not directly in your code. ~ is the regexp comparison operator. In the context of $0 ~ ENVIRON["search"] it is being used to test if the regexp contained in the environment variable search exists as part of the current record ($0). If it does then the code in the subsequent {...} block is executed, if it doesn't then it isn't.
How does the match() have any effect, if its output isn't printed or
echoed?
It identifies the starting point (and stores it in the awk variable RSTART) and the length (RLENGTH) of the first substring within the first parameter ($0) that matches the regexp provides as the second parameter (id=[0-9]+). With GNU awk it can also populate a 3rd array argument with segments of the matching string identified by round brackets (aka "capture groups").
What does it actually return or do?
It returns the value of RSTART which is zero if no match was found, 1 or greater otherwise. For what it does see the previous answer.
How can I use it in my own code?
As shown in the example you posted would be one way but that code would more typically be written as:
($0 ~ ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
and using a string rather than regexp comparison for the first part would probably be even more appropriate:
index($0,ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn how to use awk.
use the regex id=[0-9]+ to find a match in each line
if the start position of the match (RSTART) is not 0 then:
print the match without the id=
this is shorter but does the same:
xinput --list | grep -Po 'id=[0-9]+' | cut -c4-