awk and regular expressions confusion - awk

Having never used awk before on Linux I am attempting to understand how it matches regular expressions. For example in the past based on my experience the regular expression /2/ would match 2 in all of the following lines.
This will match 2
This will not match 2
Now if I run the command awk '{if(NR~2)print}' sample.txt which has the contents
2 will be matched
This will not match 2
2 may be matched
The line that is matched is This will not match 2 which indicates it is matching the line 2 because if I replace the command with awk '{if(NR~3)print}' sample.txt it matches 2 may be matched. Now if I also run the command awk '{if(NR~/^2$/)print}' sample.txt, the matches the same exact line i.e. line 2.
However the source I am referring to at http://www.youtube.com/watch?feature=player_detailpage&v=Htnno4CHVus#t=502s seems to indicate otherwise.
What am I missing and how is the command awk '{if(NR~2)print}' sample.txt different to that of awk '{if(NR~/^2$/)print}' sample.txt?

The condition NR~2 is checking whether the record number, NR, matches 2. For a 2 or 3 line input file, the expression is equivalent to:
if (NR == 2)
Similarly with NR~3, of course. Try:
awk '/2/'
That will print all lines where the text of the line ($0) contains a 2. By default, a regular expression matches against the whole line; you could limit it to a particular field with $3 ~ /3/, for example.
An awk program consists of patterns and actions, where either the pattern or the action may be omitted.
awk '{ if ($0 ~ /2/) print }
/2/
/2/ { if ($0 ~ /a.*z/) print "Matches a.*z"; }'
The first line has no pattern; the action in the { ... } is executed for each input line (but only some input lines will generate output because of the conditional. All lines that contain a 2 will be printed. (If there is no argument to print, it prints $0 followed by a newline.)
The second line has a pattern but no action; all lines that contain a 2 will be printed again. (The missing action is equivalent to { print }.)
The third line has both a pattern and an action; all lines that both contain a 2 and also contain an 'a' followed by a 'z' will be remarked upon.
How are these two commands different?
`awk '{if(NR~2)print}' sample.txt`
`awk '{if(NR~/^2$/)print}' sample.txt`
The first command will print line numbers 2, 12, 20..29, 32, 42, ... 102, 112, 120..129, ... 200..299, ...; all lines where the line number contains a 2.
The second command will print only line number 2 because the /^2$/ constrains the value to contain start of string, digit 2 and end of string.
I take it that means that the source is wrong?
Now I've looked at the YouTube resource, I think you must have misunderstood what it is trying to teach. When it talks about {if (NR~2) print}, it should be saying it will print any line number which contains a 2; the video cites line numbers 2, 12, 20, 21, 22, etc. It should not be saying any line which contains a 2; I think the video does say that, but the video misspoke (but the text was accurate). The comparison against NR is not actually wrong, but it is aconventional; I'm not sure that I'd include regexes against NR in an introductory video describing awk. So, the video appears to have a glitch in the audio but the text on screen is accurate, I think. I may still have missed something.
The command awk '{ if ($0 ~ /2/) print } against the file say sample.txt with the contents I mentioned would only result in the output 2 will be matched. Is that correct?
That command, given the input:
2 will be matched
This will not match 2
2 may be matched
will print all three lines; they all contain the digit 2.
I also thought that the action was print and the pattern was $0 ~ /2/.
No; the pattern was empty (because there was nothing before the open brace) — so all lines match it — and the action was the part in braces { if ($0 ~ /2/) print }. Now, the action contains a conditional, but that's a separate issue.
Now the command awk '/2/' sample.txt would print all three lines. Is that correct?
Yes.

NR means the Number of the Record being processed...
You are matching against line number 2.

Related

Why does NR==FNR; {} behave differently when used as NR==FNR{ }?

Hoping someone can help explain the following awk output.
awk --version: GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
OS: Linux sub system on Windows; Linux Windows11x64 5.10.102.1-microsoft-standard-WSL2
user experience: n00b
Important: In the two code snippets below, the only difference is the semi colon ( ; ) after NR==FNR in sample # 2.
sample # 1
'awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
output # 1
2
3
4
5
7
sample # 2
'awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
output # 2
2 # why is value in file 'lines_to_show.txt appearing twice?
2
3
3
4
4
5
5
7
7
line -01
line -02
line -03
line -04
line -05
line -06
line -07
line -08
line -09
line -10
Generate the text input files
lines_to_show.txt: echo -e "2\n3\n4\n5\n7" > lines_to_show.txt
all_lines.txt: echo -e "line\t-01\nline\t-02\nline\t-03\nline\t-04\nline\t-05\nline\t-06\nline\t-07\nline\t-08\nline\t-09\nline\t-10" > all_lines.txt
Request/Questions:
If you can please explain why you know the answers to the questions below (experience, tutorial, video, etc..)
How does one read an `awk' program? I was under the impression that a semi colon ( ; ) is only a statement terminator, just like in C. It should not have an impact on the execution of the program.
In output # 2, why are the values in the file 'lines_to_show.txt appearing twice? Seems like awk is printing values from the 1st file "lines_to_show.txt" but printing them 10 times, which is the number of records in the file "all_lines.txt". Is this true? why?
Why in output # 1, only output from "lines_to_show.txt" is displayed? I thought awk will process each record in each file, so I expcted to see 15 lines (10 + 5).
What have I tried so far?
going though https://www.linkedin.com/learning/awk-essential-training/using-awk-command-line-flags?autoSkip=true&autoplay=true&resume=false&u=61697657
modifying the code to see the difference and use that to 'understand' what is going on.
trying to work through the flow using pen and paper
going through https://www.baeldung.com/linux/awk-multiple-input-files --> https://www.baeldung.com/linux/awk-multiple-input-files
awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
Here you have one pattern-action pair, that is if (total) number of row equals file number of row then print whole line.
awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
Here you have two pattern-action pairs, as ; follows condition it is assumed that you want default action which is {print $0}, in other words that is equivalent to
awk 'NR==FNR{print $0}{ print $0}' lines_to_show.txt all_lines.txt
first print $0 is used solely when processing 1st file, 2nd print $0 is used indiscriminately (no condition given), so for lines_to_show.txt both prints are used, for all_lines.txt solely 2nd print.
man awk is the best reference:
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the
enclosing brace characters) can be omitted.
A missing pattern shall match any record of input,
and a missing action shall be equivalent to:
{ print }
; terminates a pattern-action block. So you have two pattern/action blocks, both whose action is to print the line.

Comparing column of two files

I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$
for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238
The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1

Working with the awk line matching pattern

The tool awk has line pattern matching like
/pattern/ { statements; }
Is there any way to get the string of pattern as a variable, for use in match expressions etc?
Or even better, directly get:
pattern matched text
pattern matched length
match groups if there are any (groups) in the pattern
within the {statements} block?
If you use GNU awk and, instead of using /pattern/ in the condition part, use match and its third argument match(string, regexp [, array]) you get access to matched text, start index, length and the groups:
$ echo foobar |
gawk 'match($0, /(fo*)(b.*)/, a) {
print a[0],a[0,"start"],a[0,"length"] # 0 index refers to whole matched text
print a[2],a[2,"start"],a[2,"length"] # 1, 2, etc. to matched groups
}'
foobar 1 6
bar 4 3
See GNU awk documentation for match for more info.
Could you please try following ones.
1st: To get matching text match is BEST option.
awk 'match($0,/regex/){print substr($0,RSTART,RLENGTH)}' Input_file
2nd: To get length of matched string:
awk 'match($0,/regex/){print RLENGTH}' Input_file
3rd: To get all matched patterns use while loop with match until match found in line and we should get all matched patterns.

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.

AWK : Ensure only one blank line after the output block

The following awk code outputs what is required except that it outputs two blank lines after each block of data. Only one blank line needs to be inserted. (Without the last {"print "\n"} statement, no blank lines are output. With the statement, there are two blank lines. I need only one blank line.)
/Reco/ {for(i=0; i<=2; i++) {getline; print} {print "\n"}}
Based on your comment below that you actually want the line that matches /Reco/ and 2 subsequent lines and a blank line (to be inserted after that) here's how to do that based on idiom "g" below:
awk '/Reco/{c=3} c&&c--{print; if(!c)print ""}' file
wrt an explanation - just remember that awk provides this functionality for you:
WHILE read line from file
DO
execute the users script (/Reco/{c=3} c&&c--{print; if(!c)print ""})
DONE
and that the body of an awk script is made up of:
<condition> { <action> }
statements with the default condition being TRUE and the default action being to print the current record/line.
The posted awk script above does the following:
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
so the whole execution of parsing the input file is:
WHILE read line from file
DO
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
DONE
Maybe it'd be a little clearer if the script was written as something like:
awk '/Reco/{c=3} c{c--; print; if(c == 0)print ""}' file
You got the answer you were looking for but here's how to really print the N lines after some pattern in awk:
c&&c--;/pattern/{c=N}
which in your case would be:
c&&c--;/Reco/{c=3}
and if you want to add that extra newline then it becomes:
c&&c--{print; if(!c)print ""} /Reco/{c=3}
If you're considering using getline make sure you read http://awk.info/?tip/getline first and understand all of the caveats so you know what you're getting yourself into.
P.S. The following idioms describe how to select a range of records given
a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.
#Kevin's post provides the specific answer (use print "" or, as suggested by #BMW, printf ORS), but here's some background:
In awk,
print
is the same as:
print $0
i.e., it prints the current input line followed by the output record separator - which defaults to \n and is stored in the special ORS variable.
You can pass arguments to print to print something other than (or in addition to) $0, but the record separator is invariably appended.
Note that if you pass multiple arguments separated with , to print, they will be output separated by the output field separator - which defaults to a space and is stored in the special variable OFS.
By contrast, the - more flexible - printf function takes a format string (as in its C counterpart) and as many arguments as are needed to instantiate the placeholders (fields) in the format string.
An output record separator is NOT appended to the result.
For instance, the printf equivalent of what print without arguments does is:
printf "%s\n", $0 # assumes that \n is the output record separator
Or, more generally:
printf "%s%s", $0, ORS
Note that, as the names suggest, the output field/record separators (OFS/ORS) have input counterparts (FS/RS) - their respective default values are identical (single space / \n - though on parsing input multiple adjacent spaces are treated as a single field separator).
print already includes the newline. Just use print "".