Why does NR==FNR; {} behave differently when used as NR==FNR{ }? - awk

Hoping someone can help explain the following awk output.
awk --version: GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
OS: Linux sub system on Windows; Linux Windows11x64 5.10.102.1-microsoft-standard-WSL2
user experience: n00b
Important: In the two code snippets below, the only difference is the semi colon ( ; ) after NR==FNR in sample # 2.
sample # 1
'awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
output # 1
2
3
4
5
7
sample # 2
'awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
output # 2
2 # why is value in file 'lines_to_show.txt appearing twice?
2
3
3
4
4
5
5
7
7
line -01
line -02
line -03
line -04
line -05
line -06
line -07
line -08
line -09
line -10
Generate the text input files
lines_to_show.txt: echo -e "2\n3\n4\n5\n7" > lines_to_show.txt
all_lines.txt: echo -e "line\t-01\nline\t-02\nline\t-03\nline\t-04\nline\t-05\nline\t-06\nline\t-07\nline\t-08\nline\t-09\nline\t-10" > all_lines.txt
Request/Questions:
If you can please explain why you know the answers to the questions below (experience, tutorial, video, etc..)
How does one read an `awk' program? I was under the impression that a semi colon ( ; ) is only a statement terminator, just like in C. It should not have an impact on the execution of the program.
In output # 2, why are the values in the file 'lines_to_show.txt appearing twice? Seems like awk is printing values from the 1st file "lines_to_show.txt" but printing them 10 times, which is the number of records in the file "all_lines.txt". Is this true? why?
Why in output # 1, only output from "lines_to_show.txt" is displayed? I thought awk will process each record in each file, so I expcted to see 15 lines (10 + 5).
What have I tried so far?
going though https://www.linkedin.com/learning/awk-essential-training/using-awk-command-line-flags?autoSkip=true&autoplay=true&resume=false&u=61697657
modifying the code to see the difference and use that to 'understand' what is going on.
trying to work through the flow using pen and paper
going through https://www.baeldung.com/linux/awk-multiple-input-files --> https://www.baeldung.com/linux/awk-multiple-input-files

awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
Here you have one pattern-action pair, that is if (total) number of row equals file number of row then print whole line.
awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
Here you have two pattern-action pairs, as ; follows condition it is assumed that you want default action which is {print $0}, in other words that is equivalent to
awk 'NR==FNR{print $0}{ print $0}' lines_to_show.txt all_lines.txt
first print $0 is used solely when processing 1st file, 2nd print $0 is used indiscriminately (no condition given), so for lines_to_show.txt both prints are used, for all_lines.txt solely 2nd print.

man awk is the best reference:
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the
enclosing brace characters) can be omitted.
A missing pattern shall match any record of input,
and a missing action shall be equivalent to:
{ print }
; terminates a pattern-action block. So you have two pattern/action blocks, both whose action is to print the line.

Related

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Delete every line if occurence found

I have a file with this format content:
1 6 8
1 6 9
1 12 20
1 6
2 8
2 9
2 12
2 20
2 35
I want to delete all the lines if the number (from 2nd or 3rd column but not from 1st) is found in the next lines whether it is in the 2nd or 3rd column inluding the line where the initial number is found.
I should have this as an output:
2 35
I've tried using:
awk '{for(i=2;i<=NF;i++){if($i in a){next};a[$i]}} 1'
but it doesn't seem to work.
What is wrong ?
One-pass awk that hashes all the records to r[NR] and keeps another array a[$i] for the values seen in fields $2,...NF.
awk ' {
for(i=2;i<=NF;i++) # iterate fields starting from the second
if($i in a) { # if field value was seen before
delete r[a[$i]] # delete related record
a[$i]="" # clear a
f=1 # flag up
} else { # if it was not seen before
a[$i]=NR # add record number to a
r[NR]=$0
}
if(f!=1) # if flag was not raised
r[NR]=$0 # store record on record number
else # if it was raised
f="" # flag down
}
END {
for(i=1;i<=NR;++i)
if(i in r)
print r[i] # output remaining
}' file
Output:
2 35
The simplest way is a double-pass algorithm where you read your file twice.
The idea is to store all values in an array a and count how many times they appear. If the value appears 2 or more times, it means you have found more then a single entry and you should not print the line.
awk '(NR==FNR){a[$2]++; if(NF>2) a[$3]++; next}
(NF==2) && (a[$2]==1);
(NF==3) && (a[$2]==1 && a[$3]==1)' <file> <file>
In practice, you should avoid things such as a[var]==1 if you are not sure whether var is in the array as it will create that array element. However, since we never increase it any more, it is fine to proceed.
If you want to achieve the same thing with more then three fields you can do:
awk '(NR==FNR){for(i=2;i<=NF;++i) a[$i]++; next }
{for(i=2;i<=NF;++i) if(a[$i]>1) next }
{print}' <file> <file>
While both these solutions read the file twice, you can also store the full file in memory and read the file only a single time. This, however, is exactly the same algorithm:
awk '{for(i=2;i<=NF;++i) a[$i]++; b[NR]=$0}
END{ for(j=1;j<=NR;++j) {
$0=b[j];
for(i=2;i<=NF;++i) if(a[$i]>1) continue
print $0
}
}' <file>
comment: this single-pass solution is very simple and stores the full file in memory. The solution of James Brown is very clever. It removes stuff from memory when they are not needed anymore. A bit shorter version is:
awk '{ for(i=2;i<=NF;++i) if ($i in a) delete b[a[$i]]; else { a[$i]=NR; b[NR]=$0 }}
END { for(n=1;n<=NR;++n) if(n in b) print b[n] }' <file>
note: you should never thrive for the shortest solution, but the most readable one!
Could you please try following.
awk '
FNR==NR{
for(i=2;i<=NF;i++){
a[$i]++
}
next
}
(NF==2 && a[$2]==1) || (NF==3 && a[$2]==1 && a[$3]==1)
' Input_file Input_file
Output will be as follows.
2 35
$ cat tst.awk
NR==FNR {
cnt[$2]++
cnt[$3]++
next
}
cnt[$2]<2 && cnt[$NF]<2
$ awk -f tst.awk file file
2 35
This might work for you (GNU sed):
sed -r 'H;s/^[0-9]+ +//;G;s/\n(.*\n)/\1/;h;$!d;s/^([^\n]*)\n(.*)/\2\n \1/;:a;/^[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/^[0-9]+ +[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/\n/P;:b;s/^[^\n]*\n//;ta;d' file
This is not a serious solution however it demonstrates what can be achieved using only matching and substitution.
The solution makes a copy of the original file and whilst doing so accumulates all numbers in the second and possible third fields of each record in a separate line which it maintains at the head of the copy.
At the end of the file, the first line of the copy contains all the pertinent keys and if there are duplicate keys then any line in the file that contains such a key is deleted. This is achieved by moving the keys (the first line) to the end of the file and matching the second (and possibly third) fields of each record on those keys.

awk print without a file

How to print using awk without a file.
script.sh
#!/bin/sh
for i in {2..10};do
awk '{printf("%.2f %.2f\n", '$i', '$i'*(log('$i'/('$i'-1))))}'
done
sh script.sh
Desired output
2 value
3 value
4 value
and so on
value indicates the quantity after computation
BEGIN Block is needed if you are not providing any input to awk either by file or standard input. This block executes at the very start of awk execution even before the first file is opened.
awk 'BEGIN{printf.....
so it is like:
From man page:
Gawk executes AWK programs in the following order. First, all variable assignments specified via the -v option are performed. Next, gawk compiles the program into an internal form. Then, gawk executes the code in the BEGIN block(s) (if any), and then proceeds to read each file named in the ARGV array. If there are no files named on the command line, gawk reads the standard input.
awk structure:
awk 'BEGIN{get initialization data from this block}{execute the logic}' optional_input_file
As PS. correctly pointed out, do use the BEGIN block to print stuff when you don't have a file to read from.
Furthermore, in your case you are looping in Bash and then calling awk on every loop. Instead, loop directly in awk:
$ awk 'BEGIN {for (i=2;i<=10;i++) print i, i*log(i/(i-1))}'
2 1.38629
3 1.2164
4 1.15073
5 1.11572
6 1.09393
7 1.07905
8 1.06825
9 1.06005
10 1.05361
Note I started the loop in 2 because otherwise i=1 would mean log(1/(1-1))=log(1/0)=log(inf).
I would suggest a different approach:
seq 2 10 | awk '{printf("%.2f %.2f\n", $1, $1*(log($1/($1-1))))}'

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

awk and regular expressions confusion

Having never used awk before on Linux I am attempting to understand how it matches regular expressions. For example in the past based on my experience the regular expression /2/ would match 2 in all of the following lines.
This will match 2
This will not match 2
Now if I run the command awk '{if(NR~2)print}' sample.txt which has the contents
2 will be matched
This will not match 2
2 may be matched
The line that is matched is This will not match 2 which indicates it is matching the line 2 because if I replace the command with awk '{if(NR~3)print}' sample.txt it matches 2 may be matched. Now if I also run the command awk '{if(NR~/^2$/)print}' sample.txt, the matches the same exact line i.e. line 2.
However the source I am referring to at http://www.youtube.com/watch?feature=player_detailpage&v=Htnno4CHVus#t=502s seems to indicate otherwise.
What am I missing and how is the command awk '{if(NR~2)print}' sample.txt different to that of awk '{if(NR~/^2$/)print}' sample.txt?
The condition NR~2 is checking whether the record number, NR, matches 2. For a 2 or 3 line input file, the expression is equivalent to:
if (NR == 2)
Similarly with NR~3, of course. Try:
awk '/2/'
That will print all lines where the text of the line ($0) contains a 2. By default, a regular expression matches against the whole line; you could limit it to a particular field with $3 ~ /3/, for example.
An awk program consists of patterns and actions, where either the pattern or the action may be omitted.
awk '{ if ($0 ~ /2/) print }
/2/
/2/ { if ($0 ~ /a.*z/) print "Matches a.*z"; }'
The first line has no pattern; the action in the { ... } is executed for each input line (but only some input lines will generate output because of the conditional. All lines that contain a 2 will be printed. (If there is no argument to print, it prints $0 followed by a newline.)
The second line has a pattern but no action; all lines that contain a 2 will be printed again. (The missing action is equivalent to { print }.)
The third line has both a pattern and an action; all lines that both contain a 2 and also contain an 'a' followed by a 'z' will be remarked upon.
How are these two commands different?
`awk '{if(NR~2)print}' sample.txt`
`awk '{if(NR~/^2$/)print}' sample.txt`
The first command will print line numbers 2, 12, 20..29, 32, 42, ... 102, 112, 120..129, ... 200..299, ...; all lines where the line number contains a 2.
The second command will print only line number 2 because the /^2$/ constrains the value to contain start of string, digit 2 and end of string.
I take it that means that the source is wrong?
Now I've looked at the YouTube resource, I think you must have misunderstood what it is trying to teach. When it talks about {if (NR~2) print}, it should be saying it will print any line number which contains a 2; the video cites line numbers 2, 12, 20, 21, 22, etc. It should not be saying any line which contains a 2; I think the video does say that, but the video misspoke (but the text was accurate). The comparison against NR is not actually wrong, but it is aconventional; I'm not sure that I'd include regexes against NR in an introductory video describing awk. So, the video appears to have a glitch in the audio but the text on screen is accurate, I think. I may still have missed something.
The command awk '{ if ($0 ~ /2/) print } against the file say sample.txt with the contents I mentioned would only result in the output 2 will be matched. Is that correct?
That command, given the input:
2 will be matched
This will not match 2
2 may be matched
will print all three lines; they all contain the digit 2.
I also thought that the action was print and the pattern was $0 ~ /2/.
No; the pattern was empty (because there was nothing before the open brace) — so all lines match it — and the action was the part in braces { if ($0 ~ /2/) print }. Now, the action contains a conditional, but that's a separate issue.
Now the command awk '/2/' sample.txt would print all three lines. Is that correct?
Yes.
NR means the Number of the Record being processed...
You are matching against line number 2.