Why my range pattern is only working on the first file? - awk

I have a set of files (FILE1.txt, FILE2.txt ...) of the form:
foo 123
bar 456
start
foo 321
bar 654
And I want to ignore everything before start and only read lines containing foo in each file.
My attempt is this command :
awk '/start/,/EOF/ {if($1=="foo"){print $2}} ' FILE*.txt
And it actually works on the first file, that is it will print foo 321 but then it will ignore the range pattern for the next files. That is, if we assume that all the files has the same content showed above, it will print:
$ awk '/start/,/EOF/ {if($1=="foo"){print $2}} ' FILE*.txt
321 // Expected from FILE1.txt, successfully ignore the first "foo" before "start".
123 // Unexpected from FILE2.txt
321 // Expected from FILE2.txt
123 // Unexpected from FILE3.txt
321 // Expected from FILE3.txt
...
What am I doing wrong ? How to make the range pattern working on each file and not only once over all the files?
I've actually found a workaround based on find but for the sake of a good understanding I'm looking toward a solution relying on awk only.

awk processes all files as a single input stream. You need to tell awk when it's processing a new file and to reset it's pattern matching.
One approach:
awk '
FNR==1 { found=0 } # FNR==1st record of new file, reset flag
/start/ { found=1 } # found start of range, set flag
found && $1=="foo" { print $2 } # if flag set and 1st field == "foo" then print 2nd field
' FILE?.txt
NOTES:
/start/ will match on the string start anywhere in the row, eg, it will match on restart, last time I started the car, etc; to match on the exact string you could use $1=="start"
this was run against 3 files (FILE{1..3}.txt) that all have the same content as OP's sample input
This generates:
321
321
321

Some thoughts about row ranges and EOF.
One solution can be to set a helper variable.
$ awk -v row="start" -v regx="foo" '
FNR == 1{x = 0}
x == 1 && $1 ~ regx{print $2}
$1 ~ row{x = 1}' file file file
321
321
321

What am I doing wrong ?
/EOF/ means line has three-letter substring EOF, it will not contact at last line of file unless that line contains substring EOF.
How to make the range pattern working on each file and not only once
over all the files?
I would exploit GNU AWK following way, let file1.txt content be
foo 123
bar 456
start
foo 321
bar 654
and file2.txt content be
foo 1230
bar 4560
start
foo 3210
bar 6540
and file3.txt content be
foo 12300
bar 45600
start
foo 32100
bar 65400
then
awk '/start/{f=1}f&&$1=="bar"{print}ENDFILE{f=0}' file1.txt file2.txt file3.txt
gives output
bar 654
bar 6540
bar 65400
Explanation: for line containing start I set f value to 1, when f is non-zero and 1st column ($1) is bar I print that line, when I encounter end of file I set f to zero using ENDFILE special parking
(tested in GNU Awk 5.0.1)

Related

Why does NR==FNR; {} behave differently when used as NR==FNR{ }?

Hoping someone can help explain the following awk output.
awk --version: GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
OS: Linux sub system on Windows; Linux Windows11x64 5.10.102.1-microsoft-standard-WSL2
user experience: n00b
Important: In the two code snippets below, the only difference is the semi colon ( ; ) after NR==FNR in sample # 2.
sample # 1
'awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
output # 1
2
3
4
5
7
sample # 2
'awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
output # 2
2 # why is value in file 'lines_to_show.txt appearing twice?
2
3
3
4
4
5
5
7
7
line -01
line -02
line -03
line -04
line -05
line -06
line -07
line -08
line -09
line -10
Generate the text input files
lines_to_show.txt: echo -e "2\n3\n4\n5\n7" > lines_to_show.txt
all_lines.txt: echo -e "line\t-01\nline\t-02\nline\t-03\nline\t-04\nline\t-05\nline\t-06\nline\t-07\nline\t-08\nline\t-09\nline\t-10" > all_lines.txt
Request/Questions:
If you can please explain why you know the answers to the questions below (experience, tutorial, video, etc..)
How does one read an `awk' program? I was under the impression that a semi colon ( ; ) is only a statement terminator, just like in C. It should not have an impact on the execution of the program.
In output # 2, why are the values in the file 'lines_to_show.txt appearing twice? Seems like awk is printing values from the 1st file "lines_to_show.txt" but printing them 10 times, which is the number of records in the file "all_lines.txt". Is this true? why?
Why in output # 1, only output from "lines_to_show.txt" is displayed? I thought awk will process each record in each file, so I expcted to see 15 lines (10 + 5).
What have I tried so far?
going though https://www.linkedin.com/learning/awk-essential-training/using-awk-command-line-flags?autoSkip=true&autoplay=true&resume=false&u=61697657
modifying the code to see the difference and use that to 'understand' what is going on.
trying to work through the flow using pen and paper
going through https://www.baeldung.com/linux/awk-multiple-input-files --> https://www.baeldung.com/linux/awk-multiple-input-files
awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
Here you have one pattern-action pair, that is if (total) number of row equals file number of row then print whole line.
awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
Here you have two pattern-action pairs, as ; follows condition it is assumed that you want default action which is {print $0}, in other words that is equivalent to
awk 'NR==FNR{print $0}{ print $0}' lines_to_show.txt all_lines.txt
first print $0 is used solely when processing 1st file, 2nd print $0 is used indiscriminately (no condition given), so for lines_to_show.txt both prints are used, for all_lines.txt solely 2nd print.
man awk is the best reference:
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the
enclosing brace characters) can be omitted.
A missing pattern shall match any record of input,
and a missing action shall be equivalent to:
{ print }
; terminates a pattern-action block. So you have two pattern/action blocks, both whose action is to print the line.

Command line - show surrounding context lines in a file around known exact line number

How to output N(e.g. 2) lines surrounding a specific known line number(e.g. 5) in a file?
cat >/tmp/file <<EOL
foo
bar
baz
qux
quux
EOL
# some command
Expected output:
bar
baz
qux
If you know the line and number of lines in advance and thus you are able to compute the number of the first line and number of the last line you might use simple GNU sed command, for example
sed -n '3,7p' file.txt
will output 3rd, 4th, 5th, 6th and 7th line of file.txt.
If you wish to change the line number then I would use GNU AWK following way
awk 'BEGIN{n=5}NR==n-2,NR==n+2' file.txt
Explanation: I set n to 5 then I use Range to select lines from n-2th line (inclusive) to n+2th line (inclusive), no action is provided which is equivalent of giving {print}.
Robustly, portably, and efficiently printing a context (same number of lines either side of a target line):
$ awk -v tgt=5 -v ctx=2 '
BEGIN{beg=tgt-(ctx=="" ? bef : ctx); end=tgt+(ctx=="" ? aft : ctx)}
NR==beg{f=1} f; NR==end{exit}
' file
bar
baz
qux
or different numbers of lines before and after the target line:
$ awk -v tgt=5 -v bef=2 -v aft=4 '
BEGIN{beg=tgt-(ctx=="" ? bef : ctx); end=tgt+(ctx=="" ? aft : ctx)}
NR==beg{f=1} f; NR==end{exit}
' file
bar
baz
qux
quux
In particular for efficiency note:
The math to calculate the begin/end line numbers is done once in the BEGIN section rather than recalculated every time a line is read, and
The NR==end{exit} instead of NR==end{f=0} or similar so awk doesn't waste time unnecessarily reading the rest of the input file after the desired lines have been printed.
Without line number prefixes:
awk -v nr=5 'FNR>=nr-2 && FNR<=nr+3{ print $0 }' /tmp/file
bar
baz
qux
With line number prefixes:
awk -v nr=5 'FNR>=nr-2 && FNR<=nr+3{ print FNR":"$0 }' /tmp/file
3:bar
4:
5:baz
6:
7:qux
This might work for you (GNU sed and bash):
sed -n $((5-2)),$((5+2))p file
Which fetches the range +/- 2 lines from line 5 of file.
Another way is to use grep:
greg -FC2 $(sed -n 5p file) file
Find line 5 in file using sed and then grep 2 lines of context either side of that line.

How to extract only first line that matches each pattern from a file?

I have a text file that looks like
Line_A 123
Line_A 456
Line_A 789
Line_B 123
Line_B 456
Line_B 789
Line_C 123
Line_C 456
Line_C 789
And a reference file that looks like this:
Line_A
Line_B
Line_C
I want to extract the first line from the text file that matches each name in the reference file like this:
Line_A 123
Line_B 123
Line_C 123
So far I can only get the first line from the first match with:
grep -A1 -w -f reference.txt -m 1 file.txt
Maybe I need a for loop? TIA
another awk
$ awk 'NR==FNR{a[$1];next} $1 in a{delete a[$1]; print}' reference file
keep the references in a set, when seen in the file print the line and remove reference, so only the first instance will be printed.
Yet another awk:
$ awk 'a[$1]++==1' ref file
Line_A 123
Line_B 123
Line_C 123
Read both files in above order, count each string in first column and print when it's seen the second time. This will fail if there are strings in file that are not in reference. In that case use one of the other solutions.
You can do it in Awk with a single pass over the files as long as you list the reference file first in the argument list:
awk 'FNR == NR { name[$1] = 0; }
FNR != NR { for (i in name) if ($0 ~ i && name[i]++ == 0) { print $0; break; } }' \
reference.txt file.txt
With the sample inputs, this yields the required output.
This is a fairly standard technique in Awk. You read the first file using the FNR == NR condition (file line number equal to overall line number; only true for lines in the first file) and save appropriate information for later use. Often, people use a next in the first line; it works. It means they can avoid
the FNR != NR condition — I like that for symmetry.
When processing the second and subsequent files, check whether each of the names read from the first file matches a line, and the name hasn't been printed before, printing the line if it hasn't been handled. The break avoids checking other names if the current name matches.
This is the way many people would write the command; it also works.
awk 'FNR == NR { name[$1] = 0; next }
{ for (i in name) if ($0 ~ i && name[i]++ == 0) { print $0; break; } }' \
reference.txt file.txt
Both versions of the code here look for the name anywhere in the line; if you strictly want to match the $1 of the second (or subsequent) files, you can alter the conditions (indeed, simplify them). And karakfa shows deleting matches when they're matched (instead of incrementing a counter), which is better for performance as you don't have to continue matching that which is no longer relevant. However, the code shown here would be simpler to adapt to showing the second, or third, or last entry for a given name (handling second or third involves changing the 0 to 1 or 2; handling 'last' requires more substantial changes).

Looks for patterns across different lines

I have a file like this (test.txt):
abc
12
34
def
56
abc
ghi
78
def
90
And I would like to search the 78 which is enclosed by "abc\nghi" and "def". Currently, I know I can do this by:
cat test.txt | awk '/abc/,/def/' | awk '/ghi/,'/def/'
Is there any better way?
One way is to use flags
$ awk '/ghi/ && p~/abc/{f=1} f; /def/{f=0} {p=$0}' test.txt
ghi
78
def
{p=$0} this will save input line for future use
/ghi/ && p~/abc/{f=1} set flag if current line contains ghi and previous line contains abc
f; print input record as long as flag is set
/def/{f=0} clear the flag if line contains def
If you only want the lines between these two boundaries
$ awk '/ghi/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
78
$ awk '/12/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
34
See also How to select lines between two patterns?
This is not really clean, but you can redefine your record separator as a regular expression to be abc\nghi\n|\ndef. This however creates multiple records, and you need to keep track which ones are between the correct ones. With awk you can check which RS was found using RT.
awk 'BEGIN{RS="abc\nghi\n|\ndef"}
(RT~/abc/){s=1}
(s==1)&&(RT~/def/){print $0}
{s=0}' file
This does :
set RS to abc\nghi\n or \ndef.
check if the record is found, if RT contains abc you found the first one.
if you found the first one and the next RT contains def, then print.
grep alternative
$ grep -Pazo '(?s)(?<=abc\nghi)(.*)(?=def)' file
but I think awk will be better
You could do this with sed. It's not ideal in that it doesn't actually understand records, but it might work for you...
sed -Ene 'H;${x;s/.*\nabc\nghi\n([0-9]+)\ndef\n.*/\1/;p;}' input.txt
Here's what's basically going on:
H - appends the current line to sed's "hold space"
${ - specifies the start of a series of commands that will be run once we come to the end of the file
x - swaps the hold space with the pattern space, so that future substitutions will work on what was stored using H
s/../../ - analyses the pattern space (which is now multi-line), capturing the data specified in your question, replacing the entire pattern space with the bracketed expression...
p - prints the result.
One important factor here is that the regular expression is ERE, so the -E option is important. If your version of sed uses some other option to enable support for ERE, then use that option instead.
Another consideration is that the regex above assumes Unix-style line endings. If you try to process a text file that was generated on DOS or Windows, the regex may need to be a little different.
awk solution:
awk '/ghi/ && r=="abc"{ f=1; n=NR+1 }f && NR==n{ v=$0 }v && NR==n+1{ print v }{ r=$0 }' file
The output:
78
Bonus GNU awk approach:
awk -v RS= 'match($0,/\nabc\nghi\n(.+)\ndef/,a){ print a[1] }' file

Search a file using combined keywords from two input files

I have 2 input files below and need to Search a 3rd file using all possible keywords (InputFile1.txt+InputFile2.txt) from two input files.
InputFile1.txt:
1.1.1.1
2.2.2.2
3.3.3.3
InputFile2.txt:
Orange
Blue
FileTobeSearched.txt:
2.2.2.2,bla,Orange
9.9.9.9,bla,bla
2.2.2.2,bla,Blue
Desired output is:
2.2.2.2,bla,Orange
2.2.2.2,bla,Blue
My attempt to loop through this is not even worth a post. please help!
*** Requested Added output:
I would like to know which keywords triggered matching line and would like to add them to beginning of each matching line output. For instance, for line matching 2.2.2.2 AND Orange, i would like matching line to start with "2.2.2.2,Orange:" and then matching line.
*** You are right: my sample file is not good.
corrected FileTobeSearched.txt:
2.2.2.2,bla,bla bla "Orange" bla bla
9.9.9.9,bla,bla
2.2.2.2,bla,bla bla bla bla "Blue"
This would now hopefully explain why i need matching keyword added to front of matching hit.
With GNU awk for ARGIND:
awk '
BEGIN { FS=OFS="," }
ARGIND==1 { a[$0]; next }
ARGIND==2 { b[$0]; next }
($1 in a) && ($3 in b) { print $1, $3 ":" $0 }
' InputFile1.txt InputFile2.txt FileTobeSearched.txt
With other awks change ARGIND==1 to FILENAME==ARGV[1] etc. or add an initial line that says FNR==1{ARGIND++} if your files can't be empty.
The differences between the above and #karakfa's answer are in performance:
his will loop through every line of file 1 once for every line of
file 2 while mine just doesn't do that, and
his requires 1 string concatenation plus 1 hash lookup for every
line of file 3 while mine doesn't require the string concatenation
but does require 2 hash lookups (but on much smaller arrays) for every line of file 3.
awk to the rescue!
$ awk -F, 'FILENAME==ARGV[1]{a[$0]; next}
FILENAME==ARGV[2]{for(k in a) b[k,$0]; next}
($1,$3) in b' InputFile1.txt InputFile2.txt FileTobeSearched.txt
2.2.2.2,bla,Orange
2.2.2.2,bla,Blue
You can always use grep:
grep -f <(cat file1 file2) filetobesearched