Delete every line if occurence found - awk

I have a file with this format content:
1 6 8
1 6 9
1 12 20
1 6
2 8
2 9
2 12
2 20
2 35
I want to delete all the lines if the number (from 2nd or 3rd column but not from 1st) is found in the next lines whether it is in the 2nd or 3rd column inluding the line where the initial number is found.
I should have this as an output:
2 35
I've tried using:
awk '{for(i=2;i<=NF;i++){if($i in a){next};a[$i]}} 1'
but it doesn't seem to work.
What is wrong ?

One-pass awk that hashes all the records to r[NR] and keeps another array a[$i] for the values seen in fields $2,...NF.
awk ' {
for(i=2;i<=NF;i++) # iterate fields starting from the second
if($i in a) { # if field value was seen before
delete r[a[$i]] # delete related record
a[$i]="" # clear a
f=1 # flag up
} else { # if it was not seen before
a[$i]=NR # add record number to a
r[NR]=$0
}
if(f!=1) # if flag was not raised
r[NR]=$0 # store record on record number
else # if it was raised
f="" # flag down
}
END {
for(i=1;i<=NR;++i)
if(i in r)
print r[i] # output remaining
}' file
Output:
2 35

The simplest way is a double-pass algorithm where you read your file twice.
The idea is to store all values in an array a and count how many times they appear. If the value appears 2 or more times, it means you have found more then a single entry and you should not print the line.
awk '(NR==FNR){a[$2]++; if(NF>2) a[$3]++; next}
(NF==2) && (a[$2]==1);
(NF==3) && (a[$2]==1 && a[$3]==1)' <file> <file>
In practice, you should avoid things such as a[var]==1 if you are not sure whether var is in the array as it will create that array element. However, since we never increase it any more, it is fine to proceed.
If you want to achieve the same thing with more then three fields you can do:
awk '(NR==FNR){for(i=2;i<=NF;++i) a[$i]++; next }
{for(i=2;i<=NF;++i) if(a[$i]>1) next }
{print}' <file> <file>
While both these solutions read the file twice, you can also store the full file in memory and read the file only a single time. This, however, is exactly the same algorithm:
awk '{for(i=2;i<=NF;++i) a[$i]++; b[NR]=$0}
END{ for(j=1;j<=NR;++j) {
$0=b[j];
for(i=2;i<=NF;++i) if(a[$i]>1) continue
print $0
}
}' <file>
comment: this single-pass solution is very simple and stores the full file in memory. The solution of James Brown is very clever. It removes stuff from memory when they are not needed anymore. A bit shorter version is:
awk '{ for(i=2;i<=NF;++i) if ($i in a) delete b[a[$i]]; else { a[$i]=NR; b[NR]=$0 }}
END { for(n=1;n<=NR;++n) if(n in b) print b[n] }' <file>
note: you should never thrive for the shortest solution, but the most readable one!

Could you please try following.
awk '
FNR==NR{
for(i=2;i<=NF;i++){
a[$i]++
}
next
}
(NF==2 && a[$2]==1) || (NF==3 && a[$2]==1 && a[$3]==1)
' Input_file Input_file
Output will be as follows.
2 35

$ cat tst.awk
NR==FNR {
cnt[$2]++
cnt[$3]++
next
}
cnt[$2]<2 && cnt[$NF]<2
$ awk -f tst.awk file file
2 35

This might work for you (GNU sed):
sed -r 'H;s/^[0-9]+ +//;G;s/\n(.*\n)/\1/;h;$!d;s/^([^\n]*)\n(.*)/\2\n \1/;:a;/^[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/^[0-9]+ +[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/\n/P;:b;s/^[^\n]*\n//;ta;d' file
This is not a serious solution however it demonstrates what can be achieved using only matching and substitution.
The solution makes a copy of the original file and whilst doing so accumulates all numbers in the second and possible third fields of each record in a separate line which it maintains at the head of the copy.
At the end of the file, the first line of the copy contains all the pertinent keys and if there are duplicate keys then any line in the file that contains such a key is deleted. This is achieved by moving the keys (the first line) to the end of the file and matching the second (and possibly third) fields of each record on those keys.

Related

Why does NR==FNR; {} behave differently when used as NR==FNR{ }?

Hoping someone can help explain the following awk output.
awk --version: GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
OS: Linux sub system on Windows; Linux Windows11x64 5.10.102.1-microsoft-standard-WSL2
user experience: n00b
Important: In the two code snippets below, the only difference is the semi colon ( ; ) after NR==FNR in sample # 2.
sample # 1
'awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
output # 1
2
3
4
5
7
sample # 2
'awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
output # 2
2 # why is value in file 'lines_to_show.txt appearing twice?
2
3
3
4
4
5
5
7
7
line -01
line -02
line -03
line -04
line -05
line -06
line -07
line -08
line -09
line -10
Generate the text input files
lines_to_show.txt: echo -e "2\n3\n4\n5\n7" > lines_to_show.txt
all_lines.txt: echo -e "line\t-01\nline\t-02\nline\t-03\nline\t-04\nline\t-05\nline\t-06\nline\t-07\nline\t-08\nline\t-09\nline\t-10" > all_lines.txt
Request/Questions:
If you can please explain why you know the answers to the questions below (experience, tutorial, video, etc..)
How does one read an `awk' program? I was under the impression that a semi colon ( ; ) is only a statement terminator, just like in C. It should not have an impact on the execution of the program.
In output # 2, why are the values in the file 'lines_to_show.txt appearing twice? Seems like awk is printing values from the 1st file "lines_to_show.txt" but printing them 10 times, which is the number of records in the file "all_lines.txt". Is this true? why?
Why in output # 1, only output from "lines_to_show.txt" is displayed? I thought awk will process each record in each file, so I expcted to see 15 lines (10 + 5).
What have I tried so far?
going though https://www.linkedin.com/learning/awk-essential-training/using-awk-command-line-flags?autoSkip=true&autoplay=true&resume=false&u=61697657
modifying the code to see the difference and use that to 'understand' what is going on.
trying to work through the flow using pen and paper
going through https://www.baeldung.com/linux/awk-multiple-input-files --> https://www.baeldung.com/linux/awk-multiple-input-files
awk 'NR==FNR { print $0 }' lines_to_show.txt all_lines.txt
Here you have one pattern-action pair, that is if (total) number of row equals file number of row then print whole line.
awk 'NR==FNR; { print $0 }' lines_to_show.txt all_lines.txt
Here you have two pattern-action pairs, as ; follows condition it is assumed that you want default action which is {print $0}, in other words that is equivalent to
awk 'NR==FNR{print $0}{ print $0}' lines_to_show.txt all_lines.txt
first print $0 is used solely when processing 1st file, 2nd print $0 is used indiscriminately (no condition given), so for lines_to_show.txt both prints are used, for all_lines.txt solely 2nd print.
man awk is the best reference:
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the
enclosing brace characters) can be omitted.
A missing pattern shall match any record of input,
and a missing action shall be equivalent to:
{ print }
; terminates a pattern-action block. So you have two pattern/action blocks, both whose action is to print the line.

awk to match string from file against another file and get previous and next 2 lines

i am trying to match string from a file against another file to fetch the matched line along with the previous and next 2 lines.
i could do this with grep for a chuck file, but throws memory exhausted on the original(200M lines of keys and a 2TB input source file).
grep --no-group-separator -A 2 -B 1 -f key source
sample key file
^CNACCCAAGGCTCATT
^ANAGCGGCAACTCGCG
I added the "^" to each line since the key is the starting 16 characters of the line next to the one starting with '#'
The pattern is formed of the characters ATGCN having length 16 and they are random. There could be multiple matches in the source file against a pattern
sample search against file
#A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
#A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
#A00354:427:HVYWLDSXY:1:1101:1271:1000 1:N:0:ATTACTTC
CNATCCCGTCTCGAGCCCGCCCCAATAGCAACAACAACAACAACAACAACAACAACAGCAACAACACCAGCAACACCAGCAACAACAGCAACAACAACAACAGCAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAGA
+
F#FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
#A00354:427:HVYWLDSXY:1:1101:1325:1000 1:N:0:ATTACTTC
TNCGGTTCATAGGAATGTAGTCTTTGTAATTATGCGCAATTTCCAAACACTTCAAGGTTTTTTTGCAAATAAAACATTCAGGCCTCGTGTGTGCCGCTGCATCTTAGATCCAACGGCTCCTAGTTGCTCATATTCNACCCAAGGCTCATTAGGTGCTCCCCGTAGC
+
:#FFF:F,FFFFFFFFFFFF,:FFF::F,FFF,F:FFFFFFF:FFFF:FF:F:FFF:F:F:FFFFFFFF,FF,F:FF:FF::F,FFF:FFFFFF,:F::FFFFFFF:FF:FFFFF,FFFFFF,FFF:FFFFFFFFF,FFFF:FFFFFFF:
even if i split the key file its painstakingly slow.
can it be done using perl one-liner or awk more efficiently.
The expected output would be
#A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
#A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
i saw code like
awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($0, i)) print $1}' key source
which checks if each entry in key is a substring of the source, but i couldn't get my head around to make it check for a pattern(^CNACCCAAGGCTCATT) and fetch the prev. and next lines
another way i tried and couldn't make out was, zcat key | match each line against source file > output
*may be the slowness is because of my code, any alternate is much appreciated
for (i in a) if (index($0, i)) would be immensely slow because you're looping 100,000,000 times per line of your "search" file (so 100M * 2TB loop iterations!) and it'd produce incorrect output as index($0, i) would find the target key anywhere on search line rather than at the start, it would have to be index($0, i) == 1 to only match at the start.
This is how to do it in awk after removing all those ^s from the start of your "key" file lines as we're going to do an efficient hash lookup with strings, not a slow regexp comparison as would be required with grep, and we're going to do 1 hash lookup per line of "source" instead of 100M string comparisons as in the awk script in your question:
$ cat tst.awk
NR==FNR { tgts[$1]; next }
c && !(--c) { print p3 ORS p2 ORS p1 ORS $0; f=0 }
{ key=substr($0,1,16); p3=p2; p2=p1; p1=$0 }
key in tgts { c=2 }
$ awk -f tst.awk key source
#A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
#A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more information on what c=2 and c && !(--c) is doing but it's setting a count for a number of lines and then becoming true (and so executing the associated action of printing the saved lines) when the count reaches zero again.
If that exceeds available memory, let us know as another approach can look something like the following pseudo-code (I am not suggesting you do this in shell!):
sort keys
sort source by middle line keeping groups of 3 lines together
while !done; do
read tgt < keys
while read source_line; do
key = substr(line,1,16)
if key == tgt; then
print line+context
else if key > tgt; then
break
fi
done < source
done
so the idea is you don't read the next value from "key" until the current value from "source" is bigger then the one you were using. That would reduce memory usage to close to zero but it does require both input files to be sorted.

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

AWK, exclude results from one file with regards to a second file

Using Awk, I am able to get a list of URL with a given error number :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Fine and dandy.
But we would like to further refine it by matching that result against a list of already know 404 URL
example :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7} '| sort | uniq -c | sort -k 2 -r | awk '{print > "/mnt/tmp/404error.txt"}'
yield today :
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
the day after :
1 /going-out/ru/%d0%bd%d0%be%d1%87%d0%bd%d0%b0%d1%8f-%d0%b6%d0%b8%d0%b7%d0%bd%d1%8c-%d0%bd%d0%b0-%d0%bf%d1%85%d1%83%d0%ba%d0%b5%d1%82%d0%b5/%d1%81%d0%be%d0%b2%d0%b5%d1%82%d1%8b-%d0%bb%d1%8e%d0%b1%d0%b8%d1%82%d0%b5%d0%bb%d1%8f%d0%bc-%d0%bd%d0%be%d1%87%d0%bd%d1%8b%d1%85-%d1%80%d0%b0%d0%b7%d0%b2%d0%bb%d0%b5%d1%87%d0%b5%d0%bd%d0%b8%d0%b9/
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
1 /going-out/report-content/?cid=29968
1 /going-out/report-content/?cid=29823
The goal is to have only the new URL.
At that point I am lost, I know I can push first file into an array, I presume I can do the same with the second file (but in a second array), then maybe (not sure if awk does have the capacity) simply cross them, and kept what does not match.
Any help will be fully appreciate.
So you have a file whose $9 field may match /404/. If so, you want to store the 7th field. Then, count how many of them appeared in total, but just if they did not appear before in a file you have.
I think all of this can be done with this (untested, because I have no sample input data):
awk 'FNR==NR {seen[$2];next}
$9 ~ /404/ {if !($7 in seen) a[$7]++}
END {for (i in a) print a[i], i}' old_file log_file
This stores the 2nd column from the file with data into an array seen[]. Then, goes through the new file and stores the 7th column if it wasn't seen before. Finally, it prints the counters.
Since it looks like you have an old awk version that does not support the syntax index in array, you can use this workaround for it:
$9 ~ /404/ {for (i in seen) {if (i==$7) next} a[$7]++}
Note you must be using a veeery old version, since this was introduced in 1987:
A.1 Major Changes Between V7 and SVR3.1
The awk language evolved considerably between the release of Version 7
Unix (1978) and the new version that was first made generally
available in System V Release 3.1 (1987). This section summarizes the
changes, with cross-references to further details:
The expression ‘indx in array’ outside of for statements (see
Reference to Elements)
You can use grep --fixed-strings --file=FILEALL FILENEW or comm -23 FILENEW FILEALL for this. FILEALL is the file containing the urls already found, FILENEW contains the pages found today. For comm both files must be sorted.
http://www.gnu.org/software/gawk/manual/gawk.html#Other-Inherited-Files
http://linux.die.net/man/1/comm
I think commis more efficient because I uses sorted files, but I did not test this.
I came up with the following :
awk 'BEGIN {
while (getline < "/mnt/tmp/404error.txt") {
A[$1] = $1;
};
while (getline < "/var/log/nginx/access.log") {
if( $9 ~ /404/)
{
{
exist[$7] = $7 ;
}
{
if ($7 in A) blah += 1; else new[$7];
}
}
}
{
asort(exist);
for(i in exist)
print exist[i] > "/mnt/tmp/404error.txt"
}
{
asorti(new);
for(i in new)
print new[i] > "/mnt/tmp/new404error.txt"
}
}
' | mutt -s "subject" -a /mnt/tmp/new404error.txt -- whoever#mail.net, whatever#mail.net
that seems providing me what I want (almost).
But I believe it is verbous too much, might be possible one of you genius can improve it
Thanks

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.