AWK, exclude results from one file with regards to a second file - awk

Using Awk, I am able to get a list of URL with a given error number :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Fine and dandy.
But we would like to further refine it by matching that result against a list of already know 404 URL
example :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7} '| sort | uniq -c | sort -k 2 -r | awk '{print > "/mnt/tmp/404error.txt"}'
yield today :
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
the day after :
1 /going-out/ru/%d0%bd%d0%be%d1%87%d0%bd%d0%b0%d1%8f-%d0%b6%d0%b8%d0%b7%d0%bd%d1%8c-%d0%bd%d0%b0-%d0%bf%d1%85%d1%83%d0%ba%d0%b5%d1%82%d0%b5/%d1%81%d0%be%d0%b2%d0%b5%d1%82%d1%8b-%d0%bb%d1%8e%d0%b1%d0%b8%d1%82%d0%b5%d0%bb%d1%8f%d0%bc-%d0%bd%d0%be%d1%87%d0%bd%d1%8b%d1%85-%d1%80%d0%b0%d0%b7%d0%b2%d0%bb%d0%b5%d1%87%d0%b5%d0%bd%d0%b8%d0%b9/
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
1 /going-out/report-content/?cid=29968
1 /going-out/report-content/?cid=29823
The goal is to have only the new URL.
At that point I am lost, I know I can push first file into an array, I presume I can do the same with the second file (but in a second array), then maybe (not sure if awk does have the capacity) simply cross them, and kept what does not match.
Any help will be fully appreciate.

So you have a file whose $9 field may match /404/. If so, you want to store the 7th field. Then, count how many of them appeared in total, but just if they did not appear before in a file you have.
I think all of this can be done with this (untested, because I have no sample input data):
awk 'FNR==NR {seen[$2];next}
$9 ~ /404/ {if !($7 in seen) a[$7]++}
END {for (i in a) print a[i], i}' old_file log_file
This stores the 2nd column from the file with data into an array seen[]. Then, goes through the new file and stores the 7th column if it wasn't seen before. Finally, it prints the counters.
Since it looks like you have an old awk version that does not support the syntax index in array, you can use this workaround for it:
$9 ~ /404/ {for (i in seen) {if (i==$7) next} a[$7]++}
Note you must be using a veeery old version, since this was introduced in 1987:
A.1 Major Changes Between V7 and SVR3.1
The awk language evolved considerably between the release of Version 7
Unix (1978) and the new version that was first made generally
available in System V Release 3.1 (1987). This section summarizes the
changes, with cross-references to further details:
The expression ‘indx in array’ outside of for statements (see
Reference to Elements)

You can use grep --fixed-strings --file=FILEALL FILENEW or comm -23 FILENEW FILEALL for this. FILEALL is the file containing the urls already found, FILENEW contains the pages found today. For comm both files must be sorted.
http://www.gnu.org/software/gawk/manual/gawk.html#Other-Inherited-Files
http://linux.die.net/man/1/comm
I think commis more efficient because I uses sorted files, but I did not test this.

I came up with the following :
awk 'BEGIN {
while (getline < "/mnt/tmp/404error.txt") {
A[$1] = $1;
};
while (getline < "/var/log/nginx/access.log") {
if( $9 ~ /404/)
{
{
exist[$7] = $7 ;
}
{
if ($7 in A) blah += 1; else new[$7];
}
}
}
{
asort(exist);
for(i in exist)
print exist[i] > "/mnt/tmp/404error.txt"
}
{
asorti(new);
for(i in new)
print new[i] > "/mnt/tmp/new404error.txt"
}
}
' | mutt -s "subject" -a /mnt/tmp/new404error.txt -- whoever#mail.net, whatever#mail.net
that seems providing me what I want (almost).
But I believe it is verbous too much, might be possible one of you genius can improve it
Thanks

Related

AWK:Convert columns to rows with condition (create list ) [duplicate]

I have a tab-delimited file with three columns (excerpt):
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR002110 Ankyrin repeat
AC148152.3_FG001 IPR026961 PGG domain
and I'd like to get this using bash:
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain
So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.
give this one-liner a try:
awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file
For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n
cat FILENAME | perl -e 'foreach $Line (<STDIN>) { #Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(#{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", #{$List})."\n"; }'
will depend off file size (and awk limitation)
if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing
A classical version with post print using a modification of the whole line
sort YourFile \
| awk '
last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
NR > 1 { print Last C; Last = $1; C = ""}
END { print Last}
'
Another version using field and pre-print but less "human readable"
sort YourFile \
| awk '
last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
'
A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.
All on one line:
{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[#]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv
Readable and commented equivalent:
{
# Define `merged` as an empty associative array.
declare -A merged
merged=()
# Read tab-separated lines. Any leftover fields also end up in `value`.
while IFS=$'\t' read -r key value
do
# Append to any value that's already there, separated by a tab.
merged[$key]="${merged[$key]}"$'\t'"$value"
done
# Loop over the input keys. Note that the order is arbitrary;
# pipe through `sort` if you want a predictable order.
for key in "${!merged[#]}"
do
# Each value is prefixed with a tab, so no need for a tab here.
echo "$key${merged[$key]}"
done
} < INPUT_FILE.tsv

Delete every line if occurence found

I have a file with this format content:
1 6 8
1 6 9
1 12 20
1 6
2 8
2 9
2 12
2 20
2 35
I want to delete all the lines if the number (from 2nd or 3rd column but not from 1st) is found in the next lines whether it is in the 2nd or 3rd column inluding the line where the initial number is found.
I should have this as an output:
2 35
I've tried using:
awk '{for(i=2;i<=NF;i++){if($i in a){next};a[$i]}} 1'
but it doesn't seem to work.
What is wrong ?
One-pass awk that hashes all the records to r[NR] and keeps another array a[$i] for the values seen in fields $2,...NF.
awk ' {
for(i=2;i<=NF;i++) # iterate fields starting from the second
if($i in a) { # if field value was seen before
delete r[a[$i]] # delete related record
a[$i]="" # clear a
f=1 # flag up
} else { # if it was not seen before
a[$i]=NR # add record number to a
r[NR]=$0
}
if(f!=1) # if flag was not raised
r[NR]=$0 # store record on record number
else # if it was raised
f="" # flag down
}
END {
for(i=1;i<=NR;++i)
if(i in r)
print r[i] # output remaining
}' file
Output:
2 35
The simplest way is a double-pass algorithm where you read your file twice.
The idea is to store all values in an array a and count how many times they appear. If the value appears 2 or more times, it means you have found more then a single entry and you should not print the line.
awk '(NR==FNR){a[$2]++; if(NF>2) a[$3]++; next}
(NF==2) && (a[$2]==1);
(NF==3) && (a[$2]==1 && a[$3]==1)' <file> <file>
In practice, you should avoid things such as a[var]==1 if you are not sure whether var is in the array as it will create that array element. However, since we never increase it any more, it is fine to proceed.
If you want to achieve the same thing with more then three fields you can do:
awk '(NR==FNR){for(i=2;i<=NF;++i) a[$i]++; next }
{for(i=2;i<=NF;++i) if(a[$i]>1) next }
{print}' <file> <file>
While both these solutions read the file twice, you can also store the full file in memory and read the file only a single time. This, however, is exactly the same algorithm:
awk '{for(i=2;i<=NF;++i) a[$i]++; b[NR]=$0}
END{ for(j=1;j<=NR;++j) {
$0=b[j];
for(i=2;i<=NF;++i) if(a[$i]>1) continue
print $0
}
}' <file>
comment: this single-pass solution is very simple and stores the full file in memory. The solution of James Brown is very clever. It removes stuff from memory when they are not needed anymore. A bit shorter version is:
awk '{ for(i=2;i<=NF;++i) if ($i in a) delete b[a[$i]]; else { a[$i]=NR; b[NR]=$0 }}
END { for(n=1;n<=NR;++n) if(n in b) print b[n] }' <file>
note: you should never thrive for the shortest solution, but the most readable one!
Could you please try following.
awk '
FNR==NR{
for(i=2;i<=NF;i++){
a[$i]++
}
next
}
(NF==2 && a[$2]==1) || (NF==3 && a[$2]==1 && a[$3]==1)
' Input_file Input_file
Output will be as follows.
2 35
$ cat tst.awk
NR==FNR {
cnt[$2]++
cnt[$3]++
next
}
cnt[$2]<2 && cnt[$NF]<2
$ awk -f tst.awk file file
2 35
This might work for you (GNU sed):
sed -r 'H;s/^[0-9]+ +//;G;s/\n(.*\n)/\1/;h;$!d;s/^([^\n]*)\n(.*)/\2\n \1/;:a;/^[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/^[0-9]+ +[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/\n/P;:b;s/^[^\n]*\n//;ta;d' file
This is not a serious solution however it demonstrates what can be achieved using only matching and substitution.
The solution makes a copy of the original file and whilst doing so accumulates all numbers in the second and possible third fields of each record in a separate line which it maintains at the head of the copy.
At the end of the file, the first line of the copy contains all the pertinent keys and if there are duplicate keys then any line in the file that contains such a key is deleted. This is achieved by moving the keys (the first line) to the end of the file and matching the second (and possibly third) fields of each record on those keys.

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

How to use grep via awk?

I am trying to apply grep on just a few strings from a huge file. But, I'd like to pass that line to the grep command via the awk script. I also want the output redirected to the script.
I've an awk script that reads in records from a file. I want grep to be applied on only a few of the records. The current record, $0, will be the text on which grep is to be used.
How do I do the same? Currently, I'm trying this -
system("grep --count -w 'GOOD' \n" $0)
But, it doesn't seem to work. What should I be using?
In Gnu Awk you could use \< and \> to match beginning and end of a word, so
gawk '/\<GOOD\>/{++i} END{print i}'
will do the same as
grep -wc 'GOOD' file
If you want to count the total number of occurances (not only the number of lines, but also occurances within a given line/record) of the word GOOD you could use FPAT in Gnu Awk version 4, as
gawk 'BEGIN { FPAT="\\<GOOD\\>"; RS="^$" } { print NF }' file
If you want to count the number of exact matches of the phrase GOOD DI in a given record, for instance record number 3, you could use
gawk 'NR==3 { print patsplit($0,a,/GOOD DI/) }' file
Your question is not very clear, and it would help if you showed some of your input file, your entire script that you have so far and also the output you want to achieve.
In the mean time, as there is nothing in your question to suggest anything to the contrary, you could do the following:
awk 'somescript' somefile | grep --count -w 'GOOD DI'
You cannot apply grep on a text string, which is what you are doing. If you really need to use grep/system something like following would be needed:
system("echo '"$0"' | grep --count -w 'foo'")
But this is no good either as count only counts lines on which it occurs not the number of times on a line which is what you are after. Or so it seems.
If you use the regex as a split seperator you get the number of split occurences +1.
So following will work:
awk '{printf FNR; a=split($0,myarray,/.OOD/); print " "a-1}' file.txt
This would print each linenumber with the number of times your regex occured. (in this case ".OOD". Representing GOOD, FOOD, MOOD etc)
you can do it the old fashion way
awk 'BEGIN{count=0} {
for( i=1;i<=NF; i++) {
if( $i == "GOOD" ){
++count
}
}
}END {
print count
}' file

Best Awk Commands

I find AWK really useful. Here is a one liner I put together to manipulate data.
ls | awk '{ print "awk " "'"'"'" " {print $1,$2,$3} " "'"'"'" " " $1 ".old_ext > " $1 ".new_ext" }' > file.csh
I used this AWK to make a script file that would rename some files and only print out selective columns. Anyone know a better way to do this? What are you best AWK one liners or clever manipulations?
The AWK book is full of great examples. They used to be collected for download from Kernighan's webpage (404s now).
You can find several nice one liners here.
I use this:
df -m | awk '{p+=$3}; END {print p}'
To total all disk space used on a system across filesystems.
Many years ago I wrote a tail script in awk:
#!/usr/bin/awk -f
BEGIN {
lines=10
}
{
high = NR % lines + 1
a[high] = $0
}
END {
for (i = 0; i < lines; i++) {
n = (i + high) % lines + 1
if (n in a) {
print a[n]
}
}
}
It's silly, I know, but that's what awk does to you. It's just very fun playing with it.
Henry Spencer wrote a fairly good implementation of nroff on awk. He called it "awf". He also claimed that if Larry Wall had known how powerful awk was, he wouldn't have needed to invent perl.
Here's a couple of awks that I used to use regularly ... note that you can use $1, $2, etc to get out the column you want. So, for manipulating a bunch of files, for example here's a stupid command you could use instead of mv ...
ls -1 *.mp3 | awk '{printf("mv %s newDir/%s\n",$1,$1)}' | /bin/sh
Or if you're looking at a set of processes maybe ...
ps -ef | grep -v username | awk '{printf("kill -9 %s\n",$2)}' | /bin/sh
Pretty trivial but you can see how that would get you quite a ways. =) Most of the stuff I used to do you can use xargs for, but hey, who needs them new fangled commands?
I use this script a lot for editing PATH and path-like environment variables.
Usage:
export PATH=$(clnpath /new/bin:/other/bin:$PATH /old/bin:/other/old/bin)
This command adds /new/bin and /other/bin in front of PATH, removes both /old/bin and /other/old/bin from PATH (if present - no error if absent), and removes duplicate directory entries on path.
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Count memory used by httpd
ps -ylC httpd | awk '/[0-9]/ {SUM += $8} END {print SUM/1024}'
Or any other process by replacing httpd. Dividing by 1024 to get output in MB.
I managed to build a DOS tree command emulator for UNIX ( find + awk ):
find . -type d -print 2>/dev/null|awk '{for (i=1;i< NF;i++)printf("%"length($i)"s","|");gsub(/[^\/]*\//,"--",$0);print $NF}' FS='/'
Print lines between two patterns:
awk '/END/{flag=0}flag;/START/{flag=1}' inputFile
Detailed explanation: http://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
A couple of favorites, essentially unrelated to each other. Read as 2 different, unconnected suggestions.
Identifying Column Numbers Easily
:
For those that use awk frequently, as I do for log analysis at work, I often find myself needing to find out what the column numbers are for a file. So, if I am analyzing, say, Apache access files (some samples can be found here) I run the script below against the file:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
print i "\t" $i
}
}
NR > 1 {
exit
}
I typically call it "cn.awk", for 'c'olumn 'n'umbers. Creative, eh? Anyway, the output looks like:
1 64.242.88.10
2 -
3 -
4 [07/Mar/2004:16:05:49
5 -0800]
6 "GET
7 /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
8 HTTP/1.1"
9 401
10 12846
Very easy to tell what's what. I usually alias this on my servers and have it everywhere.
Referencing Fields by Name
Now, suppose your file has a header row and you'd rather use those names instead of field numbers. This allows you to do so:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
field[$i] = i
}
}
Now, suppose I have this header row...
metric,time,val,location,http_status,http_request
...and I'd like to sum the val column. Instead of referring to $3, I can refer to it by name:
NR > 1 {
SUM += $field["val"]
}
The main benefit is making the script much more readable.
Printing fields is one of the first things mentioned in most AWK tutorials.
awk '{print $1,$3}' file
Lesser known but equally useful is excluding fields which is also possible:
awk '{$1=$3=""}1' file