Having awk act upon results from a system command - awk

I have found some excellent help here about hoe to invoke external commands from within awk and store results in a variable. What I have not been able to find is how to have awk act upon the result as it would on an ordinary input text file.
I use awk to parse a small HTML file (the status page of a running Tahoe LAFS node) in order to find some IP addresses listed. On each IP address I run an nmap scan of a specific port to see if it is open (yes, this is to become an automated Tahoe LAFS grid monitor). Using an if statement I can pick out the line of the output from nmap that contains the status (open/filtered/closed) of the port as its second field (typically "8098/TCP open unknown"). I would like to strip the line of fields 1 and 3 and only keep field 2, however, $2 of course refers to the fields in the HTML file I use as input to my awk script. I tried a user defined function which just did return $2, but that also refers to the field in the input HTML file.
Is there a way to refer to fields in an internally created variable inside an awk script? Something like a nested awk command within an awk script?

Use the getline "function". It sets $0 to the entire record and $1 through $NF in the usual way:
$ awk '/test/ {
> while (("ping -c 2 google.com") | getline > 0) {
> printf("$1 = %s, $2 = %s\n", $1, $2);
> }
> }'
abc
test
$1 = PING, $2 = google.com
$1 = 64, $2 = bytes
$1 = 64, $2 = bytes
$1 = , $2 =
$1 = ---, $2 = google.com
$1 = 2, $2 = packets
$1 = round-trip, $2 = min/avg/max/stddev
xyz
$
Edit: added parentheses around (cmd | getline) (it works for me without them but I guess some awk variants require it?).
Edit 2: apparently the "parentheses around getline" thing comes from a quite different issue noted in the GNU awk manuals:
According to POSIX, ‘expression | getline’ is ambiguous if expression contains unparenthesized operators other than ‘$’—for example, ‘"echo " "date" | getline’ is ambiguous because the concatenation operator is not parenthesized. You should write it as ‘("echo " "date") | getline’ if you want your program to be portable to all awk implementations.
In this case, the expression before the pipe is a single string, so there is no ambiguity. I moved the parentheses to where they would be needed for a more complex expression.
Also, it's a good idea to call close() on the command after the while loop exits. If there is another line matching test, awk will assume the existing sub-command should be read further, unless it has been close()d. As the command match is via the string, it's even better, rather than parenthesizing the left hand side of the pipe-to-getline, to store it in a variable and use that variable as an argument to close. For example:
awk '/^test / {
cmd = sprintf("ping -c %d %s", $2, $3)
while (cmd | getline > 0) print
close(cmd)
}'
(a variant without the semicolons that some dislike :-) ), which, when fed:
test 1 google.com
produces:
PING google.com (74.125.225.161): 56 data bytes
64 bytes from 74.125.225.161: icmp_seq=0 ttl=56 time=22.898 ms
--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 22.898/22.898/22.898/0.000 ms
Addendum (poking around on the web, I discovered that this is less obvious than I thought it was): be aware that this kind of "bare" getline, because it replaces the "current line", causes any remaining pattern-and-action rules in the script to fire on the new line contents. E.g., after the above, $0 begins with round-trip min/av , so a subsequent rule of the form /^round/ would match, even though the input line that triggered the "ping" was test 1 google.com. If this is not the last rule, it's probably appropriate to add a next directive to it. (In a complicated script I'd put that in every getline-ing action, even the last one, in case the last rule is moved, or more are added.)

As the relevant part of my final awk script is too large to fit as a comment, I'll post as an answer. The stripInputRecord, getIpNumber and getPortNumber functions just pick out the useful parts from the HTML code.
/address/ {
ip = stripInputRecord( $0 );
ip = getIpNumber( ip );
port[na] = stripInputRecord( $0 );
port[na] = getPortNumber( port[na] );
if (!(ip~"N/A")) {
if (ip~/loopback/) {
ip="127.0.0.1";
port[na]=stdp;
}
cmd="nmap -PN -p "stdp" "ip
cmd2="nmap -PN -p " port[na] " " ip
while ((cmd | getline)==1) {
if ($0~stdp) {
stdportstatus[na] = $2
}
}
while ((cmd2 | getline)==1) {
if ($0~port[na]) {
otherportstatus[na] = $2
}
}
}
close(cmd)
close(cmd2)
if ($0~/N\/A/) {
stdportstatus[na] = "-";
otherportstatus[na] = "-";
}
na++;
}
Thank you all (especially torek!)

Related

AWK:Convert columns to rows with condition (create list ) [duplicate]

I have a tab-delimited file with three columns (excerpt):
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR002110 Ankyrin repeat
AC148152.3_FG001 IPR026961 PGG domain
and I'd like to get this using bash:
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain
So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.
give this one-liner a try:
awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file
For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n
cat FILENAME | perl -e 'foreach $Line (<STDIN>) { #Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(#{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", #{$List})."\n"; }'
will depend off file size (and awk limitation)
if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing
A classical version with post print using a modification of the whole line
sort YourFile \
| awk '
last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
NR > 1 { print Last C; Last = $1; C = ""}
END { print Last}
'
Another version using field and pre-print but less "human readable"
sort YourFile \
| awk '
last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
'
A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.
All on one line:
{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[#]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv
Readable and commented equivalent:
{
# Define `merged` as an empty associative array.
declare -A merged
merged=()
# Read tab-separated lines. Any leftover fields also end up in `value`.
while IFS=$'\t' read -r key value
do
# Append to any value that's already there, separated by a tab.
merged[$key]="${merged[$key]}"$'\t'"$value"
done
# Loop over the input keys. Note that the order is arbitrary;
# pipe through `sort` if you want a predictable order.
for key in "${!merged[#]}"
do
# Each value is prefixed with a tab, so no need for a tab here.
echo "$key${merged[$key]}"
done
} < INPUT_FILE.tsv

AWK, exclude results from one file with regards to a second file

Using Awk, I am able to get a list of URL with a given error number :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Fine and dandy.
But we would like to further refine it by matching that result against a list of already know 404 URL
example :
awk '($9 ~ /404/)' /var/log/nginx/access.log | awk '{print $7} '| sort | uniq -c | sort -k 2 -r | awk '{print > "/mnt/tmp/404error.txt"}'
yield today :
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
the day after :
1 /going-out/ru/%d0%bd%d0%be%d1%87%d0%bd%d0%b0%d1%8f-%d0%b6%d0%b8%d0%b7%d0%bd%d1%8c-%d0%bd%d0%b0-%d0%bf%d1%85%d1%83%d0%ba%d0%b5%d1%82%d0%b5/%d1%81%d0%be%d0%b2%d0%b5%d1%82%d1%8b-%d0%bb%d1%8e%d0%b1%d0%b8%d1%82%d0%b5%d0%bb%d1%8f%d0%bc-%d0%bd%d0%be%d1%87%d0%bd%d1%8b%d1%85-%d1%80%d0%b0%d0%b7%d0%b2%d0%bb%d0%b5%d1%87%d0%b5%d0%bd%d0%b8%d0%b9/
1 /going-out/restaurants/the-current-restaurent.htm
1 /going-out/restaurants/mare.HTML
1 /going-out/report-content/?cid=5
1 /going-out/report-content/?cid=38550
1 /going-out/report-content/?cid=380
1 /going-out/report-content/?cid=29968
1 /going-out/report-content/?cid=29823
The goal is to have only the new URL.
At that point I am lost, I know I can push first file into an array, I presume I can do the same with the second file (but in a second array), then maybe (not sure if awk does have the capacity) simply cross them, and kept what does not match.
Any help will be fully appreciate.
So you have a file whose $9 field may match /404/. If so, you want to store the 7th field. Then, count how many of them appeared in total, but just if they did not appear before in a file you have.
I think all of this can be done with this (untested, because I have no sample input data):
awk 'FNR==NR {seen[$2];next}
$9 ~ /404/ {if !($7 in seen) a[$7]++}
END {for (i in a) print a[i], i}' old_file log_file
This stores the 2nd column from the file with data into an array seen[]. Then, goes through the new file and stores the 7th column if it wasn't seen before. Finally, it prints the counters.
Since it looks like you have an old awk version that does not support the syntax index in array, you can use this workaround for it:
$9 ~ /404/ {for (i in seen) {if (i==$7) next} a[$7]++}
Note you must be using a veeery old version, since this was introduced in 1987:
A.1 Major Changes Between V7 and SVR3.1
The awk language evolved considerably between the release of Version 7
Unix (1978) and the new version that was first made generally
available in System V Release 3.1 (1987). This section summarizes the
changes, with cross-references to further details:
The expression ‘indx in array’ outside of for statements (see
Reference to Elements)
You can use grep --fixed-strings --file=FILEALL FILENEW or comm -23 FILENEW FILEALL for this. FILEALL is the file containing the urls already found, FILENEW contains the pages found today. For comm both files must be sorted.
http://www.gnu.org/software/gawk/manual/gawk.html#Other-Inherited-Files
http://linux.die.net/man/1/comm
I think commis more efficient because I uses sorted files, but I did not test this.
I came up with the following :
awk 'BEGIN {
while (getline < "/mnt/tmp/404error.txt") {
A[$1] = $1;
};
while (getline < "/var/log/nginx/access.log") {
if( $9 ~ /404/)
{
{
exist[$7] = $7 ;
}
{
if ($7 in A) blah += 1; else new[$7];
}
}
}
{
asort(exist);
for(i in exist)
print exist[i] > "/mnt/tmp/404error.txt"
}
{
asorti(new);
for(i in new)
print new[i] > "/mnt/tmp/new404error.txt"
}
}
' | mutt -s "subject" -a /mnt/tmp/new404error.txt -- whoever#mail.net, whatever#mail.net
that seems providing me what I want (almost).
But I believe it is verbous too much, might be possible one of you genius can improve it
Thanks

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

awk: create list of destination ports seen for each source IP from a bro log (conn.log)

I'm trying to solve a problem in awk as an exercise but I'm having trouble.
I want awk (or gawk) to be able to print all unique destination ports for a particular source IP address.
The source IP address is field 1 ($1) and the destination port is field 4 ($4).
Cut for brevity:
SourceIP SrcPort DstIP DstPort
192.168.1.195 59508 98.129.121.199 80
192.168.1.87 64802 192.168.1.2 53
10.1.1.1 41170 199.253.249.63 53
10.1.1.1 62281 204.14.233.9 443
I imagine you would store each Source IP as in index to an array. But I'm not quite sure how you would store destination ports as values. Maybe you can keep appending to a string, being the value of the index e.g. "80,"..."80,443,"... for each match. But maybe that's not the best solution.
I'm not too concerned about output, I really just want to see how one can approach this in awk. Though, for output I was thinking something like,
Source IP:dstport, dstport, dstport
192.168.1.195:80,443,8088,5900
I'm tinkering with something like this,
awk '{ if ( NR == 1) next; arr[$1,$4] = $4 } END { for (i in arr) print arr[i] }' infile
but cannot figure out how to print out the elements and their values for a two-dimensional array. It seems something along this line would take care of the unique destination port task because each port is overwriting the value of the element.
Note: awk/gawk solution will get the answer!
Solution EDIT: slightly modified Kent's solution to print unique destination ports as mentioned in my question and to skip the column header line.
awk '{ if ( NR == 1 ) next ; if ( a[$1] && a[$1] !~ $4 ) a[$1] = a[$1]","$4; else a[$1] = $4 } END {for(x in a)print x":"a[x]}'
here is one way with awk:
awk '{k=$1;a[k]=a[k]?a[k]","$4:$4}END{for(x in a)print x":"a[x]}' file
with your example, the output is:
kent$ awk '{k=$1;a[k]=a[k]?a[k]","$4:$4}END{for(x in a)print x":"a[x]}' file
192.168.1.195:80
192.168.1.87:53
10.1.1.1:53,443
(I omitted the title line)
EDIT
k=$1;a[k]=a[k]?a[k]","$4:$4
is exactly same as:
if (a[$1]) # if a[$1] is not empty
a[$1] = a[$1]","$4 # concatenate $4 to it separated by ","
else # else if a[$1] is empty
a[$1] = $4 # let a[$1]=$4
I used k=$1 just for saving some typing. also the x=boolean?a:b expression
I hope the explanation could let you understand the codes.
I prefer a solution using perl because I like more the posibilities of creating data structures like hash of arrays:
perl -ane '
## Same BEGIN block than AWK. It prints header before processing any input.
BEGIN { printf qq|%s:%s\n|, q|Source IP|, q|dstport| }
## Skip first input line (header).
next if $. == 1;
## This is what you were thinking to achieve. Store source IP as key of a
## hash, and instead of save a string, it will save an array with all
## ports.
push #{ $ip{ $F[0] } }, $F[ 3 ];
## Same END block than AWK. For each IP, get all ports saved in the array
## and join them using a comma.
END { printf qq|%s:%s\n|, $_, join q|,|, #{ $ip{ $_ } } for keys %ip }
' infile
It yields:
Source IP:dstport
192.168.1.195:80
10.1.1.1:53,443
192.168.1.87:53

Best Awk Commands

I find AWK really useful. Here is a one liner I put together to manipulate data.
ls | awk '{ print "awk " "'"'"'" " {print $1,$2,$3} " "'"'"'" " " $1 ".old_ext > " $1 ".new_ext" }' > file.csh
I used this AWK to make a script file that would rename some files and only print out selective columns. Anyone know a better way to do this? What are you best AWK one liners or clever manipulations?
The AWK book is full of great examples. They used to be collected for download from Kernighan's webpage (404s now).
You can find several nice one liners here.
I use this:
df -m | awk '{p+=$3}; END {print p}'
To total all disk space used on a system across filesystems.
Many years ago I wrote a tail script in awk:
#!/usr/bin/awk -f
BEGIN {
lines=10
}
{
high = NR % lines + 1
a[high] = $0
}
END {
for (i = 0; i < lines; i++) {
n = (i + high) % lines + 1
if (n in a) {
print a[n]
}
}
}
It's silly, I know, but that's what awk does to you. It's just very fun playing with it.
Henry Spencer wrote a fairly good implementation of nroff on awk. He called it "awf". He also claimed that if Larry Wall had known how powerful awk was, he wouldn't have needed to invent perl.
Here's a couple of awks that I used to use regularly ... note that you can use $1, $2, etc to get out the column you want. So, for manipulating a bunch of files, for example here's a stupid command you could use instead of mv ...
ls -1 *.mp3 | awk '{printf("mv %s newDir/%s\n",$1,$1)}' | /bin/sh
Or if you're looking at a set of processes maybe ...
ps -ef | grep -v username | awk '{printf("kill -9 %s\n",$2)}' | /bin/sh
Pretty trivial but you can see how that would get you quite a ways. =) Most of the stuff I used to do you can use xargs for, but hey, who needs them new fangled commands?
I use this script a lot for editing PATH and path-like environment variables.
Usage:
export PATH=$(clnpath /new/bin:/other/bin:$PATH /old/bin:/other/old/bin)
This command adds /new/bin and /other/bin in front of PATH, removes both /old/bin and /other/old/bin from PATH (if present - no error if absent), and removes duplicate directory entries on path.
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Count memory used by httpd
ps -ylC httpd | awk '/[0-9]/ {SUM += $8} END {print SUM/1024}'
Or any other process by replacing httpd. Dividing by 1024 to get output in MB.
I managed to build a DOS tree command emulator for UNIX ( find + awk ):
find . -type d -print 2>/dev/null|awk '{for (i=1;i< NF;i++)printf("%"length($i)"s","|");gsub(/[^\/]*\//,"--",$0);print $NF}' FS='/'
Print lines between two patterns:
awk '/END/{flag=0}flag;/START/{flag=1}' inputFile
Detailed explanation: http://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
A couple of favorites, essentially unrelated to each other. Read as 2 different, unconnected suggestions.
Identifying Column Numbers Easily
:
For those that use awk frequently, as I do for log analysis at work, I often find myself needing to find out what the column numbers are for a file. So, if I am analyzing, say, Apache access files (some samples can be found here) I run the script below against the file:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
print i "\t" $i
}
}
NR > 1 {
exit
}
I typically call it "cn.awk", for 'c'olumn 'n'umbers. Creative, eh? Anyway, the output looks like:
1 64.242.88.10
2 -
3 -
4 [07/Mar/2004:16:05:49
5 -0800]
6 "GET
7 /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
8 HTTP/1.1"
9 401
10 12846
Very easy to tell what's what. I usually alias this on my servers and have it everywhere.
Referencing Fields by Name
Now, suppose your file has a header row and you'd rather use those names instead of field numbers. This allows you to do so:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
field[$i] = i
}
}
Now, suppose I have this header row...
metric,time,val,location,http_status,http_request
...and I'd like to sum the val column. Instead of referring to $3, I can refer to it by name:
NR > 1 {
SUM += $field["val"]
}
The main benefit is making the script much more readable.
Printing fields is one of the first things mentioned in most AWK tutorials.
awk '{print $1,$3}' file
Lesser known but equally useful is excluding fields which is also possible:
awk '{$1=$3=""}1' file