awk command - define the size of a word - awk

I'm learning AWK and trying to count the number of sessions to a particular destination.
Using this command:
awk '{print $9}' traffic-log-cust.txt | sort | uniq -c
and I am getting the below output.
#awk '{print $9}' traffic-log-cust.txt | sort | uniq -c
1
1 10.10.17.72/38403->157.55.235.140/40046
1 10.10.17.72/38403->157.55.235.146/40006
1 10.10.17.72/38403->157.55.235.148/40039
1 10.10.17.72/38403->157.55.235.159/40019
1 10.10.17.72/38403->157.55.235.160/40019
1 10.10.17.72/38403->157.55.56.156/40046
1 10.10.17.72/38403->157.55.56.174/40018
1 10.10.17.72/38403->64.4.23.156/40017
1 10.10.17.72/38403->64.4.23.164/40011
1 10.10.17.72/38403->64.4.23.166/40053
1 10.10.17.72/38403->65.55.223.16/40003
1 10.10.17.72/38403->65.55.223.44/40002
#
and I believe word 9 have no space and contains destination IP as well.
I would like to know how I can count the sessions based on destination IP's.
Thanks in Advance.

I am going to guess that you are having issues deciding how big each field is. (Your question is unclear.) I would argue you don't need to; just split each row into 2 fields and deal with the second field.
With awk, you specify what the delimiter is with the -F option, and since the greater-than sign (>) is meaningful in many shells, you have to escape it somehow. In Linux, you can use a backslash to do so.
Since you are using awk, you don't need sort and uniq; associative arrays can be used.
Assuming that you are NOT ignoring the ports:
awk -F\> '{dest_ips[$2]++}
END {
for (ip in dest_ips) {
printf "%s: %d\n", ip, dest_ips[ip]
}
}' traffic-log-cust.txt
If you are ignoring the ports, you have to parse that second field first (perhaps using split()).

Related

Generating 10 random numbers in a range in an awk script

So I'm trying to write an awk script that generates passwords given random names inputted from a .csv file. I'm aiming to do first 3 letters of last name, number of characters in the fourth field, then a random number between 1-200 after a space. So far I've got the letters and num of characters fine, but am having a hard time getting the syntax in my for loop to work for the random numbers. Here is an example of the input:
Danette,Suche,Female,"Kingfisher, malachite"
Corny,Chitty,Male,"Seal, southern elephant"
And desired output:
Suc21 80
Chi23 101
For 10 rows total. My code looks like this:
BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
Then I've been running it like
awk -F"," -f script.awk file.csv
But it only shows the 3 characters and length of fourth field, no random numbers. If anyone's able to point out where I'm screwing up it would be much appreciated , thanks guys
You can use rand() to generate a random number between 0 and 1:
awk -F, '{print substr($2,0,3)length($4),int(rand()*200)+1}' file.csv
BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
There is not echo function defined in GNU AWK, if you wish to use shell command you might use system function, however keep in mind that it does return status code and does print what said command output, without ability to alter it, so you need to design command so you get desired output from it.
Let file.txt content be
A
B
C
then
awk '{printf "%s ",$0;system("echo ${RANDOM}%200 | bc")}' file.txt
might give output
A 95
B 139
C 1
Explanation: firstly I use printf so no newline is appended automatically, I output whole line followed by space, then I execute command which does output random value in range
echo ${RANDOM}%200 | bc
it does simply ram RANDOM followed by %200 into calculator, which does output result of such action.
If you are not dead set on using RANDOM variable, then rand function, might be use without hassle.
(tested with gawk 4.2.1 and bc 1.07.1)

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Combine grep output when piping

I use the following command sipcalc to display information about an IP:
sipcalc 192.16.12.1/16 | grep -E 'Network address|Network mask \(bits\)'
The output is:
Network address - 192.16.0.0
Network mask (bits) - 16
Is there a way to combine the above output (only the right part), so the output would be:
192.16.0.0/16
I have my own way to do this by separating grep call and then concatenate the result, but I don't think it is a good solution. Can grep or any other commands that can be used to pipe the output like awk in order to obtain the output above?
grep is not really an ideal tool for doing operations beyond just searching for your expected text. Use awk alone!
awk '/Network address/{ ip = $NF } /Network mask \(bits\)/{ print ip "/" $NF}'
Awk processes records in /pattern/ { action } syntax. So when the first pattern in matched, extract the last field delimited by space $NF i.e. a special variable Awk uses to store the value of last column when delimited by space ( See 7.5.1 Built-in Variables That Control awk)
When the second pattern is matched in a similar way, join that last field with the value stored in ip variable. The + just concatenates the individual strings to produce the desired result.

Logically Impossible to fetch this particular string.?

I have 3 strings which are random and look somewhat like this
1) ENTL.COMPENSATION REM REVERSE PAYMENT COUPON ON ISIN //IT0004889033 IN A TRIPARTY //TRANSACTION WITH 95724
2) 01P ISIN DE000A1H36U5 QTY 44527000, //C/P 19696
3) COUPON ISIN XS0820547742 QTY 466750,
Now what is expected is to fetch the values IT0004889033 or DE000A1H36U5 or XS0820547742. If you observe the 3 strings, these 3 expected values come rite after the ISIN. So we can take isin as a reference and then fetch the values after ISIN. But that is not what is required it seems. We should not fetch the value by taking some value as a reference.
Since the expected value is IT0004889033 which is a 12 digit character the information I have is; first 2 characters are alphabets, next 9 are alphanumeric and the last one is a digit. Just with this information is it possible to do a wildcard search or something and fetch this 12 digit value.?
I'm totally lost on this one logically.
You mentioned that ISIN should not be used as a reference. Therefore, the only thing for sure is that the string to be found starts with 2 letters, followed by 9 letters and/or numbers, and ends with a number.
I saved your example text as tmp, and ran the following egrep command... seems to work for me:
jim#debian:~/tmp$ egrep -o "[a-zA-Z]{2}[a-zA-Z0-9]{9}[0-9]{1}" tmp
IT0004889033
DE000A1H36U5
XS0820547742
The above solution is more correct than the previous ones because it takes a fixed amount of characters to filter the results. Only 12-character strings will be returned by the above code.
I hope this helps!
Using grep -oP:
grep -oP 'ISIN\W+\K\w+' file
IT0004889033
DE000A1H36U5
XS0820547742
if grep -P isn't available then you can use use awk:
awk -F '.*ISIN[^0-9a-zA-Z]*| ' '{print $2}' file
IT0004889033
DE000A1H36U5
XS0820547742
OR else:
awk -F '.*ISIN[^[:alnum:]]*| ' '{print $2}' file

Can I speed up AWK program using NR function

I am using awk to pull out data form a file that us +30M records. I know within a few 1000 records where the records I want are. I am curious if I can cut down on the time it take awk to find the records by telling it a starting point setting the NR. for example, my record is >25 million lines in I could use the following:
awk 'BEGIN{NR=25000000}{rest of my script}' in
would this make awk skip straight to the 25M record and save me the time of it scanning each record before that?
For a better example, I am using this AWK in a loop in sh. I need the normal output of the awk script, but I would also like it pass along the NR when it finished to the next interation when loop comes back to this script again.
awk -v n=$line -v r=$record 'BEGIN{a=1}$4==n{print $10;a=2}($4!=n&&a==2){(pass NR out to $record);exit}' in
Nope. Let's try it:
$ cat -n file
1 one
2 two
3 three
4 four
$ awk 'BEGIN {NR=2} {print NR, $0}' file
3 one
4 two
5 three
6 four
Are your records fixed length, or do you know the average line length? If yes, then you can use a language that allows you to open a file and seek to a position. Otherwise you have to read all those lines:
awk -v start=25000000 'NR < start {next} {your program here}' file
To maintain your position between runs of the script, I'd use a language like perl: at the end of the run use tell() to output the current position, say to a file; then at the start of the next run, use seek() to pick up where you left off. Add a check that the starting position is less than the current file size, in case the file was truncated.
One way (Using sed), if you know the line numbers
for n in 3 5 8 9 ....
do
sed -n "${n}p" file |awk command
done
or
sed -n "25000,30000p" file |awk command
Records generally have no fixed size so there is no way for awk but to scan the first part of the file even just to skip them.
Should you want to skip the first part of the input file and you (roughly) know the size to ignore, you can use dd to truncate the input, eg here assuming a record is 80 bytes wide:
dd if=inputfile bs=25MB skip=80 | awk ...
Finally, you can avoid awk to scan the last records by exiting from the awk script when you have hit the end of the interesting zone.