awk help , find a file using a combination and then calculate sum a field inside that file with that combination - awk

I am spooling some data into a file like this
acb.txt 1 aa 3
gfh.txt 1 aa 3
a3g.txt 2 aa 4
tfh.txt 2 aa 4
The problem is
To fetch the file names for a particular combination
Open each file (we use rds tool)
Check the above combination in the file
Sum 32 field inside that file with the above combination.
Store the sum and the combination.
I am able to achieve it using grep and using two loops one on the combination and other on the number of file in that combination, looking for a better awk logic. I doubt grep will not be full-proof
added more info:
Using awk, how can I store strings into a array like
a[1 aa 3]=acb.txt gfh.txt
a[2 aa 4]=a3g.txt tfh.txt
and then fetch that value from outside to open the files?

awk '{a[$2, $3, $4] = a[$2, $3, $4] " " $1}' inputfile
That will create your array. But it doesn't do anything useful.
Otherwise, your question is still too vague. Show your grep and two loops and describe in more detail what it is you're actually trying to do.

Related

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Add a field to the current record before processing in awk

I want to use a large awk script that was designed to take a particular input. For example "city zipcode street housenumber", so $2 is zipcode, etc...
Now the input is provided to me in a new format. In my example "city" is now missing. The new file is "zipcode street housenumber" (not for real, just trying to make the example simple)
but I happen to know that the city is a constant for that input (which is why it's not in the dataset). So if I run it through the original script $2 is now street, and everything is one field off.
I could first process the input file to prepend the city name to each line (using awk, sed, or whatever), then run it through the original script, but I would prefer to run only one script that supports both formats. I could add a command-line option that tells it the city, but I don't know how to insert it in front of the current record at the top of the script so that the rest of the script can be unchanged. It looks like I can change a field but what I want to do is "shift" the fields right so I can modify $1.
Did I mention I am a complete awk novice? (Perl is my poison.)
I think I fixed my own problem, I'm doing the following (haven't figured out how to do this conditionally based on a command line option, but it should be easy to find tutorials for that)
NF+=1;
for(i=NF; i>1; --i) $(i)=$(i-1);
$1="Vancouver";
I had the loop wrong in my comment above, but the basic idea of manipulating NF and copying fields into each others seems to work
Something in the lines of this should do it. First some missed test data:
$ cat file
1 2 3 4
2 3 4
The awk:
$ awk -v c=V '{ # define external var
if(NF==3) # if record has only three fields
$0=v FS $0 # prepend the var to the record
print $1 # print first field
}' file
Output:
1
V

awk command - define the size of a word

I'm learning AWK and trying to count the number of sessions to a particular destination.
Using this command:
awk '{print $9}' traffic-log-cust.txt | sort | uniq -c
and I am getting the below output.
#awk '{print $9}' traffic-log-cust.txt | sort | uniq -c
1
1 10.10.17.72/38403->157.55.235.140/40046
1 10.10.17.72/38403->157.55.235.146/40006
1 10.10.17.72/38403->157.55.235.148/40039
1 10.10.17.72/38403->157.55.235.159/40019
1 10.10.17.72/38403->157.55.235.160/40019
1 10.10.17.72/38403->157.55.56.156/40046
1 10.10.17.72/38403->157.55.56.174/40018
1 10.10.17.72/38403->64.4.23.156/40017
1 10.10.17.72/38403->64.4.23.164/40011
1 10.10.17.72/38403->64.4.23.166/40053
1 10.10.17.72/38403->65.55.223.16/40003
1 10.10.17.72/38403->65.55.223.44/40002
#
and I believe word 9 have no space and contains destination IP as well.
I would like to know how I can count the sessions based on destination IP's.
Thanks in Advance.
I am going to guess that you are having issues deciding how big each field is. (Your question is unclear.) I would argue you don't need to; just split each row into 2 fields and deal with the second field.
With awk, you specify what the delimiter is with the -F option, and since the greater-than sign (>) is meaningful in many shells, you have to escape it somehow. In Linux, you can use a backslash to do so.
Since you are using awk, you don't need sort and uniq; associative arrays can be used.
Assuming that you are NOT ignoring the ports:
awk -F\> '{dest_ips[$2]++}
END {
for (ip in dest_ips) {
printf "%s: %d\n", ip, dest_ips[ip]
}
}' traffic-log-cust.txt
If you are ignoring the ports, you have to parse that second field first (perhaps using split()).

Using awk on a folder and adding file name to output rows

I should start by thanking you all for all the work you put into the answers on this site. I have spent many hours reading through them but have not found anything fitting my question yet. Hence my own post.
I have a folder with multiple subfolders and txt-files within those. In column 7 of those files, there are gene names (I do genetics for a living :)). These are the string I am trying to extract. Shortly, I would like to search the whole folder for any rows within any of the files that contain a particular gene name/string. I have been using grep for this, writing something like:
grep -r GENE . > GENE.txt
Simple, but I need to be able to tweak the search further and it seems that then awk is the way to go.
So I tried using awk. I wrote something like this:
awk '$7 == "GENENAME"' FOLDER/* > GENENAME.txt
This works well (and now I can specify that the string has to be in a particular column, this I can not do with grep, right?).
However, in contrast to grep, which writes the file name at the start of each row, I now can not directly see which file which row in my output file comes from (which mostly defeats the point of the search). This, adding the name of the origin file somewhere to each row, seems like something that should absolutely be doable, but I am not able to figure it out.
The files I am searching within change (or rather get more numerous), but otherwise my search will always be for some specific string in column 7 of the same big folder. How can I get this working?
Thank you in advance,
Elisabet E
You can use FNR (FNR means file number of record) to print the row number and FILENAME to print the file's name, then you get the matching lines from which file and which row, for instance:
sample.csv:
aaa 123
bbb 456
aaa 789
command:
awk '$1 =="aaa"{print $0, FNR, FILENAME}' sample.csv
The output is:
aaa 123 1 sample.csv
aaa 789 3 sample.csv
Sounds like you're looking for:
awk '$7 == "GENENAME"{print FILENAME, $0}' FOLDER/*
If not then edit your question to clarify with sample input and expected output.

Can I speed up AWK program using NR function

I am using awk to pull out data form a file that us +30M records. I know within a few 1000 records where the records I want are. I am curious if I can cut down on the time it take awk to find the records by telling it a starting point setting the NR. for example, my record is >25 million lines in I could use the following:
awk 'BEGIN{NR=25000000}{rest of my script}' in
would this make awk skip straight to the 25M record and save me the time of it scanning each record before that?
For a better example, I am using this AWK in a loop in sh. I need the normal output of the awk script, but I would also like it pass along the NR when it finished to the next interation when loop comes back to this script again.
awk -v n=$line -v r=$record 'BEGIN{a=1}$4==n{print $10;a=2}($4!=n&&a==2){(pass NR out to $record);exit}' in
Nope. Let's try it:
$ cat -n file
1 one
2 two
3 three
4 four
$ awk 'BEGIN {NR=2} {print NR, $0}' file
3 one
4 two
5 three
6 four
Are your records fixed length, or do you know the average line length? If yes, then you can use a language that allows you to open a file and seek to a position. Otherwise you have to read all those lines:
awk -v start=25000000 'NR < start {next} {your program here}' file
To maintain your position between runs of the script, I'd use a language like perl: at the end of the run use tell() to output the current position, say to a file; then at the start of the next run, use seek() to pick up where you left off. Add a check that the starting position is less than the current file size, in case the file was truncated.
One way (Using sed), if you know the line numbers
for n in 3 5 8 9 ....
do
sed -n "${n}p" file |awk command
done
or
sed -n "25000,30000p" file |awk command
Records generally have no fixed size so there is no way for awk but to scan the first part of the file even just to skip them.
Should you want to skip the first part of the input file and you (roughly) know the size to ignore, you can use dd to truncate the input, eg here assuming a record is 80 bytes wide:
dd if=inputfile bs=25MB skip=80 | awk ...
Finally, you can avoid awk to scan the last records by exiting from the awk script when you have hit the end of the interesting zone.