awk - store first occurrence based on cell - awk

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !

I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.

Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Related

Finding sequence in data

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt
assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).
Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

awk - How to extract quoted string in space delimited log file

I'm hoping there might be some simple way to do this, as I'm a total novice using awk.
I have a bunch of log files from an AWS load balancer, and I want to extract entries from these logs, where a particular response code was received.
Checking the response code is easy enough, I can do the following...
$9=="403" {print $0}
However what I really want is just the request itself, $13, However this column is quoted, and will contain spaces. It looks like so...
"GET https://[my domain name]:443/[my path] HTTP/2.0"
If I do the following...
$9=="403" {print $13}
I just get...
"GET
So what I think I need to do, is for awk (or some other appropriate utility) to extract the complete column 13, and then be able to break that down into it's individual fields, for method, URL etc.
Could you please try following. I have given inside regex of match 443 as per your sample to match it you could give it as per your need to look for 403 change it to match($0,/\".*403.*\"/) too.
awk 'match($0,/\".*443.*\"/){print substr($0,RSTART,RLENGTH)}' Input_file
IMHO advantage of this approach will be you need NOT to hard code any field number in your awk. 1 more thing I have assumed that your Input_file will have "......403....." kind of section only once and you want to print that only.
1 more additional awk where I am assuming you may have multiple occurrences of "..." so picking only that one where 403|443 is coming.
awk 'match($0,/\".*443[^"]*/){print substr($0,RSTART,RLENGTH+1)}' Input_file
EDIT: Or if your Input_file has "...443..." one time or this text is coming first after starting of line(assuming if other occurrences of ".." will come later) then you could try following.
awk -F'"' '/443/{print $2}' Input_file
newer version gawk has a built-in variable FPAT which you can use to define fields by a regex pattern. For your logs, if no other quoted fields before the field 9 and 13:
awk -v FPAT='[^[:space:]]+|"[^"]*"' '$9 == "403"{print $13}' log_file
REF: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

if $column_A="" equals to delete $column_A in awk?

I wish to delete one column of my data in awk but what I found is using command like $column_A="". Is column_A really deleted in this way?
For example, I wish to delete the second column and I found a solution: awk 'BEGIN{FS="\t";OFS="\t"}!($2="")' which print the result like: $1^0^0$3. It seems that it is the content of the second column is deleted but the second column.
after reading dev-null's comment, I got idea what are you asking...
My answer is: it depends on how do you define "a column is deleted".
see this example:
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{$2="";print}'
foo,,blah
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{print $1,$3}'
foo,blah
You see the difference? If you set the $x="" The column is still there, but it becomes an empty string. So the FS before and after stay. If this is what you wanted, it is fine. Otherwise just skip outputing the target column, like the 2nd example shows.
I would use cut for that:
cut -d$'\t' -f1,3- file
-f1,3- selects the first field, skips field 2 and then selects fields 3 to end.

Weird AWK behavior while forcing expression to be a number (adding 0)

I noticed an odd behavior while populating an array in awk. The indices and value both were numbers, so adding 0 shouldn’t have impacted. For the sake of understanding, lets take the following example:
Here is a file that I wish to use for this demo:
$ cat file
2.60E5-2670161065730303122012098 Invnum987678
2.60E5-2670161065846403042011098 Invnum987912
2.60E5-2670161065916903012012075 Invnum987654
2.60E5-2670161066813503042011075 Invnum987322
2.60E5-2670161066835008092012075 Invnum987323
2.60E5-2670161067040701122012075 Invnum987324
2.60E5-2670161067106602122010074 Invnum987325
What I would like to do is create an index from $1 and assign it value from $2. I will extract pieces of value from $1 and $2 using substr function.
$ awk '{p=substr($1,12)+0; A[p]=substr($2,7)+0;next}END{for(x in A) print x,A[x]}’ file
Now, ideally what the output should have been is as follows (ignore the fact that associative arrays may output in random):
161065730303122012098 987678
161065846403042011098 987912
161065916903012012075 987654
161066813503042011075 987322
161066835008092012075 987323
161067040701122012075 987324
161067106602122010074 987325
But, the output I got was as follows:
161066835008092012544 987323
161065846403042017280 987912
161067040701122019328 987324
161067106602122018816 987325
161066813503041994752 987322
161065916903012007936 987654
161065730303122014208 987678
If I remove the +0 from above awk one-liner, the output seems to be what I expect. What I would like to know is why would it corrupt the keys?
The above test was done on:
$ awk -version
awk version 20070501
It appears that AWK has some numerical limitations - I get even weirder results on gawk - perhaps the discussion in this SO will help you.

scripting in awk

I have a text file with contents as below:
1,A,100
2,A,200
3,B,150
4,B,100
5,B,250
i need the output as :
A,300
B,500
the logic here is sum of all the 3rd fields whose 2nd field is A and in the same way for B
how could we do it using awk?
You can do it using a hash as:
awk -F"," '{cnt[$2]+=$3}END{for (x in cnt){printf "%s,%d\n",x,cnt[x]}}' file
Well, I'm not up for writing and debugging the code for you. However, the elements you need are:
You can use FS="," to change the field separator to a comma.
The fields you care about are obviously the second ($2) and third ($3) fields.
You can create your own variables to accumulate the values into.
I'd suggest an associative array variable, indexed by field two.
$ awk -F"," '{_[$2]+=$3}END{for(i in _)print i,_[i]}' OFS="," file
A,300
B,500