compare multiple columns and only replace if matching - awk

I have two files (File 1 and File 2)
I am trying to compare the string of Column1 and 2 of File1 with Column4 and 5 of File2. Except this match, column6 of File2 also need to match certain string, like SO or CO (because column3 and 4 of FILE1 is SO and CO respectively), then replace of column7 of FILE2 with column3 of FILE1, otherwise keep the others unchanged.
I tried to modify and use the solution provided in the forum for a similar problem, but did not work.
FILE1
type code SO CO other
7757 1 6941.958 138.922 149.17
7757 2 8666.123 198.908 225.67
7757 4 2795.885 334.875 378.68
7759 GT3 222.104 13.5 734.62
7768 CT2 0 0 0
7805 6 3796.677 75.175 79.09
FILE2
"US","01073",,"7757","1","SO","10","299"
"US","01073",,"7758","1","SO","10","299"
"US","01073",,"7757","1","NO","10","299"
"US","01073",,"7757","1","CO","10","299"
"US","01073",,"7757","4","MO","10","299"
"US","01073",,"7757","1","GO","10","299"
"US","01073",,"7805","6","CO","10","299"
Required output:
"US","01073",,"7757","1","SO","6941.958","299"
"US","01073",,"7758","1","SO","10","299"
"US","01073",,"7757","1","NO","10","299"
"US","01073",,"7757","1","CO","138.922","299"
"US","01073",,"7757","4","MO","10","299"
"US","01073",,"7757","1","GO","10","299"
"US","01073",,"7805","6","CO","75.175","299"
Solution I tried (for CO only) :
tr -d '"' < FILE2 > temp # to remove double quote
awk 'NR==FNR{A[$1,$2]=$3;next} A[$4,$5] && $6=="CO" {$7=A[$1,$2]; print}' FS=" " OFS="," FILE1 temp > out

Complex awk solution:
awk 'function unquote(f){
return substr(f, 2, length(f)-2)
}
NR==FNR{
if (NR==1){ f3=$3; f4=$4 }
else if (NF){ a[$1,$2,f3]=$3; a[$1,$2,f4]=$4 }
next;
}
{ k=unquote($4) SUBSEP unquote($5) SUBSEP unquote($6) }
k in a{ $7=a[k] }1' file1 FS=',' OFS=',' file2
function unquote(f) { ... } - unquotes/extracts value between double quotes (in fact - between the 1st and last characters of the string)
a[$1,$2,f3]=$3; a[$1,$2,f4]=$4 - grouping crucial sequences
The output:
"US","01073",,"7757","1","SO",6941.958,"299"
"US","01073",,"7758","1","SO","10","299"
"US","01073",,"7757","1","NO","10","299"
"US","01073",,"7757","1","CO",138.922,"299"
"US","01073",,"7757","4","MO","10","299"
"US","01073",,"7757","1","GO","10","299"
"US","01073",,"7805","6","CO",75.175,"299"

Related

Compare multiple columns from one file with multiple columns of another file using awk?

I want to compare first 2 characters of col1 of file1 with col1 of file2 if col3 of file1 is same as col3 of file2 , provided col4 in file2 equals to TRUE. I tried something :-
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
file 1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
PL76UTYVJDYGHU9|ABSFXXJBW|Poland|
GB76UTRTSCLSKJ|FVDGXXXUY|Russia|
file 2
AE|^AE[0-9]{2}[0-9]{24}|United Arab Emirates|TRUE|
PL|^PL[0-9]{2}[A-Z]{10}[0-9]{4}|Poland|FALSE|
GB|^GB[0-9]{2}[A-Z]{5}[0-9]{3}|Europe|TRUE
expected output :-
AE1267453617238|BIDKFXXXX|United Arab Emirates|
You could just simply cascade the multiple conditions with a && as below. Remember your expected output is on the first file, so you need to process the second file first
awk -F'|' ' FNR == NR {
if ( $4 == "TRUE" ) m[$1] = $3 ; next }{ k = substr($1,1,2) } k in m && m[k] == $3' file2 file1
The part m[$1] = $3 creates a hash-map of the $1 with the value of $3 in the second file, which is then used in the first file to compare against only the first two characters of $1 i.e. substr($1,1,2). To avoid redundant use of substr(..), the value is extracted into a variable k and reused subsequently.
If the matches must be on the same line number in each file:
awk -F \| '
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR && $3 == a[FNR,3] &&
$1 ~ "^"a[FNR,1]' file2 file1
If the matches can be on any line (every line of file1 is checked against every line of file2, duplicate matches aren't printed):
awk -F \| '
FNR==NR {++l}
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR {
for (i=1; i<=l; ++i) {
if ($3 == a[i,3] && $1 ~ "^"a[i,1])
c[$0]==0
}
}
END {
for (i in c)
print i
}' file2 file1
Note the order files are given. file2 (which contains TRUE and FALSE), goes first. I also used regex instead of substr, so the characters should be alphanumeric only, if not, go back to substr.
Regarding your code:
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
newlines matter to awk. This:
NR==FNR
{ print }
is not the same as this:
NR==FNR { print }
The first one is actually the same as:
NR==FNR { print }
1 { print }
Also when you want to output the contents of a file (file1 in your case) it's usually better to read the OTHER file into memory and then compare the values from the target file against that so you can just print it as you go. So you should be doing awk 'script' file2 file1, not awk 'script' file1 file2, and writing a script based on that.
Try this:
$ cat tst.awk
BEGIN { FS="|" }
NR==FNR {
if ( $4 == "TRUE" ) {
map[$1] = $3
}
next
}
{ key = substr($1,1,2) }
(key in map) && (map[key] == $3)
$ awk -f tst.awk file2 file1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
awk -F\| '
NR==FNR{
a[$3,1]=$1;
a[$3,4]=$4;
next
}
substr($1,1,2) == a[$3,1] && a[$3,4] == "TRUE" { print }
' file2.txt file1.txt
AE1267453617238|BIDKFXXXX|United Arab Emirates|

AWK: add a sequential number out of 4 digits

How do I achieve from following string.ext
>Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
to change the sequential number after string to a 4 digit number (starting with 0001) and separate that number with | from string, so that output is returned like:
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
the commands I came up until here are ($faa is referring to the filename string.ext)
faa=$1
var=$(basename "$faa" .ext)
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' $faa >$faa.tmp
sed 's/ /_/g' $faa.tmp >$faa.tmp2
awk -v var="$var" '/>/{sub(">","&"var"|");sub(/\.ext/,x)}1' $faa.tmp2 >$faa.tmp3
awk '/>/{sub(/\|/,++i"|")}1' $faa.tmp3 >$faa.tmp4
tr '\.' '_' <$faa.tmp4 | tr '\:' '_' | sed 's/__/_/g' >$faa.tmp5
Edit: I also want to change following characters to 1 underscore: / . :
I'd use perl here:
perl -pe '
next unless /^>/; # only transform the "header" lines
s/[\h.]/_/g; # change dots and horizontal whitespace
substr($_,1,0) = sprintf("string|%04d|", ++$n) # insert the counter
' file
$ awk '
FNR==1 {base=FILENAME; sub(/\.[^.]+$/,"",base) }
sub(/^>/,"") { gsub(/[\/ .:]+/,"_"); $0=sprintf(">%s|%04d|%s",base,++c,$0) }
1' string.ext
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
I'm assuming from your posted sample and code that you actually want every contiguous sequence of any combination of spaces, periods, forward slashes and/or colons converted to a single underscore.
In awk.
$ awk '/^>/{n=sprintf("%04d",++i);sub(/^>/,">string|" n "|")}1' file
>string|0001|Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
Explained:
$ awk '
/^>/ { # if string starts with >
n=sprintf("%04d",++i) # iterate i from 1 and zeropad
sub(/^>/,">string|" n "|") # replace the > with stuff
}1' file # implicit output
Don't include & in string (see comments).
awk -F'[ \.]' 'BEGIN{a=1;OFS="_"}/^>/{$1=sprintf(">String|%04d",a);++a;print $0; next;}{print $0}' filename

awk to search field in file2 using range of file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB, that are between $2 and $3 in file1 which is ~2GB. If a value in $2 of file2 is between the file1 fields then it is printed along with the $6 value in file1. Both file1 and file2 are tab-delimited as well as the desired output. If there is nothing to print then the next line is processed. The awk below runs but is very slow (has been processing for ~ 1 day and still not done). Is there a better way to approach this or a better programming language?
$1 and $2 and $3 from file1 and $1 and $2 of file2 must match $1 of file1 and be in the range of $2 and $3 of file1.
so in order for the line to be printed in the output it must match $1 and be in the range of $2 and $3 of file2
So, since the line from file2matches $1 in file1 and in the $2 and $3 range it is printed in the output.
Thank you :).
file1 (~3MB)
1 948953 948956 chr1:948953-948956 . ISG15
1 949363 949858 chr1:949363-949858 . ISG15
2 800000 900500 chr1:800000-900500 . AGRN
file2 (~80MB)
1 12214 . C G
1 949800 . T G
2 900000 rs123 - A
3 900000 . C -
desired output tab-delimited
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
awk
awk -F'\t' -v OFS='\t' '
NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}
{
for (id in min)
if (min[id] < $2 && $2 < max[id]) {
print $0, id, Gene[id]
break
}
}
' file1 file2
This would be faster than what you have since it only loops through the file1 contents that have the same $1 value as in file2 and stops searching after it finds a range that matches:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $2
end[$1][c] = $3
val[$1][c] = $NF
next
}
$1 in val {
for (c=1; c<=num[$1]; c++) {
if ( (beg[$1][c] <= $2) && ($2 <= end[$1][c]) ) {
print $0, val[$1][c]
break
}
}
}
$ awk -f tst.awk file1 file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
Unfortunately for unsorted input as you have there's not too many options to make it faster. If the ranges in file1 can overlap each other then remove the "break".
Could you please try following and let me know if this helps you.
awk 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1]}' Input_file2 Input_file1
Reading the files one by one here, first reading file Input_file2 and then Input_file1 here.
If the performance is the issue, you have to sort both files by the value (and range start).
With the files sorted your scans can be incremental (and consequently much faster)
Here is an untested script
$ awk '{line=$0; k=$2;
getline < "file1";
while (k >= $2) getline < "file1";
if(k <= $3) print line, $NF}' file2
You can try to create a dict from file1 using multiarrays in gawk, this is more efficient computational (file1 has small size compared to file2),
awk '
NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
d[$1,$2]{print $0, d[$1,$2]}' file1 file2
you get,
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
One possible approach is to use AWK to generate another AWK file. Memory consumption should be low so for a really big file1 this might be a lifesaver. As for speed, that might depend on how smart the AWK implementation is. I haven't had a chance to try it on huge data sets; I am curious about your findings.
Create a file step1.awk:
{
sub(/^chr/, "", $1);
print "$1==\"" $1 "\" && " $2 "<$2 && $2<" $3 " { print $0 \"\\t" $6 "\"; }";
}
Apply that on file1:
$ awk -f step1.awk file1
$1=="1" && 948953<$2 && $2<948956 { print $0 "\tISG15"; }
$1=="1" && 949363<$2 && $2<949858 { print $0 "\tISG15"; }
Pipe the output to a file step2.awk and apply that on file2:
$ awk -f step1.awk file1 > step2.awk
$ awk -f step2.awk file2
1 949800 rs201725126 T G ISG15
Alternative: generating C
I rewrote step1.awk, making it generate C rather than AWK code. Not only will this solve the memory issue you reported earlier; it will also be a lot faster considering the fact that C is compiled to native code.
BEGIN {
print "#include <stdio.h>";
print "#include <string.h>";
print "int main() {";
print " char s[999];";
print " int a, b;";
print " while (fgets(s, sizeof(s), stdin)) {";
print " s[strlen(s)-1] = 0;";
print " sscanf(s, \"%d %d\", &a, &b);";
}
{
print " if (a==" $1 " && " $2 "<b && b<" $3 ") printf(\"%s\\t%s\\n\", s, \"" $6 "\");";
}
END {
print " }";
print "}";
}
Given your sample file1, this will generate the following C source:
#include <stdio.h>
#include <string.h>
int main() {
char s[999];
int a, b;
while (fgets(s, sizeof(s), stdin)) {
s[strlen(s)-1] = 0;
sscanf(s, "%d %d", &a, &b);
if (a==1 && 948953<b && b<948956) printf("%s\t%s\n", s, "ISG15");
if (a==1 && 949363<b && b<949858) printf("%s\t%s\n", s, "ISG15");
if (a==2 && 800000<b && b<900500) printf("%s\t%s\n", s, "AGRN");
}
}
Sample output:
$ awk -f step1.awk file1 > step2.c
$ cc step2.c -o step2
$ ./step2 < file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
It may be ineffective but should work, however slowly:
$ awk 'NR==FNR{ a[$2]=$0; next }
{ for(i in a)
if(i>=$2 && i<=$3) print a[i] "\t" $6 }
' f2 f1
1 949800 . T G ISG15
3 900000 . C - AGRN
Basically it reads the file2in memory and for every line in file1 it goes thru every entry of file2 (in memory). It won't read a 2 GB file into memory so it's still got less looking up to do as your version.
You could speed it up by replacing the print a[i] "\t" $6 with {print a[i] "\t" $6; delete a[i]}.
EDIT: Added tab delimited to output and refreshed the output to reflect the changed data. Printing "\t" is enough as the files are already tab delimited and records do not get rebuilt at any point.

Enumerate lines with same ID in awk

I'm using awk to process the following [sample] of data:
id,desc
168048,Prod_A
217215,Prod_C
217215,Prod_B
168050,Prod_A
168050,Prod_F
168050,Prod_B
What I'm trying to do is to create a column 'item' enumerating the lines within the same 'id':
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
Here what I've tried:
BEGIN {
FS = ","
a = 1
}
NR != 1 {
if (id != $1) {
id = $1
printf "%s,%s\n", $0, "#"a
}
else {
printf "%s,%s\n", $0, "#"a++
}
}
But it messes the numbering:
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#1
168050,Prod_A,#2
168050,Prod_F,#2
168050,Prod_B,#3
Could someone give me some hints?
P.S. The line order doesn't matter
$ awk -F, 'NR>1{print $0,"#"++c[$1]}' OFS=, file
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
How it works
-F,
This sets the field separator on input to a comma.
NR>1{...}
This limits the commands in braces to lines other than the first, that is, the one with the header.
print $0,"#"++c[$1]
This prints the line followed by # and a count of the number of times that we have seen the first column.
Associative array c keeps a count of the number of times that an id has been seen. For every line, we increment by 1 the count for id $1. ++ increments. Because ++ precedes c[$1], the increment is done before the value if printed.
OFS=,
This sets the field separator on output to a comma.
Printing a new header as well
$ awk -F, 'NR==1{print $0,"item"} NR>1{print $0,"#"++c[$1]}' OFS=, file
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3

How to print specific duplicate line based on fields number

I need to print out only one of various consecutive lines with same first field, and the one must be the one with "more fields in its last field". That means that last field is a set of words, and I need to print the line with more elements in its last field. In case of same number of max elements in last field, any of the max is ok.
Example input:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Bulk])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
Example output:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
solution with awk would be nice, but no need of one liner.
generate index file
$ cat input.txt |
sed 's/,\[/|[/g' |
awk -F'|' '
{if(!gensub(/[[\])]/, "", "g", $NF))n=0;else n=split($NF, a, /,/); print NR,$1,n}
' |
sort -k2,2 -k3,3nr |
awk '$2!=x{x=$2;print $1}' >idx.txt
content of index file
$ cat idx.txt
2
5
select lines
$ awk 'NR==FNR{idx[$0]; next}; (FNR in idx)' idx.txt input.txt
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
Note: no space in input.txt
Use [ as the field delimiter, then split the last field on ,:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f)}
END {for (l in line) print line[l]}
' filename
Since order is important, an update:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f); nr[$1] = NR}
END {for (l in line) printf("%d\t%s\n", nr[$1], line[l])}
' filename |
sort -n |
cut -f 2-
Something like this might work:
awk 'BEGIN {FS="["}
Ff != gensub("^([^,]+).*","\\1","g",$0) { Ff = gensub("^([^,]+).*","\\1","g",$0) ; Lf = $NF ; if (length(Ml) > 0) { print Ml } }
Ff == gensub("^([^,]+).*","\\1","g",$0) { if (length($NF) > length(Lf)) { Lf=$NF ; Ml=$0 } }
END {if (length(Ml) > 0) { print Ml } }' INPUTFILE
See here in action. BUT it's not the solution you want to use, as this is rather a hack. And it fails you if you meant that your last field is longer if it contains more , separated elements than the length of your last element. (E.g. the above script happily reports [KABLAMMMMMMMMMMM!] as longer than [A,B,C].)
This might work for you:
sort -r file | sort -t, -k1,1 -u