how to find duplication by comparing field 1 and field 2 in awk? - awk

I want to compare field 1 with field 2 and find duplication
For eg: file contains below data
51:40-5E:40
51:41-5E:41
51:42-51:40
51:52-5E:52
51:A0-5E:A0
51:A9-5D:B8
51:AA-5E:53
In this file 51:40 is found in $1 and $2 so I need this 51:40 to be printed when running the script

This awk one-liner might work for you:
awk -F- '$1 in a{print $1}{b[$1]}$2 in b{print $2}{a[$2]}' file

You want
awk '
BEGIN {FS = "-"}
{
field1[$1]++
field2[$2]++
}
END {
for (item in field1) {
if (item in field2) {
print item
}
}
}
' filename
as a one-liner:
awk -F- '{a[$1];b[$2]} END {for (i in a) if (i in b) print i}' filename

Related

Compare multiple columns from one file with multiple columns of another file using awk?

I want to compare first 2 characters of col1 of file1 with col1 of file2 if col3 of file1 is same as col3 of file2 , provided col4 in file2 equals to TRUE. I tried something :-
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
file 1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
PL76UTYVJDYGHU9|ABSFXXJBW|Poland|
GB76UTRTSCLSKJ|FVDGXXXUY|Russia|
file 2
AE|^AE[0-9]{2}[0-9]{24}|United Arab Emirates|TRUE|
PL|^PL[0-9]{2}[A-Z]{10}[0-9]{4}|Poland|FALSE|
GB|^GB[0-9]{2}[A-Z]{5}[0-9]{3}|Europe|TRUE
expected output :-
AE1267453617238|BIDKFXXXX|United Arab Emirates|
You could just simply cascade the multiple conditions with a && as below. Remember your expected output is on the first file, so you need to process the second file first
awk -F'|' ' FNR == NR {
if ( $4 == "TRUE" ) m[$1] = $3 ; next }{ k = substr($1,1,2) } k in m && m[k] == $3' file2 file1
The part m[$1] = $3 creates a hash-map of the $1 with the value of $3 in the second file, which is then used in the first file to compare against only the first two characters of $1 i.e. substr($1,1,2). To avoid redundant use of substr(..), the value is extracted into a variable k and reused subsequently.
If the matches must be on the same line number in each file:
awk -F \| '
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR && $3 == a[FNR,3] &&
$1 ~ "^"a[FNR,1]' file2 file1
If the matches can be on any line (every line of file1 is checked against every line of file2, duplicate matches aren't printed):
awk -F \| '
FNR==NR {++l}
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR {
for (i=1; i<=l; ++i) {
if ($3 == a[i,3] && $1 ~ "^"a[i,1])
c[$0]==0
}
}
END {
for (i in c)
print i
}' file2 file1
Note the order files are given. file2 (which contains TRUE and FALSE), goes first. I also used regex instead of substr, so the characters should be alphanumeric only, if not, go back to substr.
Regarding your code:
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
newlines matter to awk. This:
NR==FNR
{ print }
is not the same as this:
NR==FNR { print }
The first one is actually the same as:
NR==FNR { print }
1 { print }
Also when you want to output the contents of a file (file1 in your case) it's usually better to read the OTHER file into memory and then compare the values from the target file against that so you can just print it as you go. So you should be doing awk 'script' file2 file1, not awk 'script' file1 file2, and writing a script based on that.
Try this:
$ cat tst.awk
BEGIN { FS="|" }
NR==FNR {
if ( $4 == "TRUE" ) {
map[$1] = $3
}
next
}
{ key = substr($1,1,2) }
(key in map) && (map[key] == $3)
$ awk -f tst.awk file2 file1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
awk -F\| '
NR==FNR{
a[$3,1]=$1;
a[$3,4]=$4;
next
}
substr($1,1,2) == a[$3,1] && a[$3,4] == "TRUE" { print }
' file2.txt file1.txt
AE1267453617238|BIDKFXXXX|United Arab Emirates|

match two files with awk and output selected fields

I want to compare two files delimited with
;
with the same field1,
output field2 of file1 and field2 field1 of file2.
File1:
16003-Z/VG043;204352
16003/C3;100947
16003/C3;172973
16003/PAB4L;62245
16003;100530
16003;101691
16003;144786
File2:
16003-Z/VG043;568E;0540575;2.59
16003/C3;568E;0000340;2.53
16003/PAB4L;568H;0606738;9.74
16003;568E;0000339;0.71
16003TN9/C3;568E;0042261;3.29
Desired output:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
My try:
awk -F\, '{FS=";"} NR==FNR {a[$1]; next} ($1) in a{ print a[$2]";"$2";"$3}' File1 File2 > Output
The above is not working probably because awk is still obscure to me.
The problem is what is driving the output? what $1, $2, etc are referred to what?
The a[$2] in my intention is the field2 of file 1....but it is not...
What I get is:
;204352;16003-Z/VG043
;100947;16003/C3
;172973;16003/C3
;62245;16003/PAB4L
;100530;16003
;101691;16003
;144786;16003
thanks for helping
This might be what you are after:
awk -F";" '(NR==FNR) { a[$1] = ($1 in a ? a[$1] FS : "") $2; next }
($1 in a) { split(a[$1],b); for(i in b) print b[i] FS $2 FS $1 }' file1 file2
This outputs:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
This approach reads a file file_1.txt by first into an associative array table. (This is done to associate ids / values across files.) Then, looping over the 2nd file file_2.txt, I print the values in table that match the id field of this file along with the current value:
BEGIN {
FS=OFS=";"
while (getline < first)
table[$1] = $2 FS table[$1]
}
$1 in table {
len = split(table[$1], parts)
for (i=1; i<len; i++)
print parts[i], $2, $1
}
$ awk -v first=file_1.txt -f script.awk file_2.txt
204352;568E;16003-Z/VG043
172973;568E;16003/C3
100947;568E;16003/C3
62245;568H;16003/PAB4L
144786;568E;16003
101691;568E;16003
100530;568E;16003

awk group by multiple columns and print max value with non-primary key

i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field3, grouping by field1 and print all the fields with maximum value. Field 2 contains time, that means for each item1 there is 96 values of field2,field3 and field4
input file: (comma separated)
item1,00:15,10,30
item2,00:45,20,45
item2,12:15,30,45
item1,00:30,20,56
item3,23:00,40,44
item1,12:45,50,55
item3,11:15,30,45
desired output:
item1,12:45,50,55
item2,12:15,30,45
item3,11:15,30,45
what i tried so far:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3}
}
END{
for (i in a ){
print i,a[i]
}
but this only prints
item1,50
item2,30
item3,30
but i need to print the corresponding field2 and field4 with the max value as shown in the desired output. please help.
The problem here is that you are not storing the whole line, so when you go through the final data there is no full data to print.
What you need to do is to use another array, say data[index]=full line:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3
data[$1]=$0} # store it here!
}
END {
for (i in a )
print data[i] # print it here
}
Or as a one-liner:
$ awk 'BEGIN{FS=OFS=","} {if (a[$1]<$3) {a[$1]=$3; data[$1]=$0}} END{for (i in a) print data[i]}' file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
With a little help of the sort command:
sort -t, -k1,1 -k3,3nr file | awk -F, '!seen[$1]++'
To do this job robustly you need:
$ cat tst.awk
BEGIN { FS="," }
!($1 in max) {
max[$1] = $3
data[$1] = $0
keys[++numKeys] = $1
}
$3 > max[$1] {
max[$1] = $3
data[$1] = $0
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
print data[keys[keyNr]]
}
}
$ awk -f tst.awk file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
When doing min/max calculations you should always seed your min/max value with the first value read rather than assuming it'll always be less than or greater than some arbitrary value (e.g. zero-or-null if you skip the !($1 in max) block above).
You need the keys array to preserve input order when printing the output. If you use in instead then the output order will be random.
Note that idiomatic awk syntax is simply:
<condition> { <action> }
not C-style:
{ if ( <condition> ) { <action> } }

awk append lines if certain field matches

I'm an absolute beginner to awk and would like some help with this.
I have this data:
FOO|BAR|1234|A|B|C|D|
FOO|BAR|1234|E|F|G|H|
FOO|BAR|5678|I|J|K|L|
FOO|BAR|5678|M|N|O|P|
FOO|BAR|5678|Q|R|S|T|
Desired output:
FOO|BAR|1234|A|B|C|D|E|F|G|H|
FOO|BAR|5678|I|J|K|L|M|N|O|P|Q|R|S|T|
Basically I have to append some fields to the lines where column 3 matches.
Appreciate any responses, thanks a lot!
Another way:
awk -F"|" '$3 in a{
a[$3]=a[$3]"|"$4"|"$5"|"$6"|"$7;
next
}
{ a[$3]=$0
}
END {
for ( i in a) {
print a[i]
}
}'
$ awk -f chain.awk < data
FOO|BAR|1234|A|B|C|D|E|F|G|H|
FOO|BAR|5678|I|J|K|L|M|N|O|P|Q|R|S|T|
$ cat chain.awk
BEGIN {FS = "|"}
$3==old {for(i = 4; i <= NF; i++) saved = saved (i>4?"|":"") $i}
$3!=old {if(old) print saved ; saved = $0 ; old = $3}
END {print saved}
$
BEGIN we set the field separator
$3==old we append the fields $4 ... $NF to the saved data, joining the fields with | except for the first one (note that there is a last, null field)
$3!=old we print the saved data (except for the first record, when old is false) and we restart the mechanism
END we still have saved data in our belly, we have to print it

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file