compare two file for match, print only one if duplicate matching found - awk

I am having two files. File1 and File2. File2 is having some duplicate entries which I cannot remove due to complexity in the file structure. Now, while generating File3 which will have a matching 1st and 2nd column between File1 and File2; i want to have only one entry from File2 for matching pattern from File1. Whats the best way to do this. I tried awk 'NR==FNR{a[$1,$2]=$0;next} ($1,$2) in a{print $0}' File1 File2 but it keep all the matching entries from File2
File1
ab 12
cd 24
ef 56
File2
ab 12
ab 12
ef 56
What am getting is
File3
ab 12
ab 12
ef 56
But what I want is
File3
ab 12
ef 56
Thanks

Some more way,
Input:
$ cat f1
ab 12
cd 24
ef 56
$ cat f2
ab 12
ab 12
ef 56
Output:
$ awk '{k=$1 SUBSEP $2}FNR==NR{a[k]; next}k in a && !a[k]++' f1 f2
ab 12
ef 56
For better Readability ++a[k]==1 ( by considering thread title "compare two file for match, print only one if duplicate matching found" )
$ awk '{k=$1 SUBSEP $2}FNR==NR{a[k]; next}k in a && ++a[k]==1' f1 f2
ab 12
ef 56

You need to delete the entry from a after finding a matching line.
awk 'NR==FNR {a[$0]; next} ($0 in a) {delete a[$0]; print}' File1 File2

Related

awk,merge two data sets based on column value

I need to combine two data sets stored in variables. This merge needs to be conditional based on the value of 1st column of "$x" and third column of "$y"
-->echo "$x"
12 hey
23 hello
34 hi
-->echo "$y"
aa bb 12
bb cc 55
ff gg 34
ss ww 23
By following command, I managed to store the value of first column of $x in a[] and check for third column of $y but not getting what I am expecting, can someone please help here.
awk 'NR==FNR{a[$1]=$1;next} $3 in a{print $0,a[$1]}' <(echo "$x") <(echo "$y")
aa bb 12
ff gg 34
ss ww 23
Expected result:
aa bb 12 hey
ff gg 34 hi
ss ww 23 hello
Your answer is almost right:
awk 'NR==FNR{a[$1]=$2;next} ($3 in a){print $0,a[$3]}' <(echo "$x") <(echo "$y")
Note the a[$1]=$2 and the print $0,a[$3].
join -1 1 -2 3 <(sort -k 1b,1 a.txt) <(sort -k 3b,3 b.txt) |awk '{print $3, $4, $1, $2 }'
Might be a solution for your input in two textfiles a.txt and b.txt using join on your two number columns.
It does not keep the order though. You might have to sort again if it is important.

How to filter empty line with a 'cut' command?

I have a tab delimited file with a few fields:
f1 f2 f3
a b c
a c
d e
f g a
I want to extract the 3rd column with a 'cut'command:
cut -f3 t
This works. However, how can I filter the empty line in the output? As it can be seen, the 2nd and 3rd lines are empty after they are extracted.
To remove empty output:
$ cut -f3 file | grep .
f3
c
a
Or:
$ awk -F'\t' '$3 {print $3}' file
f3
c
a
To replace the missing output with a filler:
$ awk -F'\t' '{if ($3) print $3; else print "FILL"}' file
f3
c
FILL
FILL
a
Or, for people who like the more compact ternary statement:
$ awk -F'\t' '{print ($3?$3:"FILL")}' file
f3
c
FILL
FILL
a
Example with multiple words in field 3
$ cat file2
f1 f2 f3
f g a b c d
$ cut -f3 file2 | grep .
f3
a b c d
$ awk -F'\t' '$3 {print $3}' file2
f3
a b c d

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

Reading from a file and writing to another using Awk

There are two tab delimiter text files. My aim is to change File 1 so that corresponding values in the 2nd column of File 2 will be substituted with zeros in File 1.
To visualize,
File 1:
AA 0
BB 0
CC 0
DD 0
EE 0
File 2:
AA 256
DD 142
EE 26
File 1 - Output:
AA 256
BB 0
CC 0
DD 142
EE 26
I wrote below but as you can see I give the value of 1st row of File 2 by hand. I want to achieve this task automatically. What should I do?
awk -F'\t' 'BEGIN {OFS=FS} {if($1 == "AA") $2="256";print}' test > test.tmp && mv test.tmp test
Thank you in advance.
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[$1]=$2; next} {print $1, a[$1]+0}' file2 file1

Processing 2 files with different field separators using awk

Let's say I have 2 files :
$ cat file1
A:10
B:5
C:12
$ cat file2
100 A
50 B
42 C
I'd like to have something like :
A 10 100
B 5 50
C 12 42
I tried this :
awk 'BEGIN{FS=":"}NR==FNR{a[$1]=$2;next}{FS=" ";print $2,a[$2],$1}' file1 file2
Which outputs me that :
100 A
B 5 50
C 12 42
I guess the problem comes from the Field Separator which is set too late for the second file. How can I set different field separator for different files (and not for a single file) ?
Thanks
Edit: a more general case
With file2 and file3 like this :
$ cat file3
A:10 foo
B:5 bar
C:12 baz
How to get :
A 10 foo 100
B 5 bar 50
C 12 baz 42
Just set FS between files:
awk '...' FS=":" file1 FS=" " file2
i.e.:
$ awk 'NR==FNR{a[$1]=$2;next}{print $2,a[$2],$1}' FS=":" file1 FS=" " file2
A 10 100
B 5 50
C 12 42
You need to get awk to re-split $0 after you change FS.
You can do that with $0=$0 (for example).
So {FS=" ";$0=$0;...} in your final block will do what you want.
Though only doing that the first time you need to change FS will likely perform slightly better for large files.
You can try something like:
$ cat f1
A:10
B:5
C:12
$ cat f2
100 A
50 B
42 C
$ awk 'NR==FNR{split($0,tmp,/:/);a[tmp[1]]=tmp[2];next}$2 in a{print $2,a[$2],$1}' f1 f2
A 10 100
B 5 50
C 12 42
or set multiple field separators
$ awk -F"[: ]" 'NR==FNR{a[$1]=$2;next}$2 in a{print $2,a[$2],$1}' f1 f2
A 10 100
B 5 50
C 12 42