Compare multiple columns from one file with multiple columns of another file using awk? - awk

I want to compare first 2 characters of col1 of file1 with col1 of file2 if col3 of file1 is same as col3 of file2 , provided col4 in file2 equals to TRUE. I tried something :-
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
file 1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
PL76UTYVJDYGHU9|ABSFXXJBW|Poland|
GB76UTRTSCLSKJ|FVDGXXXUY|Russia|
file 2
AE|^AE[0-9]{2}[0-9]{24}|United Arab Emirates|TRUE|
PL|^PL[0-9]{2}[A-Z]{10}[0-9]{4}|Poland|FALSE|
GB|^GB[0-9]{2}[A-Z]{5}[0-9]{3}|Europe|TRUE
expected output :-
AE1267453617238|BIDKFXXXX|United Arab Emirates|

You could just simply cascade the multiple conditions with a && as below. Remember your expected output is on the first file, so you need to process the second file first
awk -F'|' ' FNR == NR {
if ( $4 == "TRUE" ) m[$1] = $3 ; next }{ k = substr($1,1,2) } k in m && m[k] == $3' file2 file1
The part m[$1] = $3 creates a hash-map of the $1 with the value of $3 in the second file, which is then used in the first file to compare against only the first two characters of $1 i.e. substr($1,1,2). To avoid redundant use of substr(..), the value is extracted into a variable k and reused subsequently.

If the matches must be on the same line number in each file:
awk -F \| '
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR && $3 == a[FNR,3] &&
$1 ~ "^"a[FNR,1]' file2 file1
If the matches can be on any line (every line of file1 is checked against every line of file2, duplicate matches aren't printed):
awk -F \| '
FNR==NR {++l}
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR {
for (i=1; i<=l; ++i) {
if ($3 == a[i,3] && $1 ~ "^"a[i,1])
c[$0]==0
}
}
END {
for (i in c)
print i
}' file2 file1
Note the order files are given. file2 (which contains TRUE and FALSE), goes first. I also used regex instead of substr, so the characters should be alphanumeric only, if not, go back to substr.

Regarding your code:
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
newlines matter to awk. This:
NR==FNR
{ print }
is not the same as this:
NR==FNR { print }
The first one is actually the same as:
NR==FNR { print }
1 { print }
Also when you want to output the contents of a file (file1 in your case) it's usually better to read the OTHER file into memory and then compare the values from the target file against that so you can just print it as you go. So you should be doing awk 'script' file2 file1, not awk 'script' file1 file2, and writing a script based on that.
Try this:
$ cat tst.awk
BEGIN { FS="|" }
NR==FNR {
if ( $4 == "TRUE" ) {
map[$1] = $3
}
next
}
{ key = substr($1,1,2) }
(key in map) && (map[key] == $3)
$ awk -f tst.awk file2 file1
AE1267453617238|BIDKFXXXX|United Arab Emirates|

awk -F\| '
NR==FNR{
a[$3,1]=$1;
a[$3,4]=$4;
next
}
substr($1,1,2) == a[$3,1] && a[$3,4] == "TRUE" { print }
' file2.txt file1.txt
AE1267453617238|BIDKFXXXX|United Arab Emirates|

Related

Print columns from two files

How to print columns from various files?
I tried according to Awk: extract different columns from many different files
paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R )
FILE.R == ARGV[1] { one[FNR]=$1 }
FILE.R == ARGV[2] { two[FNR]=$2 }
FILE_R == ARGV[3] { three[FNR]=$3 }
FILE_R == ARGV[4] { four[FNR]=$4 }
END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}
but I don't understand how to use this script.
FILE.R
56604.6017 2.3893 2.2926 2.2033
56605.1562 2.3138 2.2172 2.2033
FILE_R
56604.6017 2.29259 0.006699 42.
56605.1562 2.21716 0.007504 40.
Output desired
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Thank you
This is one way:
$ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2
Output:
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Explained:
$ awk -v OFS="\t" ' # setting the field separator to a tab
NR==FNR { # process the first file
a[$1]=$2 # hash the second field, use first as key
next
}
{
print $1,a[$1],$3,$4 # output
}' file1 file2
If the field spacing with tabs is not enough, use printf with modifiers like in your sample.

match two files with awk and output selected fields

I want to compare two files delimited with
;
with the same field1,
output field2 of file1 and field2 field1 of file2.
File1:
16003-Z/VG043;204352
16003/C3;100947
16003/C3;172973
16003/PAB4L;62245
16003;100530
16003;101691
16003;144786
File2:
16003-Z/VG043;568E;0540575;2.59
16003/C3;568E;0000340;2.53
16003/PAB4L;568H;0606738;9.74
16003;568E;0000339;0.71
16003TN9/C3;568E;0042261;3.29
Desired output:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
My try:
awk -F\, '{FS=";"} NR==FNR {a[$1]; next} ($1) in a{ print a[$2]";"$2";"$3}' File1 File2 > Output
The above is not working probably because awk is still obscure to me.
The problem is what is driving the output? what $1, $2, etc are referred to what?
The a[$2] in my intention is the field2 of file 1....but it is not...
What I get is:
;204352;16003-Z/VG043
;100947;16003/C3
;172973;16003/C3
;62245;16003/PAB4L
;100530;16003
;101691;16003
;144786;16003
thanks for helping
This might be what you are after:
awk -F";" '(NR==FNR) { a[$1] = ($1 in a ? a[$1] FS : "") $2; next }
($1 in a) { split(a[$1],b); for(i in b) print b[i] FS $2 FS $1 }' file1 file2
This outputs:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
This approach reads a file file_1.txt by first into an associative array table. (This is done to associate ids / values across files.) Then, looping over the 2nd file file_2.txt, I print the values in table that match the id field of this file along with the current value:
BEGIN {
FS=OFS=";"
while (getline < first)
table[$1] = $2 FS table[$1]
}
$1 in table {
len = split(table[$1], parts)
for (i=1; i<len; i++)
print parts[i], $2, $1
}
$ awk -v first=file_1.txt -f script.awk file_2.txt
204352;568E;16003-Z/VG043
172973;568E;16003/C3
100947;568E;16003/C3
62245;568H;16003/PAB4L
144786;568E;16003
101691;568E;16003
100530;568E;16003

While Read and AWK to Change Field

I have two files - FileA and FileB. FileA has 10 fields with 100 lines. If Field1 and Field2 match, Field3 should be changed. FileB has 3 fields. I am reading in FileB with a while loop to match the two fields and to get the value that should be use for field 3.
while IFS=$'\t' read hostname interface metric; do
awk -v var1=${hostname} -v var2=${interface} -v var3=${metric} '{if ($1 ~ var1 && $2 ~ var2) $3=var3; print $0}' OFS="\t" FileA.txt
done < FileB.txt
At each line iteration, this prints FileB.txt with the single line that changed. I only want it to print the line that was changed.
Please Help!
It's a smell to be calling awk once for each line of file B. You should be able to accomplish this task with a single pass through each file.
Try something like this:
awk -F'\t' -v OFS='\t' '
# first, read in data from file B
NR == FNR { values[$1 FS $2] = $3; next }
# then, output modified lines from matching lines in file A
($1 FS $2) in values { $3 = values[$1 FS $2]; print }
' fileB fileA
I'm assuming that you actually want to match with string equality instead of ~ pattern matching.
I only want it to print the line that was changed.
Simply put your print $0 statement to if clause body:
'{if ($1 ~ var1 && $2 ~ var2) { $3=var3; print $0 }}'
or even shorter:
'$1~var1 && $2~var2{ $3=var3; print $0 }'

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

awk | update field number after comparing field from other file

Input File1: file1.txt
MH=919767,918975
DL=919922
HR=919891,919394,919812
KR=919999,918888
Input File2: file2.txt
aec,919922783456,a5,b3,,,asf
abc,918975583456,a1,b1,,,abf
aeci,919998546783,a2,b4,,,wsf
Output File
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Notes
Need to compare phone number (Input file2.txt - 2nd field - initial 6 digit only) within Input file1.txt - 2nd field with "=" separted). If there is match in intial 6 digit of phone number, then OUTPUT should contain 2 digit code from file (Input file1) into output in 5th field
File1.txt is having single code (for example MH) for mupltiple phone number intials.
If you have GNU awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
FS="[=,]"
OFS=","
}
FNR==NR {
for(i=2;i<=NF;i++) {
a[$1][$i]
}
next
}
{
$5 = "NOMATCH"
for(j in a) {
for (k in a[j]) {
if (substr($2,0,6) == k) {
$5 = j
}
}
}
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next } { $5 = "NOMATCH"; for(j in a) for (k in a[j]) if (substr($2,0,6) == k) $5 = j }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
If you have an 'old' awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
# set the field separator to either an equals sign or a comma
FS="[=,]"
# set the output field separator to a comma
OFS=","
}
# for the first file in the arguments list
FNR==NR {
# loop through all the fields, starting at field two
for(i=2;i<=NF;i++) {
# add field one and each field to a pseudo-multidimensional array
a[$1,$i]
}
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# set the default value for field 5
$5 = "NOMATCH"
# loop though the array
for(j in a) {
# split the array keys into another array
split(j,b,SUBSEP)
# if the first six digits of field two equal the value stored in this array
if (substr($2,0,6) == b[2]) {
# assign field five
$5 = b[1]
}
}
# return true, therefore print by default
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1,$i]; next } { $5 = "NOMATCH"; for(j in a) { split(j,b,SUBSEP); if (substr($2,0,6) == b[2]) $5 = b[1] } }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Try something like:
awk '
NR==FNR{
for(i=2; i<=NF; i++) A[$i]=$1
next
}
{
$5="NOMATCH"
for(i in A) if ($2~"^" i) $5=A[i]
}
1
' FS='[=,]' file1 FS=, OFS=, file2