AWK command to simulate full outer join and then compare - awk

Hello Guys I need a help in building an awk command which can simulate full outer join and then compare values
Say
cat File1
1|A|B
2|C|D
3|E|F
cat File2
1|A|X
2|C|D
3|Z|F
Assumptions
first column in both the files is the key field so no duplicates
both the files are expected to have same structure
No limit on the number of fields
Now, If I run the awk command
awk -F'|' ........... File1 File2 > output
Output format
<Key>|<File1.column1>|<File2.column1>|<Matched/Mismatched>|<File1.column2>|<File2.column2>|<Matched/Mismatched>|<File1.column3>|<File2.column3>|<Matched/Mismatched>
cat output
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
Thank You

$ awk -v OFS=\| -F\| 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{printf "%s",$1;for(i=2;i<=NF;i++){printf"%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"}printf"\n"}' file1 file2
1|A|A|matched|B|X|mismatched
2|C|C|matched|D|D|matched
3|E|Z|mismatched|F|F|matched
BEGIN {
OFS="|"; FS="|"
}
NR==FNR { # for the first file
for(i=2;i<=NF;i++) # fill array with "non-key" fields
a[$1][i]=$i;next # and use the "key" field as an index
}
{
printf "%s",$1
for(i=2;i<=NF;i++) { # use the key field to match and print
printf"|%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"
}
printf"\n" # sugar on the top
}

perhaps easier with join assist
$ join -t'|' file1 file2 |
awk -F'|' -v OFS='|' '{n="MIS"; m="MATCHED";
m1=($2!=$4?n:"")m;
m2=($3!=$5?n:"")m;
print $1,$2,$4,m1,$3,$5,m2}'
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
for unspecified number of fields need more awk
$ join -t'|' file1 file2 |
awk -F'|' '{c=NF/2; printf "%s", $1;
for(i=2;i<=c+1;i++) printf "|%s|%s|%s", $i,$(i+c),($i!=$(i+c)?"MIS":"")"MATCHED";
print ""}'

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=2; i<=NF; i++) {
a[$1,i] = $i
}
next
}
{
printf "%s%s", $1, OFS
for (i=2; i<=NF; i++) {
printf "%s%s%s%s%s%s", a[$1,i], OFS, $i, OFS, (a[$1,i]==$i ? "" : "MIS") "MATCHED", (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file1 file2
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

Related

printing information of two files according specific field

I have two files. I need to print information like the example, when the first field exist and is equal, in two files.
file 1
20;"aaaaaa";99292929
24;"fsfdfa";42933294
30;"fsdsff";23832299
38;"fjsdjl";62673777
file 2
13;"fsdffsdfs";2272777
20;"ffuiiii";23728877
30;"wdwfsdh";8882817
40;"sfjslll";82371111
expect result:
file1;20;"aaaaaa";99292929;file2;20;"ffuiiii";23728877
file1,30;"fsdsff";23832299;file2;30;"wdwfsdh";8882817
I tried with:
awk 'FNR==NR{a[$1]=$1;next} $1 in a' file2 file1 > newfile
logical it's ok, but I can't show fields that I want.
awk will help:
awk -F ';' 'NR==FNR{rec[$1]=FILENAME FS $0}
NR>FNR{
if($1 in rec){
print rec[$1] FS FILENAME FS $0
}
}' file{1..2}
should do.
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ $0 = FILENAME FS $0 }
NR==FNR { a[$2] = $0; next }
$2 in a { print a[$2], $0 }
$ awk -f tst.awk file1 file2
file1;20;"aaaaaa";99292929;file2;20;"ffuiiii";23728877
file1;30;"fsdsff";23832299;file2;30;"wdwfsdh";8882817

Print columns from two files

How to print columns from various files?
I tried according to Awk: extract different columns from many different files
paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R )
FILE.R == ARGV[1] { one[FNR]=$1 }
FILE.R == ARGV[2] { two[FNR]=$2 }
FILE_R == ARGV[3] { three[FNR]=$3 }
FILE_R == ARGV[4] { four[FNR]=$4 }
END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}
but I don't understand how to use this script.
FILE.R
56604.6017 2.3893 2.2926 2.2033
56605.1562 2.3138 2.2172 2.2033
FILE_R
56604.6017 2.29259 0.006699 42.
56605.1562 2.21716 0.007504 40.
Output desired
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Thank you
This is one way:
$ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2
Output:
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Explained:
$ awk -v OFS="\t" ' # setting the field separator to a tab
NR==FNR { # process the first file
a[$1]=$2 # hash the second field, use first as key
next
}
{
print $1,a[$1],$3,$4 # output
}' file1 file2
If the field spacing with tabs is not enough, use printf with modifiers like in your sample.

match two files with awk and output selected fields

I want to compare two files delimited with
;
with the same field1,
output field2 of file1 and field2 field1 of file2.
File1:
16003-Z/VG043;204352
16003/C3;100947
16003/C3;172973
16003/PAB4L;62245
16003;100530
16003;101691
16003;144786
File2:
16003-Z/VG043;568E;0540575;2.59
16003/C3;568E;0000340;2.53
16003/PAB4L;568H;0606738;9.74
16003;568E;0000339;0.71
16003TN9/C3;568E;0042261;3.29
Desired output:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
My try:
awk -F\, '{FS=";"} NR==FNR {a[$1]; next} ($1) in a{ print a[$2]";"$2";"$3}' File1 File2 > Output
The above is not working probably because awk is still obscure to me.
The problem is what is driving the output? what $1, $2, etc are referred to what?
The a[$2] in my intention is the field2 of file 1....but it is not...
What I get is:
;204352;16003-Z/VG043
;100947;16003/C3
;172973;16003/C3
;62245;16003/PAB4L
;100530;16003
;101691;16003
;144786;16003
thanks for helping
This might be what you are after:
awk -F";" '(NR==FNR) { a[$1] = ($1 in a ? a[$1] FS : "") $2; next }
($1 in a) { split(a[$1],b); for(i in b) print b[i] FS $2 FS $1 }' file1 file2
This outputs:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
This approach reads a file file_1.txt by first into an associative array table. (This is done to associate ids / values across files.) Then, looping over the 2nd file file_2.txt, I print the values in table that match the id field of this file along with the current value:
BEGIN {
FS=OFS=";"
while (getline < first)
table[$1] = $2 FS table[$1]
}
$1 in table {
len = split(table[$1], parts)
for (i=1; i<len; i++)
print parts[i], $2, $1
}
$ awk -v first=file_1.txt -f script.awk file_2.txt
204352;568E;16003-Z/VG043
172973;568E;16003/C3
100947;568E;16003/C3
62245;568H;16003/PAB4L
144786;568E;16003
101691;568E;16003
100530;568E;16003

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

print unique lines based on field

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'