AWK command to simulate full outer join and then compare - awk
Hello Guys I need a help in building an awk command which can simulate full outer join and then compare values
Say
cat File1
1|A|B
2|C|D
3|E|F
cat File2
1|A|X
2|C|D
3|Z|F
Assumptions
first column in both the files is the key field so no duplicates
both the files are expected to have same structure
No limit on the number of fields
Now, If I run the awk command
awk -F'|' ........... File1 File2 > output
Output format
<Key>|<File1.column1>|<File2.column1>|<Matched/Mismatched>|<File1.column2>|<File2.column2>|<Matched/Mismatched>|<File1.column3>|<File2.column3>|<Matched/Mismatched>
cat output
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
Thank You
$ awk -v OFS=\| -F\| 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{printf "%s",$1;for(i=2;i<=NF;i++){printf"%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"}printf"\n"}' file1 file2
1|A|A|matched|B|X|mismatched
2|C|C|matched|D|D|matched
3|E|Z|mismatched|F|F|matched
BEGIN {
OFS="|"; FS="|"
}
NR==FNR { # for the first file
for(i=2;i<=NF;i++) # fill array with "non-key" fields
a[$1][i]=$i;next # and use the "key" field as an index
}
{
printf "%s",$1
for(i=2;i<=NF;i++) { # use the key field to match and print
printf"|%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"
}
printf"\n" # sugar on the top
}
perhaps easier with join assist
$ join -t'|' file1 file2 |
awk -F'|' -v OFS='|' '{n="MIS"; m="MATCHED";
m1=($2!=$4?n:"")m;
m2=($3!=$5?n:"")m;
print $1,$2,$4,m1,$3,$5,m2}'
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
for unspecified number of fields need more awk
$ join -t'|' file1 file2 |
awk -F'|' '{c=NF/2; printf "%s", $1;
for(i=2;i<=c+1;i++) printf "|%s|%s|%s", $i,$(i+c),($i!=$(i+c)?"MIS":"")"MATCHED";
print ""}'
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=2; i<=NF; i++) {
a[$1,i] = $i
}
next
}
{
printf "%s%s", $1, OFS
for (i=2; i<=NF; i++) {
printf "%s%s%s%s%s%s", a[$1,i], OFS, $i, OFS, (a[$1,i]==$i ? "" : "MIS") "MATCHED", (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file1 file2
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
Related
printing information of two files according specific field
I have two files. I need to print information like the example, when the first field exist and is equal, in two files. file 1 20;"aaaaaa";99292929 24;"fsfdfa";42933294 30;"fsdsff";23832299 38;"fjsdjl";62673777 file 2 13;"fsdffsdfs";2272777 20;"ffuiiii";23728877 30;"wdwfsdh";8882817 40;"sfjslll";82371111 expect result: file1;20;"aaaaaa";99292929;file2;20;"ffuiiii";23728877 file1,30;"fsdsff";23832299;file2;30;"wdwfsdh";8882817 I tried with: awk 'FNR==NR{a[$1]=$1;next} $1 in a' file2 file1 > newfile logical it's ok, but I can't show fields that I want.
awk will help: awk -F ';' 'NR==FNR{rec[$1]=FILENAME FS $0} NR>FNR{ if($1 in rec){ print rec[$1] FS FILENAME FS $0 } }' file{1..2} should do.
$ cat tst.awk BEGIN { FS=OFS=";" } { $0 = FILENAME FS $0 } NR==FNR { a[$2] = $0; next } $2 in a { print a[$2], $0 } $ awk -f tst.awk file1 file2 file1;20;"aaaaaa";99292929;file2;20;"ffuiiii";23728877 file1;30;"fsdsff";23832299;file2;30;"wdwfsdh";8882817
Print columns from two files
How to print columns from various files? I tried according to Awk: extract different columns from many different files paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R ) FILE.R == ARGV[1] { one[FNR]=$1 } FILE.R == ARGV[2] { two[FNR]=$2 } FILE_R == ARGV[3] { three[FNR]=$3 } FILE_R == ARGV[4] { four[FNR]=$4 } END { for (i=1; i<=length(one); i++) { print one[i], two[i], three[i], four[i] } } but I don't understand how to use this script. FILE.R 56604.6017 2.3893 2.2926 2.2033 56605.1562 2.3138 2.2172 2.2033 FILE_R 56604.6017 2.29259 0.006699 42. 56605.1562 2.21716 0.007504 40. Output desired 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Thank you
This is one way: $ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2 Output: 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Explained: $ awk -v OFS="\t" ' # setting the field separator to a tab NR==FNR { # process the first file a[$1]=$2 # hash the second field, use first as key next } { print $1,a[$1],$3,$4 # output }' file1 file2 If the field spacing with tabs is not enough, use printf with modifiers like in your sample.
match two files with awk and output selected fields
I want to compare two files delimited with ; with the same field1, output field2 of file1 and field2 field1 of file2. File1: 16003-Z/VG043;204352 16003/C3;100947 16003/C3;172973 16003/PAB4L;62245 16003;100530 16003;101691 16003;144786 File2: 16003-Z/VG043;568E;0540575;2.59 16003/C3;568E;0000340;2.53 16003/PAB4L;568H;0606738;9.74 16003;568E;0000339;0.71 16003TN9/C3;568E;0042261;3.29 Desired output: 204352;568E;16003-Z/VG043 100947;568E;16003/C3 172973;568E;16003/C3 62245;568H;16003/PAB4L 100530;568E;16003 101691;568E;16003 144786;568E;16003 My try: awk -F\, '{FS=";"} NR==FNR {a[$1]; next} ($1) in a{ print a[$2]";"$2";"$3}' File1 File2 > Output The above is not working probably because awk is still obscure to me. The problem is what is driving the output? what $1, $2, etc are referred to what? The a[$2] in my intention is the field2 of file 1....but it is not... What I get is: ;204352;16003-Z/VG043 ;100947;16003/C3 ;172973;16003/C3 ;62245;16003/PAB4L ;100530;16003 ;101691;16003 ;144786;16003 thanks for helping
This might be what you are after: awk -F";" '(NR==FNR) { a[$1] = ($1 in a ? a[$1] FS : "") $2; next } ($1 in a) { split(a[$1],b); for(i in b) print b[i] FS $2 FS $1 }' file1 file2 This outputs: 204352;568E;16003-Z/VG043 100947;568E;16003/C3 172973;568E;16003/C3 62245;568H;16003/PAB4L 100530;568E;16003 101691;568E;16003 144786;568E;16003
This approach reads a file file_1.txt by first into an associative array table. (This is done to associate ids / values across files.) Then, looping over the 2nd file file_2.txt, I print the values in table that match the id field of this file along with the current value: BEGIN { FS=OFS=";" while (getline < first) table[$1] = $2 FS table[$1] } $1 in table { len = split(table[$1], parts) for (i=1; i<len; i++) print parts[i], $2, $1 } $ awk -v first=file_1.txt -f script.awk file_2.txt 204352;568E;16003-Z/VG043 172973;568E;16003/C3 100947;568E;16003/C3 62245;568H;16003/PAB4L 144786;568E;16003 101691;568E;16003 100530;568E;16003
How to get cardinality of fields with AWK?
I am trying to count the unique occurrences for each field in a txt file. Sample: 2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79 2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0 2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0 2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.) For a single column we can do: for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo Is there a way to do it in one go for all the columns? I think the expected output looks like this: 1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2: For the entire dataset: 1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays: $ cat tst.awk BEGIN { FS=","; OFS=":" } { for (i=1; i<=NF; i++) { vals[i][$i] } } END { for (i=1; i<=NF; i++) { printf "%s%s", length(vals[i]), (i<NF?OFS:ORS) } } $ awk -f tst.awk file 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3 and with any awk: $ cat tst.awk BEGIN { FS=","; OFS=":" } { for (i=1; i<=NF; i++) { if ( !seen[i,$i]++ ) { cnt[i]++ } } } END { for (i=1; i<=NF; i++) { printf "%s%s", cnt[i], (i<NF?OFS:ORS) } } $ awk -f tst.awk file 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk: $ awk ' BEGIN { FS=OFS="," } # delimiters to , { for(i=1;i<=NF;i++) # iterate over every field a[i][$i] # store unique values to 2d hash } END { # after all the records for(i=1;i<=NF;i++) # iterate the unique values for each field for(j in a[i]) c[i]++ # count them and for(i=1;i<=NF;i++) printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values }' file 1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3 The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts $ awk -F, -v OFS=: '{for(i=1;i<=NF;i++) printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file 1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1 1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2 1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3 1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3 1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3 1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3 1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3 1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3 1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
print unique lines based on field
Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences. Input.csv 10,15-10-2014,abc 20,12-10-2014,bcd 10,09-10-2014,def 40,06-10-2014,ghi 10,15-10-2014,abc Desired Output: 10,15-10-2014,abc 20,12-10-2014,bcd 40,06-10-2014,ghi Have tried below command and in-complete awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to: awk -F, '!seen[$1]++' Input.csv Yes, that's the whole script: $ cat Input.csv 10,15-10-2014,abc 20,12-10-2014,bcd 10,09-10-2014,def 40,06-10-2014,ghi 10,15-10-2014,abc $ $ awk -F, '!seen[$1]++' Input.csv 10,15-10-2014,abc 20,12-10-2014,bcd 40,06-10-2014,ghi
This should give you what you want: awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax. awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'