awk compare two files and merge output

awk compare two files and merge output - awk

Original question
I have 2 files 1.csv and 2.csv
1.csv:-
AK,BA,Alpha,1095
ALL,SA,Alpha,9592
2.csv:-
AK,BA,SPAM,10
I want to merge files so that it will print output file as below
OUTPUT:-
AK,BA,Alpha,1095,SPAM,10
AL,SA,Alpha,9592,NA,NA
Updated question
I have 2 files alpha1.csv and SPAM1.csv
$ cat alpha1.csv
AKTEL_BANGLADESH,BANGLADESH,Alphanumeric_A_MSISDN_blocking,1095
ALJAWAL_SAUDI_TELECOM_COMPANY,SAUDI_ARABIA,Alphanumeric_A_MSISDN_blocking,9592
B-MOBILE_BRUNEI,BRUNEI,Alphanumeric_A_MSISDN_blocking,3
$ cat SPAM1.csv
AIN_AIS_GLOBAL_COMMUNICATIONS,THAILAND,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),1
AKTEL_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),16
ALJAWAL_SAUDI_TELECOM_COMPANY,SAUDI_ARABIA,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),10593
AT&T_WIRELESS,UNITED_STATES,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),218
BANGLALINK_SHEBA_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),111
expected output:
AIN_AIS_GLOBAL_COMMUNICATIONS,THAILAND,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),1,**NA,NA**
AKTEL_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),16,Alphanumeric_A_MSISDN_blocking,1095
ALJAWAL_SAUDI_TELECOM_COMPANY,SAUDI_ARABIA,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),10593,Alphanumeric_A_MSISDN_blocking,9592
AT&T_WIRELESS,UNITED_STATES,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),218,**NA,NA**
BANGLALINK_SHEBA_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),111,**NA,NA**
B-MOBILE_BRUNEI,BRUNEI,**NA,NA**,Alphanumeric_A_MSISDN_blocking,3
My command is only printing matched cases of file two with file 1 and not printing non matched cases:
$ awk 'BEGIN{FS=OFS=","} FNR==NR {a[$1,$2]=$3 FS $4; next} {print $0, (i=a[$1,$2]?a[$1,$2]:"NA,NA")}' alpha1.csv SPAM1.csv
AIN_AIS_GLOBAL_COMMUNICATIONS,THAILAND,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),1,NA,NA
AKTEL_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),16,Alphanumeric_A_MSISDN_blocking,1095
ALJAWAL_SAUDI_TELECOM_COMPANY,SAUDI_ARABIA,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),10593,Alphanumeric_A_MSISDN_blocking,9592
AT&T_WIRELESS,UNITED_STATES,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),218,NA,NA
BANGLALINK_SHEBA_BANGLADESH,BANGLADESH,SPAM_CHAIN_SMS_REJECT(Spam_Detection_and_Blocking),111,NA,NA

You can use this, for example:
$ awk 'BEGIN{FS=OFS=","} FNR==NR {a[$1,$2]=$3 FS $4; next} {print $0, (($1,$2) in a?a[$1,$2]:"NA,NA")}' f2 f1
AK,BA,Alpha,1095,SPAM,10
ALL,SA,Alpha,9592,NA,NA
Explanation
BEGIN{FS=OFS=","} set input and output field separator as comma.
FNR==NR {a[$1,$2]=$3 FS $4; next} store 3rd and 4th values in an array a[], whose index is the tuple ($1,$2).
{print $0, (($1,$2) in a?a[$1,$2]:"NA,NA")} print the line together with the matched item from the array. If there is no such element, then print NA,NA.

Related

How to join two CSV files by a temporary common column in awk?

I have two CSV files in the form of
file1
A,44
A,21
B,65
C,79
file2
A,7
B,4
C,11
I used awk as
awk -F, 'NR==FNR{a[$1]=$0;next} ($1 in a){print a[$1]","$2 }' file1.csv file2.csv
producing
A,44,7
A,21,7
B,65,4
C,79,11
a[$1] prints the entire line from file1. How can I omit the first columns in both files (the first column is only used to match the second columns) to produce:
44,7
21,7
65,4
79,11
In other words, how can I pass the columns from the first file to the print block, as $2 does for the second file?

Could you please try following, tested and written on shown samples only.
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
a[$1]=$2 ##Creating array a with index $1 and value is $2 from current line.
next ##next will skip all further statement from here.
}
($1 in a){ ##Statements from here will be executed when file1 is being read and it's checking if $1 is present in array a then do following.
print $2,a[$1] ##Printing 2nd field and value of array a with index $1 here.
}
' file2 file1 ##Mentioning Input_file names here.
Output will be as follows for shown samples.
44,7
21,7
65,4
79,11
2nd solution: More Generic solution, where considering that your both Input_files could have duplicates in that case it will print 1st value of A in Input_file1 to first value of Input_file2 and so on.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]
b[$1,++c[$1]]=$2
next
}
($1 in a){
print $2,b[$1,++d[$1]]
}
' file2 file1

You can join them using the join command and chose which fields you want to have in the output:
kent$ join -t',' -o 1.2,2.2 file1 file2
44,7
21,7
65,4
79,11

Understanding how OFS works in AWK

This is a follow-up to my question to understand more about the OFS in AWK.
My understanding is, set it once in the beginning and it will be used in "print" to separate the fields. However, it didn't work as expected, as explained in my original question.
My File: someone.txt
LN_A,FN_A<aa#xyz.com>;
LN_B,FN_B<bb#xyz.com>;
Expected output:
FN_A,LN_A,aa
FN_B,LN_B,bb
I have tried the following:
awk -F'[,<#]' -v OFS=',' '{print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' 'NF=3 {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' 'NF=3; {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' '{$1=$1} {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' '{$1=$1} {print $0}' someone.txt
Finally, I managed to get the required output with the following:
awk -F'[,<#]' '{print $2 "," $1 "," $3}' someone.txt

Consider these cases:
a) $ echo '1 2 3' | awk '{print}'
1 2 3
b) $ echo '1 2 3' | awk '{print $1, $2, $3}'
1 2 3
c) $ echo '1 2 3' | awk -v OFS=',' '{print}'
1 2 3
d) $ echo '1 2 3' | awk -v OFS=',' '{print $1, $2, $3}'
1,2,3
e) $ echo '1 2 3' | awk -v OFS=',' '{$1=$1; print}'
1,2,3
The above show OFS being used in "b" and "d" (when individual fields are being printed in a comma-separated list) and in "e" (when the record $0 is being reconstructed as a result of a value being assigned to a field before the record is printed).
Those are the only 2 times when OFS is used implicitly - when printing a comma-separated list of values and when reconstructing the record.
When you print the record (e.g. by print or print $0) as in "a" and "c" above or print any other string you are not using OFS. OFS may have been used earlier to reconstruct the record as in "e" above but the act of printing anything that's not a comma-separated list is not using OFS, it's just printing any old string which just happens to be $0 in this case.
Note:
Explicitly changing a field reconstructs $0 from the existing fields using OFS between the fields, it does not resplit $0 into fields again so FS is not used in this process. So $1=$1 or sub(/1/,2,$1) uses OFS but not FS.
Explicitly changing $0 (i.e. not implicitly as a result of 1 above) resplits $0 into fields using FS as the separator, it does not use OFS in any way. So $0=$0 or sub(/1/,2) uses FS but not OFS.
Understanding how FS and OFS work together and how they effect assignments to fields and $0 is very important. If you can explain this behavior then you've got it:
f) $ echo 'a b' | awk -v OFS=',' '{print NF, $0, $1, $2}'
2,a b,a,b
g) $ echo 'a b' | awk -v OFS=',' '{$1=$1; print NF, $0, $1, $2}'
2,a,b,a,b
h) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; print NF, $0, $1, $2}'
1,a,b,a,b,
i) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; print NF, $0, $1, $2}'
1,a,b,a,b,
j) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; $1=$1; print NF, $0, $1, $2}'
1,a,b,a,b,
k) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; $1=$1; $0=$0; print NF, $0, $1, $2}'
2,a,b,a,b
If not then feel free to ask questions.

It is simple, you have set the OFS="," in beginning of your awk statement but you are simply printing the fields(NOTE: without editing the line OR without mentioning field separator(using comma etc)) in that case OFS will not come in picture that is why your output is NOT having anything like separator.
awk -F'[,<#]' -v OFS=',' '{print $2,$1,$3}' Input_fie
If you use above command where I have mentioned , between printing fields you will see you are getting OFS now and this is how it works.
Or in case you want to see use of OFS you could use this(though above solution is BEST one but for your understanding I am adding this one too).
awk -F'[,<#]' -v OFS=',' '{$0=$2 OFS $1 OFS $3} 1' Input_file
Example to understand OFS by printing whole line(s): Let us understand it more clearly by printing whole line with OFS and withoutOFS` effect.
Let us run this code:
awk -F'[,<#]' -v OFS=',' 'FNR==1{$1=$1} 1' Input_file
What it does is when line number 1 is there then I am resetting $1's value as mentioned above to let OFS come into picture so that new value of OFS comes(off course wherever field separator was picked it will place OFS value there). So it will only be done for first line and REST of the lines nothing should happen. Let us see what output comes now?
LN_A,FN_A,aa,xyz.com>;
LN_B,FN_B<bb#xyz.com>;
You see the difference? See first line is having , in output and 2nd line is printing as it is, why because in only 1st line we have edited the first field so OFS came into picture.

As I just found an unused copy of Aho, Kernighan, Weinberger: The AWK Programming language from 1988, I(t)'ll take you to the source (pages 35-36):
"Field Variables. The fields of the current input line are called $1, $2,
through $NF; $0 refers to the whole line. Fields share the properties of other
variables — they may be used in arithmetic or string operations, and may be
assigned to. - -
One can assign a new string to a field:
BEGIN { FS = OFS = "\t" }
$4 == "North America" { $4 = "NA" }
$4 == "South America" { $4 = "SA" }
{ print }
In this program, the BEGIN action sets FS, the variable that controls the input
field separator, and OFS, the output field separator, both to a tab. The print
statement in the fourth line prints the value of $0 after it has been modified by
previous assignments. This is important: when $0 is changed by assignment or
substitution, $1, $2, etc., and NF will be recomputed; likewise, when one of $1, $2, etc., is changed, $0 is reconstructed using OFS to separate fields."

Awk command to compare specific columns in file1 to file2 and display output

File1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
File2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
Desired output:
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1

I tried awk NR==FNR command but it’s displaying only the last
matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
Using awk and sort
awk 'BEGIN{
# set input and output field separator
FS=OFS=","
}
# read first file f1
# index key field1 and field3 of file1 (f1)
{
k=$1 FS $3
}
# save 2nd and last field of file1 (f1) in array a, key being k
FNR==NR{
a[k]=(k in a ? a[k] RS:"") $2 OFS $NF;
# stop processing go to next line
next
}
# read 2nd file f2 from here
# 2nd and 3rd field of fiel2 (f2) used as key
{
k=$2 FS $3
}
# if key exists in array a
k in a{
# split array value by RS row separator, and put it in array t
split(a[k],t,RS);
# iterate array t, print and sort
for(i=1; i in t; i++)
print $0,t[i] | "sort -t, -nk5"
}
' f1 f2
Test Results:
$ cat f1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
$ cat f2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
$ awk 'BEGIN{FS=OFS=","}{k=$1 FS $3}FNR==NR{a[k]=(k in a ? a[k] RS:"") $2 OFS $NF; next}{k=$2 FS $3}k in a{split(a[k],t,RS); for(i=1; i in t; i++)print $0,t[i] | "sort -t, -nk5" }' f1 f2
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0624,444,0.1

Following awk may help you in same.
awk -F, '
FNR==NR{
a[FNR]=$0;
next
}
{
for(i=1;i<=length(a);i++){
print a[i] FS $2 FS $NF
}
}' Input_file2 Input_file1
Adding explanation too for code as follows.
awk -F, ' ##Setting field separator as comma here for all the lines.
FNR==NR{ ##Using FNR==NR condition which will be only TRUE then first Input_file named File2 is being read.
##FNR and NR both indicates the number of lines for a Input_file only difference is FNR value will be RESET whenever a new file is being read and NR value will be keep increasing till all Input_files are read.
a[FNR]=$0; ##Creating an array named a whose index is FNR(current line) value and its value is current line value.
next ##Using next statement will sip all further statements now.
}
{
for(i=1;i<=length(a);i++){##Starting a for loop from variable i value from 1 to length of array a value. This will be executed on 2nd Input_file reading.
print a[i] FS $2 FS $NF ##Printing the value of array a whose index is variable i and printing 2nd and last field of current line.
}
}' File2 File1 ##Mentioning the Input_file names here.

another one with join/awk
$ join -t, -j99 file2 file1 |
awk -F, -v OFS=, '$3==$6 && $4==$8 {print $2,$3,$4,$5,$7,$9}'

awk Compare 2 files, print match and print just 2 columns of the second file

I am novice and I am sure it is a silly question but I searched and I didn't find an answer.
I want to select just 2 columns of my file 2. I know how to select one column =$1 and all columns =$0. But If we want just show 2,3, ... column from file2 in my file3, is it possible?
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
or
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $0; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
I just want $1 and $2 from file2, this code doesn´t work. I obtain one column with data from $1 and $2
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1$2; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
Any solution??

awk -v RS='\r\n' ' # call awk and set row separator
BEGIN {
FS=OFS=";" # set input and output field separator
}
# Here you are reading first argument that is File2
FNR==NR {
# Save column2 and column3 separated by OFS that is ;
# from File2 which is first argument, in array a
# whose index/key being second field/column from File2
a[$2] = $2 OFS $3;
# Stop processing go to next line of File1
next
}
# Here on words you are reading second argument that is File1
{
# Global substitution
# replace _ with hyphen - in field2/column2
gsub(/_/,"-",$2);
# Uppercase field2/column2
$2=toupper($2);
# If field2 of current file (File1) exists in array a
# which is created above using File2 then
# print array value that is your field2 and field3 of File2
# else print "NA", and then output field separator,
# entire line/record of current file
print ($2 in a ? a[$2] : "NA"), $0
}' $File2 $File1 > file3

Print default value if index is not in awk array

$ cat file1 #It contains ID:Name
5:John
4:Michel
$ cat file2 #It contains ID
5
4
3
I want to Replace the IDs in file2 with Names from file1, output required
John
Michel
NO MATCH FOUND
I need to expand the below code to reult NO MATCH FOUND text.
awk -F":" 'NR==FNR {a[$1]=$2;next} {print a[$1]}' file1 file2
My current result:
John
Michel
<< empty line
Thanks,

You can use a ternary operator for this: print ($1 in a)?a[$1]:"NO MATCH FOUND". That is, if $1 is in the array, print it; otherwise, print the text "NO MATCH FOUND".
All together:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} {print ($1 in a)?a[$1]:"NO MATCH FOUND"}' f1 f2
John
Michel
NO MATCH FOUND

You can test whether the index occurs in the array:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} $1 in a {print a[$1]; next} {print "NOT FOUND"}' file1 file2
John
Michel
NOT FOUND

if file2 has only digit (no space at the end)
awk -F ':' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1
if not
awk -F '[:[:blank:]]' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1 file2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk compare two files and merge output - awk

Related

How to join two CSV files by a temporary common column in awk?

Understanding how OFS works in AWK

Awk command to compare specific columns in file1 to file2 and display output

awk Compare 2 files, print match and print just 2 columns of the second file

Print default value if index is not in awk array

Categories

Resources