Matching two fields between two files AWK - awk

trying to match fields 1,3 to fields 1,2 in another file and print line of second file. First file is tab delimited and second is csv delimited. Unexpected token error?
file1
1 x 12345 x x x
file2
1,12345,x,x,x
script
awk -F',' FNR==NR{a[$1]=$1,$3; next} ($1,$2 in a) {print}' file1 file2 > output.txt

same idea, but doesn't depend on uniqueness of the first field but the pair instead
$ awk 'NR==FNR{a[$1,$3]; next} ($1,$2) in a' file1 FS=, file2
1,12345,x,x,x

You almost nailed it !
awk 'NR==FNR{first[$1]=$3;next} $1 in first{if(first[$1]==$2){print}}' file1 FS="," file2
Output
1,12345,x,x,x
Notes
Since the field separator is different for both the files, we have changed it in between files.
This script makes an assumption that the first field of each file is unique, else, the script breaks
See [ switching field separator ] in between files.

Related

awk/grep print WHOLE record in file2 based on matched string list in file1

This question has some popularity on stackoverflow. I've looked through previous posts but can't quite get the solution I need.
I have two files. One file is a list of string identfiers, the other is a list of entries. I'd like to match each item in the list of file1 with an entry in file2, then print the whole matching record in file2. My current issue is that I'm only able to print the first line (not whole record) of file two.
Examples:
File1
id100
id000
id004
...
File2
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
Desired output:
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
My current code:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){print $1 ;next}}' file1 file2
only prints:
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
and trying to specify the RS makes the entire file print..., ie:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){RS=">"}{print $1 ;next}}' file1 file2
prints
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
I'm having the same issue with grep. First line prints, but not the entire record:
grep -Fwf file1 file2
gives
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
I feel like I'm just defining the RS in the wrong place, but I can't figure out where. Any advice is welcome!
edit:
real-life file looks more like this:
awk '{print $0}' file2
>gnl|gene49202|rna95089| id0023
GGTGCTCTAGACAAAACATTGATTCCTCGTGACTGGGATTAGCCAATAGCTGAACGCGACTGAGTGTGAAACACGGAGGA
GGAGTAGGAAGTTGGAACTAGACAGGCGACTCGGTTAGGGGACACCGGAGAGATGACTCATGACTCGTGGAAACCAACGT
GAGCTTGCCCGACAAAAGAATATGAAGAAAAGTCAGGATAAACAAAAGAAACAAGATGATGGCTTGTCTGCTGCTGCACG
GAAGCACTGACCCTTTCACCAAACCACAGTGCTCTCACTGCTATGTACTGTGTTCAGcctttttatttgtcacaggCTTGTAGCAT
AGCTCCTTTATTGCCTCTTGTACATACTATAAATTCTCCATATGATTCTCTTTATTTTCATCTATTCCCCACTGATGGCT
CTCTAACTGCATGCTGGTTTAGCATTGCTTAAGTCTGCTCTGGAAAATACATGTTTTGAGGGAGTACAAACAGATCATGT
CCCTTCCTTCAACTCAAATGACCTTTTTGTATTCACGGTGACCCAGttgaatatttaataaagaatttttttctgtga
>gnl|gene37771|rna78596| id230400
GGCGATACTAGATGTTGGCGGGGTTACACTGTAGATGCGGGGGGGCTACACTAGATGTGGGCGAGGCTACACTGCAGATG
TGGGCAAGGCTATACTAGATGTGGGTGGGGCTACACTGTAGATGTGGGTGGGGCTACACTTCAGATGTGGGCGAGGCTAT
ACTGTAGATGTGGGCTGAATTTCCTATAAAGCCTGTACCTTCTTTGTTTTTGCAGGGCTTGATGGCAGAATGGAGCAGCC
AGAGCTACAGAGTGGATGACCCAGATTTGGCCCTAACCTTTCCCACCCGGCCTGGTTTCCGTAGCTTTCCCAGTCCCCAA
GTCTTTCCTATTTTCTCCCTCTTGCCACAATCTGATCCCTGCAGTAACAATGAGCTGGTTGAGTAAACTTAACCCTCGGG
GAGCTGGCGGCAGGGCCAAGTGTCAGTCTCCAACCGCCGCTCACTGCC
EDIT: As OP changed the Input_file so as per new Input I am writing this code now.
awk -F"| " 'FNR==NR{a[$0];next} /^>/{flag=""} ($NF in a){flag=1} flag' FILE1 FILE2
Following awk may help you here.
awk 'FNR==NR{a[$0];next} ($3 in a){print $0;getline;print}' Input_file1 FS="|" Input_file2
this should work if your records are separated by one or more empty lines.
$ awk -v ORS='\n\n' 'NR==FNR{a[$1]; next} $2 in a' file1 RS= file2
here the output is also separated with one empty line, if you want to remove the empty lines just delete -v ORS='\n\n'
$ grep -A1 -Fwf file1 file2
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
The -A1 means "also show 1 line After the match". Check your grep man page.
If the trailing information is a fixed number of lines, then adjust "1" accordingly. Otherwise you'll need awk or perl or ... for a more flexible solution.

Using awk how do I combine data in two files and substitute multiple values from the second file to the first file?

This question is an extension to Using awk how do I combine data in two files and substitute values from the second file to the first file?
data.txt contains some data:
A;1
B;2
A;3
keys.txt contains "key;value1;value;value3;value4" ("C" is in this example not part of data.txt, but the awk script should still work):
A;30;BC;100;1000
B;20;CD;200;2000
C;10;DE;300;3000
Wanted output:
A;1;30;BC;100;1000
B;2;20;CD;200;2000
A;3;30;BC;100;1000
Hence, each row in data.txt that contains any key from keys.txt should get the corresponding values appended to the row in data.txt.
it's similar to the previous answer referred in the question.
$ awk 'BEGIN {FS=OFS=";"}
NR==FNR {k=$1; $1=""; a[k]=$0; next}
$1 in a {print $0 a[$1]}' file2 file1
A;1;30;BC;100;1000
B;2;20;CD;200;2000
A;3;30;BC;100;1000

Using awk how do I combine data in two files and substitute values from the second file to the first file?

Any ideas how to the following using awk?
Two input files, data.txt and keys.txt:
data.txt contains some data:
A;1
B;2
A;3
keys.txt contains "key;value" pairs ("C" is in this example not part of data.txt, but the awk script should still work):
A;30
B;20
C;10
The output should be as follows:
A;1;30
B;2;20
A;3;30
Hence, each row in data.txt that contains any key from keys.txt should get the corresponding value appended to the row in data.txt.
awk to the rescue!
assumes the second file has unique keys unlike first file (if not you need to specify what happens then)
$ awk 'BEGIN {FS=OFS=";"}
NR==FNR {a[$1]=$2; next}
$1 in a {print $0,a[$1]}' file2 file1
A;1;30
B;2;20
A;3;30
ps. note the order of files...
awk solution:
awk -F';' 'NR==FNR{a[$1]=$2; next}{if($1 in a) $0=$0 FS a[$1]; print}' file2 file1
The output:
A1;1;2
A2;2;1
A3;3;0.5
A1;1;2
A2;2;1
A3;3;0.5
NR==FNR - processing the first file i.e. file2
a[$1]=$2 - accumulating additional values for each key
if($1 in a) $0=$0 FS a[$1] - appending value if first column matches

Compare two big files with awk

I took reference from following link for comparing two files :
Compare files with awk
awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2
It prints 2nd column of file2, if 1st column of file2 found in file1.
But my requirement is little different. how to print 2nd column of file1 if 1st column of file2 found in associative array (built with 1st column of file1) ?
With this:
awk 'NR==FNR{a[$1]=$2;next}$1 in a{print a[$1]}' file1 file2
With this way you assign a value to each array element of array a.
For a line with fields foo bar, you actually create a[foo]=bar.
If you later give a command {print a[foo]} it will print bar (it's assigned value)
The previous {a[$1];next} creates an array with name a and index $1,but value is null; It is a stortcut of a[$1]="".
The whole thing works in awk, because awk has an easy way to look up indexes in an array using $1 in a{print something}. This is an awk if then shortcut.
It is the same like {if ($1 in a) {print something}}. The great about this is that the part $1 in a refers to array a indexes and not array values.

how to retain header using AWK

How can I print the header from data1 while running this comparison of two files data1 and data2 matching on column 2? My code only prints data lines. The headers are named differently and so I choose to use column position
awk -F, 'FNR==NR {a[$2]=$0; next}; $2 in a {print a[$2]}' /data1 /data2 > /data3.txt
$ awk -F, 'NR==1; FNR==NR{a[$2]=$0; next} ...
will print the first line of the first file. If you want to skip processing replace with NR==1{print; next}.