awk/grep print WHOLE record in file2 based on matched string list in file1 - awk

This question has some popularity on stackoverflow. I've looked through previous posts but can't quite get the solution I need.
I have two files. One file is a list of string identfiers, the other is a list of entries. I'd like to match each item in the list of file1 with an entry in file2, then print the whole matching record in file2. My current issue is that I'm only able to print the first line (not whole record) of file two.
Examples:
File1
id100
id000
id004
...
File2
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
Desired output:
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
My current code:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){print $1 ;next}}' file1 file2
only prints:
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
and trying to specify the RS makes the entire file print..., ie:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){RS=">"}{print $1 ;next}}' file1 file2
prints
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
I'm having the same issue with grep. First line prints, but not the entire record:
grep -Fwf file1 file2
gives
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
I feel like I'm just defining the RS in the wrong place, but I can't figure out where. Any advice is welcome!
edit:
real-life file looks more like this:
awk '{print $0}' file2
>gnl|gene49202|rna95089| id0023
GGTGCTCTAGACAAAACATTGATTCCTCGTGACTGGGATTAGCCAATAGCTGAACGCGACTGAGTGTGAAACACGGAGGA
GGAGTAGGAAGTTGGAACTAGACAGGCGACTCGGTTAGGGGACACCGGAGAGATGACTCATGACTCGTGGAAACCAACGT
GAGCTTGCCCGACAAAAGAATATGAAGAAAAGTCAGGATAAACAAAAGAAACAAGATGATGGCTTGTCTGCTGCTGCACG
GAAGCACTGACCCTTTCACCAAACCACAGTGCTCTCACTGCTATGTACTGTGTTCAGcctttttatttgtcacaggCTTGTAGCAT
AGCTCCTTTATTGCCTCTTGTACATACTATAAATTCTCCATATGATTCTCTTTATTTTCATCTATTCCCCACTGATGGCT
CTCTAACTGCATGCTGGTTTAGCATTGCTTAAGTCTGCTCTGGAAAATACATGTTTTGAGGGAGTACAAACAGATCATGT
CCCTTCCTTCAACTCAAATGACCTTTTTGTATTCACGGTGACCCAGttgaatatttaataaagaatttttttctgtga
>gnl|gene37771|rna78596| id230400
GGCGATACTAGATGTTGGCGGGGTTACACTGTAGATGCGGGGGGGCTACACTAGATGTGGGCGAGGCTACACTGCAGATG
TGGGCAAGGCTATACTAGATGTGGGTGGGGCTACACTGTAGATGTGGGTGGGGCTACACTTCAGATGTGGGCGAGGCTAT
ACTGTAGATGTGGGCTGAATTTCCTATAAAGCCTGTACCTTCTTTGTTTTTGCAGGGCTTGATGGCAGAATGGAGCAGCC
AGAGCTACAGAGTGGATGACCCAGATTTGGCCCTAACCTTTCCCACCCGGCCTGGTTTCCGTAGCTTTCCCAGTCCCCAA
GTCTTTCCTATTTTCTCCCTCTTGCCACAATCTGATCCCTGCAGTAACAATGAGCTGGTTGAGTAAACTTAACCCTCGGG
GAGCTGGCGGCAGGGCCAAGTGTCAGTCTCCAACCGCCGCTCACTGCC

EDIT: As OP changed the Input_file so as per new Input I am writing this code now.
awk -F"| " 'FNR==NR{a[$0];next} /^>/{flag=""} ($NF in a){flag=1} flag' FILE1 FILE2
Following awk may help you here.
awk 'FNR==NR{a[$0];next} ($3 in a){print $0;getline;print}' Input_file1 FS="|" Input_file2

this should work if your records are separated by one or more empty lines.
$ awk -v ORS='\n\n' 'NR==FNR{a[$1]; next} $2 in a' file1 RS= file2
here the output is also separated with one empty line, if you want to remove the empty lines just delete -v ORS='\n\n'

$ grep -A1 -Fwf file1 file2
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
The -A1 means "also show 1 line After the match". Check your grep man page.
If the trailing information is a fixed number of lines, then adjust "1" accordingly. Otherwise you'll need awk or perl or ... for a more flexible solution.

Related

How to improve the speed of this awk script

I have a large file, say file1.log, that looks like this:
1322 a#gmail.com
2411 b#yahoo.com
and a smaller file, say file2.log, that looks like this:
a#gmail.com
c#yahoo.com
In fact, file1.log contains about 6500000 lines and file2.log contains about 140000.
I want to find all lines in file2.log that do not appear in file1.log. I wrote this awk command:
awk 'NR==FNR{c[$2]++} NR!=FNR && c[$1]==0 {print $0}' file1.log file2.log > result.log'
after half an hour or so I find the command is still running, and less result.log shows that result.log is empty.
I am wondering whether there is something I can do to do the job quicker?
Hash the smaller file file2 into memory. Remember Tao of Programming, 1.3: How could it be otherwise?:
$ awk '
NR==FNR { # hash file2 since its smaller
a[$0]
next
}
($2 in a) { # if file1 entry found in hash
delete a[$2] # remove it
}
END { # in the end
for(i in a) # print the ones that remain in the hash
print i
}' file2 file1 # mind the order
Output:
c#yahoo.com
If you sort the files, you can use comm to print only those lines present in the second file with:
comm -13 <(awk '{ print $2 }' file1.log | sort) <(sort file2.log)
I believe the easiest is just a simple grep-pipeline
grep -Fwof file2 file1 | grep -Fwovf - file2
you can also just extract the second column of file1 and use the last part of the above command again:
awk '{print $2}' file1 | grep -Fwovf - file2
Or everything in a single awk:
awk '(NR==FNR){a[$2]; next}!($1 in a)' file1 file2

Matching two fields between two files AWK

trying to match fields 1,3 to fields 1,2 in another file and print line of second file. First file is tab delimited and second is csv delimited. Unexpected token error?
file1
1 x 12345 x x x
file2
1,12345,x,x,x
script
awk -F',' FNR==NR{a[$1]=$1,$3; next} ($1,$2 in a) {print}' file1 file2 > output.txt
same idea, but doesn't depend on uniqueness of the first field but the pair instead
$ awk 'NR==FNR{a[$1,$3]; next} ($1,$2) in a' file1 FS=, file2
1,12345,x,x,x
You almost nailed it !
awk 'NR==FNR{first[$1]=$3;next} $1 in first{if(first[$1]==$2){print}}' file1 FS="," file2
Output
1,12345,x,x,x
Notes
Since the field separator is different for both the files, we have changed it in between files.
This script makes an assumption that the first field of each file is unique, else, the script breaks
See [ switching field separator ] in between files.

Using file redirects to input a variable search pattern to awk

I'm attempting to write a small script in bash. The script's purpose is to pull out a search pattern from file1.txt and to print the line number of the matching search from file2.txt. I know the exact place of the pattern that I want in file1.txt, and I can pull that out quite easily with sed and awk e.g.
sed -n 3p file1.txt | awk '{print $4}'
The part that I'm having trouble with is passing that information again to awk to use as a search pattern in file2.txt. Something along the lines of:
awk '/search_pattern/{print NR}' file2.txt
I was able to get this code working in two lines of code by storing the output of the first line as a variable, and passing that variable to awk in the second line,
myVariable=`sed -n 3p file1.txt | awk '{print $4}'`
awk '/'"$myVariable"'/{print NR}' file2.txt
but this seems "inelegant". I was hoping there was a way to do this in one line of code using file redirects (or something similar?). Any help is greatly appreciated!
You can avoid sed | awk with
awk 'NR==3{print $4; exit 0}' file1.txt
You can do your search with:
search=$(awk 'NR==3{print $4; exit 0}' file1.txt)
awk -v search="$search" '$0 ~ search { print NR }' file2.txt
You could even write that all on one line, but I don't recommend that; clarity is more important than brevity.
In principle, you could use:
awk 'NR==3{search = $4; next} FNR!=NR && $0 ~ search {print NR}' file1.txt file2.txt
This scans file1.txt and finds the search pattern; then it scans file2.txt and finds the lines that match. One line — even moderately clear. There'll be lots of matches if there isn't a column 4 on line 3 of file1.txt.

Difference of files from Nth line

I am trying to get the difference of two text files. However, the first line can always change. For this reason I was executing this from a python:
tail -n +2 file1
tail -n +2 file2
Then to compare I match the results from the outputs.
However, I would like to use awk or sed if possible.
What I have found so far is:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' file2 file1
but this compares from the first line.
How can I diff from the second line?
You can use diff together with process substitution:
diff <(tail -n +2 file1) <(tail -n +2 file2)
You can write something like
awk 'NR == FNR { A[$0]=3; next } !A[$0]&&FNR>1' file2 file1
FNR>1 The FNR value is reset to 1 for each file read. So FNR>1 selects all lines from the second line onwards.
All of the current AWK answers won't show differences between files, they will simply show if one file doesn't contains lines from the other, with no respect to order or number of occurences.
An awk way that compares line by line.
awk 'NR==FNR{A[FNR]=$0}FNR>1&&!(A[FNR]==$0)' file1 file2
If you want both lines to be output(similar to diff(ish))
awk 'NR==FNR{A[FNR]=$0}
FNR>1&&!(A[FNR]==$0){
print "Line:",FNR"\n"ARGV[1]":"A[FNR]"\n->\n"ARGV[2]":"$0"\n"
}' file file2
Explanation
Sets an array with File record number(FNR) as key to the line for first file.
Checks if line in second file is the same for the same FNR as the first file.
If it isn't print
Second one is mostly just formatting for the output.
It outputs FNR,first arg to awk(filename1),line from array,arrow,second arg to awk(filename2),line from file2
In addition to nu11p01n73R solution, you can always use <(...) for input files:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' <(tail -n+2 f2) <(tail -n+2 f1)

Using each line of awk output as grep pattern

I want to find every line of a file that contains any of the strings held in a column of a different file.
I have tried
grep "$(awk '{ print $1 }' file1.txt)" file2.txt
but that just outputs file2.txt in its entirety.
I know I've done this before with a pattern I found on this site, but I can't find that question anymore.
I see in the OP's comment that maybe the question is no longer a question. However, the following slight modification will handle the blank line situation. Just add a check to make sure the line has at least one field:
grep "$(awk '{if (NF > 0) print $1}' file1)" file2
And if the file with the patterns is simply a set of patterns per line, then a much simpler version of it is:
grep -f file1 file2
That causes grep to use the lines in file1 as the patterns.
THere is no need to use grep when you have awk
awk 'FNR==NR&&NF{a[$0];next}($1 in a)' file2 file1
$(awk '{ print $1 }' file1.txt) | grep text > file.txt