Difference of files from Nth line - awk

I am trying to get the difference of two text files. However, the first line can always change. For this reason I was executing this from a python:
tail -n +2 file1
tail -n +2 file2
Then to compare I match the results from the outputs.
However, I would like to use awk or sed if possible.
What I have found so far is:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' file2 file1
but this compares from the first line.
How can I diff from the second line?

You can use diff together with process substitution:
diff <(tail -n +2 file1) <(tail -n +2 file2)

You can write something like
awk 'NR == FNR { A[$0]=3; next } !A[$0]&&FNR>1' file2 file1
FNR>1 The FNR value is reset to 1 for each file read. So FNR>1 selects all lines from the second line onwards.

All of the current AWK answers won't show differences between files, they will simply show if one file doesn't contains lines from the other, with no respect to order or number of occurences.
An awk way that compares line by line.
awk 'NR==FNR{A[FNR]=$0}FNR>1&&!(A[FNR]==$0)' file1 file2
If you want both lines to be output(similar to diff(ish))
awk 'NR==FNR{A[FNR]=$0}
FNR>1&&!(A[FNR]==$0){
print "Line:",FNR"\n"ARGV[1]":"A[FNR]"\n->\n"ARGV[2]":"$0"\n"
}' file file2
Explanation
Sets an array with File record number(FNR) as key to the line for first file.
Checks if line in second file is the same for the same FNR as the first file.
If it isn't print
Second one is mostly just formatting for the output.
It outputs FNR,first arg to awk(filename1),line from array,arrow,second arg to awk(filename2),line from file2

In addition to nu11p01n73R solution, you can always use <(...) for input files:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' <(tail -n+2 f2) <(tail -n+2 f1)

Related

How to improve the speed of this awk script

I have a large file, say file1.log, that looks like this:
1322 a#gmail.com
2411 b#yahoo.com
and a smaller file, say file2.log, that looks like this:
a#gmail.com
c#yahoo.com
In fact, file1.log contains about 6500000 lines and file2.log contains about 140000.
I want to find all lines in file2.log that do not appear in file1.log. I wrote this awk command:
awk 'NR==FNR{c[$2]++} NR!=FNR && c[$1]==0 {print $0}' file1.log file2.log > result.log'
after half an hour or so I find the command is still running, and less result.log shows that result.log is empty.
I am wondering whether there is something I can do to do the job quicker?
Hash the smaller file file2 into memory. Remember Tao of Programming, 1.3: How could it be otherwise?:
$ awk '
NR==FNR { # hash file2 since its smaller
a[$0]
next
}
($2 in a) { # if file1 entry found in hash
delete a[$2] # remove it
}
END { # in the end
for(i in a) # print the ones that remain in the hash
print i
}' file2 file1 # mind the order
Output:
c#yahoo.com
If you sort the files, you can use comm to print only those lines present in the second file with:
comm -13 <(awk '{ print $2 }' file1.log | sort) <(sort file2.log)
I believe the easiest is just a simple grep-pipeline
grep -Fwof file2 file1 | grep -Fwovf - file2
you can also just extract the second column of file1 and use the last part of the above command again:
awk '{print $2}' file1 | grep -Fwovf - file2
Or everything in a single awk:
awk '(NR==FNR){a[$2]; next}!($1 in a)' file1 file2

awk/grep print WHOLE record in file2 based on matched string list in file1

This question has some popularity on stackoverflow. I've looked through previous posts but can't quite get the solution I need.
I have two files. One file is a list of string identfiers, the other is a list of entries. I'd like to match each item in the list of file1 with an entry in file2, then print the whole matching record in file2. My current issue is that I'm only able to print the first line (not whole record) of file two.
Examples:
File1
id100
id000
id004
...
File2
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
Desired output:
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
My current code:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){print $1 ;next}}' file1 file2
only prints:
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
and trying to specify the RS makes the entire file print..., ie:
awk 'NR==FNR{a[$0];next}{for(i in a)if(index($0,i)){RS=">"}{print $1 ;next}}' file1 file2
prints
>gnl|gene42342|rna3234| id0023
CCAATGAGA
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
....
I'm having the same issue with grep. First line prints, but not the entire record:
grep -Fwf file1 file2
gives
>gnl|gene402|rna9502| id004
>gnl|gene422|rna22229| id100
I feel like I'm just defining the RS in the wrong place, but I can't figure out where. Any advice is welcome!
edit:
real-life file looks more like this:
awk '{print $0}' file2
>gnl|gene49202|rna95089| id0023
GGTGCTCTAGACAAAACATTGATTCCTCGTGACTGGGATTAGCCAATAGCTGAACGCGACTGAGTGTGAAACACGGAGGA
GGAGTAGGAAGTTGGAACTAGACAGGCGACTCGGTTAGGGGACACCGGAGAGATGACTCATGACTCGTGGAAACCAACGT
GAGCTTGCCCGACAAAAGAATATGAAGAAAAGTCAGGATAAACAAAAGAAACAAGATGATGGCTTGTCTGCTGCTGCACG
GAAGCACTGACCCTTTCACCAAACCACAGTGCTCTCACTGCTATGTACTGTGTTCAGcctttttatttgtcacaggCTTGTAGCAT
AGCTCCTTTATTGCCTCTTGTACATACTATAAATTCTCCATATGATTCTCTTTATTTTCATCTATTCCCCACTGATGGCT
CTCTAACTGCATGCTGGTTTAGCATTGCTTAAGTCTGCTCTGGAAAATACATGTTTTGAGGGAGTACAAACAGATCATGT
CCCTTCCTTCAACTCAAATGACCTTTTTGTATTCACGGTGACCCAGttgaatatttaataaagaatttttttctgtga
>gnl|gene37771|rna78596| id230400
GGCGATACTAGATGTTGGCGGGGTTACACTGTAGATGCGGGGGGGCTACACTAGATGTGGGCGAGGCTACACTGCAGATG
TGGGCAAGGCTATACTAGATGTGGGTGGGGCTACACTGTAGATGTGGGTGGGGCTACACTTCAGATGTGGGCGAGGCTAT
ACTGTAGATGTGGGCTGAATTTCCTATAAAGCCTGTACCTTCTTTGTTTTTGCAGGGCTTGATGGCAGAATGGAGCAGCC
AGAGCTACAGAGTGGATGACCCAGATTTGGCCCTAACCTTTCCCACCCGGCCTGGTTTCCGTAGCTTTCCCAGTCCCCAA
GTCTTTCCTATTTTCTCCCTCTTGCCACAATCTGATCCCTGCAGTAACAATGAGCTGGTTGAGTAAACTTAACCCTCGGG
GAGCTGGCGGCAGGGCCAAGTGTCAGTCTCCAACCGCCGCTCACTGCC
EDIT: As OP changed the Input_file so as per new Input I am writing this code now.
awk -F"| " 'FNR==NR{a[$0];next} /^>/{flag=""} ($NF in a){flag=1} flag' FILE1 FILE2
Following awk may help you here.
awk 'FNR==NR{a[$0];next} ($3 in a){print $0;getline;print}' Input_file1 FS="|" Input_file2
this should work if your records are separated by one or more empty lines.
$ awk -v ORS='\n\n' 'NR==FNR{a[$1]; next} $2 in a' file1 RS= file2
here the output is also separated with one empty line, if you want to remove the empty lines just delete -v ORS='\n\n'
$ grep -A1 -Fwf file1 file2
>gnl|gene402|rna9502| id004
AAAAAAGGGGGGGGGG
>gnl|gene422|rna22229| id100
GATTACAGATTACA
The -A1 means "also show 1 line After the match". Check your grep man page.
If the trailing information is a fixed number of lines, then adjust "1" accordingly. Otherwise you'll need awk or perl or ... for a more flexible solution.

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222

awk to read specific column from a file

I have a small problem and I would appreciate helping me in it.
In summary, I have a file:
1,5,6,7,8,9
2,3,8,5,35,3
2,46,76,98,9
I need to read specific lines from it and print them into another text document. I know I can use (awk '{print "$2" "$3"}') to print the second and third columns beside each other. However, I need to use two statement as (awk '{print "$2"}' >> file.text) then (awk '{print "$3"}' >> file.text), but the two columns would appear under each other and not beside each other.
How can I make them appear beside each other?
If you must extract the columns in separate processes, use paste to stitch them together. I assume your shell is bash/zsh/ksh, and I assume the blank lines in your sample input should not be there.
paste -d, <(awk -F, '{print $2}' file) <(awk -F, '{print $3}' file)
produces
5,6
3,8
46,76
Without the process substitutions:
awk -F, '{print $2}' file > tmp1
awk -F, '{print $3}' file > tmp2
paste -d, tmp1 tmp2 > output
Update based on your answer:
On first appearance, that's a confusing setup. Does this work?
for (( x=1; x<=$number_of_features; x++ )); do
feature_number=$(sed -n "$x {p;q}" feature.txt)
if [[ -f out.txt ]]; then
cut -d, -f$feature_number file.txt > out.txt
else
paste -d, out.txt <(cut -d, -f$feature_number file.txt) > tmp &&
mv tmp out.txt
fi
done
That has to read the file.txt file a number of times. It would clearly be more efficient to only have to read it once:
awk -F, -f numfeat=$number_of_features '
# read the feature file into an array
NR==FNR {
colno[++i] = $0
next
}
# now, process the file.txt and emit the desired columns
{
sep = ""
for (i=1; i<=numfeat; i++) {
printf "%s%s", sep, $(colno[i])
sep = FS
}
print ""
}
' feature.txt file.txt > out.txt
Thanks all for contributing in the answers. I believe that i should be more clearer in my question, sorry for that.
My code is as follow:
for (( x = 1; x <= $number_of_features ; x++ )) # the number extracted from a text file
do
feature_number=$(awk 'FNR == "'$x'" {print}' feature.txt)
awk -F, '{print $"'$feature_number'"}' file.txt >> out.txt
done
Basically, I extract the feature number (which is the same as column number) from a text document and then print that column. the text document may contains many features number.
The thing is, each time I have different features number (which reflect the column number). so, applying the above solutions are not sufficient for this problem.
I hope it is clearer now.
Waiting for your comments please.
Thanks
Ahmad
instead of using awks file redirection, use shell redirection eg
awk '{print $2,$3}' >> file
the comma is replaced with the value of the output field seperator( space by default ).