awk to skip pattern and compare lines and output match and mismatch between files - awk

In the below awk I am skipping the lines that start with # and trying to find the matching lines between two files based on 4 specific columns, and if a line does not match it outputs the file and line it is missing from. Thank you :).
file1
##....
#....
chr1 1013466 . T TA 11438.1 PASS
chr1 1013490 . C G 14137 PASS
chr1 1013600 . T TAA 1140 PASS
file2
##...
#....
chr1 1013466 . T TA 10914 PASS
chr1 1013490 . C G 13785.1 PASS
chr1 1014600 . C A 2000 PASS
awk
f1=1234.txt
f2=5678.txt
awk -F'\t' '/^[^#]/ FNR==1 { next }
FNR == NR { $f1[$1,$2,$4,$5] = $1 FS $2 FS $4 FS $5 }
FNR != NR { $f2[$1,$2,$4,$5] = $1 FS $2 FS $4 FS $5 }
END { print "Match:"; for (k in $f1) if (k in $f2) print $f1[k] # Or $f2[k]
print "Not Found:"; for (k in $f2) if (!(k in $f1)) print $f2[k]
}' OFS="\t" $f1 $f2
desired tab-delimited
Match between 1234.txt and 5678.txt:
chr1 1013466 . T TA 11438.1 PASS
chr1 1013490 . C G 14137 PASS
Not Found in 1234.txt:
chr1 1014600 . C A 2000 PASS
Not Found in 5678.txt:
chr1 1013600 . T TAA 1140 PASS

A few issues:
the incorrect prefacing of awk variables with a $; in awk the $ is used to preface column references but not variable references
need to capture the entire input line (not just the 4 index columns)
the missing set of logic for the 3rd set of data
need to capture the input filenames (to be included in the output headers)
One awk idea:
f1=1234.txt
f2=5678.txt
awk '
BEGIN { FS=OFS="\t" }
/^#/ { next }
{ ndx=$1 FS $2 FS $4 FS $5 }
FNR == NR { if (!fname1) fname1=FILENAME; f1[ndx]=$0 }
FNR != NR { if (!fname2) fname2=FILENAME; f2[ndx]=$0 }
END { printf "Match between %s and %s:\n", fname1, fname2
for (ndx in f1)
if (ndx in f2)
print f1[ndx]
printf "Not Found in %s:\n",fname1
for (ndx in f2)
if ( !(ndx in f1) )
print f2[ndx]
printf "Not Found in %s:\n",fname2
for (ndx in f1)
if ( !(ndx in f2) )
print f1[ndx]
}
' $f1 $f2
This generates:
Match between 1234.txt and 5678.txt:
chr1 1013490 . C G 14137 PASS
chr1 1013466 . T TA 11438.1 PASS
Not Found in 1234.txt:
chr1 1014600 . C A 2000 PASS
Not Found in 5678.txt:
chr1 1013600 . T TAA 1140 PASS
If it's necessary to maintain the input ordering of the rows:
awk '
BEGIN { FS=OFS="\t" }
/^#/ { next }
{ ndx=$1 FS $2 FS $4 FS $5 }
FNR == NR { if (!fname1) fname1=FILENAME; indices1[++cnt1]=ndx; f1[ndx]=$0 }
FNR != NR { if (!fname2) fname2=FILENAME; indices2[++cnt2]=ndx; f2[ndx]=$0 }
END { printf "Match between %s and %s:\n", fname1, fname2
for (i=1;i<=cnt1;i++) {
ndx=indices1[i]
if (ndx in f1 && ndx in f2)
print f1[ndx]
}
printf "Not Found in %s:\n",fname1
for (i=1;i<=cnt2;i++) {
ndx=indices2[i]
if ( !(ndx in f1) )
print f2[ndx]
}
printf "Not Found in %s:\n",fname2
for (i=1;i<=cnt1;i++) {
ndx=indices1[i]
if ( !(ndx in f2) )
print f1[ndx]
}
}
' $f1 $f2
This generates:
Match between 1234.txt and 5678.txt:
chr1 1013466 . T TA 11438.1 PASS
chr1 1013490 . C G 14137 PASS
Not Found in 1234.txt:
chr1 1014600 . C A 2000 PASS
Not Found in 5678.txt:
chr1 1013600 . T TAA 1140 PASS

Assuming each key is unique in each input file and you don't are about the order of lines per file in the output:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
/^#/ { next }
{ key = $1 FS $2 FS $4 FS $5 }
NR==FNR {
file1[key] = $0
next
}
{
if ( key in file1 ) {
both[key] = file1[key]
delete file1[key]
}
else {
file2[key] = $0
}
}
END {
printf "Match between %s and %s\n", ARGV[1], ARGV[2]
for ( key in both ) {
print both[key]
}
printf "Not found in %s\n", ARGV[1]
for ( key in file2 ) {
print file2[key]
}
printf "Not found in %s\n", ARGV[2]
for ( key in file1 ) {
print file1[key]
}
}
$ awk -f tst.awk 1234.txt 5678.txt
Match between 1234.txt and 5678.txt
chr1 1013466 . T TA 11438.1 PASS
chr1 1013490 . C G 14137 PASS
Not found in 1234.txt
chr1 1014600 . C A 2000 PASS
Not found in 5678.txt
chr1 1013600 . T TAA 1140 PASS

Related

How to print filename with processed data in awk?

I am trying to print the filename with processed data but not getting desired output. In below example I am trying to find the record which is more than 100 and putting it in a counter and then add it in an array of filename so that I can print the number of record of greater than 100 with filename.
$awk -f test.awk f*
1 f1
4 f2
$cat test.awk
BEGIN{FS=","}
FNR==1 {filename=FILENAME; next}
{
if(NF == 3 && $3 > 100) {
counter++
}
a[filename]=counter
}
END{
for(k in a){
print a[k], k
}
}
$head f?
==> f1 <==
1,2,99
1,3,101
1,1,1
a,11,3,4
a,12,321,110
==> f2 <==
1,2,99
1,3,101
1,4,101
b,1,24,3
1,5,101
c,1,101,1
b,2,24,310
1,1,1
Expected output was -
1 f1
3 f2
Any suggestion?
Using gnu-awk it can be simplified to:
awk -F, '$3 > 100 {++c} ENDFILE {print c, FILENAME; c=0}' f1 f2
1 f1
3 f2
Could you please try following, written and tested shown samples in GNU awk.
awk '
BEGIN{
FS=","
}
FNR==1{
if(count){
print count,prevFilename
}
count=""
prevFilename=FILENAME
}
$NF>100{
++count
}
END{
if(count){
print count,prevFilename
}
}
' file1 file2
With shown samples output will be as follows.
1 file1
3 file2
Add a debugging line to your code:
BEGIN{FS=","}
FNR==1 {filename=FILENAME; next}
{
if(NF == 3 && $3 > 100) {
counter++
print "COUNTER:", counter, FILENAME, $0 # debugging
}
a[filename]=counter
}
END{
for(k in a){
print a[k], k
}
}
and examine the results:
COUNTER: 1 f1 1,3,101
COUNTER: 2 f2 1,3,101
COUNTER: 3 f2 1,4,101
COUNTER: 4 f2 1,5,101
1 f1
4 f2
What, do you think, is the output of 1 + 3......?
With GNU awk for ENDFILE:
$ awk -F, '(NF==3) && ($3>100){c++} ENDFILE{print c+0, FILENAME; c=0}' f1 f2
1 f1
3 f2

awk to capture and reformat based on pattern

The awk below captures the desired output for lines 1-5 in the input below. I am struggling with the sixth line and can not seem to add a portion to the awk to capture and print line 6 as in the desired output. Printing the id as well as portion after the -. I have tried adding a split but the prints only numbers in field 5. I apologize for the long post just trying to include the details as I can not seem to figure this out. Thank you :)
spilt tried
{
split($4,F,/_/)
if(split($4,A,/[_]/)) {
if(A[2]~/[[:alpha:]]/)
p=A[2]
}
}
{
print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p
}
input tab-delimeted
6 18122723 18122843 469_380805_378884(NHLRC1)_1.1_1
6 31114121 31114241 344047_16724314_rs746647_1
6 31430946 31431066 344049_16724385_HCP5(10866)_1_1
6 32808479 32808599 445446_18754304_PSMB8-exon6_1
1 33478785 33478905 19186497_AK2-Exon1_1
1 24022788 24022908 466743_18956150_RPL11-NM_000975-exon6_1
desired output tab-delimeted
chr6 18122723 18122843 chr6:18122723-18122843 NHLRC1
chr6 31114121 31114241 chr6:31114121-31114241 rs746647
chr6 31430946 31431066 chr6:31430946-31431066 HCP5
chr6 32808479 32808599 chr6:32808479-32808599 PSMB8-exon6
chr1 33478785 33478905 chr1:33478785-33478905 AK2-Exon1
chr1 24022788 24022908 chr1:24022788-24022908 RPL11-exon6
awk
awk '
{
split($4,F,/_/)
if(split(F[3],G,/[)(]/)) {
if(G[2]~/[[:alpha:]]/)
p=G[2]
else
p=G[1]
}
else
p=F[3]
}
{
split($4,F,/_/)
if(split($4,A,/[_]/)) {
if(A[2]~/[[:alpha:]]/)
p=A[2]
}
}
{
print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p
}
' FS='\t' OFS='\t' input
current output tab-delimeted
chr6 18122723 18122843 chr6:18122723-18122843 NHLRC1
chr6 31114121 31114241 chr6:31114121-31114241 rs746647
chr6 31430946 31431066 chr6:31430946-31431066 HCP5
chr6 32808479 32808599 chr6:32808479-32808599 PSMB8-exon6
chr1 33478785 33478905 chr1:33478785-33478905 AK2-Exon1
chr1 24022788 24022908 chr1:24022788-24022908 RPL11-NM
here is another approach
$ awk 'BEGIN (FS=OFS="\t"}
{n=split($NF,a,"[_()-]");
key=sep="";
for(i=1;i<=n;i++) if(a[i]~/[a-zA-Z]+[0-9]+/)
{key=key sep a[i]; sep="-"}
print $1,$2,$3,"chr"$1":"$2"-"$3,key }' file
6 18122723 18122843 chr6:18122723-18122843 NHLRC1
6 31114121 31114241 chr6:31114121-31114241 rs746647
6 31430946 31431066 chr6:31430946-31431066 HCP5
6 32808479 32808599 chr6:32808479-32808599 PSMB8-exon6
1 33478785 33478905 chr1:33478785-33478905 AK2-Exon1
1 24022788 24022908 chr1:24022788-24022908 RPL11-exon6
Your shown output's last line doesn't seems to be matching your law(which is applied to all other lines), considering that this is a typo, try following.
awk '
{
if($NF ~ /\(/){
sub(/.*\(/,"",$NF);
sub(/\).*/,"",$NF)}
else {
num=split($NF,array,"_");
$NF=array[num-1]}
}
{
$NF=$1":"$2"-"$3 OFS $NF
}
1
' Input_file

awk to search field in file2 using range of file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB, that are between $2 and $3 in file1 which is ~2GB. If a value in $2 of file2 is between the file1 fields then it is printed along with the $6 value in file1. Both file1 and file2 are tab-delimited as well as the desired output. If there is nothing to print then the next line is processed. The awk below runs but is very slow (has been processing for ~ 1 day and still not done). Is there a better way to approach this or a better programming language?
$1 and $2 and $3 from file1 and $1 and $2 of file2 must match $1 of file1 and be in the range of $2 and $3 of file1.
so in order for the line to be printed in the output it must match $1 and be in the range of $2 and $3 of file2
So, since the line from file2matches $1 in file1 and in the $2 and $3 range it is printed in the output.
Thank you :).
file1 (~3MB)
1 948953 948956 chr1:948953-948956 . ISG15
1 949363 949858 chr1:949363-949858 . ISG15
2 800000 900500 chr1:800000-900500 . AGRN
file2 (~80MB)
1 12214 . C G
1 949800 . T G
2 900000 rs123 - A
3 900000 . C -
desired output tab-delimited
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
awk
awk -F'\t' -v OFS='\t' '
NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}
{
for (id in min)
if (min[id] < $2 && $2 < max[id]) {
print $0, id, Gene[id]
break
}
}
' file1 file2
This would be faster than what you have since it only loops through the file1 contents that have the same $1 value as in file2 and stops searching after it finds a range that matches:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $2
end[$1][c] = $3
val[$1][c] = $NF
next
}
$1 in val {
for (c=1; c<=num[$1]; c++) {
if ( (beg[$1][c] <= $2) && ($2 <= end[$1][c]) ) {
print $0, val[$1][c]
break
}
}
}
$ awk -f tst.awk file1 file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
Unfortunately for unsorted input as you have there's not too many options to make it faster. If the ranges in file1 can overlap each other then remove the "break".
Could you please try following and let me know if this helps you.
awk 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1]}' Input_file2 Input_file1
Reading the files one by one here, first reading file Input_file2 and then Input_file1 here.
If the performance is the issue, you have to sort both files by the value (and range start).
With the files sorted your scans can be incremental (and consequently much faster)
Here is an untested script
$ awk '{line=$0; k=$2;
getline < "file1";
while (k >= $2) getline < "file1";
if(k <= $3) print line, $NF}' file2
You can try to create a dict from file1 using multiarrays in gawk, this is more efficient computational (file1 has small size compared to file2),
awk '
NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
d[$1,$2]{print $0, d[$1,$2]}' file1 file2
you get,
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
One possible approach is to use AWK to generate another AWK file. Memory consumption should be low so for a really big file1 this might be a lifesaver. As for speed, that might depend on how smart the AWK implementation is. I haven't had a chance to try it on huge data sets; I am curious about your findings.
Create a file step1.awk:
{
sub(/^chr/, "", $1);
print "$1==\"" $1 "\" && " $2 "<$2 && $2<" $3 " { print $0 \"\\t" $6 "\"; }";
}
Apply that on file1:
$ awk -f step1.awk file1
$1=="1" && 948953<$2 && $2<948956 { print $0 "\tISG15"; }
$1=="1" && 949363<$2 && $2<949858 { print $0 "\tISG15"; }
Pipe the output to a file step2.awk and apply that on file2:
$ awk -f step1.awk file1 > step2.awk
$ awk -f step2.awk file2
1 949800 rs201725126 T G ISG15
Alternative: generating C
I rewrote step1.awk, making it generate C rather than AWK code. Not only will this solve the memory issue you reported earlier; it will also be a lot faster considering the fact that C is compiled to native code.
BEGIN {
print "#include <stdio.h>";
print "#include <string.h>";
print "int main() {";
print " char s[999];";
print " int a, b;";
print " while (fgets(s, sizeof(s), stdin)) {";
print " s[strlen(s)-1] = 0;";
print " sscanf(s, \"%d %d\", &a, &b);";
}
{
print " if (a==" $1 " && " $2 "<b && b<" $3 ") printf(\"%s\\t%s\\n\", s, \"" $6 "\");";
}
END {
print " }";
print "}";
}
Given your sample file1, this will generate the following C source:
#include <stdio.h>
#include <string.h>
int main() {
char s[999];
int a, b;
while (fgets(s, sizeof(s), stdin)) {
s[strlen(s)-1] = 0;
sscanf(s, "%d %d", &a, &b);
if (a==1 && 948953<b && b<948956) printf("%s\t%s\n", s, "ISG15");
if (a==1 && 949363<b && b<949858) printf("%s\t%s\n", s, "ISG15");
if (a==2 && 800000<b && b<900500) printf("%s\t%s\n", s, "AGRN");
}
}
Sample output:
$ awk -f step1.awk file1 > step2.c
$ cc step2.c -o step2
$ ./step2 < file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
It may be ineffective but should work, however slowly:
$ awk 'NR==FNR{ a[$2]=$0; next }
{ for(i in a)
if(i>=$2 && i<=$3) print a[i] "\t" $6 }
' f2 f1
1 949800 . T G ISG15
3 900000 . C - AGRN
Basically it reads the file2in memory and for every line in file1 it goes thru every entry of file2 (in memory). It won't read a 2 GB file into memory so it's still got less looking up to do as your version.
You could speed it up by replacing the print a[i] "\t" $6 with {print a[i] "\t" $6; delete a[i]}.
EDIT: Added tab delimited to output and refreshed the output to reflect the changed data. Printing "\t" is enough as the files are already tab delimited and records do not get rebuilt at any point.

awk runs but resulting output is empty

The awk below runs, however the output file is 0 bytes. It is basically matching input files that are 21 - 259 records to a file of 11,137,660 records. Basically, what it does is use the input files of which there are 4 to search and match in a large 11,000,000 record file and output the average of all the $7 in the matches. I can not seem to figure out why the file is empty. Thank you :).
input
AGRN
CCDC39
CCDC40
CFTR
search
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 0
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
expected output
chr1:955543 AGRN|gc=75 1.3
awk
awk '
NR == FNR {input[$0]; next}
{
split($5, a, "-")
if (a[1] in input) {
key = $4 OFS $5
n[key]++
sum[key] += $7
}
}
END {
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}
' search.txt input.txt > output.txt
Because the search file comes first in ARGV, you can't do the data matchup until END [as input will be empty].
Here's what I think will work. Based upon your test files, it produces a single line of output:
chr1:955543 AGRN-6|gc=75 0.7
Here is the script file, invoked with awk -f script.awk search.txt input.txt:
BEGIN {
slen = 0;
}
# get input file(s)
# NOTE: IMO, this is a cleaner better test condition
ARGIND > 1 {
###printf("input_push: DEBUG %s\n",$0);
input[$0];
next;
}
# get single search list
{
###printf("search_push: DEBUG %s\n",$0);
search[slen++] = $0;
next;
}
END {
# sum up data
for (sidx = 0; sidx < slen; ++sidx) {
sval = search[sidx];
###printf("search_end: DEBUG %s\n",sval);
split(sval,sary)
split(sary[5],a,"-");
###printf("search_end: DEBUG sary[5]='%s' a[1]='%s'\n",sary[5],a[1]);
if (a[1] in input) {
key = sary[4] OFS sary[5]
n[key]++
sum[key] += sary[7]
}
}
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}

How to compare values in a column of the same file based on the common entries in another column

In my file, every entry in $1 is duplicated, but the values in $2 are unique. I want to compare the corresponding values in $3 for each duplicate pair, and then if the $3 value is bigger, do awk {print $1,$2} for the corresponding row:
File
A ND 1
B NE 6
C NF 2
A ND_upd 10
B NE_upd 3
C NF_upd 7
Desired output:
A ND_upd
B NE
C NF_upd
If you have each duplicated entry in column1 occurring only twice then the following should suffice:
awk -v OFS="\t" '
($1 in compare) { print ($3 > compare[$1] ? $1 OFS $2 : line[$1]); next }
{ compare[$1] = $3; line[$1] = $1 OFS $2 }
' file
If the column1 can be duplicated more than 2 times then you will need to build an array keeping the max for each duplicated entry and print them all in the END block.
awk -v OFS="\t" '
($1 in compare) {
if ($3 > compare[$1]) {
compare[$1] = $3
line[$1] = $1 OFS $2
}
next
}
{
compare[$1] = $3;
line[$1] = $1 OFS $2
key[++idx] = $1
}
END {
for (i=1; i<=idx; i++) print line[key[i]]
}' file
Output: In both cases
A ND_upd
B NE
C NF_upd