awk to match two fields in two files - awk

I want to find lines where fields 1 and 2 from file1 match fields 2 and 3 from file2, and then print all fields from file2. There are more lines in file2 than in file1
File1
rs116801199 720381
rs138295790 16057310
rs131531 16870251
rs131546 16872281
rs140375 16873251
rs131552 16873461
File2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1
1 rs12565286 721290 0.028 1.000 1.000 2 0.370 0.934 0.000
1 rs3094315 752566 0.432 1.000 1.000 2 0.678 0.671 0.435
--- rs3131972 752721 0.353 0.906 0.938 0 -1 -1 -1
--- rs61770173 753405 0.481 0.921 0.950 0 -1 -1 -1
I tried something like:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2 > test
But got a syntax error

Consider:
awk -F 'FNR==NR{a[$1];b[$2];next} FNR==1 || ($2 in a && $3 in b)' file1 file2
The option -F expects an argument but no argument is provided intentionally. The result is that awk interprets the entirety of the code as the field separator. That is why that code does not run as expected.
From the problem statement, I didn't see why FNR==1 should be in the code. So, I removed it. Once that is done, the parens are unnecessary. If that is the case, then, the code further simplifies to:
$ awk 'FNR==NR{a[$1];b[$2];next} $2 in a && $3 in b' file1 file2
--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1

Related

Subsetting GWAS results by matching snp column from another file

I have a GWAS summary estimate file with the following columns (file 1):
1 chr1_1726_G_A 0.023 0.160
1 chr1_20184_GAATA_G 0.033 0.180
1 chr1_791101_T_TGG 0.099 0.170
file 2
chr1_20184_GAATA_G
chr1_791101_T_TGG
I would like to match the column1 of file 2 with column 2 of file1 to create a file 3 such as:
1 chr1_20184_GAATA_G 0.033 0.180
1 chr1_791101_T_TGG 0.099 0.170
By using the below code, I get an empty file3:
awk 'FNR==NR{arr[$2];next} (($2) in arr)' file2 file1 > file3
With your shown samples, please try following awk code.
awk 'FNR==NR{arr[$0];next} ($2 in arr)' file2 file1
OR
awk 'FNR==NR{arr[$1];next} ($2 in arr)' file2 file1
Explanation: Use $0(in 1st solution) OR $1(in OR solution) for array rather than using $2 in first block and then rest of your code is fine to match; matching records here.

Awk compare 2 files and match 2 columns, get % difference between column 3s

I have 2 files, with the formatting below. I am trying to compare lines where columns 1 and 2 match and then get the difference in the 2 #'s that are in column 3.
if file 2 column 3 is greater than file1 column 3, i would like a + at the end of the row
if file2 column is less than file 1 column 3, i would like a - at the end of the row
if either file column 3 is 0 i would like a * at the end of the row.
I only want to print lines where the difference between the 2 columns is > 15%
file1
abc,1,472
abc,2,536
abc,3,652
abc,4,512
abc,5,474
abc,6,266
abc,7,520
def,1,954
def,9,538
def,10,136
def,11,341
def,12,183
def,13,1209
def,14,365
def,15,536
def,16,979
def,17,0
xyz,1,547
xyz,19,0
xyz,20,0
xyz,21,0
xyz,22,0
xyz,23,0
xyz,24,0
File 2
abc,1,456
abc,2,533
abc,3,643
abc,4,444
abc,5,124
abc,6,255
abc,7,520
def,1,954
def,9,538
def,10,435
def,11,341
def,12,155
def,13,1209
def,14,365
def,15,536
def,16,979
def,17,0
xyz,1,547
xyz,19,124
xyz,20,0
xyz,21,0
xyz,22,0
xyz,23,0
xyz,24,0
expected output
abc,5,474,124,74%,- // (474-124)/474 = 74%
def,10,136,435,31%,+. // (435-136)/474 = 69%
xyz,19,0,124,100%,*. // either file has 0 , print 100% and *
I have tried multiple iterations of this but cannot seem to get the formatting to work.
awk -F, 'FNR==NR{a[$1,$2]; next ;b[$1,$2,$3]; next} $1,$2 in a {if ($3>b[$3]) {Q=((b[$3]/$3) *100)) {print Q,$0 }} else if (b[$3]>$3) {Q=(($3/b[$3]) *100)){print Q,$0 }}' file1 file2
i get this error
^ unexpected newline or end of string
also tried variations on this line but i cannot figure out the division by 0 error
awk -F, 'FNR==NR{a[$1,$2]; next ;b[$1,$2,$3]; next} $1,$2 in a {if ((Q=(b[$3]/$3) > 15) || (Q=($3/b[$3])) > 15 ){print Q,$0}}' file1 file2
awk: cmd. line:1: (FILENAME=file2 FNR=1) fatal: division by zero attempted
you need to handle if the denominator is zero in the base case, since you cannot find the relative change in that case, you need to report absolute change.
$ awk -F, -v OFS=, '{k=$1 FS $2}
FNR==NR {a[k]=$3; next}
k in a {if(a[k]) q=$3/a[k]-1;
else if($3) zero=1
else q=0
plus=q>0.15
minus=q<-0.15
q=q<0?-q:q;
if(zero) plus=minus=0
if(plus || minus || zero)
print k,a[k],$3,(zero?100:int(100*q))"%",(plus?"+":minus?"-":"*")
q=zero=0}' file1 file2
abc,5,474,124,73%,-
def,10,136,435,219%,+
def,12,183,155,15%,-
xyz,19,0,124,100%,*
you can put this in a diff.awk file and run with awk -f diff.awk file1 file2
the file contents should be
BEGIN{FS=OFS=","}
{k=$1 FS $2}
... the code in between
q=zero=0}
note that text body is without the single quotes. You can make it executable with the right shebang but I think this will be simpler.

awk calculation fails in cases where zero is used

I am using awk to calculate % of each id using the below, which runs and is very close except when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. Thank you :).
file1
ABCA2 9 232
ABHD12 211 648
ABL2 83 0
file2
CC2D2A 442
(CCDC114) 0
awk with error
awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
> NR==FNR{f1[$1]=$2; next}
> $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"}' all_sorted_genes_base_counts.bed all_sorted_total_base_counts.bed > total_panel_coverage.txt
awk: cmd. line:3: (FILENAME=file1 FNR=3) fatal: division by zero attempted
When you have a script that fails when parsing 2 input files, I can't imagine why you'd only show 1 sample input file and no expected output thereby ensuring
we can't test our potential solutions against a sample you think is relevant and
we have no way of knowing if our script is doing what you want or not,
but in general to guard against a zero denominator you'd use code like:
awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
e.g.
$ echo '6 2' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
3
$ echo '6 0' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
NaN

Maintaining the separator in awk output

I would like to subset a file while I keep the separator in the subsetted output using ´awk´ in bash.
That´s what I am using:
The input file is created in R language with:
inp <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
inp <- read.table(text=inp, header=F)
write.table(inp, "inp.txt", col.names=F, row.names=F, quote=F, sep="\t")
(So fields are separated by tabs)
The code in bash:
awk {'print $1 $2 $3'} inp.txt
The result:
AX-11125
AX-22456
AX-333445
Please note that my columns were merged in the awkoutput (and I would like it to be tab delimited as the input file). Probably it is a simple syntax problem, but I would be grateful to any ideas.
Use
awk -v OFS='\t' '{ print $1, $2, $3 }'
or
awk '{ print $1 "\t" $2 "\t" $3 }'
Written one after another without an operator between them, variables in awk are concatenated - $1 $2 $3 is no different from $1$2$3 in this respect.
The first solution sets the output field separator OFS to a tab, then uses the comma operator to print separated fields. The second solution simply sprinkles tabs in there directly, and everything is concatenated as it was before.

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1