awk calculation fails in cases where zero is used - awk

I am using awk to calculate % of each id using the below, which runs and is very close except when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. Thank you :).
file1
ABCA2 9 232
ABHD12 211 648
ABL2 83 0
file2
CC2D2A 442
(CCDC114) 0
awk with error
awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
> NR==FNR{f1[$1]=$2; next}
> $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"}' all_sorted_genes_base_counts.bed all_sorted_total_base_counts.bed > total_panel_coverage.txt
awk: cmd. line:3: (FILENAME=file1 FNR=3) fatal: division by zero attempted

When you have a script that fails when parsing 2 input files, I can't imagine why you'd only show 1 sample input file and no expected output thereby ensuring
we can't test our potential solutions against a sample you think is relevant and
we have no way of knowing if our script is doing what you want or not,
but in general to guard against a zero denominator you'd use code like:
awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
e.g.
$ echo '6 2' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
3
$ echo '6 0' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
NaN

Related

awk illegal statement at source line 1

I am executing following awk command:
awk -F'\t' '{ split($4,array,"[- ]"); print > array[1]""array[2]""array[3]}' myFile.txt
but seeing this error:
awk: syntax error at source line 1
context is
{ split($4,array,"[- ]"); print > >>> array[1]"" <<<
awk: illegal statement at source line 1
awk: illegal statement at source line 1
What can be the reason for that? How to fix the script?
Those pairs of double quotes are doing nothing, you could just remove them:
awk -F'\t' '{ split($4,array,"[- ]"); print > array[1] array[2] array[3]}' myFile.txt
An unparenthesized expression on the right side of input or output redirection is undefined behavior per POSIX which is why some awks (e.g. gawk) will interpret your code as you intended:
awk -F'\t' '{ split($4,array,"[- ]"); print > (array[1] array[2] array[3])}' myFile.txt
while others can interpret it as:
awk -F'\t' '{ split($4,array,"[- ]"); (print > array[1]) (array[2] array[3])}' myFile.txt
which is a syntax error in any awk, or anything else.
You can fix your syntax error by adding the parens:
awk -F'\t' '{ split($4,array,"[- ]"); print > (array[1] array[2] array[3])}' myFile.txt
but that could have other problems too and the right way to do what you're trying to do depends on whatever it is you're trying to do, which we can't tell just from your code. If you post a new question with sample input and expected output then we can help you write your code the right way.
You need
print > (array[1]""array[2]""array[3])
in many implementations of awk. Note the parenthesis around the expression that generates the filename.
Might want to close the file afterwards too in case there's a lot of possible filenames that can be created, and use appending instead:
awk -F'\t' '{ split($4,array,"[- ]")
file = array[1] "" array[2] "" array[3]
print >> file
close(file)
}' myFile.txt
here's an awk-based solution verified on 4 awk variants, requires no array splitting, while also closing file connections along the way :
pristine $0 has been pre-saved, thus performing ++NF against a blank OFS does not result in data truncation
(ps : as a matter of fact, saving $0 is
only necessary for gawk and nawk )
SETUP and INPUT
removed 'wyx-8979479BCCF-;#%&*[)(]~'
zsh: no matches found: wyx*
1 --------INPUT------------
2 bca 0106 qsr wyx-8979479BCCF-=;#%&*[)(]~ testtail
CODE
{m,n,g}awk '
BEGIN {
OFS = _
FS = "^[^\t]*\t[^\t]*\t[^\t]*\t|[ -]|\t[^\t]*$"
} {
___ = $(_*(__==_?_:close(__)))
print(___) > (__ = $!++NF) }'
# mawk-1/2 specific streamlining
mawk 'BEGIN { FS="^[^\t]*\t[^\t]*\t[^\t]*\t|[ -]|\t[^\t]*$"(OFS=_)
} { print $(_*(__==_?_:close(__))) > (__ = $!++NF) }'
OUTPUT
-rw-r--r-- 1 501 20 50 Jun 19 12:13 wyx8979479BCCF=;#%&*[)(]~
1 bca 0106 qsr wyx-8979479BCCF-=;#%&*[)(]~ testtail

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

awk to output two files based on match or no match

In the below awk I am trying to print out the lines that match have the string FP or RFP $2 in the tab-delimited input. If a match is found in $2 then in result only the lines of file that do not have those keywords in them are printed. At the same time another file removed is printed that has those lines that did have those keywords in them. The awk has a syntax error in it when I try to print two files, if I only print one the awk runs. Thank you :).
input
12 aaa
123 FP bbb
11 ccc
10 RFP ddd
result
12 aaa
11 ccc
removed
123 FP bbb
10 RFP ddd
awk
awk -F'\t' 'BEGIN{d["FP"];d["RFP"]}!($2 in d) {print > "removed"}; else {print > "result"}' file
awk: cmd. line:1: BEGIN{d["FP"];d["RFP"]}!($2 in d) {print > "removed"}; else {print > "result"}
awk: cmd. line:1: ^ syntax error
else goes with if. Your script didn't have an if, just an else, hence the syntax error. All you need is:
awk -F'\t' '{print > ($2 ~ /^R?FP$/ ? "removed" : "result")}' file
or if you prefer the array approach you are trying to use:
awk -F'\t' '
BEGIN{ split("FP RFP",t,/ /); for (i in t) d[t[i]] }
{ print > ($2 in d ? "removed" : "result") }
' file
Read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn awk syntax and semantics.
Btw when writing if/else code like you show in your question:
if ( !($2 in d) ) removed; else result
THINK about the fact you're using negative (!) logic which makes your code harder to understand right away AND opens you up to potential double negatives. Always try to express every condition in a positive way, in this case that'd be:
if ($2 in d) result; else removed

awk and log2 divisions

I have a tab delimited file that looks something like this:
foo 0 4
boo 3 2
blah 4 0
flah 1 1
I am trying to calculate log2 for between the two columns for each row. my problem is with the division by zero
What I have tried is this:
cat file.txt | awk -v OFS='\t' '{print $1, log($3/$2)log(2)}'
when there is a zero as the denominator, the awk will crash. What I would want to do is some sort of conditional statement that would print an "inf" as the result when the denominator is equal to 0.
I am really not sure how to go about this?
Any help would be appreciated
Thanks
You can implement that as follows (with a few additional tweaks):
awk 'BEGIN{OFS="\t"} {if ($2==0) {print $1, "inf"} else {print $1, log($3/$2)log(2)}} file.txt
Explanation:
if ($2==0) {print $1, "inf"} else {...} - First check to see if the 2nd field ($2) is zero. If so, print $1 and inf and move on to the next line; otherwise proceed as usual.
BEGIN{OFS="\t"} - Set OFS inside the awk script; mostly a preference thing.
... file.txt - awk can read from files when you specify it as an argument; this saves the use of a cat process. (See UUCA)
awk -F'\t' '{print $1,($2 ? log($3/$2)log(2) : "inf")}' file.txt

nested awk commands?

I have got following two codes:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
and
grep -v '#' neco.txt |
grep -v 'seq-name' |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk '$6 >= 49 { print }' |
awk '$6 <= 180 { print }' |
awk '$4 > 1 { print }' |
awk '$5 < $nut { print }' |
wc -l
I would like my script to replace "nut" at this place:
awk '$4 < $nut { print }'
with the number returned from this:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
However, $1 in code just above should represent not column from ids_lengths.txt, but first column from neco.txt! (similiarly as I use $6 and $4 in main code).
A help how to solve these nested awks will definitely be appreciated:-)
edit:
Line of my input file (neco.txt) looks like this:
FZWTUY402JKYFZ 2 100.000 3 11 9 4.500 7 0 0 0 . TG TGTGTGTGT
The biggest problem is that I want to filter those lines that have in the fifth column number less than number, which I get from another file (ids_lengths.txt), when searching with first column (e.g. FZWTUY402JKYFZ). That's why I put "nut" variable in my draft script :-)
ids_lengths.txt looks like this:
>FZWTUY402JKYFZ
153
>FZWTUY402JXI9S
42
>FZWTUY402JMZO4
158
You can combine the two grep -v operations and the four consecutive awk operations into one of each. This gives you useful economy without completely rewriting everything:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
grep -E -v '#|seq-name' neco.txt |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk -vnut="$nut" '$6 >= 49 && $6 <= 180 && $4 > 1 && $5 < nut { print }' |
wc -l
I would not bother to make a single awk script determine the value of nut and do the value-based filtering. It can be done, but it complicates things unnecessarily — unless you can demonstrate that the whole thing is a bottleneck for the performance of the production system, in which case you do work harder (though I'd probably use Perl in that case; it can do the whole lot in one command).
Approximately:
awk -v select="$1" '$0 ~ select && FNR == NR { getline; nut = $0; } FNR == NR {next} $4 > 1 $5 < nut && $6 >= 49 && $6 <= 180 && ! /#/ && ! /seq-name/ && $NF ~ /^AC.+CA$/ {count++} END {print count}' neco.txt ids_lengths.txt
The regex will need to be adjusted to something that AWK understands. I can't see how the regex matches the sample data you provided. Part of the solution may be to use a field count as one of the conditions. Perhaps NF == 13 or NF >= 13.
Here's the script above broken out on multiple lines for readability:
awk -v select="$1" '
$0 ~ select && FNR == NR {
getline
nut = $0;
}
FNR == NR {next}
$4 > 1
$5 < nut &&
$6 >= 49 &&
$6 <= 180 &&
! /#/ &&
! /seq-name/ &&
$NF ~ /^AC.+CA$/ {
count++
}
END {
print count
}' ids_lengths.txt neco.txt