how to get rid of awk fatal division by zero error - awk

when ever I am trying to calculate mean and standard deviation using awk i am getting "awk: fatal: division by zero attempted" error.
my command is
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
does any one know how to solve this ?

Your trouble is that ... you are dividing by zero.
You have two commands:
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
The first command reads from standard input to EOF. The second command is then run, tries to read standard input, but finds that it is empty, so it has zero records read, so NR is zero, and you are dividing by 0, and crashing.
You will need to deal with both the mean and the standard deviation in a single command.
awk '{s1 += $3; s2 += $3*$3}
END { if (NR > 0){
print $2 "\t" s1 / NR;
print $2 "\t" sqrt(s2 / NR - (s1/NR)^2);
}
}'
This avoids divide-by-zero errors.

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!
To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

Keep current and previous line only if current line fulfills a given condition

I have a file which looks like this:
>4RYF_1
MAENTKNENITNILTQKLIDTRTVLIYGEINQELAEDVSKQLLLLESISNDPITIFINSQGGHVEAGDTIHDMIKFIKPTVKVVGTGWVASAGITIYLAAEKENRFSLPNTRYMIHQPAGGVQGQSTEIEIEAKEIIRMRERINRLIAEATGQSYEQISKDTDRNFWLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
I want to keep the sequence and previous line only if the sequence has a given length. For selecting only lines with that condition I use:
awk 'length($0) > 50 && length($0) <=800)' sample.txt
But how can I keep lines starting with > as well if this condition is met?
Yet another awk solution:
awk '/^>/ { header = $0; next } length > 50 && length <= 800 { print header ORS $0 }'
Would you please try the following:
awk -v RS='>' -F'\n' '
length($2) > 50 && length($2) <= 800 {printf ">%s", $0}
' sample.txt
Assigning RS to '>' tells awk to split the file on > into records,
treating the header line and the sequence line in the same record.
Assigning FS to '\n' splits the record to the header and
sequence, each assigning $1 to the header and $2 to the sequence.
As the leading > is chopped off as a delimiter, we need to prepend it
when printing the matched records.
Here is one-liner:
LANG=C grep -B1 '^.\{51,800\}$' < sample.txt
The command was really slow with LANG=en_US.UTF-8 that I set by default, so using LANG=C instead.
man grep tells you that '-B NUM' means ' Print NUM lines of leading context before matching lines.'.
'^' means start of line
'.' means any character
'{51,800}' means we want between 51 and 800 of the previous thing
'$' means end of line.
Or in other words, we want to match lines that are between 51 and 800 characters, and print it and the previous line.
A potential solution with AWK is:
awk '!/^>/ {next}; {getline s}; length(s) > 50 && length(s) <= 800 { print $0 "\n" s }' example.fasta
e.g. if example.fasta contains
>4RYF_1
WLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
>1000_chars
YiJOgeCApTkcJWxIuvooOxuqVnPdSLtOQmUfnzpBvcpYKyCvelFwKgMchYFnlvuZwVxNcnSvGcACsMywDQVvYBAiaIesQkLkYNsExRbqKPZIPnCRMAFHLmIzxIBqLwoNEPSKMZCTpwbbQCNrHSrbDMtCksTjvQsMeAkoudRGUJnPpQTEzwwnKoZBHtpMSIQBfYSPDYHwKktvCiFpewrsdDTQpqBajOWZkKURaKszEqDmdYMkzSAkMtlkXPfHroiTbyxZwzvrrMSXMRSavrBdgVYZanudjacRHWfpErJMkomXpzagXIzwbaeFgAgFnMxLuQHsdvZysqAsngkCZILvVLaFpkWnOpuYensROwkhwqUdngvlTsXBoCBwJUENUFgVdnSnxVOvfksyiabglFPqmSwhGabjNZiWGyvktzSDOQNGlEvoxhJCAOhxVAtZfyimzsziakpzfIszSWYVgKZTHatWSfttHYTkvgafcsVmitfEfQDuyyDAAAoTKpuhLrnHVFKgmEsSgygqcNLQYkpnhOosKiZJKpDolXcxAKHABtALqVXoVcSHpskrpWPrkkZLTpUXkENhnesmoQjonLWxkpcuJrOosXKNTDNuZaWIEtrDILXsIFTjAnrnwJBoirgNHcDURwDIzAXJSLPLmWkurOhWSLPrIOyqNvADBdIFaCGoZeewKleBHUGmKFWFcGgZIGUdOHwwINZqcOClPAjYaLNdLgDsUNCPwKMrOXJEyPvMRLaTJGgxzeoLCggJYTVjlJpyMsoCRZBDrBDckNMhJSQWBAxYBlqSpXnpmLeEJYirwjfCqZGBZdgkHzWGoAMxgNKHOAvGXsIbbuBjeeORhZaIrruBwDfzgTICuwWCAhCPqMqkHrxkQMZbXUIavknNhuIycoDssXlOtbSWsxVXQhWMyDQZWDlEtewXWKBPUcHDYWWgyOerbnoAxrnpsCulOxqxdywFJFoeWNpVGIPMUJSWwvlVDWNkjIBMlXPi
It will only print
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
Edit
The method that I would recommend to better handle edge-cases is to use purpose-built bioinformatics software, e.g. seqkit
seqkit seq -m 50 -M 800 example.fasta
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDI
FLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEI
MIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEA
KDYGLIDDIIINKSGLKGHHHHHH
Is perl an option?
perl -nle '$prev && print if length() >50 and length() < 800 && print $prev; $prev = $_' input_file
$prev - Create a variable which will hold every line. When the length condition is met, and there has been a previous line $prev, then it prints the condition matched in $prev and prints the last line.
$prev = $_ Assigns the current line to the prev line variable
If the upper limit 800 is not essential, could sed be an option?
$ sed -En '/>/ {N;/[a-zA-Z0-9]{50,}/p}' input_file
/>/ - Match > and read into the pattern space
N; Run the condition on the next line after the match and append that to the pattern space also:
{50,} - If the length is 50 or more
\1/p - Return it and print
Output
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
With your shown samples, please try following awk code. Written and tested with GNU awk.
awk -v RS= '
{
val=""
delete arr
while(match($0,/>[^\n]*\n*[^\n]*/)){
val=substr($0,RSTART,RLENGTH)
split(val,arr,"\n")
if(length(arr[2])>50 && length(arr[2])<=800){
print val
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
If only the next line should meet the length restrictions, you can match and store the line that starts with > in a variable, for example previous
Then for the next line, check for the length and if the previous line is not empty.
If is is not, print the previous and the current line.
At the end, set the previous variable to an empty string.
awk '{
if (/^>/) {
previous = $0
next
}
if (length(previous) != 0 && length($0) > 50 && length($0) <= 800) {
print previous ORS $0
}
previous=""
}' sample.txt
See an AWK demo

awk match pattern and convert number to different unit

I have a csv file that contains this kind of values:
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
vm48,8,32794384Ki,16257312Ki
vm48,8,30223304245,15223072Ki
vm49,8,32794384Ki,16257320Ki
vm49,8,30223304245,15223080Ki
The columns 3 and 4 are memoy values expressed either in bytes, or kibibytes. The problem is that the "Ki" string appears randomly through the CSV file, particularly in column3, it's inconsistent.
So to make the file consistent, I need to convert everything in bytes. So basically, any value matching a trailing "Ki" needs to have its numeric value multiplied by 1024, and then replace the corresponding XXXXXKi match.
The reason why I want to do it with awk is because I am already using awk to generate that csv format, but I am happy to do it with sed too.
This is my code so far but obviously it's wrong as it's multiplying any value in columns 3 and 4 by 1024 even though it does not match "Ki". I am not sure at this point how to ask awk "if you see Ki at the end, then multiply by 1024".
kubectl describe node --context=$context| sed -E '/Name:|cpu:|ephemeral-storage:|memory:/!d' | sed 's/\s//g' | awk '
BEGIN {FS = ":"; OFS = ","}
{record[$1] = $2}
$1 == "memory" {print record["Name"], record["cpu"], record["ephemeral-storage"], record["memory"]}
' | awk -F, '{print $1,$2,$3,$3*1024,$4,$4*1024}' >> describe_nodes.csv
Edit: I made a mistake, you need to multiply by 128 to convert KiB in bytes, not 1024.
"if you see Ki at the end, then multiply by 1024
You may use:
awk 'BEGIN{FS=OFS=","} $3 ~ /Ki$/ {$3 *= 1024} $4 ~ /Ki$/ {$4 *= 1024} 1' file
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920
vm48,8,33581449216,16647487488
vm48,8,30223304245,15588425728
vm49,8,33581449216,16647495680
vm49,8,30223304245,15588433920
Or a bit shorter:
awk 'BEGIN{FS=OFS=","} {
for (i=3; i<=4; ++i) $i ~ /Ki$/ && $i *= 1024} 1' file
With your shown samples/attempts, please try following awk code. Simple explanation would be, traverse through fields from 3rd field onwards and look for if a value has Ki(ignore cased manner) then multiply it with 128, print all edited/non-edited lines at last.
awk 'BEGIN{FS=OFS=","} {for(i=3;i<=NF;i++){if($i~/[Kk][Ii]$/){$i *= 128}}} 1' Input_file
You could try numfmt:
$ numfmt -d, --field 3,4 --from=auto --to=none <<EOF
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
EOF
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920

awk floating point comparison not working

I have input file with with x1 , x2 and x values, I want to check if x is midpoint between x1 and x2.
But the comparison is failing.
sample input file
x1=20.9280 x2=20.9600 x=20.9440
x1=20.9280 x2=20.9600 x=20.9440
x1=22.7840 x2=22.8160 x=22.8000
Awk command
awk -F'[ =]' '{ if(($2 + $4)/2 != ($6)) print ($2 + $4)/2, " ", $6;}' sample
OUTPUT
20.944 20.9440
20.944 20.9440
22.8 22.8000
Comparison is failing due to extra zeros after decimal points. Please help to fix it.
This is happening due to floating point comparison issue commonly found in all platforms.
You may use this awk for floating point number comparison by converting number to a floating point with 4 decimal points:
awk -F'[ =]+' '{avg = sprintf("%.4f", ($2 + $4) / 2)} avg != $6 { print avg, $6 }' file
If you have gnu awk then you can set precision to a lower number:
awk -M -v PREC=30 -F'[ =]+' '{avg = ($2 + $4) / 2; $6 += 0} avg != $6 { print avg, $6 }' file
Not really an anwser but do demonstrate. You are comparing floating point numbers, they are not equal. I replaced print with printf and modifiers with enough decimals (20, %.20f):
$ awk -F'[ =]' '{
if(($2 + $4)/2 != ($6))
printf "%.20f %.20f\n", ($2 + $4)/2, $6
}' file
Ottput:
20.94400000000000261480 20.94399999999999906208
20.94400000000000261480 20.94399999999999906208
22.79999999999999715783 22.80000000000000071054
So use sprintf and appropriate modifiers (see the printf I used) to control the values.
As others have pointed out, if you are having a problem then it's probably that you're just tripping over the common floating point arithmetic issue but since all of your input values have the same precision you can just get rid of the .s to treat the input numbers as integers and multiply by 2 instead of dividing by 2 just to keep it an integer comparison too:
$ awk -F'[ =]' '{o=$0; gsub(/\./,"")} ($6*2) == ($2+$4){$0=o; print ($2+$4)/2, $6}' file
20.944 20.9440
20.944 20.9440
22.8 22.8000
$ awk -F'[ =]' '{o=$0; gsub(/\./,"")} ($6*2) != ($2+$4){$0=o; print ($2+$4)/2, $6}' file
$

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file