Calculating 95th percentile with awk - awk

I'm new in awk scripting and would like to have some help in calculating 95th percentile value for a file that consist of this data:
0.0001357
0.000112
0.000062
0.000054
0.000127
0.000114
0.000136
I tried:
cat filename.txt | sort -n |
awk 'BEGIN{c=0} {total[c]=$1; c++;} END{print total[int(NR*0.95-0.5)]}'
but I dont seem to get the correct value when I compare it to excel.

I am not sure if Excel does some kind of weighted percentile, but if you actually want one of the numbers that was in your original set, then your method should work correctly for rounding.
You can simplify a little bit like this, but it's the same thing.
sort -n input.txt | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'

Following the calculation suggested here, you can do this:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
Output for given input:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
0.0001357
Explanation:
Sort the file numerically
drop the top 5%
pick the next value
PS. The statement p5=p5%1?int(p5)+1:p5 is doing a ceil operation available in many languages.

Just for the record, there is also solution, inspired by merlin2011 answer, that prints several desired percentiles:
# get amount of values
num="$(wc -l input.txt | cut -f1 -d' ')";
# sort values
sort -n input.txt > temp && mv temp input.txt
# print the desired percentiles
for p in 50 70 80 90 92 95 99 100; do
printf "%3s%%: %-5.5sms\n" "$p" "$(head input.txt -n "$((num / 100 * $p))" | tail -n1)";
done
Update: I messed it up. Bash math can't handle floating numbers, even not if used during a "single expression". That only works for files with 100*(N>0) values. So either bc or awk is required to do the math.
In case you have an "odd" amount of values, you should replace "$((num / 100 * $p))" with "$(awk "BEGIN {print int($num/100*$p)}")" in the code above.
Finally awk is part of that answer. ;)

Related

Count entries based on exponential notation values with pure awk

I am trying to count the entries that are less than the e threshold of 1e-5 in my tab-del data file that looks something like the table below.
col1 col2 col3 eval
entry1 - - 1e-10
entry2 - - -
entry3 - - 0.001
I used this code:
$: awk -F"\t" '{print $4}' table.txt | awk '($1 + 0) < 1e-5' | grep [0-9*] | wc -l
This outputs:
$: 1
While this works, I would like to improve the command into something pure awk. I would love to know how to do this in awk. Also, I would like to know how to print the line that satisfies the threshold if this is possible. Thank for helping!
This is probably the best way:
awk -F"\t" '($4+0==$4) && ($4 < 1E-5){c++}END{print c}' file
This does the following:
($4+0==$4): first conditional to check if $4 is a number.
($4<1E-5): second conditional to check if the value matches the range
&&: If both conditions are satisfied, increment a counter c
at the END, print the value of c
Be aware that your grep in your original command will fail. If $4 in the original file would read like XXX1XXX (i.e. a string with a number in it) or XXX*XXX (i.e. a string with an asterisk in it), it would be counted as a match.

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

How to sum first 100 rows of a specific column using Awk?

How to sum first 100 rows of a specific column using Awk? I wrote
awk 'BEGIN{FS="|"} NR<=100 {x+=$5}END {print x}' temp.txt
But this is taking lot of time to process; is there any other way which gives result quickly?
Just exit after the required first 100 records:
awk -v iwant=100 '{x+=$5} NR==iwant{exit} END{print x+0}' test.in
Take it out for a spin:
$ for i in {1..1000}; do echo 1 >> test.in ; done # thousand of records
$ awk -v iwant=100 '{x+=$1} NR==iwant{exit} END{print x+0}' test.in
100
'{x+=$5} NR==iwant{exit} END{print x+0}'
you can always trim the input and use the same script
head -100 file | awk ... your script here ...

Divide floats in awk

I have written a code to calculate the zscore which calculates the mean and standard deviation from one file and uses some values from rows in another file, as follows:
mean=$(awk '{total += $2; count++} END {print total/count}' ABC_avg.txt)
#calculating mean of the second column of the file
std=$(awk '{x[NR]=$2; s+=$2; n++} END{a=s/n; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/n); print sd}' ABC_avg.txt)
#calculating standard deviation from the second column of the same file
awk '{if (std) print $2-$mean/$std}' ABC_splicedavg.txt" > ABC.tmp
#calculate the zscore for each row and store it in a temporary file
zscore=$(awk '{total += $0; count++} END {if (count) print total/count}' ABC.tmp)
#calculate an average of all the zscores in the rows and store it in a variable
echo $motif" "$zscore
rm ABC.tmp
However when I execute this code ,at the step where a temp file is created I get an error as fatal: division by zero attempted, what is the right way to implement this code? TIA I used bc -l option but it gives a very long version of the floating integer.
Here is a script to compute mean and std in one pass, you may lose some resolution if not acceptable there are alternatives...
$ awk '{print rand()}' <(seq 100)
| awk '{sum+=$1; sqsum+=$1^2}
END{print mean=sum/NR, std=sqrt(sqsum/NR-mean^2), z=mean/std}'
0.486904 0.321789 1.51312
Your script for z-score for each sample is wrong! You need to do ($2-mean)/std.
You can control the precision of your output with bc by using the scale variable:
$ echo "4/7" | bc -l
.57142857142857142857
$ echo "scale=3; 4/7" | bc -l
.571

setting default numeric format in awk

I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt