Count entries based on exponential notation values with pure awk - awk

I am trying to count the entries that are less than the e threshold of 1e-5 in my tab-del data file that looks something like the table below.
col1 col2 col3 eval
entry1 - - 1e-10
entry2 - - -
entry3 - - 0.001
I used this code:
$: awk -F"\t" '{print $4}' table.txt | awk '($1 + 0) < 1e-5' | grep [0-9*] | wc -l
This outputs:
$: 1
While this works, I would like to improve the command into something pure awk. I would love to know how to do this in awk. Also, I would like to know how to print the line that satisfies the threshold if this is possible. Thank for helping!

This is probably the best way:
awk -F"\t" '($4+0==$4) && ($4 < 1E-5){c++}END{print c}' file
This does the following:
($4+0==$4): first conditional to check if $4 is a number.
($4<1E-5): second conditional to check if the value matches the range
&&: If both conditions are satisfied, increment a counter c
at the END, print the value of c
Be aware that your grep in your original command will fail. If $4 in the original file would read like XXX1XXX (i.e. a string with a number in it) or XXX*XXX (i.e. a string with an asterisk in it), it would be counted as a match.

Related

Extract all negative number from column2 in another file

I have a very long data file "file.dat" with two columns.
Here I am putting a very small portion. I want to extract all the negative numbers from the column2 into another file let us say file2.dat and similarly for positive numbers from the same column2 to another file file3.dat
4.0499 -7.1787
4.0716 -7.1778
4.0932 -7.1778
4.1148 -7.1785
4.1365 -7.1799
4.1581 -7.1819
4.1798 -7.1843
4.2014 -7.1868
4.2231 -7.1890
4.2447 -7.1902
4.2663 -7.1900
4.2880 -7.1886
<-------Note: this kind of break is also there in many places
0.0000 2.1372
0.0707 2.1552
0.1414 2.2074
0.2121 2.2864
0.2828 2.3791
0.3535 2.4646
0.4242 2.5189
0.4949 2.5207
0.5655 2.5098
Expected Results for Negative numbers file2.dat
-7.1787
-7.1778
-7.1778
-7.1785
-7.1799
-7.1819
-7.1843
-7.1868
-7.1890
-7.1902
-7.1900
-7.1886
Expected Results for Positive numbers file3.dat
2.1372
2.1552
2.2074
2.2864
2.3791
2.4646
2.5189
2.5207
2.5098
Nearest Solution I found
This solution did not work for me because of my lack of knowledge.
http://www.unixcl.com/2009/11/awk-extract-negative-numbers-from-file.html
It is quite simple to do with awk. You simply check the value in the 2nd column and write it out based on its value, e.g.
awk '$2<0 {print $2 > "negative.dat"} $2>=0 {print $2 > "positive.dat"}' file
Where the two rules used by awk above are:
$2<0 {print $2 > "negative.dat"}, if the value in the 2nd column is less than 0, write to negative.dat,
$2>=0 {print $2 > "positive.dat"}, if the value in the 2nd column is greater than or equal to 0, write to "positive.dat".
Example Use/Output
With you example data in file (without your comment), running the above would result in:
$ cat negative.dat
-7.1787
-7.1778
-7.1778
-7.1785
-7.1799
-7.1819
-7.1843
-7.1868
-7.1890
-7.1902
-7.1900
-7.1886
The positive values in:
$ cat positive.dat
2.1372
2.1552
2.2074
2.2864
2.3791
2.4646
2.5189
2.5207
2.5098
David's answer is pretty good, here is a shorter awk one liner using ternary condition:
awk 'NF>1 {print $2 > ($2 < 0 ? "neg.dat" : "pos.dat")}' file

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.

Calculating 95th percentile with awk

I'm new in awk scripting and would like to have some help in calculating 95th percentile value for a file that consist of this data:
0.0001357
0.000112
0.000062
0.000054
0.000127
0.000114
0.000136
I tried:
cat filename.txt | sort -n |
awk 'BEGIN{c=0} {total[c]=$1; c++;} END{print total[int(NR*0.95-0.5)]}'
but I dont seem to get the correct value when I compare it to excel.
I am not sure if Excel does some kind of weighted percentile, but if you actually want one of the numbers that was in your original set, then your method should work correctly for rounding.
You can simplify a little bit like this, but it's the same thing.
sort -n input.txt | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'
Following the calculation suggested here, you can do this:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
Output for given input:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
0.0001357
Explanation:
Sort the file numerically
drop the top 5%
pick the next value
PS. The statement p5=p5%1?int(p5)+1:p5 is doing a ceil operation available in many languages.
Just for the record, there is also solution, inspired by merlin2011 answer, that prints several desired percentiles:
# get amount of values
num="$(wc -l input.txt | cut -f1 -d' ')";
# sort values
sort -n input.txt > temp && mv temp input.txt
# print the desired percentiles
for p in 50 70 80 90 92 95 99 100; do
printf "%3s%%: %-5.5sms\n" "$p" "$(head input.txt -n "$((num / 100 * $p))" | tail -n1)";
done
Update: I messed it up. Bash math can't handle floating numbers, even not if used during a "single expression". That only works for files with 100*(N>0) values. So either bc or awk is required to do the math.
In case you have an "odd" amount of values, you should replace "$((num / 100 * $p))" with "$(awk "BEGIN {print int($num/100*$p)}")" in the code above.
Finally awk is part of that answer. ;)

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].