Extract all negative number from column2 in another file - awk

I have a very long data file "file.dat" with two columns.
Here I am putting a very small portion. I want to extract all the negative numbers from the column2 into another file let us say file2.dat and similarly for positive numbers from the same column2 to another file file3.dat
4.0499 -7.1787
4.0716 -7.1778
4.0932 -7.1778
4.1148 -7.1785
4.1365 -7.1799
4.1581 -7.1819
4.1798 -7.1843
4.2014 -7.1868
4.2231 -7.1890
4.2447 -7.1902
4.2663 -7.1900
4.2880 -7.1886
<-------Note: this kind of break is also there in many places
0.0000 2.1372
0.0707 2.1552
0.1414 2.2074
0.2121 2.2864
0.2828 2.3791
0.3535 2.4646
0.4242 2.5189
0.4949 2.5207
0.5655 2.5098
Expected Results for Negative numbers file2.dat
-7.1787
-7.1778
-7.1778
-7.1785
-7.1799
-7.1819
-7.1843
-7.1868
-7.1890
-7.1902
-7.1900
-7.1886
Expected Results for Positive numbers file3.dat
2.1372
2.1552
2.2074
2.2864
2.3791
2.4646
2.5189
2.5207
2.5098
Nearest Solution I found
This solution did not work for me because of my lack of knowledge.
http://www.unixcl.com/2009/11/awk-extract-negative-numbers-from-file.html

It is quite simple to do with awk. You simply check the value in the 2nd column and write it out based on its value, e.g.
awk '$2<0 {print $2 > "negative.dat"} $2>=0 {print $2 > "positive.dat"}' file
Where the two rules used by awk above are:
$2<0 {print $2 > "negative.dat"}, if the value in the 2nd column is less than 0, write to negative.dat,
$2>=0 {print $2 > "positive.dat"}, if the value in the 2nd column is greater than or equal to 0, write to "positive.dat".
Example Use/Output
With you example data in file (without your comment), running the above would result in:
$ cat negative.dat
-7.1787
-7.1778
-7.1778
-7.1785
-7.1799
-7.1819
-7.1843
-7.1868
-7.1890
-7.1902
-7.1900
-7.1886
The positive values in:
$ cat positive.dat
2.1372
2.1552
2.2074
2.2864
2.3791
2.4646
2.5189
2.5207
2.5098

David's answer is pretty good, here is a shorter awk one liner using ternary condition:
awk 'NF>1 {print $2 > ($2 < 0 ? "neg.dat" : "pos.dat")}' file

Related

Count entries based on exponential notation values with pure awk

I am trying to count the entries that are less than the e threshold of 1e-5 in my tab-del data file that looks something like the table below.
col1 col2 col3 eval
entry1 - - 1e-10
entry2 - - -
entry3 - - 0.001
I used this code:
$: awk -F"\t" '{print $4}' table.txt | awk '($1 + 0) < 1e-5' | grep [0-9*] | wc -l
This outputs:
$: 1
While this works, I would like to improve the command into something pure awk. I would love to know how to do this in awk. Also, I would like to know how to print the line that satisfies the threshold if this is possible. Thank for helping!
This is probably the best way:
awk -F"\t" '($4+0==$4) && ($4 < 1E-5){c++}END{print c}' file
This does the following:
($4+0==$4): first conditional to check if $4 is a number.
($4<1E-5): second conditional to check if the value matches the range
&&: If both conditions are satisfied, increment a counter c
at the END, print the value of c
Be aware that your grep in your original command will fail. If $4 in the original file would read like XXX1XXX (i.e. a string with a number in it) or XXX*XXX (i.e. a string with an asterisk in it), it would be counted as a match.

awk include column value in pattern

I am looking for a way to pattern match against another column in awk. For example, I wish to find rows for which the value in column 4 is nested in column 5.
Performing awk '$4 ~ /$5/' doesn't work, as the dollar sign is interpreted as part of the regular expression. How do I get the column 5 value into this pattern match!
Many thanks!
if you're looking for literal match, not regex; you can use
awk 'index($5,$4)' file
will print the lines where $4 is a substring of $5.
> awk '$2 ~ $1' <<< "field another_field"
field another_field
this will print lines when $2 contains the value of $1

awk extract part of string, compare to number, output original line intact

using the basic awk tool, say I have a file, where there can be any number after "DP=" (in the 8th column) and before the semicolon. I want to only keep lines where this number is > 10.
Chr1 26313 . G A,X 0 . DP=78;I16=28,38,10,0,2405,88631,356,12836,3960,237600,530,29234,1195,26039,199,4509;VDB=0.0000 PL:DP 12,0,
Chr1 26597 . G T,X 0 . DP=5;I16=29,27,0,10,2054,76598,389,15193,3360,201600,558,32130,1046,22598,238,5730;VDB=0.0000 PL:DP 48,0,
...etc..
How do I use awk to extract the number, and only return lines if the number is greater than 10? My desired output would be (since in the other line, DP=5 is < 10 ):
Chr1 26313 . G A,X 0 . DP=78;I16=28,38,10,0,2405,88631,356,12836,3960,237600,530,29234,1195,26039,199,4509;VDB=0.0000 PL:DP 12,0,
Here is what I have so far.. but I can't figure out how to extract the string and compare to a number:
awk '( $5 ~ /[ACGT]/ && $8 ~ /^DP=/ && $10 !~ /^0/) {print $0}'
Maybe I can split this into two awk commands? Or maybe there is a trick to do this all in one call?
Sorry if it has been answered, but I looked around and couldn't figure it out..
I don't want to use perl, or gawk or anything else..
EDIT: I think I made my example too simple.. updated it..
Set the field separator and test the condition. Adding 0 to the field gets rid of the trailing ;
awk -F'=' '$2+0>10' file
Your sample input line appears to be a truncated version of your actual input. So keeping the rest of the conditions as is, you can just add the following check:
awk '$5~/[ACGT]/ && $8~/^DP=/ && $10!~/^0/{split($0,tmp,/[=;]/);if(tmp[2]>10) print}' file

Awk to multiply all numbers in a text file by a constant subject to another constraint

I have files with contents like the following:
0.23423
0.10093
0.44231
0.45630
0.89999
I want to increase every number by a given percentage, say, 20%. So, I want to know how to multiply each value by 1.2.
The "constraint" I need to impose is that the products be less than or equal to 1, because these values are probabilities.
So, in pseudocode, I need to replace each number X in a given text file by max(1.0,X*1.2).
How can this be acheived in Awk?
try this one-liner:
awk '{p=1.2*$0;$0=p>1?1:p}7' file
test with your example:
kent$ cat f
0.23423
0.10093
0.44231
0.45630
0.89999
kent$ awk '{p=1.2*$0;$0=p>1?1:p}7' f
0.281076
0.121116
0.530772
0.54756
1
if you want to keep the precision of the floats, you could use printf:
awk '{p=1.2*$0;$0=p>1?1:p;printf "%.5f\n",$0}' file
with same input, it gives:
kent$ awk '{p=1.2*$0;$0=p>1?1:p;printf "%.5f\n",$0}' f
0.28108
0.12112
0.53077
0.54756
1.00000
Using the C-like ternary operator in a one-liner :
awk '{res = $1 * 1.2; print (res > 1) ? 1 : res}' file
0.281076
0.121116
0.530772
0.54756
1

setting default numeric format in awk

I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt