I'm having trouble with awk and comparing values. Here's a minimal example :
echo "0000e149 0000e152" | awk '{print($1==$2)}'
Which outputs 1 instead of 0. What am I doing wrong ? And how should I do to compare such values ?
Thanks,
To convert a string representing a hex number to a numerical value, you need 2 things: prefix the string with "0x" and use the strtonum() function.
To demonstrate:
echo "0000e149 0000e152" | gawk '{
print $1, $1+0
print $2, $2+0
n1 = strtonum("0x" $1)
n2 = strtonum("0x" $2)
print $1, n1
print $2, n2
}'
0000e149 0
0000e152 0
0000e149 57673
0000e152 57682
We can see that naively treating the strings as numbers, awk thinks their value is 0. This is because the digits preceding the first non-digit happen to be only zeros.
Ref: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
Note that strtonum is a GNU awk extension
You need to convert $1 and $2 to strings in order to enforce alphanumeric comparison. This can be done by simply append an empty string to them:
echo "0000e149 0000e152" | awk '{print($1""==$2"")}'
Otherwise awk would perform a numeric comparison. awk will need to convert them to numeric values in this case. Converting those values to numbers in awk leads to 0 - because of the leading zero(s) they are treated as octal numbers but parsing as an octal number fails because the values containing invalid digits which aren't allowed in octal numbers, which results in 0. You can verify that using the following command:
echo "0000e149 0000e152" | awk '{print $1+0; print $2+0)}'
0
0
When using non-decimal data you just need to tell gawk that's what you're doing and specify what base you're using in each number:
$ echo "0xe152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xE152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xe149 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
0
See http://www.gnu.org/software/gawk/manual/gawk.html#Nondecimal-Data
i think many forgot the fact that the hexdigits 0-9 A-F a-f rank order in ASCII - instead of wasting time performing the conversion, or risk facing numeric precision shortfall, simply :
trim out leading edge zeros, including the optional 0x / 0X
depending on the input source, also trim out delimiters such as ":" (e.g. IPv6, MAC address), "-" (e.g. UUID), "_" (e.g. "0xffff_ffff_ffff_ffff"), "%" (e.g. URL-encoding) etc
—- be mindful of the need to pad in missing leading zeros for formats that are very flexible with delimiters, such as IPv6
compare their respective string length()s :
if those differ, then one is already distinctly larger,
— otherwise
prefix both with something meaningless like "\1" to guarantee a string-compare operation without risk of either awk being too smart or running into extreme edge cases like locale-specific peculiarities to its collating order :
(("\1") toupper(hex_str_1)) == (("\1") toupper(hex_str_2))
Related
I have a csv file and I want to add a column that takes some values from other columns and make some calculations. As a simplified version I'm trying this:
awk -F"," '{print $0, $1+1}' myFile.csv |head -1
The output is:
29.325172701023977,...other columns..., 30
The column added should be 30.325172701023977 but the output is rounded off.
I tried some options using printf, CONVFMT and OFMT but nothing worked.
How can I avoid the round off?
Assumptions:
the number of decimal places is not known beforehand
the number of decimal places can vary from line to line
Setup:
$ cat myfile.csv
29.325172701023977,...other columns...
15.12345,...other columns...
120.666777888,...other columns...
46,...other columns...
One awk idea where we use the number of decimal places to dynamically generate the printf "%.?f" format:
awk '
BEGIN { FS=OFS="," }
{ split($1,arr,".") # split $1 on period
numdigits=length(arr[2]) # count number of decimal places
newNF=sprintf("%.*f",numdigits,$1+1) # calculate $1+1 and format with "numdigits" decimal places
print $0,newNF # print new line
}
' myfile.csv
NOTE: assumes OP's locale uses a decimal/period to separate integer from fraction; for a locale that uses a comma to separate integer from fraction it gets more complicated since it will be impossible to distinguish between a comma as integer/fraction delimiter vs field delimiter without some changes to the file's format
This generates:
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns...,47
as long as you aren't dealing with numbers greater than 9E15, there's no need to fudge any one of CONVFMT, OFMT, or s/printf() at all :
{m,g}awk '$++NF = int((_=$!__) + sub("^[^.]*",__,_))_' FS=',' OFS=','
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns…,47
if mawk-1 is sending ur numbers to scientific notation do :
mawk '$++NF=int((_=$!!NF)+sub("^[^.]*",__,_))_' FS=',' OFS=',' CONVFMT='%.f'
when u scroll right you'll notice all input digits beyond the decimal point are fully preserved
2929292929.32323232325151515151727272727270707070701010101010232323232397979797977,...other columns...,2929292930.32323232325151515151727272727270707070701010101010232323232397979797977
1515151515.121212121234343434345,...other columns...,1515151516.121212121234343434345
12121212120.66666666666767676767777777777788888888888,...other columns...,12121212121.66666666666767676767777777777788888888888
4646464646,...other columns…,4646464647
2929.32325151727270701010232397977,...other columns...,2930.32325151727270701010232397977
1515.121234345,...other columns...,1516.121234345
12120.66666767777788888,...other columns...,12121.66666767777788888
4646,...other columns...,4647
change it to CONVFMT='%\47.f', and you can even get mawk-1 to nicely comma format them for u :
29292929292929.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977,...other columns…,29,292,929,292,930.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977
15151515151515.12121212121212343434343434345,...other columns…,15,151,515,151,516.12121212121212343434343434345
121212121212120.666666666666666767676767676777777777777777888888888888888,…other columns…,121,212,121,212,121.666666666666666767676767676777777777777777888888888888888
46464646464646,...other columns...,46,464,646,464,647
Just to make sure, is it really that using awk (Gnu awk at least) I can convert:
from octal to ASCII by:
print "\101" # or a="\101"
A
from hex to ASCII:
print "\x41" # or b="\x41"
B
but from decimal to ASCII I have to:
$ printf "%c\n", 67 # or c=sprintf("%c", 67)
C
There is no secret print "\?67" in that RTFM (Memo) I missed?
I'm trying to get character frequencies from $0="aabccc" like:
for(i=141; i<143; i++) a=a gsub("\\"i, ""); print a
213
but using decimals (instead of octals in above example). The decimalistic approach seem awfully long:
$ cat foo
aabccc
$ awk '{for(i=97;i<=99;i++){c=sprintf("%c",i);a=a gsub(c,"")} print a}' foo
213
It got used here.
No, \nnn is octal and \xnn is hex - that's all there is for including characters you cannot include as-is in strings and you should always use the octal, not the hex, representation for robustness (see, for example, http://awk.freeshell.org/PrintASingleQuote).
I don't understand the last part of your question where you state what you're trying to do with this - provide concise, testable sample input and expected output and I'm sure someone can help you do it the right way, whatever it is.
Is this what you're trying to do?
$ awk 'BEGIN{for (i=0141; i<0143; i++) print i}'
97
98
A lookup table is the only way to address this (directly convert CHAR to ASCII DECIMAL) within "AWK only".
You can simply use sprintf() to convert ASCII DECIMAL to CHAR.
You can create a lookup table by iterating through each of the known
ascii chars and storing them in an array where the key is the character and the value is the ascii value of that char.
You can use sprintf() within AWK to get the char for each decimal.
Then you can pass the char to the array to get the corresponding
decimal again.
In this example, using awk.
We cycle through all 256 characters, printing out each one.
We split the resulting string into a series of lines where each line has a single character.
We build a table in awk of the 256 characters (in BEGIN), and then feed each of the input characters in to lookup each one.
Finally we then print out the code for each character on the input.
awk 'BEGIN{
for(n=0;n<256;n++)
print sprintf("%c",n)
}' | awk '{
for (i=0; ++i <= length($0);)
printf "%s\n", substr($0, i, 1)
}' | awk 'BEGIN{
for(n=0;n<256;n++)
ord[sprintf("%c",n)]=n
}{
print ord[$1]
}'
The reverse can also be done, where we lookup a list of character codes.
awk 'BEGIN{
for(n=0;n<256;n++)
print sprintf("%s",n)
}' | awk 'BEGIN{
for(n=0;n<256;n++)
char[n]=sprintf("%c",n)
}{
print char[$1]
}'
Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.
If as you say at the end of your question you're simply looking to count the frequency of characters, I'd just assemble an array.
$ awk '{for(i=1;i<=length($0);i++) a[substr($0,i,1)]++} END{for(i in a) printf "%d %s\n",a[i],i}' <<<$'aabccc\ndaae'
1 d
1 e
4 a
1 b
3 c
Note that this also supports multi-line input.
We're stepping through each line of input, incrementing a counter that is an array subscript keyed with the character in question.
I would expect this approach to be more performant than applying a regex to count the replacements for every interesting character, but I haven't done any speed comparison tests (and of course it would depend on how large a set you're interested in).
While this answer doesn't address your initial question, I hope it'll provide a better way to approach the problem.
(Thanks for including the final details in your question. XY problems are all too frequent here.)
if you need to encode bytes -> octals in awk, here's a fully self-encapsulated, recursive, and cross-awk compatible octal encoder that I came up with before :
verified on gawk, mawk-1, mawk-2, and nawk,
benchmarked throughput rate of 39.2 MByte/sec
|
out9: 1.82GiB 0:00:47 [39.2MiB/s] [39.2MiB/s] [ <=> ]
in0: 466MiB 0:00:00 [1.78GiB/s] [1.78GiB/s] [>] 100%
( pvE 0.1 in0 < "${m3l}" | mawk2x ; )
39.91s user 6.94s system 98% cpu 47.656 total
1
2 78b4c27659ae66e4c98796a60043f1fe stdin
3
echo "${data}" | awk '{
print octencode_v7($0)
}
function octencode_v7(______,_,__,___,____,_____,_______) {
if ( ( (_+=_+=_^=_<_\
)^_*(_+_)*(_^_)^(!(("\1"~"\0")+\
index(-+log(_<_),"+") ) ) )<=(___=\
(_^=_<_)<length("\333\222")\
? length(______) : match(______,"$")-_)) {
return \
octencode_v7(substr(______,_^=_<_,_=int(___/(_+_)))) \
octencode_v7(substr(______,++_))
}
_______=___
___="\36\6\2\24"
gsub(/\\/,___,______)
_______-=gsub("["(!_)"-"(_+(_+=++_+_))"]", "\\"(!_)(_)"&",______)
_--;
if (!+(_______-=gsub(___, "\\"(_--^--_+_*_),______) \
- gsub("[[]","\\" ((_^!_)(_)_),______) \
- gsub(/\^/, "\\" ((_^!_)(_)(_+_)),______))) {
return ______
}
____=___=_+=_^=_<_
_____=(___^=++____)-_^(____=!_)
do { ___=_____
do { __=_____
if (+____ || (_____-___)!=_^(_<_)) {
do { _=(____)(___)__
if (+____!=_^(_<_) || ! index(___^___,_____) ||
+__!~"^["(_____^___+___)"]$") {
_="\\"(_)
_______-=gsub(((!+____ && +_____<(___+___)) ||
(+____==_^(_<_) &&
( +___==+_____ ||
(___+____+___)==+_____))) \
? "["(_)"]" : (_), _,______)
} } while(__--)
} } while(___--)
if (!_______) {
return ______
} } while((++____+____)<_____)
return ______
}'
It's basically a triple-nested do-while loop combo to cycle through all the octal codes, without needing any previously made lookup reference strings/arrays
Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.
This can be circumvented by using octal codes \200 - \377 for 128-255.
IIRC the bytes C0 C1 F5 F6 F7 F8 F9 FA FB FC FD FE FF shouldn't exist within properly encoded UTF-8 documents (or not yet spec'ed to). FE and FF may overlap with UTF16 byte order mark, but that should hardly be a concern as of today since the world has standardized upon UTF-8.
I am using this command
awk '$1 > 3 {print $1}' file;
file :
String
2
4
5
6
7
String
output this;
String
4
5
6
7
String
Why result does not been only numbers as below,
4
5
6
7
This happens because one side of the comparison is a string, so awk is doing string comparison and the numeric value of the character 'S' is greater than 3.
$ printf "3: %d S: %d\n" \'3 \'S
3: 51 S: 83
Note: the ' before the arguments passed to printf are important, as they trigger the conversion to the numeric value in the underlying codeset:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
We write \' so that the ' is passed to printf, rather than being interpreted as syntax by the shell (a plain ' would open/close a string literal).
Returning to the question, to get the desired behaviour, you need to convert the first field to a number:
awk '+$1 > 3 { print $1 }' file
I am using the unary plus operator to convert the field to a number. Alternatively, some people prefer to simply add 0.
Taken from the awk user guide...
ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_8.html
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT. ... CONVFMT's default value is
"%.6g", which prints a value with at least six significant digits.
So, basically they are all treated as strings, and "String" Happens to be greater than "3".
I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
I am trying to read in a formatted file using awk. The content looks like the following:
1PS1 A1 1 11.197 5.497 7.783
1PS1 A1 1 11.189 5.846 7.700
.
.
.
Following c format, these lines are in following format
"%5d%5s%5s%5d%8.3f%.3f%8.3f"
where, first 5 positions are integer (1), next 5 positions are characters (PS1), next 5 positions are characters (A1), next 5 positions are integer (1), next 24 positions are divided into 3 columns of 8 positions with 3 decimal point floating numbers.
What I've been using is just calling these lines separated by columns using "$1, $2, $3". For example,
cat test.gro | awk 'BEGIN{i=0} {MolID[i]=$1; id[i]=$2; num[i]=$3; x[i]=$4;
y[i]=$5; z[i]=$6; i++} END { ...} >test1.gro
But I ran into some problems with this, and now I am trying to read these files in a formatted way as discussed above.
Any idea how I do this?
Looking at your sample input, it seems the format string is actually "%5d%-5s%5s%5d%8.3f%.3f%8.3f" with the first string field being left-justified. It's too bad awk doesn't have a scanf() function, but you can get your data with a few substr() calls
awk -v OFS=: '
{
a=substr($0,1,5)
b=substr($0,6,5)
c=substr($0,11,5)
d=substr($0,16,5)
e=substr($0,21,8)
f=substr($0,29,8)
g=substr($0,37,8)
print a,b,c,d,e,f,g
}
'
outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
If you have GNU awk, you can use the FIELDWIDTHS variable like this:
gawk -v FIELDWIDTHS="5 5 5 5 8 8 8" -v OFS=: '{print $1, $2, $3, $4, $5, $6, $7}'
also outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
You never said exactly which fields you think should have what number, so I'd like to be clear about how awk thinks that works (Your choice to be explicit about calling the whitespace in your output format string fields makes me worry a little. You might have a different idea about this than awk.).
From the manpage:
An input line is normally made up of fields separated by white space,
or by regular expression FS. The fields are denoted $1, $2, ..., while
$0 refers to the entire line. If FS is null, the input line is split
into one field per character.
Take note that the whitespace in the input line does not get assigned a field number and that sequential whitespace is treated as a single field separator.
You can test this with something like:
echo "1 2 3 4" | awk '{print "1:" $1 "\t2:" $2 "\t3:" $3 "\t4:" $4}'
at the command line.
All of this assumes that you have not diddles the FS variable, of course.