awk greater than why show string value? - awk

I am using this command
awk '$1 > 3 {print $1}' file;
file :
String
2
4
5
6
7
String
output this;
String
4
5
6
7
String
Why result does not been only numbers as below,
4
5
6
7

This happens because one side of the comparison is a string, so awk is doing string comparison and the numeric value of the character 'S' is greater than 3.
$ printf "3: %d S: %d\n" \'3 \'S
3: 51 S: 83
Note: the ' before the arguments passed to printf are important, as they trigger the conversion to the numeric value in the underlying codeset:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
We write \' so that the ' is passed to printf, rather than being interpreted as syntax by the shell (a plain ' would open/close a string literal).
Returning to the question, to get the desired behaviour, you need to convert the first field to a number:
awk '+$1 > 3 { print $1 }' file
I am using the unary plus operator to convert the field to a number. Alternatively, some people prefer to simply add 0.

Taken from the awk user guide...
ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_8.html
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT. ... CONVFMT's default value is
"%.6g", which prints a value with at least six significant digits.
So, basically they are all treated as strings, and "String" Happens to be greater than "3".

Related

Automatically round off - awk command

I have a csv file and I want to add a column that takes some values from other columns and make some calculations. As a simplified version I'm trying this:
awk -F"," '{print $0, $1+1}' myFile.csv |head -1
The output is:
29.325172701023977,...other columns..., 30
The column added should be 30.325172701023977 but the output is rounded off.
I tried some options using printf, CONVFMT and OFMT but nothing worked.
How can I avoid the round off?
Assumptions:
the number of decimal places is not known beforehand
the number of decimal places can vary from line to line
Setup:
$ cat myfile.csv
29.325172701023977,...other columns...
15.12345,...other columns...
120.666777888,...other columns...
46,...other columns...
One awk idea where we use the number of decimal places to dynamically generate the printf "%.?f" format:
awk '
BEGIN { FS=OFS="," }
{ split($1,arr,".") # split $1 on period
numdigits=length(arr[2]) # count number of decimal places
newNF=sprintf("%.*f",numdigits,$1+1) # calculate $1+1 and format with "numdigits" decimal places
print $0,newNF # print new line
}
' myfile.csv
NOTE: assumes OP's locale uses a decimal/period to separate integer from fraction; for a locale that uses a comma to separate integer from fraction it gets more complicated since it will be impossible to distinguish between a comma as integer/fraction delimiter vs field delimiter without some changes to the file's format
This generates:
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns...,47
as long as you aren't dealing with numbers greater than 9E15, there's no need to fudge any one of CONVFMT, OFMT, or s/printf() at all :
{m,g}awk '$++NF = int((_=$!__) + sub("^[^.]*",__,_))_' FS=',' OFS=','
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns…,47
if mawk-1 is sending ur numbers to scientific notation do :
mawk '$++NF=int((_=$!!NF)+sub("^[^.]*",__,_))_' FS=',' OFS=',' CONVFMT='%.f'
when u scroll right you'll notice all input digits beyond the decimal point are fully preserved
2929292929.32323232325151515151727272727270707070701010101010232323232397979797977,...other columns...,2929292930.32323232325151515151727272727270707070701010101010232323232397979797977
1515151515.121212121234343434345,...other columns...,1515151516.121212121234343434345
12121212120.66666666666767676767777777777788888888888,...other columns...,12121212121.66666666666767676767777777777788888888888
4646464646,...other columns…,4646464647
2929.32325151727270701010232397977,...other columns...,2930.32325151727270701010232397977
1515.121234345,...other columns...,1516.121234345
12120.66666767777788888,...other columns...,12121.66666767777788888
4646,...other columns...,4647
change it to CONVFMT='%\47.f', and you can even get mawk-1 to nicely comma format them for u :
29292929292929.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977,...other columns…,29,292,929,292,930.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977
15151515151515.12121212121212343434343434345,...other columns…,15,151,515,151,516.12121212121212343434343434345
121212121212120.666666666666666767676767676777777777777777888888888888888,…other columns…,121,212,121,212,121.666666666666666767676767676777777777777777888888888888888
46464646464646,...other columns...,46,464,646,464,647

Comparing hexadecimal values with awk

I'm having trouble with awk and comparing values. Here's a minimal example :
echo "0000e149 0000e152" | awk '{print($1==$2)}'
Which outputs 1 instead of 0. What am I doing wrong ? And how should I do to compare such values ?
Thanks,
To convert a string representing a hex number to a numerical value, you need 2 things: prefix the string with "0x" and use the strtonum() function.
To demonstrate:
echo "0000e149 0000e152" | gawk '{
print $1, $1+0
print $2, $2+0
n1 = strtonum("0x" $1)
n2 = strtonum("0x" $2)
print $1, n1
print $2, n2
}'
0000e149 0
0000e152 0
0000e149 57673
0000e152 57682
We can see that naively treating the strings as numbers, awk thinks their value is 0. This is because the digits preceding the first non-digit happen to be only zeros.
Ref: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
Note that strtonum is a GNU awk extension
You need to convert $1 and $2 to strings in order to enforce alphanumeric comparison. This can be done by simply append an empty string to them:
echo "0000e149 0000e152" | awk '{print($1""==$2"")}'
Otherwise awk would perform a numeric comparison. awk will need to convert them to numeric values in this case. Converting those values to numbers in awk leads to 0 - because of the leading zero(s) they are treated as octal numbers but parsing as an octal number fails because the values containing invalid digits which aren't allowed in octal numbers, which results in 0. You can verify that using the following command:
echo "0000e149 0000e152" | awk '{print $1+0; print $2+0)}'
0
0
When using non-decimal data you just need to tell gawk that's what you're doing and specify what base you're using in each number:
$ echo "0xe152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xE152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xe149 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
0
See http://www.gnu.org/software/gawk/manual/gawk.html#Nondecimal-Data
i think many forgot the fact that the hexdigits 0-9 A-F a-f rank order in ASCII - instead of wasting time performing the conversion, or risk facing numeric precision shortfall, simply :
trim out leading edge zeros, including the optional 0x / 0X
depending on the input source, also trim out delimiters such as ":" (e.g. IPv6, MAC address), "-" (e.g. UUID), "_" (e.g. "0xffff_ffff_ffff_ffff"), "%" (e.g. URL-encoding) etc
—- be mindful of the need to pad in missing leading zeros for formats that are very flexible with delimiters, such as IPv6
compare their respective string length()s :
if those differ, then one is already distinctly larger,
— otherwise
prefix both with something meaningless like "\1" to guarantee a string-compare operation without risk of either awk being too smart or running into extreme edge cases like locale-specific peculiarities to its collating order :
(("\1") toupper(hex_str_1)) == (("\1") toupper(hex_str_2))

Check if a field is an integer in awk

I am using the following script to find the number of running connections on my mongodb-server.
mongostat | awk 'BEGIN{FS=" *"}{print "Number of connections: "$19}'
But every 10 lines, $19 carries a string, denoting a field name.
I want to modify my script to print only if $19 is an integer.
I could try FS = " *[^0-9]*", but it matches columns that start with number rather than giving selective printing.
Use
mongostat | awk -F ' *' '$19 ~ /^[0-9]+$/ { print "Number of connections: " $19 }'
$19 ~ /^[0-9]+$/ checks if $19 matches the regex ^[0-9]+$ (i.e., if it only consists of digits), and the associated action is only executed if this is the case.
By the way, come to think of it, the special field separator is probably unnecessary. The default field separator of awk is any sequence of whitespaces, so unless mongostat uses an odd mix of tabs and spaces,
mongostat | awk '$19 ~ /^[0-9]+$/ { print "Number of connections: " $19 }'
should work fine.
Check if this field is formed by just digits by making it match the regex ^[0-9]+$:
$19~/^[0-9]+$/
^ stands for beginning of string and $ for end, so we are checking if it consist in digits from the beginning until the end. With + we make it match at least one digit, otherwise an empty field would also match (so a file with less fields would always match).
All together:
mongostat | awk 'BEGIN{FS=" *"} $19~/^[0-9]+$/ {print "Number of connections: "$19}'
You have to be very careful here. The answer is not as simple as you imagine:
an integer has a sign, so you need to take this into account in your tests. So the integers -123 and +123 will not be recognised as integers in earlier proposed tests.
awk flexibly converts variables types from floats (numbers) to strings and vice versa. Converting to strings is done using sprintf. If the float represents an integer, use the format %d otherwise use the format CONVFMT (default %.6g). Some more detailed explanations are at the bottom of this post. So checking if a number is an integer or if a string is an integer are two different things.
So when you make use of a regular expression to test if a number is an integer, it will work flawlessly if your variable is still considered to be a string (such as an unprocessed field). However, if your variable is a number, awk will first convert the number in a string before doing the regular expression test and as such, this can fail:
is_integer(x) { x ~ /^[-+]?[0-9]+$/ }
BEGIN { n=split("+0 -123 +123.0 1.0000001",a)
for(i=1;i<=n;++i) print a[i],is_integer(a[i]), is_integer(a[i]+0), a[i]+0
}
which outputs:
+0 1 1 0
-123 1 1 -123
+123.0 0 1 123 << QUESTIONABLE
1.0000001 0 1 1 << FAIL
^ ^
test test
as string as number
As you see, the last case failed because "%.6g" converts 1.0000001 into the string 1 and this is done because we use string operations.
A more generic solution to validate if a variable represents an integer would be the following:
function is_number(x) { return x+0 == x }
function is_string(x) { return ! is_number(x) }
function is_float(x) { return x+0 == x && int(x) != x }
function is_integer(x) { return x+0 == x && int(x) == x }
BEGIN { n=split( "0 +0 -0 123 +123 -123 0.0 +0.0 -0.0 123.0 +123.0 -123.0 1.23 1.0000001 -1.23E01 123ABD STRING",a)
for(i=1;i<=n;++i) {
print a[i], is_number(a[i]), is_float(a[i]), is_integer(a[i]), \
a[i]+0, is_number(a[i]+0), is_float(a[i]+0), is_integer(a[i]+0)
}
}
This method still has issues with recognising 123.0 as a float, but that is because awk only knows floating point numbers.
A numeric value that is exactly equal to the value of an integer (see Concepts Derived from the ISO C Standard) shall be converted to a string by the equivalent of a call to the sprintf function (see String Functions) with the string "%d" as the fmt argument and the numeric value being converted as the first and only expr argument. Any other numeric value shall be converted to a string by the equivalent of a call to the sprintf function with the value of the variable CONVFMT as the fmt argument and the numeric value being converted as the first and only expr argument. The result of the conversion is unspecified if the value of CONVFMT is not a floating-point format specification. This volume of POSIX.1-2017 specifies no explicit conversions between numbers and strings. An application can force an expression to be treated as a number by adding zero to it, or can force it to be treated as a string by concatenating the null string ( "" ) to it.
source: Awk Posix standard

Please explain this awk script

echo "45" | awk 'BEGIN{FS=""}{for (i=1;i<=NF;i++)x+=$i}END{print x}'
I want to know how this works,what specifically does awk Fs,NF do here?
FS is the field separator. Setting it to "" (the empty string) means that every single character will be a separate field. So in your case you've got two fields: 4, and 5.
NF is the number of fields in a given record. In your case, that's 2. So i ranges from 1 to 2, which means that $i takes the values 4 and 5.
So this AWK script iterates over the characters and prints their sum — in this case 9.
These are built-in variables, FS being Field Separator - blank meaning split each character out. NF being Num Fields split by FS... so in this case num of chars, 2. So split the input by each character ("4", "5"), iterate each char (2) while adding their values up, print the result.
http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/
FS is the field separator. Normally fields are separated by whitespace, but when you set FS to the null string, each character of the input line is a separate field.
NF is the number of fields in the current input line. Since each character is a field, in this case it's the number of characters.
The for loop then iterates over each character on the line, adding it to x. So this is adding the value of each digit in input; for 45 it adds 4+5 and prints 9.

formatted reading using awk

I am trying to read in a formatted file using awk. The content looks like the following:
1PS1 A1 1 11.197 5.497 7.783
1PS1 A1 1 11.189 5.846 7.700
.
.
.
Following c format, these lines are in following format
"%5d%5s%5s%5d%8.3f%.3f%8.3f"
where, first 5 positions are integer (1), next 5 positions are characters (PS1), next 5 positions are characters (A1), next 5 positions are integer (1), next 24 positions are divided into 3 columns of 8 positions with 3 decimal point floating numbers.
What I've been using is just calling these lines separated by columns using "$1, $2, $3". For example,
cat test.gro | awk 'BEGIN{i=0} {MolID[i]=$1; id[i]=$2; num[i]=$3; x[i]=$4;
y[i]=$5; z[i]=$6; i++} END { ...} >test1.gro
But I ran into some problems with this, and now I am trying to read these files in a formatted way as discussed above.
Any idea how I do this?
Looking at your sample input, it seems the format string is actually "%5d%-5s%5s%5d%8.3f%.3f%8.3f" with the first string field being left-justified. It's too bad awk doesn't have a scanf() function, but you can get your data with a few substr() calls
awk -v OFS=: '
{
a=substr($0,1,5)
b=substr($0,6,5)
c=substr($0,11,5)
d=substr($0,16,5)
e=substr($0,21,8)
f=substr($0,29,8)
g=substr($0,37,8)
print a,b,c,d,e,f,g
}
'
outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
If you have GNU awk, you can use the FIELDWIDTHS variable like this:
gawk -v FIELDWIDTHS="5 5 5 5 8 8 8" -v OFS=: '{print $1, $2, $3, $4, $5, $6, $7}'
also outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
You never said exactly which fields you think should have what number, so I'd like to be clear about how awk thinks that works (Your choice to be explicit about calling the whitespace in your output format string fields makes me worry a little. You might have a different idea about this than awk.).
From the manpage:
An input line is normally made up of fields separated by white space,
or by regular expression FS. The fields are denoted $1, $2, ..., while
$0 refers to the entire line. If FS is null, the input line is split
into one field per character.
Take note that the whitespace in the input line does not get assigned a field number and that sequential whitespace is treated as a single field separator.
You can test this with something like:
echo "1 2 3 4" | awk '{print "1:" $1 "\t2:" $2 "\t3:" $3 "\t4:" $4}'
at the command line.
All of this assumes that you have not diddles the FS variable, of course.