Check if a field is an integer in awk - awk

I am using the following script to find the number of running connections on my mongodb-server.
mongostat | awk 'BEGIN{FS=" *"}{print "Number of connections: "$19}'
But every 10 lines, $19 carries a string, denoting a field name.
I want to modify my script to print only if $19 is an integer.
I could try FS = " *[^0-9]*", but it matches columns that start with number rather than giving selective printing.

mongostat | awk -F ' *' '$19 ~ /^[0-9]+$/ { print "Number of connections: " $19 }'
$19 ~ /^[0-9]+$/ checks if $19 matches the regex ^[0-9]+$ (i.e., if it only consists of digits), and the associated action is only executed if this is the case.
By the way, come to think of it, the special field separator is probably unnecessary. The default field separator of awk is any sequence of whitespaces, so unless mongostat uses an odd mix of tabs and spaces,
mongostat | awk '$19 ~ /^[0-9]+$/ { print "Number of connections: " $19 }'
should work fine.

Check if this field is formed by just digits by making it match the regex ^[0-9]+$:
^ stands for beginning of string and $ for end, so we are checking if it consist in digits from the beginning until the end. With + we make it match at least one digit, otherwise an empty field would also match (so a file with less fields would always match).
All together:
mongostat | awk 'BEGIN{FS=" *"} $19~/^[0-9]+$/ {print "Number of connections: "$19}'

You have to be very careful here. The answer is not as simple as you imagine:
an integer has a sign, so you need to take this into account in your tests. So the integers -123 and +123 will not be recognised as integers in earlier proposed tests.
awk flexibly converts variables types from floats (numbers) to strings and vice versa. Converting to strings is done using sprintf. If the float represents an integer, use the format %d otherwise use the format CONVFMT (default %.6g). Some more detailed explanations are at the bottom of this post. So checking if a number is an integer or if a string is an integer are two different things.
So when you make use of a regular expression to test if a number is an integer, it will work flawlessly if your variable is still considered to be a string (such as an unprocessed field). However, if your variable is a number, awk will first convert the number in a string before doing the regular expression test and as such, this can fail:
is_integer(x) { x ~ /^[-+]?[0-9]+$/ }
BEGIN { n=split("+0 -123 +123.0 1.0000001",a)
for(i=1;i<=n;++i) print a[i],is_integer(a[i]), is_integer(a[i]+0), a[i]+0
which outputs:
+0 1 1 0
-123 1 1 -123
+123.0 0 1 123 << QUESTIONABLE
1.0000001 0 1 1 << FAIL
^ ^
test test
as string as number
As you see, the last case failed because "%.6g" converts 1.0000001 into the string 1 and this is done because we use string operations.
A more generic solution to validate if a variable represents an integer would be the following:
function is_number(x) { return x+0 == x }
function is_string(x) { return ! is_number(x) }
function is_float(x) { return x+0 == x && int(x) != x }
function is_integer(x) { return x+0 == x && int(x) == x }
BEGIN { n=split( "0 +0 -0 123 +123 -123 0.0 +0.0 -0.0 123.0 +123.0 -123.0 1.23 1.0000001 -1.23E01 123ABD STRING",a)
for(i=1;i<=n;++i) {
print a[i], is_number(a[i]), is_float(a[i]), is_integer(a[i]), \
a[i]+0, is_number(a[i]+0), is_float(a[i]+0), is_integer(a[i]+0)
This method still has issues with recognising 123.0 as a float, but that is because awk only knows floating point numbers.
A numeric value that is exactly equal to the value of an integer (see Concepts Derived from the ISO C Standard) shall be converted to a string by the equivalent of a call to the sprintf function (see String Functions) with the string "%d" as the fmt argument and the numeric value being converted as the first and only expr argument. Any other numeric value shall be converted to a string by the equivalent of a call to the sprintf function with the value of the variable CONVFMT as the fmt argument and the numeric value being converted as the first and only expr argument. The result of the conversion is unspecified if the value of CONVFMT is not a floating-point format specification. This volume of POSIX.1-2017 specifies no explicit conversions between numbers and strings. An application can force an expression to be treated as a number by adding zero to it, or can force it to be treated as a string by concatenating the null string ( "" ) to it.
source: Awk Posix standard


Automatically round off - awk command

I have a csv file and I want to add a column that takes some values from other columns and make some calculations. As a simplified version I'm trying this:
awk -F"," '{print $0, $1+1}' myFile.csv |head -1
The output is:
29.325172701023977,...other columns..., 30
The column added should be 30.325172701023977 but the output is rounded off.
I tried some options using printf, CONVFMT and OFMT but nothing worked.
How can I avoid the round off?
the number of decimal places is not known beforehand
the number of decimal places can vary from line to line
$ cat myfile.csv
29.325172701023977,...other columns...
15.12345,...other columns...
120.666777888,...other columns...
46,...other columns...
One awk idea where we use the number of decimal places to dynamically generate the printf "%.?f" format:
awk '
BEGIN { FS=OFS="," }
{ split($1,arr,".") # split $1 on period
numdigits=length(arr[2]) # count number of decimal places
newNF=sprintf("%.*f",numdigits,$1+1) # calculate $1+1 and format with "numdigits" decimal places
print $0,newNF # print new line
' myfile.csv
NOTE: assumes OP's locale uses a decimal/period to separate integer from fraction; for a locale that uses a comma to separate integer from fraction it gets more complicated since it will be impossible to distinguish between a comma as integer/fraction delimiter vs field delimiter without some changes to the file's format
This generates:
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns...,47
as long as you aren't dealing with numbers greater than 9E15, there's no need to fudge any one of CONVFMT, OFMT, or s/printf() at all :
{m,g}awk '$++NF = int((_=$!__) + sub("^[^.]*",__,_))_' FS=',' OFS=','
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns…,47
if mawk-1 is sending ur numbers to scientific notation do :
mawk '$++NF=int((_=$!!NF)+sub("^[^.]*",__,_))_' FS=',' OFS=',' CONVFMT='%.f'
when u scroll right you'll notice all input digits beyond the decimal point are fully preserved
2929292929.32323232325151515151727272727270707070701010101010232323232397979797977,...other columns...,2929292930.32323232325151515151727272727270707070701010101010232323232397979797977
1515151515.121212121234343434345,...other columns...,1515151516.121212121234343434345
12121212120.66666666666767676767777777777788888888888,...other columns...,12121212121.66666666666767676767777777777788888888888
4646464646,...other columns…,4646464647
2929.32325151727270701010232397977,...other columns...,2930.32325151727270701010232397977
1515.121234345,...other columns...,1516.121234345
12120.66666767777788888,...other columns...,12121.66666767777788888
4646,...other columns...,4647
change it to CONVFMT='%\47.f', and you can even get mawk-1 to nicely comma format them for u :
29292929292929.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977,...other columns…,29,292,929,292,930.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977
15151515151515.12121212121212343434343434345,...other columns…,15,151,515,151,516.12121212121212343434343434345
121212121212120.666666666666666767676767676777777777777777888888888888888,…other columns…,121,212,121,212,121.666666666666666767676767676777777777777777888888888888888
46464646464646,...other columns...,46,464,646,464,647

AWK script- Not showing data

I'm trying to create a variable to sum columns 26 to 30 and 32.
SO far I have this code which prints me the hearder and the output format like I want but no data is being shown.
#! /usr/bin/awk -f
BEGIN { FS="," }
NR>1 {
TotalPositiveStats= ($26+$27+$28+$29+$30+$32)
{printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n",
NR==1 {
print "EndYear,Rk,G,Date,Years,Days,Age,Tm,HOme,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats" }#header
Input data:
Output expected:
This script will be called like gawk -f script.awk <filename>.
Currently when calling this is the output (It seems to be calculating the variable but the rest of fields are empty)
awk is well suited to summing columns:
awk 'NR>1{$(NF+1)=$26+$27+$28+$29+$30+$32}1' FS=, OFS=, input-file > tmp
mv tmp input-file
That doesn't add a field in the header line, so you might want something like:
awk '{$(NF+1) = NR>1 ? ($26+$27+$28+$29+$30+$32) : "TotalPositiveStats"}1' FS=, OFS=,
An explanation on the issues with the current printf output is covered in the 2nd half of this answer (below).
It appears OP's objective is to reformat three of the current fields while also adding a new field on the end of each line. (NOTE: certain aspects of OPs code are not reflected in the expected output so I'm not 100% sure what OP is looking to generate; regardless, OP should be able to tweak the provided code to generate the desired result)
Using sprintf() to reformat the three fields we can rewrite OP's current code as:
awk '
BEGIN { FS=OFS="," }
NR==1 { print $0, "TotalPositiveStats"; next }
{ TotalPositiveStats = ($26+$27+$28+$29+$30+$32)
$17 = sprintf("%.3f",$17) # FG_PCT
if ($20 != "") $20 = sprintf("%.3f",$20) # 3P_PCT
$23 = sprintf("%.3f",$23) # FT_PCT
print $0, TotalPositiveStats
' raw.dat
NOTE: while OP's printf shows a format of %.2f % for the 3 fields of interest ($17, $20, $23), the expected output shows that the fields are not actually being reformatted (eg, $17 remains %.3f, $20 is an empty string, $23 remains %.2f); I've opted to leave $20 as blank otherwise reformat all 3 fields as %.3f; OP can modify the sprintf() calls as needed
This generates:
NOTE: in OP's expected output it appears the last/new field (TotalPositiveStats) does not contain the value from $30 hence the mismatch between the expected results and this answer; again, OP can modify the assignment statement for TotalPositiveStats to include/exclude fields as needed
Regarding the issues with the current printf ...
{printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n",
... is referencing (awk) variables that have not been defined (eg, EndYear, Rk, G). [NOTE: one exeception is the very last variable in the list - TotalPositiveStats - which has in fact been defined earlier in the script.]
The default value for undefined variables is the empty string ("") or zero (0), depending on how the awk code is referencing the variable, eg:
printf "%s", EndYear => EndYear is treated as a string and the printed result is an empty string; with an output field delimiter of a comma (,) this empty strings shows up as 2 commas next to each other (,,)
printf "%.2f %", FG_PCT => FG_PCT is treated as a numeric (because of the %f format) and the printed result is 0.00 %
Where it gets a little interesting is when the (undefined) variable name starts with a numeric (eg, 3P) in which case the P is ignored and the entire reference is treated as a number, eg:
printf "%s", 3P => 3P is processed as 3 and the printed result is 3
This should explain the 5 static values (0.00 %, 3, 3, 3.00 % and 0.00 %) printed in all output lines as well as the 'missing' values between the rest of the commas (eg, ,,,,).
Obviously the last value in the line is an actual number, ie, the value of the awk variable TotalPositiveStats.

About awk and integer to ASCII character conversion

Just to make sure, is it really that using awk (Gnu awk at least) I can convert:
from octal to ASCII by:
print "\101" # or a="\101"
from hex to ASCII:
print "\x41" # or b="\x41"
but from decimal to ASCII I have to:
$ printf "%c\n", 67 # or c=sprintf("%c", 67)
There is no secret print "\?67" in that RTFM (Memo) I missed?
I'm trying to get character frequencies from $0="aabccc" like:
for(i=141; i<143; i++) a=a gsub("\\"i, ""); print a
but using decimals (instead of octals in above example). The decimalistic approach seem awfully long:
$ cat foo
$ awk '{for(i=97;i<=99;i++){c=sprintf("%c",i);a=a gsub(c,"")} print a}' foo
It got used here.
No, \nnn is octal and \xnn is hex - that's all there is for including characters you cannot include as-is in strings and you should always use the octal, not the hex, representation for robustness (see, for example,
I don't understand the last part of your question where you state what you're trying to do with this - provide concise, testable sample input and expected output and I'm sure someone can help you do it the right way, whatever it is.
Is this what you're trying to do?
$ awk 'BEGIN{for (i=0141; i<0143; i++) print i}'
A lookup table is the only way to address this (directly convert CHAR to ASCII DECIMAL) within "AWK only".
You can simply use sprintf() to convert ASCII DECIMAL to CHAR.
You can create a lookup table by iterating through each of the known
ascii chars and storing them in an array where the key is the character and the value is the ascii value of that char.
You can use sprintf() within AWK to get the char for each decimal.
Then you can pass the char to the array to get the corresponding
decimal again.
In this example, using awk.
We cycle through all 256 characters, printing out each one.
We split the resulting string into a series of lines where each line has a single character.
We build a table in awk of the 256 characters (in BEGIN), and then feed each of the input characters in to lookup each one.
Finally we then print out the code for each character on the input.
awk 'BEGIN{
print sprintf("%c",n)
}' | awk '{
for (i=0; ++i <= length($0);)
printf "%s\n", substr($0, i, 1)
}' | awk 'BEGIN{
print ord[$1]
The reverse can also be done, where we lookup a list of character codes.
awk 'BEGIN{
print sprintf("%s",n)
}' | awk 'BEGIN{
print char[$1]
Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.
If as you say at the end of your question you're simply looking to count the frequency of characters, I'd just assemble an array.
$ awk '{for(i=1;i<=length($0);i++) a[substr($0,i,1)]++} END{for(i in a) printf "%d %s\n",a[i],i}' <<<$'aabccc\ndaae'
1 d
1 e
4 a
1 b
3 c
Note that this also supports multi-line input.
We're stepping through each line of input, incrementing a counter that is an array subscript keyed with the character in question.
I would expect this approach to be more performant than applying a regex to count the replacements for every interesting character, but I haven't done any speed comparison tests (and of course it would depend on how large a set you're interested in).
While this answer doesn't address your initial question, I hope it'll provide a better way to approach the problem.
(Thanks for including the final details in your question. XY problems are all too frequent here.)
if you need to encode bytes -> octals in awk, here's a fully self-encapsulated, recursive, and cross-awk compatible octal encoder that I came up with before :
verified on gawk, mawk-1, mawk-2, and nawk,
benchmarked throughput rate of 39.2 MByte/sec
out9: 1.82GiB 0:00:47 [39.2MiB/s] [39.2MiB/s] [ <=> ]
in0: 466MiB 0:00:00 [1.78GiB/s] [1.78GiB/s] [>] 100%
( pvE 0.1 in0 < "${m3l}" | mawk2x ; )
39.91s user 6.94s system 98% cpu 47.656 total
2 78b4c27659ae66e4c98796a60043f1fe stdin
echo "${data}" | awk '{
print octencode_v7($0)
function octencode_v7(______,_,__,___,____,_____,_______) {
if ( ( (_+=_+=_^=_<_\
index(-+log(_<_),"+") ) ) )<=(___=\
? length(______) : match(______,"$")-_)) {
return \
octencode_v7(substr(______,_^=_<_,_=int(___/(_+_)))) \
_______-=gsub("["(!_)"-"(_+(_+=++_+_))"]", "\\"(!_)(_)"&",______)
if (!+(_______-=gsub(___, "\\"(_--^--_+_*_),______) \
- gsub("[[]","\\" ((_^!_)(_)_),______) \
- gsub(/\^/, "\\" ((_^!_)(_)(_+_)),______))) {
return ______
do { ___=_____
do { __=_____
if (+____ || (_____-___)!=_^(_<_)) {
do { _=(____)(___)__
if (+____!=_^(_<_) || ! index(___^___,_____) ||
+__!~"^["(_____^___+___)"]$") {
_______-=gsub(((!+____ && +_____<(___+___)) ||
(+____==_^(_<_) &&
( +___==+_____ ||
(___+____+___)==+_____))) \
? "["(_)"]" : (_), _,______)
} } while(__--)
} } while(___--)
if (!_______) {
return ______
} } while((++____+____)<_____)
return ______
It's basically a triple-nested do-while loop combo to cycle through all the octal codes, without needing any previously made lookup reference strings/arrays
Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.
This can be circumvented by using octal codes \200 - \377 for 128-255.
IIRC the bytes C0 C1 F5 F6 F7 F8 F9 FA FB FC FD FE FF shouldn't exist within properly encoded UTF-8 documents (or not yet spec'ed to). FE and FF may overlap with UTF16 byte order mark, but that should hardly be a concern as of today since the world has standardized upon UTF-8.

awk greater than why show string value?

I am using this command
awk '$1 > 3 {print $1}' file;
file :
output this;
Why result does not been only numbers as below,
This happens because one side of the comparison is a string, so awk is doing string comparison and the numeric value of the character 'S' is greater than 3.
$ printf "3: %d S: %d\n" \'3 \'S
3: 51 S: 83
Note: the ' before the arguments passed to printf are important, as they trigger the conversion to the numeric value in the underlying codeset:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
We write \' so that the ' is passed to printf, rather than being interpreted as syntax by the shell (a plain ' would open/close a string literal).
Returning to the question, to get the desired behaviour, you need to convert the first field to a number:
awk '+$1 > 3 { print $1 }' file
I am using the unary plus operator to convert the field to a number. Alternatively, some people prefer to simply add 0.
Taken from the awk user guide...
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT. ... CONVFMT's default value is
"%.6g", which prints a value with at least six significant digits.
So, basically they are all treated as strings, and "String" Happens to be greater than "3".

Comparing hexadecimal values with awk

I'm having trouble with awk and comparing values. Here's a minimal example :
echo "0000e149 0000e152" | awk '{print($1==$2)}'
Which outputs 1 instead of 0. What am I doing wrong ? And how should I do to compare such values ?
To convert a string representing a hex number to a numerical value, you need 2 things: prefix the string with "0x" and use the strtonum() function.
To demonstrate:
echo "0000e149 0000e152" | gawk '{
print $1, $1+0
print $2, $2+0
n1 = strtonum("0x" $1)
n2 = strtonum("0x" $2)
print $1, n1
print $2, n2
0000e149 0
0000e152 0
0000e149 57673
0000e152 57682
We can see that naively treating the strings as numbers, awk thinks their value is 0. This is because the digits preceding the first non-digit happen to be only zeros.
Note that strtonum is a GNU awk extension
You need to convert $1 and $2 to strings in order to enforce alphanumeric comparison. This can be done by simply append an empty string to them:
echo "0000e149 0000e152" | awk '{print($1""==$2"")}'
Otherwise awk would perform a numeric comparison. awk will need to convert them to numeric values in this case. Converting those values to numbers in awk leads to 0 - because of the leading zero(s) they are treated as octal numbers but parsing as an octal number fails because the values containing invalid digits which aren't allowed in octal numbers, which results in 0. You can verify that using the following command:
echo "0000e149 0000e152" | awk '{print $1+0; print $2+0)}'
When using non-decimal data you just need to tell gawk that's what you're doing and specify what base you're using in each number:
$ echo "0xe152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
$ echo "0xE152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
$ echo "0xe149 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
i think many forgot the fact that the hexdigits 0-9 A-F a-f rank order in ASCII - instead of wasting time performing the conversion, or risk facing numeric precision shortfall, simply :
trim out leading edge zeros, including the optional 0x / 0X
depending on the input source, also trim out delimiters such as ":" (e.g. IPv6, MAC address), "-" (e.g. UUID), "_" (e.g. "0xffff_ffff_ffff_ffff"), "%" (e.g. URL-encoding) etc
—- be mindful of the need to pad in missing leading zeros for formats that are very flexible with delimiters, such as IPv6
compare their respective string length()s :
if those differ, then one is already distinctly larger,
— otherwise
prefix both with something meaningless like "\1" to guarantee a string-compare operation without risk of either awk being too smart or running into extreme edge cases like locale-specific peculiarities to its collating order :
(("\1") toupper(hex_str_1)) == (("\1") toupper(hex_str_2))