How can I use the symbols [ and ] as field seperators for gawk? - gawk

emphasized textI have some text like
CreateMainPageLink("410",$objUserData,$mnt[139]);
from which i want to extract the number 139 after the occurrence of mnt with gawk. I tried the following expression (within a pipe expression to be used on a result of a grep)
gawk '{FS="[\[\]]";print NF}'
to print the number of fields. If my field separators were [ and ] I expect to see the number 3 printed out (three fields; one before the opening rectangular bracket, one after, and the actual number I want to extract). What I get instead is one field, corresponding to the full line, and two warnings:
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
I was following the example given here, but obviously there is some problem/error with my expression.
Using the following two expressions also do not work:
gawk '{FS="[]"}{print NF;}'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/
and
gawk '{FS="\[\]"}{print NF;}'
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/

gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
$ gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
titi[toto]tutu
titi[toto]tutu -> titi toto
1) You must set the FS before entering the main parsing loop. You could do:
awk 'BEGIN { FS="[\\[\\]]"; } { print $0" -> "$1"\t"$2; }'
Which executes the BEGIN clause before parsing the file.
I have to escape the [character twice: one because it is inside a quoted string. And another once because gawk mandate it inside a bracket expression.
I personnaly prefer to use the -F flag which is less verbose.
2) FS="[\[\]]" is wrong, because you are inside a quoted string, this escape the character inside the string. Awk will see: [[]] which is an invalid bracket expression.
3) FS="[]" is wrong because it is an empty bracket expression trying to match nothing
4) FS="\[\]" is wrong again because it is error 2) and 3) together :)
gawk manual says: The regular expressions in awk are a superset of the POSIX specification. This is why you can use either: [\\[\\]] or [][]. The later being the posix way.
To include a literal ']' in the list, make it the first character
See:
Posix Regex specification:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04
Posix awk specification:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
Gnu Awk manual:
http://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions

FS="[]" Here it looks for data inside the [] and there are none.
To use square brackets you need to write them like this [][]
This is also wrong gawk '{FS="[\[\]]";print NF}' you need FS as a variable outside expression.
Eks
echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | awk -F[][] '{print $2}'
139
Or
awk '{print $2}' FS=[][]
Or
awk 'BEGIN {FS="[][]"} {print $2}'
All gives 139
Edit: gawk '{FS="[\[\]]";print NF}' Here you print number of fields NF and not value of it $NF. Anyway it will not help, since dividing your data with [] gives ); as last filed, use this awk '{print $(NF-1)}' FS=[][] to get second last filed.

Do you need awk? You can get the value via sed like this:
# echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | sed -n 's:.*\[\([0-9]\+\)\].*:\1:p'
139

Related

awk command works, but not in openwrt's awk

Works here: 'awk.js.org/`
but not in openwrt's awk, which returns the error message:
awk: bad regex '^(server=|address=)[': Missing ']'
Hello everyone!
I'm trying to use an awk command I wrote which is:
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
Which counts invalid lines in a dns blocklist (oisd in this case):
Input would be eg:
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
Output for this should be "2" since there are two lines that don't match the criteria (correctly formed address, comments, or blank lines).
I've tried formatting the command every which way with [], but can't find anything that works. Does anyone have an idea what format/syntax/option needs adjusting?
Thanks!
To portably include - in a bracket expression it has to be the first or last character, otherwise it means a range, and \s is shorthand for [[:space:]] in only some awks. This will work in any POSIX awk:
$ awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:].-]+([/]|[/]#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
Per #tripleee's comment below if your awk is broken such that a / inside a bracket expression isn't treated as literal then you may need this instead:
$ awk '!/^(server=|address=)\/[[:alnum:]][[:alnum:].-]+(\/|\/#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
but get a new awk, e.g. GNU awk, as who knows what other surprises the one you're using may have in store for you!
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
- has special meaning inside [ and ], it is used to denote range e.g. [A-Z] means uppercase ASCII letter, use \ escape sequence to make it literal dash, let file.txt content be
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
then
awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:]\-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}' file.txt
gives output
2
You might also consider replacing \s using [[:space:]] in order to main consistency.
(tested in GNU Awk 5.0.1)

What is the function of a comma between arguments in awk?

For this HackerRank bash challenge (round to 3 decimal places), the following solution works well:
$ echo '5+50*3/20 + (19*2)/7' | bc -l | awk '{ printf ("%.3f \n",$1) }'
17.929
whereas the same without a comma between printf's format string and the $1 produces the following error on a bash prompt:
$ echo '5+50*3/20 + (19*2)/7' | bc -l | awk '{ printf ("%.3f \n" $1) }'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
`%.3f
17.92857142857142857142'
^ ran out for this one
The error message suggests that the $1 without comma is not supplied as an argument to printf, but its elision has hitherto not caused me issues (awk '{ print $0 " with appendix." }' happily prints the appended text). Understandably, searching the manual for values separated by commas is not helpful. What is the function of the comma in separating arguments in awk (aside from inserting a space between strings)? Additionally: what are the round brackets doing in the example? For what it's worth, HackerRank gives the following error:
bc -l | awk '{ printf ("%.3f \n" $1) }'
Your Output (stdout)
0.000
17.92857142857142857142
First of all you don't even need awk to restrict decimal number to 3 decimal points. bc itself can do that:
bc -l <<< 'scale=3; 5+50*3/20 + (19*2)/7'
17.928
Now question about printf, syntax of printf should be:
printf format, item1, item2, …
But when you use it like this:
printf ("%.3f \n" $1)
You don't supply enough number of arguments to satisfy %.3f format string (since "%.3f \n" and $1 are concatenated into a single string), hence you get this error:
not enough arguments to satisfy format string
Even if you put parentheses around, it doesn't make error go away. (...) is optional in printf so it can be either of these 2 statements:
printf "%.3f \n", $1
printf ("%.3f \n", $1)
awk does not have an explicit string concatenation operator. Two strings are concatenated by simply placing then side-by-side
print "foo" "bar" # => prints "foobar"
When you omit the comma, you have essentially this:
fmt = "%.3f \n" $1 # the string => "%.3f\n17.92"
printf (fmt)
and theres a %f directive but no value given.
The error message suggests that the $1 without comma is not supplied as an argument to printf, but its elision has hitherto not caused me issues (awk '{ print $0 " with appendix." }' happily prints the appended text)
Yes. Both effects arise from the fact that awk concatenates adjacent strings without any explicit operator. And not only literals. See section 6.2.2 of the manual for details and examples. In the case of your print statement, that produces an effect that serves your purpose, but in the case of your printf call, it means that you are passing only one, concatenated argument to printf, which it interprets as a format string.
When you put a comma between the strings, whether in a print statement or in a printf call, you get a list of two items instead of a single, concatenated string.

Awk: gsub("\\\\", "\\\\") yields suprising results

Consider the following input:
$ cat a
d:\
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ cat a_double.awk
{ sub("\\\\", "\\\\"); print $0 }
Now running cat a | awk -f a.awk gives
d:\
and running cat a | awk -f a_double.awk gives
d:\\
and I expect exactly the other way around. How should I interpret this?
$ awk -V
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Yes, its expected behavior of awk. When you run sub("\\", "\\\\") in your first script, in sub's inside "(double quotes) since we are NOT using / for matching pattern we need to escape first \(actual literal character) then for escaping we are using \ so we need to escape that also, hence it will become \\\\
\\ \\
| |
| |
first 2 chars are denoting escaping next 2 chars are denoting actual literal character \
Which is NOT happening your 1st case hence NO match so no substitution in it, in your 2nd awk script you are doing this(escaping part in regex matching section of sub) hence its matching \ perfectly.
Let's see this by example and try putting ... for checking purposes.
When Nothing happens: Since no match on
awk '{sub("\\", "....\\\\"); print $0}' Input_file
d:\
When pattern matching happens:
awk '{sub("\\\\", "...\\\\"); print $0}' Input_file
d:...\\
From man awk:
gsub(r, s [, t])
For each substring matching the regular expression r in the string t,
substitute the string s, and return the number of substitutions.
How could we could do perform actual escaping part(where we need to use only \ before character only once)? Do mention your regexp in /../ in first section of sub like and we need NOT to double escape \ here.
awk '{sub(/\\/,"&\\")} 1' Input_file
The first arg to *sub() is a regexp, not a string, so you should use regexp (/.../) rather than string ("...") delimiters. The former is a literal regexp which is used as-is while the latter defines a dynamic (or computed) regexp which forces awk to interpret the string twice, the first time to convert the string to a regexp and the second to use it as a regexp, hence double the backslashes needed for escaping. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
In the following we just need to escape the backslash once because we're using a literal, rather than dynamic, regexp:
$ cat a
d:\
$ awk '{sub(/\\/,"\\\\")}1' a
d:\\
Your first script would rightfully produce a syntax error in a more recent version of gawk (5.1.0) since "\\" in a dynamic regexp is equivalent to /\/ in a literal one and in that expression the backslash is escaping the final forward slash which means there is no final delimiter:
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ awk -f a.awk a
awk: a.awk:1: (FILENAME=a FNR=1) fatal: invalid regexp: Trailing backslash: /\/

Awk multi character field separator containing caret not working as expected

I have tried multiple google searches, but none of the proposed answers are working for my example below. NF should be 3, but I keep getting 1.
# cat a
1^%2^%3
# awk -F^% '{print NF}' a
1
# awk -F'^%' {print NF}' a
1
awk -F "^%" {print NF}' a
1
The -F variable in awk takes a regular expression as its value. So the value ^ is interpreted as a special anchor regex pattern which need to be deprived of its special meaning. So you escape it a with a literal backslash \ character
awk -F'\\^%' '{ print NF }'
from GNU Awk manual for Escape Sequences
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written \"\\.
You should escape ^ to remove its special meaning which is getting used as a regex by field separator.Once you escape ^ by doing \\^ it will be treated as a normal/literal character and then ^% will be considered as string and you will get answer as 3.
awk -F'\\^%' '{print NF}' Input_file
Here is one nice SO link which you could take it as an example too for better understanding, it doesn't talk about specifically ^ character but it talks about how to use escape sequence in field separator in awk.
https://stackoverflow.com/a/44072825/5866580

Understanding awk delimiter - escaping in a regex-based field separator

I have the following shell command:
awk -F'\[|\]' '{print $2}'
What is this command doing? Split into fields using as delimiter [sometext]?
E.g.:
$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line
Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.
The apparent intent is to treat literal [ and ] as field-separator characters, i.e., to split each input record into fields by each occurrence of [ and/or ], which, with the sample line, yields this  as field 1 ($1), line as field 2 ($2), and  passed to awk as the last field ($3).
This is achieved by a regex (regular expression) that uses alternation (|), either side of which defines a field separator (delimiter): \[ and \] in a regex are needed to represent literal [ and ], because, by default, [ and ] are so-called metacharacters (characters with special syntactical meaning).
Note that awk always interprets the value of the FS variable (-F option) as a regex.
However, the correct form is '\\[|\\]':
$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line
That said, a more concise version that uses a character set ([...]) rather than alternation (|) is:
$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line
Note the careful placement of ] before [ inside the enclosing [...] to make this work, and how the enclosing [...] now have special meaning: they enclose a set of characters, any of which matches.
As for why 2 \ instances are needed in '\\[|\\]':
Taken as a regex in isolation, \[|\] would work:
\[ matches literal [
\] matches literal ]
| is an alternation that matches one or the other.
However, Awk's string processing comes first:
It should, due to \ handling in a string, reduce \[|\] to [|] before interpretation as a regex.
Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
[|], interpreted as a regex, would then only match a single, literal |
Thus, the robust and portable way is to use \\ in a string literal when you mean to pass a single \ as part of a regex.
This quote from the relevant section of the GNU Awk manual sums it up well:
To get a backslash into a regular expression inside a string, you have to type two backslashes.
[1] Implementation differences:
Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \ before a regex metacharacter inside a string literal.
BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \-prefixed regex metacharacter is found:
# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c
# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c
# Mawk: *Guesses* that a *regex* was intended.
# The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b
Optional reading: regex literals inside Awk scripts
Awk supports regex literals enclosed in /.../, the use of which bypasses the double-escaping problem.
However:
These literals (which are invariably constant) are only available inside an Awk script,
and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.
Therefore, even though /\[|\]/ is in principle equivalent to "\\[|\\]", you can not use the following, because the regex literal cannot be assigned to (special) variable FS:
# !! DOES NOT WORK in any of the 3 major Awk implementations.
# Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'
# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b