Understanding awk delimiter - escaping in a regex-based field separator - awk

I have the following shell command:
awk -F'\[|\]' '{print $2}'
What is this command doing? Split into fields using as delimiter [sometext]?
E.g.:
$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line
Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.

The apparent intent is to treat literal [ and ] as field-separator characters, i.e., to split each input record into fields by each occurrence of [ and/or ], which, with the sample line, yields this  as field 1 ($1), line as field 2 ($2), and  passed to awk as the last field ($3).
This is achieved by a regex (regular expression) that uses alternation (|), either side of which defines a field separator (delimiter): \[ and \] in a regex are needed to represent literal [ and ], because, by default, [ and ] are so-called metacharacters (characters with special syntactical meaning).
Note that awk always interprets the value of the FS variable (-F option) as a regex.
However, the correct form is '\\[|\\]':
$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line
That said, a more concise version that uses a character set ([...]) rather than alternation (|) is:
$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line
Note the careful placement of ] before [ inside the enclosing [...] to make this work, and how the enclosing [...] now have special meaning: they enclose a set of characters, any of which matches.
As for why 2 \ instances are needed in '\\[|\\]':
Taken as a regex in isolation, \[|\] would work:
\[ matches literal [
\] matches literal ]
| is an alternation that matches one or the other.
However, Awk's string processing comes first:
It should, due to \ handling in a string, reduce \[|\] to [|] before interpretation as a regex.
Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
[|], interpreted as a regex, would then only match a single, literal |
Thus, the robust and portable way is to use \\ in a string literal when you mean to pass a single \ as part of a regex.
This quote from the relevant section of the GNU Awk manual sums it up well:
To get a backslash into a regular expression inside a string, you have to type two backslashes.
[1] Implementation differences:
Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \ before a regex metacharacter inside a string literal.
BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \-prefixed regex metacharacter is found:
# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c
# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c
# Mawk: *Guesses* that a *regex* was intended.
# The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b
Optional reading: regex literals inside Awk scripts
Awk supports regex literals enclosed in /.../, the use of which bypasses the double-escaping problem.
However:
These literals (which are invariably constant) are only available inside an Awk script,
and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.
Therefore, even though /\[|\]/ is in principle equivalent to "\\[|\\]", you can not use the following, because the regex literal cannot be assigned to (special) variable FS:
# !! DOES NOT WORK in any of the 3 major Awk implementations.
# Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'
# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b

Related

Awk: gsub("\\\\", "\\\\") yields suprising results

Consider the following input:
$ cat a
d:\
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ cat a_double.awk
{ sub("\\\\", "\\\\"); print $0 }
Now running cat a | awk -f a.awk gives
d:\
and running cat a | awk -f a_double.awk gives
d:\\
and I expect exactly the other way around. How should I interpret this?
$ awk -V
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Yes, its expected behavior of awk. When you run sub("\\", "\\\\") in your first script, in sub's inside "(double quotes) since we are NOT using / for matching pattern we need to escape first \(actual literal character) then for escaping we are using \ so we need to escape that also, hence it will become \\\\
\\ \\
| |
| |
first 2 chars are denoting escaping next 2 chars are denoting actual literal character \
Which is NOT happening your 1st case hence NO match so no substitution in it, in your 2nd awk script you are doing this(escaping part in regex matching section of sub) hence its matching \ perfectly.
Let's see this by example and try putting ... for checking purposes.
When Nothing happens: Since no match on
awk '{sub("\\", "....\\\\"); print $0}' Input_file
d:\
When pattern matching happens:
awk '{sub("\\\\", "...\\\\"); print $0}' Input_file
d:...\\
From man awk:
gsub(r, s [, t])
For each substring matching the regular expression r in the string t,
substitute the string s, and return the number of substitutions.
How could we could do perform actual escaping part(where we need to use only \ before character only once)? Do mention your regexp in /../ in first section of sub like and we need NOT to double escape \ here.
awk '{sub(/\\/,"&\\")} 1' Input_file
The first arg to *sub() is a regexp, not a string, so you should use regexp (/.../) rather than string ("...") delimiters. The former is a literal regexp which is used as-is while the latter defines a dynamic (or computed) regexp which forces awk to interpret the string twice, the first time to convert the string to a regexp and the second to use it as a regexp, hence double the backslashes needed for escaping. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
In the following we just need to escape the backslash once because we're using a literal, rather than dynamic, regexp:
$ cat a
d:\
$ awk '{sub(/\\/,"\\\\")}1' a
d:\\
Your first script would rightfully produce a syntax error in a more recent version of gawk (5.1.0) since "\\" in a dynamic regexp is equivalent to /\/ in a literal one and in that expression the backslash is escaping the final forward slash which means there is no final delimiter:
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ awk -f a.awk a
awk: a.awk:1: (FILENAME=a FNR=1) fatal: invalid regexp: Trailing backslash: /\/

Recognising backslash in awk field separator

Input is
AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999
awk script sets field separator to "\x0D" (I have tried with and without escaping the backslash.
awk script is
BEGIN {FS="\\x0D"}
{print NF}
It should output 3 because there are 2 occurrences of the field separator but it outputs 1 which indicates it is not being recognized.
There are 2 ways to provide a regexp in awk - a static regexp (aka regexp literal) written as /regexp/ and a dynamic regexp (aka computed regexp) written as "regexp" and used in a regexp context. A field separator is just a regexp with some additional behavior so lets just consider regexps in general to explain what's going on in your example.
The split() function takes a field separator (a regexp for our purposes) as it's third argument so it provides a good test bed:
Using a static regexp:
$ awk '{print split($0,a,/\x0D/)}' file
1
The \ above is escaping the x, it's not a literal \. For that you need to escape the \ itself:
$ awk '{print split($0,a,/\\x0D/)}' file
3
What if we used a dynamic regexp instead of the above static regexp?
$ awk '{print split($0,a,"\x0D")}' file
1
$ awk '{print split($0,a,"\\x0D")}' file
1
$ awk '{print split($0,a,"\\\x0D")}' file
' is not a known regexp operator FNR=1) warning: regexp escape sequence `\
1
$ awk '{print split($0,a,"\\\\x0D")}' file
3
The behavior above is because awk first parses the string to convert it into a regexp (using up one layer of escape chars) and then parses it a second time when using it as a regexp (using up a second layer of escape chars).
Unfortunately when you specify a FS there is no option to specify it as a literal regexp, it's always specified using a string and thus is a dynamic regexp and so needs an extra layer of escaping:
$ awk -v FS='\x0D' '{print NF}' file
1
$ awk -v FS='\\x0D' '{print NF}' file
1
$ awk -v FS='\\\x0D' '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS='\\\\x0D' '{print NF}' file
3
Now - what if you were using the wrong type of quotes in the shell part of the script, i.e. " instead of '? Then you introduce even more pain because now you're inviting the shell to also parse the string even before awk gets to see and parse it twice:
$ awk -v FS="\\\\x0D" '{print NF}' file
1
$ awk -v FS="\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\\x0D" '{print NF}' file
3
That's different from the case where the double quotes are using inside awk because that's all wrapped inside single quotes and so protected from the shell already:
$ awk 'BEGIN{FS="\\\\x0D"} {print NF}' file
3
So - in the shell always use the most restrictive quotes (' over " over none) unless you have a very specific reason not to, and when using regexps or field separators always use literal /.../ rather than dynamic "...", again unless you have a very specific reason not to.
The odd, truncated looking error message above are because of the \rs the tool is trying to print due to the escape sequence we're providing, they're really all warning: regexp escape sequence '\^M' is not a known regexp operator
You need two backslashes for a literal backslash since \ is an escape character:
$ echo 'AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999' |
awk 'BEGIN{ FS="\\\\x0D" } { print NF }'
3

Awk multi character field separator containing caret not working as expected

I have tried multiple google searches, but none of the proposed answers are working for my example below. NF should be 3, but I keep getting 1.
# cat a
1^%2^%3
# awk -F^% '{print NF}' a
1
# awk -F'^%' {print NF}' a
1
awk -F "^%" {print NF}' a
1
The -F variable in awk takes a regular expression as its value. So the value ^ is interpreted as a special anchor regex pattern which need to be deprived of its special meaning. So you escape it a with a literal backslash \ character
awk -F'\\^%' '{ print NF }'
from GNU Awk manual for Escape Sequences
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written \"\\.
You should escape ^ to remove its special meaning which is getting used as a regex by field separator.Once you escape ^ by doing \\^ it will be treated as a normal/literal character and then ^% will be considered as string and you will get answer as 3.
awk -F'\\^%' '{print NF}' Input_file
Here is one nice SO link which you could take it as an example too for better understanding, it doesn't talk about specifically ^ character but it talks about how to use escape sequence in field separator in awk.
https://stackoverflow.com/a/44072825/5866580

Why I can't use as delimiter in awk the string "?B?"

By running the following I am getting as a result the string "utf-8"
I thought that with this command I would had string "tralala" returned
echo "=?utf-8?B?tralala" | awk -F "?B?" '{print $2 }'
Why is that?
What delimiter should I use in order to get the string "tralala" ?
? is a regex metacharacter that means zero or one matches of the preceding atom. (I'm surprised awk didn't complain about the one at the start but .)
Try echo "=?utf-8?B?tralala" | awk -F '\\?B\\?' '{print $2 }' instead.
Awk delimiters are NOT strings, they are "Field Separators" (hence the variable named FS) which are a type of Extended Regular Expression with some additional features (e.g. a single blank char as the field separator when not inside square brackets means separate by all chains of contiguous white space and ignore leading and trailing white space on each record).
The difference between a string, a regular expression, and a field separator are very important to be aware of. You sometimes also see the word "pattern" used - do not use that term, it has no (or too many possible) meaning.
A ? is an RE metacharacter so you need to tell awk not to treat it as such in your case by either of these methods:
$ echo "=?utf-8?B?tralala" | awk -F '[?]B[?]' '{print $2}'
tralala
$ echo "=?utf-8?B?tralala" | awk -F '\\?B\\?' '{print $2}'
tralala
You don't strictly need to do that for the first ? as it's metacharacter functionality is not applicable when it's the first char in an RE:
$ echo "=?utf-8?B?tralala" | awk -F '?B[?]' '{print $2}'
tralala
$ echo "=?utf-8?B?tralala" | awk -F '?B\\?' '{print $2}'
tralala
but IMHO it's best to do it anyway for clarity and future-proofing.

How can I use the symbols [ and ] as field seperators for gawk?

emphasized textI have some text like
CreateMainPageLink("410",$objUserData,$mnt[139]);
from which i want to extract the number 139 after the occurrence of mnt with gawk. I tried the following expression (within a pipe expression to be used on a result of a grep)
gawk '{FS="[\[\]]";print NF}'
to print the number of fields. If my field separators were [ and ] I expect to see the number 3 printed out (three fields; one before the opening rectangular bracket, one after, and the actual number I want to extract). What I get instead is one field, corresponding to the full line, and two warnings:
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
I was following the example given here, but obviously there is some problem/error with my expression.
Using the following two expressions also do not work:
gawk '{FS="[]"}{print NF;}'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/
and
gawk '{FS="\[\]"}{print NF;}'
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/
gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
$ gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
titi[toto]tutu
titi[toto]tutu -> titi toto
1) You must set the FS before entering the main parsing loop. You could do:
awk 'BEGIN { FS="[\\[\\]]"; } { print $0" -> "$1"\t"$2; }'
Which executes the BEGIN clause before parsing the file.
I have to escape the [character twice: one because it is inside a quoted string. And another once because gawk mandate it inside a bracket expression.
I personnaly prefer to use the -F flag which is less verbose.
2) FS="[\[\]]" is wrong, because you are inside a quoted string, this escape the character inside the string. Awk will see: [[]] which is an invalid bracket expression.
3) FS="[]" is wrong because it is an empty bracket expression trying to match nothing
4) FS="\[\]" is wrong again because it is error 2) and 3) together :)
gawk manual says: The regular expressions in awk are a superset of the POSIX specification. This is why you can use either: [\\[\\]] or [][]. The later being the posix way.
To include a literal ']' in the list, make it the first character
See:
Posix Regex specification:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04
Posix awk specification:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
Gnu Awk manual:
http://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions
FS="[]" Here it looks for data inside the [] and there are none.
To use square brackets you need to write them like this [][]
This is also wrong gawk '{FS="[\[\]]";print NF}' you need FS as a variable outside expression.
Eks
echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | awk -F[][] '{print $2}'
139
Or
awk '{print $2}' FS=[][]
Or
awk 'BEGIN {FS="[][]"} {print $2}'
All gives 139
Edit: gawk '{FS="[\[\]]";print NF}' Here you print number of fields NF and not value of it $NF. Anyway it will not help, since dividing your data with [] gives ); as last filed, use this awk '{print $(NF-1)}' FS=[][] to get second last filed.
Do you need awk? You can get the value via sed like this:
# echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | sed -n 's:.*\[\([0-9]\+\)\].*:\1:p'
139