Field separators-trouble delimiting command characters - awk

I'm trying to parse through html source code. In my example I'm just echoing it in. But, I am reading html from a file in practice.
Here is a bit of code that works, syntactically:
echo "<td>Here</td> some dynamic text to ignore <garbage> is a string</table>more junk" |
awk -v FS="(<td>|</td>|<garbage>|</table>)" '{print $2, $4}'
in the FS declaration I create 4 delimiters which work fine, and I output the 2nd and 4th field.
However, the 3rd field delimeter I actually need to use contains awk command characters, literally:
')">
such that when I change the above statement to:
echo "<td>Here</td> some dynamic text to ignore ')\"> is a string</table>more junk" |
awk -v FS="(<td>|</td>|')\">|</table>)" '{print $2, $4}'
I've tried escaping one, all, and every combination of the offending string with the \character. but, nothing is working.

This might be what you're looking for:
$ echo "<td>Here</td> some dynamic text to ignore ')\"> is a string</table>more junk" |
awk -v FS='(<td>|</td>|\047\\)">|</table>)' '{print $2, $4}'
Here is a string
In shell, always include strings (and command line scripts) in single quotes unless you NEED to use double quotes to expose your strings contents to the shell, e.g. to let the shell expand a variable.
Per shell rules you cannot include a single quote within a single quote delimited string 'foo'bar' though (no amount of backslashes will work to escape that mid-string ') so you need to either jump back out of the single quotes to provide a single quote and then come back in, e.g. with 'foo'\''bar' or use the octal escape sequence \047 (do not use the hex equivalent as it is error prone) wherever you want a single quote, e.g. 'foo\047bar'. You then need to escape the ) twice - once for when awk converts the string to a regexp and then again when awk uses it as a regexp.
If you had been using double quotes around the string you'd have needed one additional escape for when shell parsed the string but that's not needed when you surround your string in single quotes since that is blocking the shell from parsing the string.

Related

Attempting to pass an escape char to awk as a variable

I am using this command;
awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
which warns me;
awk: warning: escape sequence `\(' treated as plain `(
which prints;
new[[:blank:]]+File(
I would like the value to be;
new[[:blank:]]+File\(
I've tried amending the command to account for escape chars but always get the same warning
When you run:
$ awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
awk: warning: escape sequence `\(' treated as plain `('
Regex1 = new[[:blank:]]+File(
you're in shell assigning a string to an awk variable. When you use -v in awk you're asking awk to interpret escape sequences in such an assignment so that \t can become a literal tab char, \n a newline, etc. but the ( in your string has no special meaning when escaped and so \( is exactly the same as (, hence the warning message.
If you want to get a literal \ character you'd need to escape it so that \\ gets interpreted as just \:
$ awk -v regex1='new[[:blank:]]+File\\(' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
You seem to be trying to pass a regexp to awk and in my opinion once you get to needing 2 escapes your code is clearer and simpler if you put the target character into a bracket expression instead:
$ awk -v regex1='new[[:blank:]]+File[(]' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File[(]
If you want to assign an awk variable the value of a literal string with no interpretation of escape sequences then there are other ways of doing so without using -v, see How do I use shell variables in an awk script?.
If you use gnu awk then you can use a regexp literal with #/.../ format and avoid double escaping:
awk -v regex1='#/new[[:blank:]]+File\(/' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
i think gawk and mawk 1/2 are also okay with the hideous but fool-proof octal method like
-v regex1="new[[:blank:]]+File[\\050]" # note the double quotes
once the engine takes out the first \\ layer, the regex being tested against is equivalent to
/new[[:blank:]]+File[\050]/
which is as safe as it gets. Reason why it matters is that something like
/new[[:blank:]]+File[\(]/
is something mawk/mawk2 are totally cool with but gawk will give an annoying warning message. octals (or [\x28]) get rid of that cross-awk weirdness and allow the same custom string regex to be deployed across all 3
(haven't tested against less popular variants like BWK original or NAWK etc).
ps : since i'm on the subject of octal caveats, mawk/mawk2 and gawk in binary mode are cool with square bracket octals for all bytes, meaning
"[\\302-\\364][\\200-\\277]+" # this happens to be a *very* rough proxy for UTF-8
is valid for all 3. if you really want to be the hex guy, that same regex becomes
"[\\xC2-\\xF4][\\x80-\\xBF]+"
however, gawk in unicode mode will scream about locale whenever you attempt to put squares around any non-ASCII byte. To circumvent that, you'll have to just list them out with a bunch of or's like :
(\302|\303|\304.....|\364)(\200|\201......|\277)+
this way you can get gawk unicode mode to handle any arbitrary byte and also handle binary input data (whatever the circumstances caused that to happen), and perform full base64 or URI plus encoding/decoding from within (plus anything else you want, like SHA256 or LZMA etc).... So far I've even managed to get gawk in unicode mode to base64 encode an MP4 file input without gawk spitting out the "illegal multi byte" error message.
.....and also get gawk and mawk in binary modes to become mostly UTF-8 aware and safe.
The "mostly" caveat being I haven't implemented the minute details like directly doing normalization form conversions from within instead of dumping out to python3 and getting results back via getline, or keeping modifier linguistics marks with its intended character if i do a UC-safe-substring string-reversal.

With sed or awk, how to replace all occurrences of string between quotes?

Given a file that looks like:
some text
no replace "text in quotes" no replace
more text
no replace "more text in quotes" no replace
even more text
no replace "even more text in quotes" no replace
etc
what sed or awk script would replace all the es that are between quotes and only the es between quotes such that something like the following is produced:
some text
no replace "t##$xt in quot##$s" no replace
more text
no replace "mor##$ t##$xt in quot##$s" no replace
even more text
no replace "##$v##$n mor##$ t##$xt in quot##$s" no replace
etc
There can be any number es between the quotes.
$ awk 'BEGIN{FS=OFS="\""} {gsub(/e/,"##$",$2)} 1' file
some text
no replace "t##$xt in quot##$s" no replace
more text
no replace "mor##$ t##$xt in quot##$s" no replace
even more text
no replace "##$v##$n mor##$ t##$xt in quot##$s" no replace
etc
Also consider multiple pairs of quotes on a line:
$ echo 'aebec"edeee"fegeh"eieje"kelem' |
awk 'BEGIN{FS=OFS="\""} {gsub(/e/,"##$",$2)} 1'
aebec"##$d##$##$##$"fegeh"eieje"kelem
$ echo 'aebec"edeee"fegeh"eieje"kelem' |
awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) gsub(/e/,"##$",$i)} 1'
aebec"##$d##$##$##$"fegeh"##$i##$j##$"kelem
This might work for you (GNU sed):
sed -r ':a;s/^([^"]*("[^"e]*"[^"]*)*"[^"e]*)e/\1##$/;ta' file
This regex looks from the start of line for a series of non-double quote characters, followed by a possible pair of double quotes with no e's within them, followed by another possibile series of non-double quote characters, followed by a double quote and a possible series of non-double quotes. If the next pattern is an e it replaces the pattern by the \1 (which is everything up until the e) and ##$. If the substitution is successful i.e. ta, then the process is repeated until no further substitutions occur.
N.B. This caters for lines with multiple pairs of double quoted strings.
sed ':cycle
s/^\(\([^"]*\("[^"]*"\)*\)*"[^"]*\)e/\1##$/
t cycle' YourFile
Posix version
front last till first e
change also any e that will be in an unclosed quoted string (at the end thus and failed in next line if that could happend (Not present in your sample)

gawk system-command ignoring backslashes

Consider a file with a list of strings,
string1
string2
...
abc\de
.
When using gawk's system command to execute a shell command
, in this case printing the strings,
cat file | gawk '{system("echo " $0)}'
the last string will be formatted to abcde. $0 denotes the whole record, here this is just the one string
Is this a limitation of gawk's system command, not being able to output the gawk variables unformatted?
Expanding on Mussé Redi's answer, observe that, in the following, the backslash does not print:
$ echo 'abc\de' | gawk '{system("echo " $0)}'
abcde
However, here, the backslash will print:
$ echo 'abc\de' | gawk '{system("echo \"" $0 "\"")}'
abc\de
The difference is that the latter command passes $0 to the shell with double-quotes around it. The double-quotes change how the shell processes the backslash.
The exact behavior will change from one shell to another.
To print while avoiding all the shell vagaries, a simple solution is:
$ echo 'abc\de' | gawk '{print $0}'
abc\de
In Bash we use double backslashes to denote an actual backslash. The function of a single backslash is escaping a character. Hence the system command is not formatting at all; Bash is.
The solution for this problem is writing a function in awk to preformat backslashes to double backslahses, afterwards passing it to the system command.

In awk, how can I use a file containing multiple format strings with printf?

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.
Here's a tiny example of the problem:
$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello: world
foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$
So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?
UPDATE #1:
As a further example, consider the following using bash heretext:
[me#here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me#here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me#here ~]$
As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.
UPDATE #2:
The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:
#!/bin/sh
while read -r fmtid fmt; do
while read cid name addy; do
awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
done < /path/to/sampledata
done < /path/to/fmtstrings
Example input would be:
## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n
## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere
My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:
awk '
NR==FNR { fmts[$1]=$2; next; }
{
for(fmtid in fmts) {
outputfile=sprintf("/path/%d/%d", fmtid, custid);
printf(fmts[fmtid], $1, $2) > outputfile;
}
}
' /path/to/fmtstrings /path/to/sampledata
Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)
FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.
Why so lengthy and complicated an example? This demonstrates the problem:
$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a b
$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb
In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.
In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.
You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per #EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):
$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a b
If you don't need the result in a variable, you can just call system().
If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).
You could call awk instead of the shell printf if you prefer.
Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!
With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:
$ cat tst2.awk
function expandEscapes(old, segs, segNr, escs, idx, new) {
split(old,segs,/\\./,escs)
for (segNr=1; segNr in segs; segNr++) {
if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
new = new segs[segNr] escs[segNr]
}
return new
}
{
s = expandEscapes($0)
printf s, "foo", "bar"
}
.
$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar
Alternatively, this shoudl be functionally equivalent but not gawk-specific:
function expandEscapes(tail, head, esc, idx) {
head = ""
while ( match(tail, /\\./) ) {
esc = substr( tail, RSTART + 1, 1 )
head = head substr( tail, 1, RSTART-1 )
tail = substr( tail, RSTART + 2 )
idx = index( "abfnrtv", esc )
if ( idx )
esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
head = head esc
}
return (head tail)
}
If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to
/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/
and for a hex value after the \\:
c = sprintf("%c", strtonum("0x" rest_of_str))
and for an octal value:
c = sprintf("%c", strtonum("0" rest_of_str))
Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.
The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.
awk 'BEGIN {if(ARGC!=3)exit(1);
fn=ARGV[2];ARGC=2}
NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
ARGV[ARGC++]="fmtid="$1;
ARGV[ARGC++]=fn;
next}
{match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
}' fmtfile sampledata
(
What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.
As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.
The snippet also shows two ways of concatenating trailing fields.
In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.
Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)
Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.
Ed Morton shows the problem clearly (edit: and it's now complete, so just go accept it): awk's string literal processing handled the escapes, and file I/O code isn't a lexical analyzer.
It's an easy fix: decide what escapes you want to support, and support them. Here's a one-liner form if you're doing special-purpose work that doesn't need to handle escaped backslashes
awk '{ gsub(/\\n/,"\n"); gsub(/\\t/,"\t"); printf($0 "bar\n", "world"); }' <<\EOD
hello:\t%s\n\tfoo
EOD
but for doit-and-forgetit peace of mind just use the full form in the linked answer.
#Ed Morton's answer explains the problem well.
A simple workaround is to:
pass the format-string file contents via an awk variable, using command substitution,
assuming that file is not too large to be read into memory in full.
Using GNU awk or mawk:
awk -v formats="$(tr '\n' '\3' <fmtStrings)" '
# Initialize: Split the formats into array elements.
BEGIN {n=split(formats, aFormats, "\3")}
# For each data line, loop over all formats and print.
{ for(i=1;i<n;++i) {printf aFormats[i] "\n", $1, $2, $3} }
' sampleData
Note:
The advantage of this solution is that it works generically - you don't need to anticipate specific escape sequences and handle them specially.
On FreeBSD awk, this almost works, but - sadly - split() still splits by newlines, despite being given an explicit separator - this smells like a bug. Observed on versions 20070501 (OS X 10.9.4) and 20121220 (FreeBSD 10.0).
The above solves the core problem (for brevity, it omits stripping the ID from the front of the format strings and omits the output-file creation logic).
Explanation:
tr '\n' '\3' <fmtStrings replaces actual newlines in the format-strings file with \3 (0x3) characters, so as to be able to later distinguish them from the \n escape sequences embedded in the lines, which awk turns into actual newlines when assigning to variable formats (as desired).
\3 (0x3) - the ASCII end-of-text char. - was arbitrarily chosen as an auxiliary separator that is assumed not to be present in the input file.
Note that using \0 (NUL) is NOT an option, because awk interprets that as an empty string, causing split() to split the string into individual characters.
Inside the BEGIN block of the awk script, split(formats, aFormats, "\3") then splits the combined format strings back into individual format strings.
I had to create another answer to start clean, I believe I've come to a good solution, again with perl:
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'
hi : hello
That bad boy s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg will translate any meta character I can think of, let us take a look with cat -A :
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"' | cat -A
hi^I:^I hello^M$
PS. I didn't create that regex, I googled unquote meta and found here
What you are trying to do is called templating. I would suggest that shell tools are not the best tools for this job. A safe way to go would be to use a templating library such as Template Toolkit for Perl, or Jinja2 for Python.
The problem lies in the non-interpretation of the special characters \t and \n by echo: it makes sure that they are understood as as-is strings, and not as tabulations and newlines. This behavior can be controlled by the -e flag you give to echo, without changing your awk script at all:
echo -e "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
tada!! :)
EDIT:
Ok, so after the point rightfully raised by Chrono, we can devise this other answer corresponding to the original request to have the pattern read from a file:
echo "hello:\t%s\n\tfoo" > myfile
awk 'BEGIN {s="'$(cat myfile)'" ; printf(s "bar\n", "world")}'
Of course in the above we have to be careful with the quoting, as the $(cat myfile) is not seen by awk but interpreted by the shell.
This looks extremely ugly, but it works for this particular problem:
s=$0;
gsub(/'/, "'\\''", s);
gsub(/\\n/, "\\\\\\\\n", s);
"printf '%b' '" s "'" | getline s;
gsub(/\\\\n/, "\n", s);
gsub(/\\n/, "\n", s);
printf(s " bar\n", "world");
Replace all single quotes with shell-escaped single quotes ('\'').
Replace all escaped newline sequences that appear normally as \n with the sequence that appears as \\\\n. It would suffice to use \\\\n as the actual replacement string (meaning \\n would print if you printed it), but the version of gawk I have messes things up in POSIX mode.
Invoke the shell to execute printf '%b' 'escape'\''d format' and use awk's getline statement to retrieve the line.
Unescape \\n to yield a newline. This step wouldn't be necessary if gawk in POSIX mode played nicely.
Unescape \n to yield a newline.
Otherwise you're left to call the gsub function for each possible escape sequence, which is terrible for \001, \002, etc.
Graham,
Ed Morton's solution is the best (and perhaps only) one available.
I'm including this answer for a better explanation of WHY you're seeing what you're seeing.
A string is a string. The confusing part here is WHERE awk does the translation of \t to a tab, \n to a newline, etc. It appears NOT to be the case that the backslash and t get translated when used in a printf format. Instead, the translation happens at assignment, so that awk stores the tab as part of the format rather than translating when it runs the printf.
And this is why Ed's function works. When read from stdin or a file, no assignment is performed that will implement the translation of special characters. Once you run the command s="a\tb"; in awk, you have a three character string containing no backslash or t.
Evidence:
$ echo "a\tb\n" | awk '{ s=$0; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2 \
3 t
4 b
5 \
6 n
vs
$ awk 'BEGIN{s="a\tb\n"; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2
3 b
4
And there you go.
As I say, Ed's answer provides an excellent function for what you need. But if you can predict what your input will look like, you can probably get away with a simpler solution. Knowing how this stuff gets parsed, if you have a limited set of characters you need to translate, you may be able to survive with something simple like:
s=$0;
gsub(/\\t/,"\t",s);
gsub(/\\n/,"\n",s);
That's a cool question, I don't know the answer in awk, but in perl you can use eval :
echo '%10s\t:\t%-10s\n' | perl -ne ' chomp; eval "printf (\"$_\", \"hi\", \"hello\")"'
hi : hello
PS. Be aware of code injection danger when you use eval in any language, no just eval any system call can't be done blindly.
Example in Awk:
echo '$(whoami)' | awk '{"printf \"" $0 "\" " "b" | getline s; print s}'
tiago
What if the input was $(rm -rf /)? You can guess what would happen :)
ikegami adds:
Why would even think of using eval to convert \n to newlines and \t to tabs?
echo '%10s\t:\t%-10s\n' | perl -e'
my %repl = (
n => "\n",
t => "\t",
);
while (<>) {
chomp;
s{\\(?:(\w)|(\W))}{
if (defined($2)) {
$2
}
elsif (exists($repl{$1})) {
$repl{$1}
}
else {
warn("Unrecognized escape \\$1.\n");
$1
}
}eg;
printf($_, "hi", "hello");
}
'
Short version:
echo '%10s\t:\t%-10s\n' | perl -nle'
s/\\(?:(n)|(t)|(.))/$1?"\n":$2?"\t":$3/seg;
printf($_, "hi", "hello");
'

Escaping separator within double quotes, in awk

I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").
Example
filed1,filed2,field3,"field4,FOO,BAR",field5
How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?
It's easy, with GNU awk 4:
zsh-4.3.12[t]% awk '{
for (i = 0; ++i <= NF;)
printf "field %d => %s\n", i, $i
}' FPAT='([^,]+)|("[^"]+")' infile
field 1 => filed1
field 2 => filed2
field 3 => field3
field 4 => "field4,FOO,BAR"
field 5 => field5
Adding some comments as per OP requirement.
From the GNU awk manual on "Defining fields by content:
The value of FPAT should be a string that provides a regular
expression. This regular expression describes the contents of each
field. In the case of CSV data as presented above, each field is
either “anything that is not a comma,” or “a double quote, anything
that is not a double quote, and a closing double quote.” If written as
a regular expression constant, we would have /([^,]+)|("[^"]+")/. Writing this as a string
requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Using + twice, this does not work properly for empty fields, but it can be fixed as well:
As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"
FPAT works when there are newlines and commas inside the quoted fields, but not when there are double quotes, like this:
field1,"field,2","but this field has ""escaped"" quotes"
You can use a simple wrapper program I wrote called csvquote to make data easy for awk to interpret, and then restore the problematic special characters, like this:
csvquote inputfile.csv | awk -F, '{print $4}' | csvquote -u
See https://github.com/dbro/csvquote for code and docs
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
Suppose you only want to print the 4th field:
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ #f=$csv->fields(); print "\"$f[3]\"" }' file
The input line is split into array #f
Field 4 is $f[3] since Perl starts indexing at 0
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk