How do awk match and ~ operators work together?

How do awk match and ~ operators work together? - awk

I'm having trouble understanding this awk code:
$0 ~ ENVIRON["search"] {
match($0, /id=[0-9]+/);
if (RSTART) {
print substr($0, RSTART+3, RLENGTH-3)
}
}
How do the ~ and match() operators interact with each other?
How does the match() have any effect, if its output isn't printed or echo'd? What does it actually return or do? How can I use it in my own code?
This is related to Why are $0, ~, &c. used in a way that violates usual bash syntax docs inside an argument to awk?, but that question was centered around understanding the distinction between bash and awk syntaxes, whereas this one is centered around understanding the awk portions of the script.

Taking your questions one at a time:
How do the ~ and match() operators interact with each other?
They don't. At least not directly in your code. ~ is the regexp comparison operator. In the context of $0 ~ ENVIRON["search"] it is being used to test if the regexp contained in the environment variable search exists as part of the current record ($0). If it does then the code in the subsequent {...} block is executed, if it doesn't then it isn't.
How does the match() have any effect, if its output isn't printed or
echoed?
It identifies the starting point (and stores it in the awk variable RSTART) and the length (RLENGTH) of the first substring within the first parameter ($0) that matches the regexp provides as the second parameter (id=[0-9]+). With GNU awk it can also populate a 3rd array argument with segments of the matching string identified by round brackets (aka "capture groups").
What does it actually return or do?
It returns the value of RSTART which is zero if no match was found, 1 or greater otherwise. For what it does see the previous answer.
How can I use it in my own code?
As shown in the example you posted would be one way but that code would more typically be written as:
($0 ~ ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
and using a string rather than regexp comparison for the first part would probably be even more appropriate:
index($0,ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn how to use awk.

use the regex id=[0-9]+ to find a match in each line
if the start position of the match (RSTART) is not 0 then:
print the match without the id=
this is shorter but does the same:
xinput --list | grep -Po 'id=[0-9]+' | cut -c4-

Related

what is the meaning of a[FNR]=a[FNR]?a[FNR]","$0:$0 in awk?

I know the ? is for something like
(condition) ? statement-1: statement-2
but, in this case I am not understanding how a[FNR]=a[FNR] is working, when is this false?
awk '{a[FNR]=a[FNR] ? a[FNR]","$0 : $0} END{for(i=1;i<=FNR;i++)print a[i]}' *.csv

a[FNR]=a[FNR] ? a[FNR]","$0 : $0
Here ? is a ternary operator. Where condition is just a[FNR]. Where a is an associative array.
It means if a[FNR] is not empty and non-zero then set a[FNR] = a[FNR] "," $0 expression otherwise set a[FNR] = $0.
In other words it is equivalent of:
if (a[FNR]) {
a[FNR] = a[FNR] "," $0
} else {
a[FNR] = $0
}
Correct approach is to use it this way as Ed rightly suggest in comments:
a[FNR] = (FNR in a ? a[FNR] "," : "") $0

Following detailed explanation may help you here on same. This is only for explanation purposes.
awk ' ##Starting awk program from here.
{
a[FNR]=a[FNR] ? a[FNR]","$0 : $0 ##Creating an array named a with index of current line number FNR.
##It is keep concatenating same line number index array value to itself to get all same line number values in a single array item.
}
END{ ##Starting END block of this program from here.
for(i=1;i<=FNR;i++){ ##Running for loop till value of FNR here.
print a[i] ##Printing array a value with index of variable i here.
}
}' *.csv ##Passing all csv files here.
Why we usually do this(since we are passing lot of .csv files into the awk program and same line numbers values to be stored on same index value of array we are using this), these are ternary operators(? and :), its format is like condition?value when condition is TRUE: value when condition is false. Basically if a condition is TRUE then after ? value will be assigned, if condition is false then value after : will be assigned.

The missing parentheses in the conditional operator sure confuse a lot compared to when applied:
$ awk '
BEGIN {
a="a" # some values for a and b
b="b"
a=(a ? a "," b : b) # properly parenthesized ternary operator
print a
}'
Some test cases for varying values of a (or a[FNR] in your sample) to consider:
$ awk 'BEGIN{a="a";b="b";a=a?a","b:b;print a}'
a,b
$ awk 'BEGIN{a="";b="b";a=a?a","b:b;print a}'
b
$ awk 'BEGIN{a="0";b="b";a=a?a","b:b;print a}'
0,b
$ awk 'BEGIN{a=0;b="b";a=a?a","b:b;print a}'
b
Basically a gets appended a comma and the value of b when a is non-empty and numerically not zero.
PS. you should probably monitor the value of b for the comma control (left as an exercise), who would like the output:
a,

Relevant piece of documentation for understanding effect of
a[FNR]=a[FNR] ? a[FNR]","$0 : $0
is Operator Precedence (How Operators Nest) which states which operator have higher precedence, thus allowing to answer where ( and ) can be put without changing logic. In your example you have assignment, contatenation and ternary. String contatenation has highest precedence of all enumerated, ternary is intermediate and assignment lowest, so after applying that:
a[FNR]=(a[FNR] ? (a[FNR]","$0) : $0)
Documentation prompts code writers to use ( and ) to avoid possible confusion
it is wise to always use parentheses whenever there is an unusual
combination of operators, because other people who read the program
may not remember what the precedence is in this case. Even experienced
programmers occasionally forget the exact rules, which leads to
mistakes. Explicit parentheses help prevent any such mistakes.
Note that need of providing operator precedence is not limited to AWK language, but all languages using infix notation, i.e. most of commonly used programming languages (languages exploiting Reverse Polish Notation for example FORTH does not suffer such requirement).

AWK script, linefeed under Windows causing different function

I have a simple AWK script which I try to execute under Windows. Gnu AWK 3.1.6.
The awk script is run with awk -f script.awk f1 f2 under Windows 10.
After spending almost half a day debugging, I came to find that the following two scenarios produce different results:
FNR==NR{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
versus
FNR==NR
{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
The difference of course being the linefeed at line 1.
It puzzles me because I don't recall seeing anywhere awk should be critical about linefeeds. Other linefeeds in the script are unimportant.
In example one, desired result is achieved. Example 2 prints f1, which is not desred.
So I made it work, but would like to know why

From the docs (https://www.gnu.org/software/gawk/manual/html_node/Statements_002fLines.html)
awk is a line-oriented language. Each rule’s action has to begin on
the same line as the pattern. To have the pattern and action on
separate lines, you must use backslash continuation; there is no other
option.
Note that the action only has to begin on the same line as the pattern. After that as we're all aware it can be spread over multiple lines, though not willy-nilly. From the same page in the docs:
However, gawk ignores newlines after any of the following symbols and
keywords:
, { ? : || && do else
In Example 2, since there is no action beginning on the same line as the FNR == NR pattern, the default action of printing the line is performed when that statement is true (which it is for all and only f1). Similarly in that example, the action block is not paired with any preceding pattern on its same line, so it is executed for every record (though there's no visible result for that).

Understanding syntax for print multiple lines after pattern match

To print multiple(2) lines following the pattern using awk:
I have found somewhere the following solution
$ awk '/Linux/{x=NR+2}(NR<=x){print}' file
Linux
Solaris
Aix
I am trying to understand the syntax
Generally awk syntax is
awk 'pattern{action}' file
Here we find
pattern = /Linux/
action = {x=NR+2}
then what is (NR<=x){print}
Solution:
My understaning of c-like syntax for this is:
While read (file,line)
{
if (line ~ '/pattern/') then
{
x= NR+2
}
if (NR <= x)
{
print
{
}
for NR=1 and if (line ~ '/pattern/') then x is set to NR+2 eg(1+2 =3). This value will not be reset till the process is over. SO when the next line is read and !(line ~ '/pattern/') then x is still 3, (NR (2) <= 3) is true so it prints the next line
Thanks to #Edmorton for the undestating

FWIW I wouldn't write the code you're asking about, instead I'd write:
awk '/Linux/{c=3} c&&c--' file
See example "g" at https://stackoverflow.com/a/17914105/1745001.
Having said that, your original code in C-like syntax would be:
NR=0
x=0
While read (file,line)
{
NR++
if (line ~ "Linux") {
x = NR+2
}
if (NR <= x) {
print
}
}
Btw, I know it's frequently mis-used but don't use the word "pattern" in your software as it's highly ambiguous - use string or regexp or condition (or in shell but not awk, sed, grep, etc. and only where appropriate "globbing pattern"), whichever it is you really mean.
For example you wrote that awk syntax is:
awk 'pattern{action}' file
No. Or maybe, depending on what you think "pattern" means! Despite what many books, tutorials, etc. say so as to remove any ambiguity you should simply think of awk syntax as:
awk 'condition{action}' file
where condition can be any of:
a key word like BEGIN or END
an arithmetic expression like var < 7 or NF or 1
a regexp comparison like $0 ~ "foo" or $0 ~ /foo/ or /foo/ or $0 ~ var or match($0,/foo/)
a string comparison like $0 == "foo" or index($0,"foo")
nothing at all in which case it's assumed to be true when there's an associated action block.
and probably other things I'm forgetting to list.

your script has two blocks
$ awk '/Linux/ {x=NR+2}
NR<=x {print}' file
first block sets the variable x, second uses to print the lines. Note that you can drop {print}, since it's the default action.

In awk, how can I use a file containing multiple format strings with printf?

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.
Here's a tiny example of the problem:
$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello: world
foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$
So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?
UPDATE #1:
As a further example, consider the following using bash heretext:
[me#here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me#here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me#here ~]$
As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.
UPDATE #2:
The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:
#!/bin/sh
while read -r fmtid fmt; do
while read cid name addy; do
awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
done < /path/to/sampledata
done < /path/to/fmtstrings
Example input would be:
## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n
## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere
My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:
awk '
NR==FNR { fmts[$1]=$2; next; }
{
for(fmtid in fmts) {
outputfile=sprintf("/path/%d/%d", fmtid, custid);
printf(fmts[fmtid], $1, $2) > outputfile;
}
}
' /path/to/fmtstrings /path/to/sampledata
Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)
FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.

Why so lengthy and complicated an example? This demonstrates the problem:
$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a b
$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb
In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.
In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.
You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per #EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):
$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a b
If you don't need the result in a variable, you can just call system().
If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).
You could call awk instead of the shell printf if you prefer.
Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!
With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:
$ cat tst2.awk
function expandEscapes(old, segs, segNr, escs, idx, new) {
split(old,segs,/\\./,escs)
for (segNr=1; segNr in segs; segNr++) {
if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
new = new segs[segNr] escs[segNr]
}
return new
}
{
s = expandEscapes($0)
printf s, "foo", "bar"
}
.
$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar
Alternatively, this shoudl be functionally equivalent but not gawk-specific:
function expandEscapes(tail, head, esc, idx) {
head = ""
while ( match(tail, /\\./) ) {
esc = substr( tail, RSTART + 1, 1 )
head = head substr( tail, 1, RSTART-1 )
tail = substr( tail, RSTART + 2 )
idx = index( "abfnrtv", esc )
if ( idx )
esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
head = head esc
}
return (head tail)
}
If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to
/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/
and for a hex value after the \\:
c = sprintf("%c", strtonum("0x" rest_of_str))
and for an octal value:
c = sprintf("%c", strtonum("0" rest_of_str))

Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.
The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.
awk 'BEGIN {if(ARGC!=3)exit(1);
fn=ARGV[2];ARGC=2}
NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
ARGV[ARGC++]="fmtid="$1;
ARGV[ARGC++]=fn;
next}
{match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
}' fmtfile sampledata
(
What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.
As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.
The snippet also shows two ways of concatenating trailing fields.
In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.
Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)
Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.

Ed Morton shows the problem clearly (edit: and it's now complete, so just go accept it): awk's string literal processing handled the escapes, and file I/O code isn't a lexical analyzer.
It's an easy fix: decide what escapes you want to support, and support them. Here's a one-liner form if you're doing special-purpose work that doesn't need to handle escaped backslashes
awk '{ gsub(/\\n/,"\n"); gsub(/\\t/,"\t"); printf($0 "bar\n", "world"); }' <<\EOD
hello:\t%s\n\tfoo
EOD
but for doit-and-forgetit peace of mind just use the full form in the linked answer.

#Ed Morton's answer explains the problem well.
A simple workaround is to:
pass the format-string file contents via an awk variable, using command substitution,
assuming that file is not too large to be read into memory in full.
Using GNU awk or mawk:
awk -v formats="$(tr '\n' '\3' <fmtStrings)" '
# Initialize: Split the formats into array elements.
BEGIN {n=split(formats, aFormats, "\3")}
# For each data line, loop over all formats and print.
{ for(i=1;i<n;++i) {printf aFormats[i] "\n", $1, $2, $3} }
' sampleData
Note:
The advantage of this solution is that it works generically - you don't need to anticipate specific escape sequences and handle them specially.
On FreeBSD awk, this almost works, but - sadly - split() still splits by newlines, despite being given an explicit separator - this smells like a bug. Observed on versions 20070501 (OS X 10.9.4) and 20121220 (FreeBSD 10.0).
The above solves the core problem (for brevity, it omits stripping the ID from the front of the format strings and omits the output-file creation logic).
Explanation:
tr '\n' '\3' <fmtStrings replaces actual newlines in the format-strings file with \3 (0x3) characters, so as to be able to later distinguish them from the \n escape sequences embedded in the lines, which awk turns into actual newlines when assigning to variable formats (as desired).
\3 (0x3) - the ASCII end-of-text char. - was arbitrarily chosen as an auxiliary separator that is assumed not to be present in the input file.
Note that using \0 (NUL) is NOT an option, because awk interprets that as an empty string, causing split() to split the string into individual characters.
Inside the BEGIN block of the awk script, split(formats, aFormats, "\3") then splits the combined format strings back into individual format strings.

I had to create another answer to start clean, I believe I've come to a good solution, again with perl:
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'
hi : hello
That bad boy s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg will translate any meta character I can think of, let us take a look with cat -A :
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"' | cat -A
hi^I:^I hello^M$
PS. I didn't create that regex, I googled unquote meta and found here

What you are trying to do is called templating. I would suggest that shell tools are not the best tools for this job. A safe way to go would be to use a templating library such as Template Toolkit for Perl, or Jinja2 for Python.

The problem lies in the non-interpretation of the special characters \t and \n by echo: it makes sure that they are understood as as-is strings, and not as tabulations and newlines. This behavior can be controlled by the -e flag you give to echo, without changing your awk script at all:
echo -e "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
tada!! :)
EDIT:
Ok, so after the point rightfully raised by Chrono, we can devise this other answer corresponding to the original request to have the pattern read from a file:
echo "hello:\t%s\n\tfoo" > myfile
awk 'BEGIN {s="'$(cat myfile)'" ; printf(s "bar\n", "world")}'
Of course in the above we have to be careful with the quoting, as the $(cat myfile) is not seen by awk but interpreted by the shell.

This looks extremely ugly, but it works for this particular problem:
s=$0;
gsub(/'/, "'\\''", s);
gsub(/\\n/, "\\\\\\\\n", s);
"printf '%b' '" s "'" | getline s;
gsub(/\\\\n/, "\n", s);
gsub(/\\n/, "\n", s);
printf(s " bar\n", "world");
Replace all single quotes with shell-escaped single quotes ('\'').
Replace all escaped newline sequences that appear normally as \n with the sequence that appears as \\\\n. It would suffice to use \\\\n as the actual replacement string (meaning \\n would print if you printed it), but the version of gawk I have messes things up in POSIX mode.
Invoke the shell to execute printf '%b' 'escape'\''d format' and use awk's getline statement to retrieve the line.
Unescape \\n to yield a newline. This step wouldn't be necessary if gawk in POSIX mode played nicely.
Unescape \n to yield a newline.
Otherwise you're left to call the gsub function for each possible escape sequence, which is terrible for \001, \002, etc.

Graham,
Ed Morton's solution is the best (and perhaps only) one available.
I'm including this answer for a better explanation of WHY you're seeing what you're seeing.
A string is a string. The confusing part here is WHERE awk does the translation of \t to a tab, \n to a newline, etc. It appears NOT to be the case that the backslash and t get translated when used in a printf format. Instead, the translation happens at assignment, so that awk stores the tab as part of the format rather than translating when it runs the printf.
And this is why Ed's function works. When read from stdin or a file, no assignment is performed that will implement the translation of special characters. Once you run the command s="a\tb"; in awk, you have a three character string containing no backslash or t.
Evidence:
$ echo "a\tb\n" | awk '{ s=$0; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2 \
3 t
4 b
5 \
6 n
vs
$ awk 'BEGIN{s="a\tb\n"; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2
3 b
4
And there you go.
As I say, Ed's answer provides an excellent function for what you need. But if you can predict what your input will look like, you can probably get away with a simpler solution. Knowing how this stuff gets parsed, if you have a limited set of characters you need to translate, you may be able to survive with something simple like:
s=$0;
gsub(/\\t/,"\t",s);
gsub(/\\n/,"\n",s);

That's a cool question, I don't know the answer in awk, but in perl you can use eval :
echo '%10s\t:\t%-10s\n' | perl -ne ' chomp; eval "printf (\"$_\", \"hi\", \"hello\")"'
hi : hello
PS. Be aware of code injection danger when you use eval in any language, no just eval any system call can't be done blindly.
Example in Awk:
echo '$(whoami)' | awk '{"printf \"" $0 "\" " "b" | getline s; print s}'
tiago
What if the input was $(rm -rf /)? You can guess what would happen :)
ikegami adds:
Why would even think of using eval to convert \n to newlines and \t to tabs?
echo '%10s\t:\t%-10s\n' | perl -e'
my %repl = (
n => "\n",
t => "\t",
);
while (<>) {
chomp;
s{\\(?:(\w)|(\W))}{
if (defined($2)) {
$2
}
elsif (exists($repl{$1})) {
$repl{$1}
}
else {
warn("Unrecognized escape \\$1.\n");
$1
}
}eg;
printf($_, "hi", "hello");
}
'
Short version:
echo '%10s\t:\t%-10s\n' | perl -nle'
s/\\(?:(n)|(t)|(.))/$1?"\n":$2?"\t":$3/seg;
printf($_, "hi", "hello");
'

format regexp constant on several lines for readability

For learning purposes I am implementing a little regexp matcher for telephone numbers. My goal is readability, not the shortest possible gawk program:
# should match
#1234567890
#123-456-7890
#123.456.7890
#(123)456-7890
#(123) 456-7890
BEGIN{
regexp="[0-9]{10},[0-9]{3}[-.][0-9]{3}[.-][0-9]{4},\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
len=split(regexp,regs,/,/)
}
{for (i=1;i<=len;i++)
if ($0 ~ regs[i]) print $0
}
For better readability I would like to split the line regexp="... on several lines like:
regexp="[0-9]{10}
,[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}
,\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
Is there an easy way to do this in awk?

BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
c = 3
}
{
for (i = 1; i <= c; i++)
if ($0 ~ regs[i])
print $0
}
If your awk implementation supports length(array) - use it (see Jaypal Singh comments below):
BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
}
{
for (i = 1; i <= length(regs); i++)
if ($0 ~ regs[i])
print $0
}
Consider also the side effects of the computed (dynamic) regular expressions,
see the GNU awk manual for more information.

The following link may contain the answer you were looking for :
http://www.gnu.org/software/gawk/manual/html_node/Statements_002fLines.html
It says that in awk script files or on the command line of certain shells, awk commands can be split over several lines in the same manner as makefile commands. Simply end the line with a backslash (\) and awk will discard the newline character upon parsing. Combine this with implicit concatenation of strings (similar to C) and the solution could be
BEGIN {
regexp = "[0-9]{10}," \
"[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}," \
"\\([0-9]{3}\\)?[0-9]{3}-[0-9]{4}"
len = split(regexp, regs, /,/)
}
Nevertheless, I would favor the solution that stores the regular expressions in an array directly: it better reflects the intent of the statement and doesn't force the programmer to do any more work than required. Also, there is no need for the length function since one can use the foreach syntax. One should note that arrays in awk are like maps in Java or dictionaries in Python in that they don't associate a range of integer indices with values. Rather they map string keys to values. Even if integers are used as keys, they are implicitly converted to a string. Thus the length function is not always provided since it is misleading.
BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\)?[0-9]{3}-[0-9]{4}"
}
{
for (i in regs) { # i recieves each key added to the regs array
if ($0 ~ regs[i]) {
print # by default `print' prints the whole record
break # we can stop finding a regexp
}
}
}
Note that the break command exits the for loop prematurely. This is necessary if each record must only be printed once, even though several regular expressions could match.

Well you can store the regexp in variables, then join them, e.g.:
awk '{
COUNTRYCODE="WHATEVER_YOUR_CONTRY_CODE_REGEXP"
CITY="CITY_REGEXP"
PHONENR="PHONENR_REGEX"
THE_WHOLE_THING=COUNTRYCODE CITY PHONENR
if ($0 ~ THE_WHOLE_THING) { print "BINGO" }
}'
HTH

The concensus seems to be that there is no simple way to split multiline strings without disturbing awk? Thanks for the other ideas, but make me as the programmer do the work of the computer what i don't enjoy. So i came up with this solution, which in my opinion is pretty close to a kind of executable specification. I use the base and here documents and process redicrection to create the files for awk on the fly:
#!/bin/bash
# numbers that should be matched
read -r -d '' VALID <<'valid'
1234567890
123-456-7890
123.456.7890
(123)456-7890
(123) 456-7890
valid
# regexp patterns that should match
read -r -d '' PATTERNS <<'patterns'
[0-9]{10}
[0-9]{3}\.[0-9]{3}\.[0-9]{4}
[0-9]{3}-[0-9]{3}-[0-9]{4}
\([0-9]{3}\) ?[0-9]{3}-[0-9]{4}
patterns
gawk --re-interval 'NR==FNR{reg[FNR]=$0;next}
{for (i in reg)
if ($0 ~ reg[i]) print $0}' <(echo "$PATTERNS") <(echo "$VALID")
Any comments are welcome.

I want to introduce my favorite to this question as it hasn't been mentioned yet. I like to use the simple string append operation of awk, that is just the default operator between two terms, as multiplication in typical math notations:
x = x"more stuff"
appends "more stuff" to x and sets the new value to x again. So you can write
regexp = ""
regexp = regexp"[0-9]{10}"
regexp = regexp"[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regexp = regexp"\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
To control additional split characters like newlines between the snippets most languages I know of and awk too, can use the array join and split methods to make a string from an array and convert back the string into an array, without loosing the original structure of the array (for example the newline markers):
i = 0
regexp[i++] = "[0-9]{10}"
regexp[i++] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regexp[i++] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
Using regstr = join(regexp, ",") add the split "," you used.
Of course there is no join function in awk, but I guess it is very simple
to implement, knowing the string append operation above.
My method seem to look more verbose but has the advantage, that the original data, the regexp string snippets in this part, are prepended by a string constant for each snippet. That means the code can be generated by a very simple algorithm (or even some editors shortcuts).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do awk match and ~ operators work together? - awk

use the regex id=[0-9]+ to find a match in each line if the start position of the match (RSTART) is not 0 then: print the match without the id= this is shorter but does the same: xinput --list | grep -Po 'id=[0-9]+' | cut -c4-

Related

what is the meaning of a[FNR]=a[FNR]?a[FNR]","$0:$0 in awk?

AWK script, linefeed under Windows causing different function

Understanding syntax for print multiple lines after pattern match

In awk, how can I use a file containing multiple format strings with printf?

format regexp constant on several lines for readability

Categories

Resources