How can I get the length of an array in awk? - awk

This command
echo "hello world" | awk '{split($0, array, " ")} END{print length(array) }'
does not work for me and gives this error message
awk: line 1: illegal reference to array array
Why?

When you split an array, the number of elements is returned, so you can say:
echo "hello world" | awk '{n=split($0, array, " ")} END{print n }'
# ------------------------^^^--------------------------------^^
Output is:
2

Mr. Ventimiglia's function requires a little adjustment to do the work (see the semicolon in for statement):
function alen(a, i) {
for(i in a);
return i
}
But don't work all the cases or times. That is because the manner that awk store and "see" the indexes of the arrays: they are associative and no necessarily contiguous (like C.) So, i does not return the "last" element.
To resolve it, you need to count:
function alen(a, i, k) {
k = 0
for(i in a) k++
return k
}
And, in this manner, take care other index types of "unidimensional" arrays, where the index maybe an string. Please see: http://docstore.mik.ua/orelly/unix/sedawk/ch08_04.htm. For "multidimensional" and arbitrary arrays, see http://www.gnu.org/software/gawk/manual/html_node/Walking-Arrays.html#Walking-Arrays.

I don't think the person is asking, "How do I split a string and get the length of the resulting array?" I think the command they provide is just an example of the situation where it arose. In particular, I think the person is asking 1) Why does length(array) provoke an error, and 2) How can I get the length of an array in awk?
The answer to the first question is that the length function does not operate on arrays in POSIX standard awk, though it does in GNU awk (gawk) and a few other variations. The answer to the second question is (if we want a solution that works in all variations of awk) to do a linear scan.
For example, a function like this:
function alen (a, i) {
for (i in a);
return i;}
NOTE: The second parameter i warrants some explanation.
The way you introduce local variables in awk is as extra function parameters and the convention is to indicate this by adding extra spaces before these parameters. This is discussed in the GNU Awk manual here.

In gawk you can use the function length():
$ gawk 'BEGIN{a[1]=1; a[2]=2; a[23]=45; print length(a)}'
3
$ gawk 'BEGIN{a[1]=1; a[2]=2; print length(a); a[23]=45; print length(a)}'
2
3
From The GNU Awk user's guide:
With gawk and several other awk implementations, when given an array argument, the length() function returns the number of elements in the
array. (c.e.) This is less useful than it might seem at first, as
the array is not guaranteed to be indexed from one to the number of
elements in it. If --lint is provided on the command line (see
Options), gawk warns that passing an array argument is not portable.
If --posix is supplied, using an array argument is a fatal error (see
Arrays).

Just want to point that:
Don't need to store the result of the split function in order to print it.
If separator is not supplied for the split, the default FS (blank space) will be used.
The END part is useless here.
echo 'hello world' | awk '{print split($0, a)}'

sample on MacOSX Lion to show used ports (output can be 192.168.111.130.49704 or ::1.49704) :
netstat -a -n -p tcp | awk '/\.[0-9]+ / {n=split($4,a,"."); print a[n]}'
In this sample, that print the last array item of 4th column : "49704"

Try this if you are not using gawk.
awk 'BEGIN{test="aaa bbb ccc";a=split(test, ff, " "); print ff[1]; print a; print ff[a]}'
Output:
aaa
3
ccc
8.4.4 Using split() to Create Arrays http://docstore.mik.ua/orelly/unix/sedawk/ch08_04.htm

Here's a quick way for me to get length of array, init to zero length if non-existent, but don't overwrite any existing ones or accidentally add extra elements :
(g/mawk) 'function arrayinit(ar, x) { for(x in ar) {break}; return length(ar) };
The for loop basically has O(1) since it exits upon any existing element, regardless of sort order. My old way used to either test, or split empty string. This way saves the split step since the for loop perhaps that function implicitly.
This also works for pseudo multi-dim array like arr[x,y] or gawk arr[x][y] ones without having to worry whether "x" is a sub-array in the gawk sense.

echo "hello world" | awk '{lng=split($0, array, " ")} END{print lng) }'

Related

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.

How do awk match and ~ operators work together?

I'm having trouble understanding this awk code:
$0 ~ ENVIRON["search"] {
match($0, /id=[0-9]+/);
if (RSTART) {
print substr($0, RSTART+3, RLENGTH-3)
}
}
How do the ~ and match() operators interact with each other?
How does the match() have any effect, if its output isn't printed or echo'd? What does it actually return or do? How can I use it in my own code?
This is related to Why are $0, ~, &c. used in a way that violates usual bash syntax docs inside an argument to awk?, but that question was centered around understanding the distinction between bash and awk syntaxes, whereas this one is centered around understanding the awk portions of the script.
Taking your questions one at a time:
How do the ~ and match() operators interact with each other?
They don't. At least not directly in your code. ~ is the regexp comparison operator. In the context of $0 ~ ENVIRON["search"] it is being used to test if the regexp contained in the environment variable search exists as part of the current record ($0). If it does then the code in the subsequent {...} block is executed, if it doesn't then it isn't.
How does the match() have any effect, if its output isn't printed or
echoed?
It identifies the starting point (and stores it in the awk variable RSTART) and the length (RLENGTH) of the first substring within the first parameter ($0) that matches the regexp provides as the second parameter (id=[0-9]+). With GNU awk it can also populate a 3rd array argument with segments of the matching string identified by round brackets (aka "capture groups").
What does it actually return or do?
It returns the value of RSTART which is zero if no match was found, 1 or greater otherwise. For what it does see the previous answer.
How can I use it in my own code?
As shown in the example you posted would be one way but that code would more typically be written as:
($0 ~ ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
and using a string rather than regexp comparison for the first part would probably be even more appropriate:
index($0,ENVIRON["search"]) && match($0,/id=[0-9]+/) {
print substr($0, RSTART+3, RLENGTH-3)
}
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn how to use awk.
use the regex id=[0-9]+ to find a match in each line
if the start position of the match (RSTART) is not 0 then:
print the match without the id=
this is shorter but does the same:
xinput --list | grep -Po 'id=[0-9]+' | cut -c4-

In awk, how can I use a file containing multiple format strings with printf?

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.
Here's a tiny example of the problem:
$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello: world
foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$
So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?
UPDATE #1:
As a further example, consider the following using bash heretext:
[me#here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me#here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me#here ~]$
As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.
UPDATE #2:
The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:
#!/bin/sh
while read -r fmtid fmt; do
while read cid name addy; do
awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
done < /path/to/sampledata
done < /path/to/fmtstrings
Example input would be:
## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n
## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere
My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:
awk '
NR==FNR { fmts[$1]=$2; next; }
{
for(fmtid in fmts) {
outputfile=sprintf("/path/%d/%d", fmtid, custid);
printf(fmts[fmtid], $1, $2) > outputfile;
}
}
' /path/to/fmtstrings /path/to/sampledata
Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)
FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.
Why so lengthy and complicated an example? This demonstrates the problem:
$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a b
$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb
In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.
In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.
You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per #EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):
$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a b
If you don't need the result in a variable, you can just call system().
If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).
You could call awk instead of the shell printf if you prefer.
Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!
With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:
$ cat tst2.awk
function expandEscapes(old, segs, segNr, escs, idx, new) {
split(old,segs,/\\./,escs)
for (segNr=1; segNr in segs; segNr++) {
if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
new = new segs[segNr] escs[segNr]
}
return new
}
{
s = expandEscapes($0)
printf s, "foo", "bar"
}
.
$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar
Alternatively, this shoudl be functionally equivalent but not gawk-specific:
function expandEscapes(tail, head, esc, idx) {
head = ""
while ( match(tail, /\\./) ) {
esc = substr( tail, RSTART + 1, 1 )
head = head substr( tail, 1, RSTART-1 )
tail = substr( tail, RSTART + 2 )
idx = index( "abfnrtv", esc )
if ( idx )
esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
head = head esc
}
return (head tail)
}
If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to
/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/
and for a hex value after the \\:
c = sprintf("%c", strtonum("0x" rest_of_str))
and for an octal value:
c = sprintf("%c", strtonum("0" rest_of_str))
Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.
The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.
awk 'BEGIN {if(ARGC!=3)exit(1);
fn=ARGV[2];ARGC=2}
NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
ARGV[ARGC++]="fmtid="$1;
ARGV[ARGC++]=fn;
next}
{match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
}' fmtfile sampledata
(
What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.
As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.
The snippet also shows two ways of concatenating trailing fields.
In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.
Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)
Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.
Ed Morton shows the problem clearly (edit: and it's now complete, so just go accept it): awk's string literal processing handled the escapes, and file I/O code isn't a lexical analyzer.
It's an easy fix: decide what escapes you want to support, and support them. Here's a one-liner form if you're doing special-purpose work that doesn't need to handle escaped backslashes
awk '{ gsub(/\\n/,"\n"); gsub(/\\t/,"\t"); printf($0 "bar\n", "world"); }' <<\EOD
hello:\t%s\n\tfoo
EOD
but for doit-and-forgetit peace of mind just use the full form in the linked answer.
#Ed Morton's answer explains the problem well.
A simple workaround is to:
pass the format-string file contents via an awk variable, using command substitution,
assuming that file is not too large to be read into memory in full.
Using GNU awk or mawk:
awk -v formats="$(tr '\n' '\3' <fmtStrings)" '
# Initialize: Split the formats into array elements.
BEGIN {n=split(formats, aFormats, "\3")}
# For each data line, loop over all formats and print.
{ for(i=1;i<n;++i) {printf aFormats[i] "\n", $1, $2, $3} }
' sampleData
Note:
The advantage of this solution is that it works generically - you don't need to anticipate specific escape sequences and handle them specially.
On FreeBSD awk, this almost works, but - sadly - split() still splits by newlines, despite being given an explicit separator - this smells like a bug. Observed on versions 20070501 (OS X 10.9.4) and 20121220 (FreeBSD 10.0).
The above solves the core problem (for brevity, it omits stripping the ID from the front of the format strings and omits the output-file creation logic).
Explanation:
tr '\n' '\3' <fmtStrings replaces actual newlines in the format-strings file with \3 (0x3) characters, so as to be able to later distinguish them from the \n escape sequences embedded in the lines, which awk turns into actual newlines when assigning to variable formats (as desired).
\3 (0x3) - the ASCII end-of-text char. - was arbitrarily chosen as an auxiliary separator that is assumed not to be present in the input file.
Note that using \0 (NUL) is NOT an option, because awk interprets that as an empty string, causing split() to split the string into individual characters.
Inside the BEGIN block of the awk script, split(formats, aFormats, "\3") then splits the combined format strings back into individual format strings.
I had to create another answer to start clean, I believe I've come to a good solution, again with perl:
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'
hi : hello
That bad boy s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg will translate any meta character I can think of, let us take a look with cat -A :
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"' | cat -A
hi^I:^I hello^M$
PS. I didn't create that regex, I googled unquote meta and found here
What you are trying to do is called templating. I would suggest that shell tools are not the best tools for this job. A safe way to go would be to use a templating library such as Template Toolkit for Perl, or Jinja2 for Python.
The problem lies in the non-interpretation of the special characters \t and \n by echo: it makes sure that they are understood as as-is strings, and not as tabulations and newlines. This behavior can be controlled by the -e flag you give to echo, without changing your awk script at all:
echo -e "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
tada!! :)
EDIT:
Ok, so after the point rightfully raised by Chrono, we can devise this other answer corresponding to the original request to have the pattern read from a file:
echo "hello:\t%s\n\tfoo" > myfile
awk 'BEGIN {s="'$(cat myfile)'" ; printf(s "bar\n", "world")}'
Of course in the above we have to be careful with the quoting, as the $(cat myfile) is not seen by awk but interpreted by the shell.
This looks extremely ugly, but it works for this particular problem:
s=$0;
gsub(/'/, "'\\''", s);
gsub(/\\n/, "\\\\\\\\n", s);
"printf '%b' '" s "'" | getline s;
gsub(/\\\\n/, "\n", s);
gsub(/\\n/, "\n", s);
printf(s " bar\n", "world");
Replace all single quotes with shell-escaped single quotes ('\'').
Replace all escaped newline sequences that appear normally as \n with the sequence that appears as \\\\n. It would suffice to use \\\\n as the actual replacement string (meaning \\n would print if you printed it), but the version of gawk I have messes things up in POSIX mode.
Invoke the shell to execute printf '%b' 'escape'\''d format' and use awk's getline statement to retrieve the line.
Unescape \\n to yield a newline. This step wouldn't be necessary if gawk in POSIX mode played nicely.
Unescape \n to yield a newline.
Otherwise you're left to call the gsub function for each possible escape sequence, which is terrible for \001, \002, etc.
Graham,
Ed Morton's solution is the best (and perhaps only) one available.
I'm including this answer for a better explanation of WHY you're seeing what you're seeing.
A string is a string. The confusing part here is WHERE awk does the translation of \t to a tab, \n to a newline, etc. It appears NOT to be the case that the backslash and t get translated when used in a printf format. Instead, the translation happens at assignment, so that awk stores the tab as part of the format rather than translating when it runs the printf.
And this is why Ed's function works. When read from stdin or a file, no assignment is performed that will implement the translation of special characters. Once you run the command s="a\tb"; in awk, you have a three character string containing no backslash or t.
Evidence:
$ echo "a\tb\n" | awk '{ s=$0; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2 \
3 t
4 b
5 \
6 n
vs
$ awk 'BEGIN{s="a\tb\n"; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2
3 b
4
And there you go.
As I say, Ed's answer provides an excellent function for what you need. But if you can predict what your input will look like, you can probably get away with a simpler solution. Knowing how this stuff gets parsed, if you have a limited set of characters you need to translate, you may be able to survive with something simple like:
s=$0;
gsub(/\\t/,"\t",s);
gsub(/\\n/,"\n",s);
That's a cool question, I don't know the answer in awk, but in perl you can use eval :
echo '%10s\t:\t%-10s\n' | perl -ne ' chomp; eval "printf (\"$_\", \"hi\", \"hello\")"'
hi : hello
PS. Be aware of code injection danger when you use eval in any language, no just eval any system call can't be done blindly.
Example in Awk:
echo '$(whoami)' | awk '{"printf \"" $0 "\" " "b" | getline s; print s}'
tiago
What if the input was $(rm -rf /)? You can guess what would happen :)
ikegami adds:
Why would even think of using eval to convert \n to newlines and \t to tabs?
echo '%10s\t:\t%-10s\n' | perl -e'
my %repl = (
n => "\n",
t => "\t",
);
while (<>) {
chomp;
s{\\(?:(\w)|(\W))}{
if (defined($2)) {
$2
}
elsif (exists($repl{$1})) {
$repl{$1}
}
else {
warn("Unrecognized escape \\$1.\n");
$1
}
}eg;
printf($_, "hi", "hello");
}
'
Short version:
echo '%10s\t:\t%-10s\n' | perl -nle'
s/\\(?:(n)|(t)|(.))/$1?"\n":$2?"\t":$3/seg;
printf($_, "hi", "hello");
'

Weird AWK behavior while forcing expression to be a number (adding 0)

I noticed an odd behavior while populating an array in awk. The indices and value both were numbers, so adding 0 shouldn’t have impacted. For the sake of understanding, lets take the following example:
Here is a file that I wish to use for this demo:
$ cat file
2.60E5-2670161065730303122012098 Invnum987678
2.60E5-2670161065846403042011098 Invnum987912
2.60E5-2670161065916903012012075 Invnum987654
2.60E5-2670161066813503042011075 Invnum987322
2.60E5-2670161066835008092012075 Invnum987323
2.60E5-2670161067040701122012075 Invnum987324
2.60E5-2670161067106602122010074 Invnum987325
What I would like to do is create an index from $1 and assign it value from $2. I will extract pieces of value from $1 and $2 using substr function.
$ awk '{p=substr($1,12)+0; A[p]=substr($2,7)+0;next}END{for(x in A) print x,A[x]}’ file
Now, ideally what the output should have been is as follows (ignore the fact that associative arrays may output in random):
161065730303122012098 987678
161065846403042011098 987912
161065916903012012075 987654
161066813503042011075 987322
161066835008092012075 987323
161067040701122012075 987324
161067106602122010074 987325
But, the output I got was as follows:
161066835008092012544 987323
161065846403042017280 987912
161067040701122019328 987324
161067106602122018816 987325
161066813503041994752 987322
161065916903012007936 987654
161065730303122014208 987678
If I remove the +0 from above awk one-liner, the output seems to be what I expect. What I would like to know is why would it corrupt the keys?
The above test was done on:
$ awk -version
awk version 20070501
It appears that AWK has some numerical limitations - I get even weirder results on gawk - perhaps the discussion in this SO will help you.

format regexp constant on several lines for readability

For learning purposes I am implementing a little regexp matcher for telephone numbers. My goal is readability, not the shortest possible gawk program:
# should match
#1234567890
#123-456-7890
#123.456.7890
#(123)456-7890
#(123) 456-7890
BEGIN{
regexp="[0-9]{10},[0-9]{3}[-.][0-9]{3}[.-][0-9]{4},\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
len=split(regexp,regs,/,/)
}
{for (i=1;i<=len;i++)
if ($0 ~ regs[i]) print $0
}
For better readability I would like to split the line regexp="... on several lines like:
regexp="[0-9]{10}
,[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}
,\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
Is there an easy way to do this in awk?
BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
c = 3
}
{
for (i = 1; i <= c; i++)
if ($0 ~ regs[i])
print $0
}
If your awk implementation supports length(array) - use it (see Jaypal Singh comments below):
BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
}
{
for (i = 1; i <= length(regs); i++)
if ($0 ~ regs[i])
print $0
}
Consider also the side effects of the computed (dynamic) regular expressions,
see the GNU awk manual for more information.
The following link may contain the answer you were looking for :
http://www.gnu.org/software/gawk/manual/html_node/Statements_002fLines.html
It says that in awk script files or on the command line of certain shells, awk commands can be split over several lines in the same manner as makefile commands. Simply end the line with a backslash (\) and awk will discard the newline character upon parsing. Combine this with implicit concatenation of strings (similar to C) and the solution could be
BEGIN {
regexp = "[0-9]{10}," \
"[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}," \
"\\([0-9]{3}\\)?[0-9]{3}-[0-9]{4}"
len = split(regexp, regs, /,/)
}
Nevertheless, I would favor the solution that stores the regular expressions in an array directly: it better reflects the intent of the statement and doesn't force the programmer to do any more work than required. Also, there is no need for the length function since one can use the foreach syntax. One should note that arrays in awk are like maps in Java or dictionaries in Python in that they don't associate a range of integer indices with values. Rather they map string keys to values. Even if integers are used as keys, they are implicitly converted to a string. Thus the length function is not always provided since it is misleading.
BEGIN {
regs[1] = "[0-9]{10}"
regs[2] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regs[3] = "\\([0-9]{3}\\)?[0-9]{3}-[0-9]{4}"
}
{
for (i in regs) { # i recieves each key added to the regs array
if ($0 ~ regs[i]) {
print # by default `print' prints the whole record
break # we can stop finding a regexp
}
}
}
Note that the break command exits the for loop prematurely. This is necessary if each record must only be printed once, even though several regular expressions could match.
Well you can store the regexp in variables, then join them, e.g.:
awk '{
COUNTRYCODE="WHATEVER_YOUR_CONTRY_CODE_REGEXP"
CITY="CITY_REGEXP"
PHONENR="PHONENR_REGEX"
THE_WHOLE_THING=COUNTRYCODE CITY PHONENR
if ($0 ~ THE_WHOLE_THING) { print "BINGO" }
}'
HTH
The concensus seems to be that there is no simple way to split multiline strings without disturbing awk? Thanks for the other ideas, but make me as the programmer do the work of the computer what i don't enjoy. So i came up with this solution, which in my opinion is pretty close to a kind of executable specification. I use the base and here documents and process redicrection to create the files for awk on the fly:
#!/bin/bash
# numbers that should be matched
read -r -d '' VALID <<'valid'
1234567890
123-456-7890
123.456.7890
(123)456-7890
(123) 456-7890
valid
# regexp patterns that should match
read -r -d '' PATTERNS <<'patterns'
[0-9]{10}
[0-9]{3}\.[0-9]{3}\.[0-9]{4}
[0-9]{3}-[0-9]{3}-[0-9]{4}
\([0-9]{3}\) ?[0-9]{3}-[0-9]{4}
patterns
gawk --re-interval 'NR==FNR{reg[FNR]=$0;next}
{for (i in reg)
if ($0 ~ reg[i]) print $0}' <(echo "$PATTERNS") <(echo "$VALID")
Any comments are welcome.
I want to introduce my favorite to this question as it hasn't been mentioned yet. I like to use the simple string append operation of awk, that is just the default operator between two terms, as multiplication in typical math notations:
x = x"more stuff"
appends "more stuff" to x and sets the new value to x again. So you can write
regexp = ""
regexp = regexp"[0-9]{10}"
regexp = regexp"[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regexp = regexp"\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
To control additional split characters like newlines between the snippets most languages I know of and awk too, can use the array join and split methods to make a string from an array and convert back the string into an array, without loosing the original structure of the array (for example the newline markers):
i = 0
regexp[i++] = "[0-9]{10}"
regexp[i++] = "[0-9]{3}[-.][0-9]{3}[.-][0-9]{4}"
regexp[i++] = "\\([0-9]{3}\\) ?[0-9]{3}-[0-9]{4}"
Using regstr = join(regexp, ",") add the split "," you used.
Of course there is no join function in awk, but I guess it is very simple
to implement, knowing the string append operation above.
My method seem to look more verbose but has the advantage, that the original data, the regexp string snippets in this part, are prepended by a string constant for each snippet. That means the code can be generated by a very simple algorithm (or even some editors shortcuts).