In awk, is there a functional difference between putting conditions outside the brackets or inside? - awk

As far as I can tell, this:
$2 > 50{print}
functions exactly the same way as this:
{if ($2 > 50) print}
When should I use one vs the other, or is it a matter of style?

2nd one is much more readable in my opinion (especially for non awk gurus who may have to maintain you code).

It's mostly style. The first example is canonical AWK. There are, of course, times when an if statement is required.
Consider that:
$2 > 50 {foo; print}
takes 20 characters versus 27 for the following:
{if ($2 > 50) {foo; print}}
If you remove spaces, it would be 16 versus 22 characters.
So the former is more desirable if you are interested in economy (both in terms of actual characters and visually).
As potong points out, the if form requires curly braces if there are more than one statement (as I have shown in my second example above). If there is only one statement, they are optional. This is the source of subtle bugs if it's written without and additional statements are later added without adding braces. Also, for the reader of a script, the intent is much clearer if the braces are included.

Yes, they are the same. As with most syntax alternatives, there are trade offs.
I mostly agree with John3136 that 2nd form has a higher likelyhood of being understood by someone with background in c, c++, C#, java, and others.
And as Dennis Williamson points out, there are times you have to use the if (...) form.
For one-lines, I think the first form is useful, acceptable, and helps reinforce the primary design of awk, i.e. [pattern]-[action] (being optional or providing a default behaviour).
If you look at comp.lang.awk, you'll see that the long-timers mostly prefer style 1.
Large blocks of code in style 1, to my eye, are difficult to read, but do have the look of lex code.
So.... depends ;-) .... are you writing code that will ultimately be maintained by likely non-awk experts? Then stick w style 2. Writing for yourself? Use both so you don't get rusty!
I hope this helps.

Yes they are the same.
On the first form, you can omit { print } because it's the default action. So, that works too:
$2 > 50

Related

Raku operator for 2's complement arithmetic?

I sometimes use this:
$ perl -e "printf \"%d\", ((~18446744073709551592)+1)"
24
I can't seem to do it with Raku. The best I could get is:
$ raku -e "say +^18446744073709551592"
-18446744073709551593
So: how can I make Raku give me the same answer as Perl ?
Gotta go with (my variant¹ of) Liz's custom op (in her comment below).
sub prefix:<²^>(uint $a) { (+^ $a) + 1 }
say ²^ 18446744073709551592; # 24
My original "semi-educated wild guess"² that turned out to be acceptable to #zentrunix and the basis for Liz's op:
say (+^ my uint $ = 18446744073709551592) + 1; # 24
\o/ It works!³
Footnotes
¹ I flipped the two character op because I wanted to follow the +^ form, have it sub-vocalize as "two's complement", and avoid it looking like ^2.
² One line of thinking was about the particular integer. I saw that 18446744073709551592 is close to 2**64. Another was that integers are limited precision in Perl unless you do something to make them otherwise, whereas in Raku they are arbitrary precision unless you do something to make them otherwise. A third line of thinking came from reading the doc for prefix +^ which says "converts the number to binary using as many bytes as needed" which I interpreted as meaning that the representation is somehow important. Hmm. What if I try an int variable? Overflow. (Of course.) uint? Bingo.
³ I've no idea if this solution is right for the wrong reasons. Or even worse. One thing that's concerning is that uint in Raku is defined to correspond to the largest native unsigned integer size supported by the Raku compiler used to compile the Raku code. (Iirc.) In practice today this means Rakudo and whatever underlying platform is being targeted, and I think that almost certainly means C's uint64_t in almost all cases. I imagine perl has some similar platform dependent definition. So my solution, if it is a reasonable one, is presumably only portable to the degree that the Raku compiler (which in practice today means Rakudo) agrees with the perl binary (which in practice today means P5P's perl) when run on some platform. See also #p6steve's comment below.
'Long-hand' answer:
raku -e 'put ( (18446744073709551592.base(2) - 0b1).comb.map({!$_.Int+0}).join.parse-base(2));'
OR
raku -e 'say 18446744073709551592.base(2).comb.map({!$_.Int+0}).join.parse-base(2) + 1;'
Sample Output: 24
The answers above (should?) implement "Two's-Complement" encoding directly. Neither uses Raku's +^ twos-complement operator. The first one subtracts one from the binary representation, then inverts. The second one inverts first, then adds one. Neither answer feels truly correct, yet the same answer as Perl5 is obtained (24).
Looking at the Raku Docs page, one would conclude that the "twos-complement" of a positive number would be negative, hence it's not clear what the Perl (and now Raku) answers represent. Hopefully the foregoing is somewhat useful.
https://docs.raku.org/routine/+$CIRCUMFLEX_ACCENT

Brace Delimiters with qq Don't Interpolate Code in Raku

Sorry if this is documented somewhere, but I haven't been able to find it. When using brace delimiters with qq, code is not interpolated:
qq.raku
#!/usr/bin/env raku
say qq{"Two plus two": { 2 + 2 }};
say qq["Two plus two": { 2 + 2 }];
$ ./qq.raku
"Two plus two": { 2 + 2 }
"Two plus two": 4
Obviously, this isn't a big deal since I can use a different set of delimiters, but I ran across it and thought I'd ask.
Update
As #raiph pointed out, I forgot to put the actual question: Is this the way it's supposed to work?
The quote language "nibbler" (the bit of the grammar that eats its way through a quoted string) looks like this:
[
<!stopper>
[
|| <starter> <nibbler> <stopper>
|| <escape>
|| .
]
]*
That is, until we see a stopper, eat whichever comes first of:
A starter (the opening { in your case), followed by some internal stuff, followed by a stopper (the }); this allows for nesting of the construct inside of the string
An escape (and closure interpolation is considered a kind of escape)
Any other character
This ordering in the grammar means that a nesting of the chosen quote starter/stopper will always win over an escape. This issue was discussed during the language design; we could, after all, have reordered the alternation in the grammar to have escapes win. On balance, however, it was felt that the choice of starter/stopper was the more local decision than the general properties of the quoting language, and so should take precedence. (This is also consistent with how quote languages are constructed: we take the base quoted string grammar and mix starter/stopper methods into it.)
Obviously, this isn't a big deal since I can use a different set of delimiters, but I ran across it and thought I'd ask.
You didn't ask anything. :)
Let's say you've got some text. And you want to use double quote processing to get interpolation, except you don't want braced text to be interpolated as code. You could write, say, qq:!c '...'. But don't you think it's a lot easier to remember, write, and read qq{ ... }?
Nice little touch, right?
Which is why it's the way it is -- it's a very nice touch.
And, perhaps, why it's not documented -- it's little, and, once you encounter it, obvious what you need to do.
That said, the Q lang escapes include ones to recursively re-enter the Q lang:
say qq{"Two plus two": \qq[{ 2 + 2 }] }; # "Two plus two": 4
Does that answer your question? :)

How to add a small bit of context in a grammar?

I am tasked to parse (and transform) a code of a computer language, that has a slight quirk in its rules, at least I see it this way. To be exact, the compiler treats new lines (as well as semicolons) as statement separators, but other than that (e.g. inside the statement) it treats them as spacers (whitespace).
As an example, this code:
try
local x = 5 / 0
catch (i)
print(i + "\n")
is proved to be equivalent to this:
try local x = 5 / 0 catch (i) print(i + "\n")
I don't see how I can express such a rule in EBNF, or specifically in Lark EBNF dialect. I mean in a sensible way. I probably could define all possible newline positions inside all statements, but it would be cumbersome and error-prone.
I wish to find a way to treat newlines contextually. Is there a proven method for this, preferably within Python/Lark domain? If I have to modify the parser for that purpose, then where should I start?
Or if I misunderstood something in this language in particular or in machine language parsing in general, or my statement of the problem is wrong, I'd also be happy to get educated.
(As you may guess, the language in question has a well proven implementation, but no officially defined grammar. Also, it is Squirrel, for all that it matters.)
The relevant quote from the "specification" is this:
A squirrel program is a simple sequence of statements.:
stats := stat [';'|'\n'] stats
[...] Statements can be separated with a new line or ‘;’ (or with the keywords case or default if inside a switch/case statement), both symbols are not required if the statement is followed by ‘}’.
These are relatively complex rules and in their totality not context free if newlines can also be ignored everywhere else. Note however that in my understanding the text implies that ; or \n are required when no of the other cases apply. That would make your example illegal. That probably means that the BNF as written is correct, e.g. both ; and \n are optionally everywhere. In that case you can (for lark) just put an %ignore "\n" statement and it should work fine.
Also, lark should not complain if you both ignore the \n and use it in a rule: Where useful it will match it in a rule, otherwise it will just ignore it. Note however that this breaks if you use a Terminal that includes the \n (e.g. WS or /\s/). Just have \n as an extra case.
(For the future: You will probably get faster response for lark questions if you ask over on gitter or at least put a link to SO there.)

Equivalent awk syntax for a[x][0]+=$2

What is the proper awk syntax for this following piece of code, which I think is for gawk: a[x][0]+=$2
On one level - there is no equivalent, that's why the gawk code to support it was written.
Whether or not you actually need all the functionality that comes with that syntax depends what you're going to do with the array afterwards and if you're not doing much with it then the implementation you need might be as simple as:
a[x,0]+=$2
but it might also be something like:
a[x,0]+=$2; b[x]=(x in b?b[x] SUBSEP:"") 0
and if you need to do the equivalent of:
split("1 3 5 7",a[x]); ... ; a[x][0]+=$2
then you have some head scratching to do.

Is a /start/,/end/ range expression ever useful in awk?

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.