Grammar not parsing as expected with negative lookaround assertion - grammar

OK, this is either a bug or I'm going to look like a total idiot and I'm using a lookaround assertion completely wrong. I don't care about the latter so here we go.
Got this grammar I'm testing:
our grammar HC2 {
token TOP { <line>+ }
token line { [ <header> \n | <not-header> \n ] }
token header { <header-start> <header-content> }
token not-header { \N* }
token header-start { <header-one> }
token header-one { <[#]> <![#]> } # note this negative lookahead here
token header-content { \N* }
}
I want to capture a markdown header with just one # sign, no more.
Here is the output from Grammar::Tracer/Debugger:
So it's skipping right over the <header-start> capture. If I remove the <![#]> negative lookahead assertion, I get this:
So is this a bug or am I out to lunch?
As text:
TOP
>
| line
>
| | not-header
>
| | * MATCH "# Grandmother's for a Brighter Future"
>
| * MATCH "# Grandmother's for a Brighter Future\n"
>
| line
>
| | not-header
>
| | * MATCH ""
>
| * MATCH "\n"
>
| line
>
| | not-header
>
| | * MATCH "# Development site"
>
| * MATCH "# Development site\n"
>
| line
>
| | not-header
>
| | * MATCH "* The new site is up and running at example.com"
>
| * MATCH "* The new site is up and running at example.com\n"
>
| line
>
| | not-header
>
TOP
>
| line
>
| | header
>
| | | header-start
>
| | | | header-one
>
| | | | * MATCH "#"
>
| | | * MATCH "#"
>
| | | header-content
>
| | | * MATCH " Grandmother's for a Brighter Future"
>
| | * MATCH "# Grandmother's for a Brighter Future"
>
| * MATCH "# Grandmother's for a Brighter Future\n"
>
| line
>
| | not-header
>
| | * MATCH ""
>
| * MATCH "\n"
>
| line
>
| | header
>
| | | header-start
>
| | | | header-one
>
| | | | * MATCH "#"
>
| | | * MATCH "#"
UPDATE: If I modify header-start to:
token header-one { <[#]> <-[#]> }
it matches as expected. However, that does not answer the question as to why the original code does not match.

OK, so the non-technical answer is I made a bad assumption that the | character behaves the same was as in Perl. It does not. In Perl, the regex engine attempts to match the pattern on the left hand side of the | character. If that fails, it moves on to the pattern in the right hand side.
To get the "old school" Perl behavior, use the || operator, called the "Alternation" operator: https://docs.raku.org/language/regexes#Alternation:_||
The | operator is called the "Longest Alternation" operator. See https://docs.raku.org/language/regexes#Longest_alternation:_|
A more detailed, much more technical discussion of how the "Longest Alternation" operator works is here: https://design.raku.org/S05.html#Longest-token_matching
Though I was already aware the || existed from my reading of the docs, I didn't read about it carefully. I mistakenly assumed Raku core developer would make | behave like it did in Perl and that || was some cool new operator I could learn about later.
Big takeaway: try hard to uncover the basic assumptions you are making and don't assume anything until you've read the docs closely.

Related

check if column matches any line in file with awk

say I have some output from the command openstack security group list:
+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID | Name | Description | Project | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| 1dda8a57-fff4-4832-9bac-4e806992f19a | default | Default security group | 0ce266c801ae4611bb5744a642a01eda | [] |
| 2379d595-0fdc-479f-a211-68c83caa9d42 | default | Default security group | 602ad29db6304ec39dc253bcbba408a7 | [] |
| 431df666-a9ba-4643-a3a0-9a70c89e1c05 | tempest | tempest test | b320a32508a74829a0563078da3cba2e | [] |
| 5b54e63c-f2e5-4eda-b2b9-a7061d19695f | default | Default security group | 57e745b9612941709f664c58d93e4188 | [] |
| 6381ebaf-79fb-4a31-bc32-49e2fecb7651 | default | Default security group | f5c30c42f3d74b8989c0c806603611da | [] |
| 6cce5c94-c607-4224-9401-c2f920c986ef | default | Default security group | e3190b309f314ebb84dffe249009d9e9 | [] |
| 7402fdd3-0f1e-4eb1-a9cd-6896f1457567 | default | Default security group | d390b68f95c34cefb0fc942d4e0742f9 | [] |
| 76978603-545b-401d-9959-9574e907ec57 | default | Default security group | 3a7b5361e79f4914b09b022bcae7b44a | [] |
| 7705da1e-d01e-483d-ab82-c99fdb9eba9c | default | Default security group | 1da03b5e7ce24be38102bd9c8f99e914 | [] |
| 7fd52305-850c-4d9a-a5e9-0abfb267f773 | default | Default security group | 5b20d6b7dfab4bfbac0a1dd3eb6bf460 | [] |
| 82a38caa-8e7f-468f-a4bc-e60a8d4589a6 | default | Default security group | d544d2243caa4e1fa027cfdc38a4f43e | [] |
| a4a5eaba-5fc9-463a-8e09-6e28e5b42f80 | default | Default security group | 08efe6ec9b404119a76996907abc606b | [] |
| e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1 | default | Default security group | 539c238bf0e84463b8639d0cb0278699 | [] |
| f96bf2e8-35fe-4612-8988-f489fd4c04e3 | default | Default security group | 2de96a1342ee42a7bcece37163b8dfa0 | [] |
+--------------------------------------+---------+------------------------+----------------------------------+------+
And I have a list of Project IDs:
0ce266c801ae4611bb5744a642a01eda
b320a32508a74829a0563078da3cba2e
57e745b9612941709f664c58d93e4188
f5c30c42f3d74b8989c0c806603611da
e3190b309f314ebb84dffe249009d9e9
d390b68f95c34cefb0fc942d4e0742f9
3a7b5361e79f4914b09b022bcae7b44a
5b20d6b7dfab4bfbac0a1dd3eb6bf460
d544d2243caa4e1fa027cfdc38a4f43e
08efe6ec9b404119a76996907abc606b
539c238bf0e84463b8639d0cb0278699
2de96a1342ee42a7bcece37163b8dfa0
which is the intersection of two files I get from runnning fgrep -x -f projects secgrup
how can I extract the rows from the ID column for which the Project column IDs match this list that I have?
It would be something like:
openstack security group list | awk '$2 && $2!="ID" && $10 in $(fgrep -x -f projects secgrup) {print $2}'
which should yield:
1dda8a57-fff4-4832-9bac-4e806992f19a
431df666-a9ba-4643-a3a0-9a70c89e1c05
5b54e63c-f2e5-4eda-b2b9-a7061d19695f
6381ebaf-79fb-4a31-bc32-49e2fecb7651
6cce5c94-c607-4224-9401-c2f920c986ef
7402fdd3-0f1e-4eb1-a9cd-6896f1457567
76978603-545b-401d-9959-9574e907ec57
7fd52305-850c-4d9a-a5e9-0abfb267f773
82a38caa-8e7f-468f-a4bc-e60a8d4589a6
a4a5eaba-5fc9-463a-8e09-6e28e5b42f80
e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1
f96bf2e8-35fe-4612-8988-f489fd4c04e3
but obviously this doesn't work.
You can use this awk for this:
awk -F ' *\\| *' 'FNR == NR {arr[$1]; next}
$5 in arr {print $2}' projects secgrup
1dda8a57-fff4-4832-9bac-4e806992f19a
431df666-a9ba-4643-a3a0-9a70c89e1c05
5b54e63c-f2e5-4eda-b2b9-a7061d19695f
6381ebaf-79fb-4a31-bc32-49e2fecb7651
6cce5c94-c607-4224-9401-c2f920c986ef
7402fdd3-0f1e-4eb1-a9cd-6896f1457567
76978603-545b-401d-9959-9574e907ec57
7fd52305-850c-4d9a-a5e9-0abfb267f773
82a38caa-8e7f-468f-a4bc-e60a8d4589a6
a4a5eaba-5fc9-463a-8e09-6e28e5b42f80
e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1
f96bf2e8-35fe-4612-8988-f489fd4c04e3
Here:
-F ' *\\| *' sets input field separator to | surrounded with 0 or more spaces on both sides.
With your shown samples only, please try following awk code. Written and tested in GNU awk.
awk '
FNR==NR{
arr1[$0]
next
}
match($0,/.*default \| Default security group \| (\S+)/,arr2) && (arr2[1] in arr1){
print arr2[1]
}
' ids Input_file
Explanation:
Checking FNR==NR condition which will be TRUE when first Input_file named ids(where your ids are stored) is being read.
Then creating an array named arr1 is being created with index of current line.
next keyword will skip all further statements from here.
Then using match function with regex .*default \| Default security group \| (\S+) which will create 1 capturing group and share its value to array named arr2.
Then checking condition if arr2 value is present inside arr1 then print its value else do nothing.

If examples has too many columns ,every row will be too long to read!

If examples has too many columns ,every row will be too long to read!
Feature:
Background:
Scenario Outline:
* match '<msg>' == <prefix> + ',' + '<end>'
Examples:
| prefix | end | msg |
| hello | mike | hello,mike |
| hello | jerry | hello,jerry |
Could it be like that:
Feature:
Background:
Examples:
| prefix |
| hello |
Scenario Outline:
* match '<msg>' == <prefix> + ',' + '<end>'
Examples:
| end | msg |
| mike | hello,mike |
| jerry | hello,jerry |
I want to divide the examples in several parts, or set a base examples before Outline.What should I do?
karate address this in many different ways in its latest versions 0.9.X, Let see
As you asked we can define the Examples: tables before Scenario Outline: using table in karate,
Feature: my feature
Background: BG
* table myExample
| prefix | end | msg |
| 'hello' | 'mike' | 'hello,mike' |
| 'hello' | 'jerry' | 'hello,jerry' |
Scenario Outline: SOW
* match '<msg>' == '<prefix>' + ',' + '<end>'
Examples:
| myExample |
the same can be kept in another feature file and read it in this feature file, but don't complicate as we have some other solutions coming below..
2.karate sees all these table, Examples: as array of JSON's
typically you example above will be represented as,
[
{
"prefix": "hello",
"end": "mike",
"msg": "hello,mike"
},
{
"prefix": "hello",
"end": "jerry",
"msg": "hello,jerry"
}
]
so karate allows you to define these Examples also from JSON or csv formats using karate's dynamic scenario outline feature
if you feel like you examples are too large to accommodate in your feature file keep it in a csv file and do read in your Examples
Feature: my feature
Background: BG
* def myExample = read("myExample.csv")
Scenario Outline: SOW
* match '<msg>' == '<prefix>' + ',' + '<end>'
Examples:
| myExample |
The same applies to JSON also, providing data as JSON array.

ANTLR 4 TL Grammar seems incorrect

I downloaded the TL Grammar from https://raw.githubusercontent.com/bkiers/tiny-language-antlr4/master/src/main/antlr4/tl/antlr4/TL.g4
And after attempting to try it, I realized the grammar is unable to handle user defined function calls at the top-level
For example, if your file contents are:
def s(n)
return n+n;
end
s("5", "6");
And you listen for a FunctionCallExpression, you don't get a callback. However, if your file contents are:
def s(n)
return n+n;
end
s(s("5"))
you do get the call back.
Your input:
s("5", "6");
is matched by the statement (not an expression!):
functionCall
: Identifier '(' exprList? ')' #identifierFunctionCall
| ...
;
and "5", "6" are two expressions matched by exprList.
The first s in your input s(s("5")) will again match identifierFunctionCall, and the inner s will be matched as an expression (a functionCallExpression to be precise).
Here are the different parse trees:
s("5", "6");
'- parse
|- block
| '- statement
| |- identifierFunctionCall
| | |- s
| | |- (
| | |- exprList
| | | |- stringExpression
| | | | '- "5"
| | | |- ,
| | | '- stringExpression
| | | '- "6"
| | '- )
| '- ;
'- <EOF>
s(s("5"));
'- parse
|- block
| '- statement
| |- identifierFunctionCall
| | |- s
| | |- (
| | |- exprList
| | | '- functionCallExpression
| | | '- identifierFunctionCall
| | | |- s
| | | |- (
| | | |- exprList
| | | | '- stringExpression
| | | | '- "5"
| | | '- )
| | '- )
| '- ;
'- <EOF>
In short: the grammar works as it is supposed to.
EDIT
A valid TL script is a code block where each code block consists of statements. To simplify the grammar and eliminate some ambiguous rules (which was needed for the older ANTLRv3), it was easiest to not allow a statement to be a simple expression. For example, the following code is not a valid TL script:
1 + 2;
I.e. 1 + 2 is not a statement, but an expression.
However a function call might be a statement, but, when placed at the right hand side of an assignment statement, it could also be an expression:
foo(); // this is a statement
i = foo(); // now foo is an expression
That is why you observed one s(...) to trigger a certain enter...() method, while the other did not.

Using Scenario Outline and Examples with Strings

I'm facing the following "issue" when trying out cucumber-jvm with an Arabic to Roman Numeral converter - the "then" piece gets converted into several step defs instead of one.
here's my scenario outline:
Scenario Outline:
Given a romanizer
When I romanize <arabic number>
Then I get the <roman numeral>
Examples:
| arabic number | roman numeral |
| 1 | I |
| 2 | II |
| 3 | III |
| 4 | IV |
| 5 | V |
| 6 | VI |
| 7 | VII |
| 8 | VIII |
| 9 | IX |
| 10 | X |
For this, i was expecting the then stepdef to simply be:
I_get_the(String value)
but instead it turns it into one stepdef for each example:
I_get_the_I()
I_get_the_II()
etc.
what am i doing wrong?
Thanks,
Nilesh
use (note the double quotes)
Scenario Outline:
Given a romanizer
When I romanize "<arabic number>"
Then I get the "<roman numeral>"

What are all the standard CGI environment variables?

CGI scripts should have access to a list of environment variables set by the web server. What are their names?
See RFC 3875 for the CGI spec, which has all the info you need. :-)
From the RFC:
meta-variable-name = "AUTH_TYPE" | "CONTENT_LENGTH" |
"CONTENT_TYPE" | "GATEWAY_INTERFACE" |
"PATH_INFO" | "PATH_TRANSLATED" |
"QUERY_STRING" | "REMOTE_ADDR" |
"REMOTE_HOST" | "REMOTE_IDENT" |
"REMOTE_USER" | "REQUEST_METHOD" |
"SCRIPT_NAME" | "SERVER_NAME" |
"SERVER_PORT" | "SERVER_PROTOCOL" |
"SERVER_SOFTWARE" | scheme |
protocol-var-name | extension-var-name
protocol-var-name = ( protocol | scheme ) "_" var-name
scheme = alpha *( alpha | digit | "+" | "-" | "." )
var-name = token
extension-var-name = token
http://www.cgi101.com/book/ch3/text.html
The "hoohoo" machine at NCSA that has the CGI documentation is down, but here's what seems to be a mirror.
A quick Google search finds the what you need.