Grammar rule for sequence of blocks of digits that increase in length each repetition - antlr

I want to generate a parsing rule (using ANTLR 4) that defines a repeating chain of binary blocks separated by ':'.
Each block has one digit more than the previous block, starting with two digits. For example:
01:010:0001:01010 ...
The chain can have an arbitrary number of these blocks.
Right now I have defined the rule as:
BIN : [0-1]+ ;
connections : BIN (':' BIN)* ;
I know how to make it check that each block has at least two binary digits, but not the correct number.
Is there any way to make it more specific, using ANTLR?

With a semantic predicate it would look similar to this:
connections locals[int i] :
{$i = 2;} BIN {check}? ({$i++;} ':' BIN {$check}?)* ;
BIN :
[0-1]+ ;
where check is $BIN.getText().length() == $i (replace check to make the grammar work).
Another option would be to generate a parse tree visitor and to validate the BIN-Nodes while traversing the parse tree.

Related

Anti-matching against an infinite family of <!before> patterns in Raku

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Howto parse runlength encoded binary subformat with antlr

Given the following input:
AA:4:2:#5#xxAAx:2:a:
The part #5# defines the start of a binary subformat with the length of 5. The sub format can contain any kind of character and is likely to contain tokens from the main format. (ex. AA is a keyword/token inside the main format).
I want to build a lexer that is able to extract one token for the whole binary part.
I already tried several approaches (ex. partials, sematic predicates) but I did not get them working together the right way.
Finally I found the solution by myself.
Below are the relevant parts of the lexer definition
#members {
public int _binLength;
}
BINARYHEAD: '#' [0-9]+ '#' { _binLength = Integer.parseInt(getText().substring(1,getText().length()-1)); } -> pushMode(RAW) ;
mode RAW;
BINARY: .+ {getText().length() <= _binLength}? -> popMode;
The solution is based on an extra field that set while parsing the length definition of the binary field. Afterward a semantic predicate is used to restrict the validity of the binary content to the size of that field.
Any suggestion to simplify the parseInt call is welcome.

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

How do I prioritize two overlapping expressions? (Ragel)

I have 2 expression :
ident = alpha . (alnum|[._\-])*;
string = (printable1)+;
# Printable includes almost all Windows-1252 characters with glyphs.
main := ( ident % do_ident | string % do_string )
# The do_* actions have been defined, and generate tokens.
Obviously, any ident is a string. Ragel has priority operators to overcome this. But no matter how I've tried to set the priorities, either some idents execute both actions, or some valid strings are ignored (valid strings with a valid ident as a prefix, for example: ab$).
I have found one way around it, without using priorities:
main := ( ident % do_ident | (string - ident) % do_string )
But if I have more than a few overlapping expression, this will get cumbersome. Is this the only practical way?
Any help with the correct way to do this would be appreciated.
Take a look at section '6.3 Scanners' in the Ragel Guide.
main := |*
ident => do_ident;
string => do_string;
*|;
Note: When using scanners, have ts, te, and act defined in the host language.
Looks like your issue is that all valid identifiers are also valid strings, you just want it to be interpreted as an identifier first if possible. You can force it to accept an identifier first by embedding a priority in the leaving action for ident, which overrides over all transitions for string:
main := ( ident %(ident_vs_string, 1) % do_ident | string $(ident_vs_string, 0) % do_string )
This will ensure that the leaving transition following a valid expression stops the machine exploring either continuing or leaving a string.
Be careful with how this combined expression is terminated. Whatever expression follows the identifier/string must start with a character not permissible in either, so that the exit transitions are well defined.

complex AST rewrite rule in ANTLR

After the problem about AST rewrite rule with devide group technique at AST rewrite rule with " * +" in antlr.
I have a trouble with AST generating in ANTLR, again :).Here is my antlr code :
start : noun1+=n (prep noun2+=n (COMMA noun3+=n)*)*
-> ^(NOUN $noun1) (^(PREP prep) ^(NOUN $noun2) ^(NOUN $noun3)*)*
;
n : 'noun1'|'noun2'|'noun3'|'noun4'|'noun5';
prep : 'and'|'in';
COMMA : ',';
Now, with input : "noun1 and noun2, noun3 in noun4, noun5", i got following unexpected AST:
Compare with the "Parse Tree" in ANLRwork:
I think the $noun3 variable holding the list of all "n" in "COMMA noun3+=n". Consequently, AST parser ^(NOUN $noun3)* will draw all "n" without sperating which "n" actually belongs to the "prep"s.
Are there any way that can make the sepration in "(^(PREP prep) ^(NOUN $noun2) ^(NOUN $noun3))". All I want to do is AST must draw exactly, without token COMMA, with "Parse Tree" in ANTLRwork.
Thanks for help !
Getting the separation that you want is easiest if you break up the start rule. Here's an example (without writing COMMAs to the AST):
start : prepphrase //one prepphrase is required.
(COMMA! prepphrase)* //"COMMA!" means "match a COMMA but don't write it to the AST"
;
prepphrase: noun1=n //You can use "noun1=n" instead of "noun1+=n" when you're only using it to store one value
(prep noun2=n)?
-> ^(NOUN $noun1) ^(PREP prep)? ^(NOUN $noun2)?
;
A prepphrase is a noun that may be followed by a preposition with another noun. The start rule looks for comma-separated prepphrases.
The output appears like the parse tree image, but without the commas.
If you prefer explicitly writing out ASTs with -> or if you don't like syntax like COMMA!, you can write the start rule like this instead. The two different forms are functionally equivalent.
start : prepphrase //one prepphrase is required.
(COMMA prepphrase)*
-> prepphrase+ //write each prepphrase, which doesn't include commas
;