How do I prioritize two overlapping expressions? (Ragel) - ragel

I have 2 expression :
ident = alpha . (alnum|[._\-])*;
string = (printable1)+;
# Printable includes almost all Windows-1252 characters with glyphs.
main := ( ident % do_ident | string % do_string )
# The do_* actions have been defined, and generate tokens.
Obviously, any ident is a string. Ragel has priority operators to overcome this. But no matter how I've tried to set the priorities, either some idents execute both actions, or some valid strings are ignored (valid strings with a valid ident as a prefix, for example: ab$).
I have found one way around it, without using priorities:
main := ( ident % do_ident | (string - ident) % do_string )
But if I have more than a few overlapping expression, this will get cumbersome. Is this the only practical way?
Any help with the correct way to do this would be appreciated.

Take a look at section '6.3 Scanners' in the Ragel Guide.
main := |*
ident => do_ident;
string => do_string;
*|;
Note: When using scanners, have ts, te, and act defined in the host language.

Looks like your issue is that all valid identifiers are also valid strings, you just want it to be interpreted as an identifier first if possible. You can force it to accept an identifier first by embedding a priority in the leaving action for ident, which overrides over all transitions for string:
main := ( ident %(ident_vs_string, 1) % do_ident | string $(ident_vs_string, 0) % do_string )
This will ensure that the leaving transition following a valid expression stops the machine exploring either continuing or leaving a string.
Be careful with how this combined expression is terminated. Whatever expression follows the identifier/string must start with a character not permissible in either, so that the exit transitions are well defined.

Related

unable to write lexer to parse this

I am designing my own data format :
-key=value-123
It is "DASH KEY EQUAL IDENTIFIER", problem is, identifier also contain dash, so it eats all characters. Please help
DASH : '-';
EQUAL : '=';
IDENTIFIER : [a-zA-Z0-9 -_<>#:\\.#()/]+;
thanks
Peter
This is how ANTLR s work. If multiple Lexer rules match an input stream of characters, the rule with the longest match will win (when the length matches, the the first rule wins), Since your IDENTIFIER rule includes ‘-‘ but excludes ‘=‘, ANTLR will create the longer token for IDENTIFIER. You won’t be able to get a match for DASH unless your input begins with “-=“ (of course, then there’d be no IDENTIFIER).
If you are designing your own format, you could make the choice to disallow “-“ in IDENTIFIERS and you should be good to go.
Is this the full picture of what you are attempting to parse, or just a small subset? If this is the full picture, then you’d be able to easily “parse” this with a REGEX and capture groups. ANTLR would be overkill.
You could take the following approach if you really have to have a DASH in your identifier:
1 - remove the "-" from the IDENTIFIER Lexer rule (we'll call that ID), and we'll handle the full identifier in an identifier parse rule:
keyValue : DASH key=identifier EQUAL val=identifier;
identifier: ID (DASH ID)+;
DASH : '-';
EQUAL : '=';
ID : [a-zA-Z0-9 _<>#:\\.#()/]+;
In a listener (or visitor for the IdentiferCtx (ex: (enter|exit)Identifer for a listener), you can call cox.getText() for the string of the full identifier rule, and have the full text of your identifier

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

Grammar rule for sequence of blocks of digits that increase in length each repetition

I want to generate a parsing rule (using ANTLR 4) that defines a repeating chain of binary blocks separated by ':'.
Each block has one digit more than the previous block, starting with two digits. For example:
01:010:0001:01010 ...
The chain can have an arbitrary number of these blocks.
Right now I have defined the rule as:
BIN : [0-1]+ ;
connections : BIN (':' BIN)* ;
I know how to make it check that each block has at least two binary digits, but not the correct number.
Is there any way to make it more specific, using ANTLR?
With a semantic predicate it would look similar to this:
connections locals[int i] :
{$i = 2;} BIN {check}? ({$i++;} ':' BIN {$check}?)* ;
BIN :
[0-1]+ ;
where check is $BIN.getText().length() == $i (replace check to make the grammar work).
Another option would be to generate a parse tree visitor and to validate the BIN-Nodes while traversing the parse tree.

Why do i have a shift reduce/conflict on the ')' and not '('?

I have syntax like
%(var)
and
%var
and
(var)
My rules are something like
optExpr:
| '%''('CommaLoop')'
| '%' CommaLoop
CommaLoop:
val | CommaLoop',' val
Expr:
MoreRules
| '(' val ')'
The problem is it doesnt seem to be able to tell if ) belongs to %(CommaLoop) or % (val) but it complains on the ) instead of the (. What the heck? shouldnt it complain on (? and how should i fix the error? i think making %( a token is a good solution but i want to be sure why $( isnt an error before doing this.
This is due to the way LR parsing works. LR parsing is effectively bottom-up, grouping together tokens according to the RHS of your grammar rules, and replacing them with the LHS. When the parser 'shifts', it puts a token on the stack, but doesn't actually match a rule yet. Instead, it tracks partially matched rules via the current state. When it gets to a state that corresponds to the end of the rule, it can reduce, popping the symbols for the RHS off the stack and pushing back a single symbol denoting the LHS. So if there are conflicts, they don't show up until the parser gets to the end of some rule and can't decide whether to reduce (or what to reduce).
In your example, after seeing % ( val, that is what will be on the stack (top is at the right side here). When the lookahead is ), it can't decide whether it should pop the val and reduce via the rule CommaLoop: val, or if it should shift the ) so it can then pop 3 things and reduce with the rule Expr: '(' val ')'
I'm assuming here that you have some additional rules such as CommaLoop: Expr, otherwise your grammar doesn't actually match anything and bison/yacc will complain about unused non-terminals.
Right now, your explanation and your grammar don't seem to match. In your explanation, you show all three phrases as having 'var', but your grammar shows the ones starting with '%' as allowing a comma-separated list, while the one without allows only a single 'val'.
For the moment, I'll assume all three should allow a comma-separated list. In this case, I'd factor the grammar more like this:
optExpr: '%' aList
aList: CommaLoop
| parenList
parenList: '(' CommaLoop ')'
CommaLoop:
| val
| CommaLoop ',' val
Expr: MoreRules
| parenList
I've changed optExpr and Expr so neither can match an empty sequence -- my guess is you probably didn't intend that to start with. I've fleshed this out enough to run it through byacc; it produces no warnings or errors.