Antlr Lexer exclude a certain pattern - antlr

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.

You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;

How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

Related

Optional Prefix in ANTLR parser/lexer

I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:
grammar MyGrammar;
parse : PREFIX? SEARCH;
PREFIX
: [0-9]+ ':'
;
SEARCH
: .+
;
e.g. valid input strings include:
0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper
But the SEARCH rule always matches the whole string - whether it has a prefix or not.
I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH rule matches all input, not giving the PREFIX rule a chance.
And the non-greedy version (i.e. SEARCH : .+? ;) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH rule doesn't have any other parts to constrain it.
If it helps, I could constrain the SEARCH text to exclude ':' but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.
I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.
Note: SEARCH text might have a structure to it - like €53.00 and 10kg above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple, maracujá and chilli pepper above. But I've tried to simplify so I can solve the problem of extracting the PREFIX first.
ANTLR does lexing before parsing. The lexer prefers long matches and SEARCH tokens match every PREFIX token and even any character appended to it, so your complete line is matched by SEARCH.
To prevent this: Keep the lexer rules disjunct, or at least the tokens should not subsume each other.
parse : prefix? search;
search: (WORD | NUMBER)+;
prefix: NUMBER ':';
NUMBER : [0-9]+;
WORD : (~[0-9:])+;

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Writing parser rules sensitive to whitespace while skipping WS from the lexer

I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:
ENTITY_VAR
: 'user'
| 'resource'
;
INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;
NEWLINE : '\r'? '\n';
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );
The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.
Any ideas, please?
Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.
There are several approaches I can think of (not in a special order):
Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
Allow whitespace in the parser and check afterwards
Use the single token and split in code
Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
Insert predicates in the parser rule that checks if there is whitespace between.
Create a "trap" rule like this:
INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
| '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
;
This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.
I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.
As far as I managed to understand by browsing the documentation, it doesn't look like something like that is feasible.
Parser rules seem to work just on the default channel, so I can't send WS to channel(HIDDEN) and then recover it just for a single parser rule.
On the other hand, an author of antlr explains here that it's not possible to break down any token since version 4.
Even though I don't like it at all, it seems that the fastest way is to parse it from the lexer (as in the code from the question), only to get to re-parse it again from Java the whole string.
Still, any other better option or correction to my conclusions is welcome.
Hooking two parsers in a sort of pipeline, as your own answer suggets, is a sound and simple design/solution, and I'm pretty sure ANTLR is capable of helping with that.
I don't know far the ANTLR folks have gone in their work on stream/feed parsing. But, adopting a two-pass strategy should be efficient enough as the first pass would be just lexing a regular language, which is O(c * N) over the size of the input with a very small c.
If you want a single pass that costs O(k * N) (with a large k), you could consider PEG, for which there are implementations in Java (which I haven't tried).

Does logical AND and NOT exists in ANTLR?

Is there NOT logic in ANTLR? Im basically trying to negate a rule that i have and was wondering if its possible, also is there AND logic?
#larsmans already supplied the answer, I just like to give an example of the legal negations in ANTLR rules (since it happens quite a lot that mistakes are made with them).
The negation operator in ANTLR is ~ (tilde). Inside lexer rules, the ~ negates a single character:
NOT_A : ~'A';
matches any character except 'A' and:
NOT_LOWER_CASE : ~('a'..'z');
matches any character except a lowercase ASCII letter. The lats example could also be written as:
NOT_LOWER_CASE : ~LOWER_CASE;
LOWER_CASE : 'a'..'z';
As long as you negate just a single character, it's valid to use ~. It is invalid to do something like this:
INVALID : ~('a' | 'aa');
because you can't negate the string 'aa'.
Inside parser rules, negation does not work with characters, but on tokens. So the parse rule:
parse
: ~B
;
A : 'a';
B : 'b';
C : 'c';
does not match any character other than 'b', but matches any token other than the B token. So it'd match either token A (character 'a') or token C (character 'c').
The same logic applies to the . (DOT) operator:
inside lexer rules it matches any character from the set \u0000..\uFFFF;
inside parser rules it matches any token (any lexer rule).
ANTLR produces parsers for context-free languages (CFLs). In that context, not would translate to complement and and to intersection. However, CFLs aren't closed under complement and intersection, i.e. not(rule) is not necessarily a CFG rule.
In other words, it's impossible to implement not and and in a sane way, so they're not supported.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.