ANTLR reports error and I think it should be able to resolve input with backtracking - antlr

I have a simple grammar that works for the most part, but at one place it reports error and I think it shouldn't, because it can be resolved using backtracking.
Here is the portion that is problematic.
command: object message_chain;
object: ID;
message_chain: unary_message_chain keyword_message?
| binary_message_chain keyword_message?
| keyword_message;
unary_message_chain: unary_message+;
binary_message_chain: binary_message+;
unary_message: ID;
binary_message: BINARY_OPERATOR object;
keyword_message: (ID ':' object)+;
This is simplified version, object is more complex (it can be result of other command, raw value and so on, but that part works fine). Problem is in message_chain, in first alternative. For input like obj unary1 unary2 it works fine, but for intput like obj unary1 unary2 keyword1:obj2 is trys to match keyword1 as unary message and fails when it reaches :. I would think that it this situation parser would backtrack and figure that there is : and recognize that that is keyword message.
If I make keyword message non-optional it works fine, but I need keyword message to be optional.
Parser finds keyword message if it is in second alternative (binary_message) and third alternative (just keyword_message). So something like this gives good results: 1 + 2 + 3 Keyword1:Value
What am I missing? Backtracking is set to true in options and it works fine in other cases in the same grammar.
Thanks.

This is not really a case for PEG-style backtracking, because upon failure that returns to decision points in uncompleted derivations only. For input obj unary1 unary2 keyword1:obj2, with a single token lookahead, keyword1 could be consumed by unary_message_chain. The failure may not occur before keyword_message, and next to be tried would be the second alternative of message_chain, i.e. binary_message_chain, thus missing the correct parse.
However as this grammar is LL(2), it should be possible to extend lookahead to avoid consuming keyword1 from within unary_message_chain. Have you tried explicitly setting k=2, without backtracking?

Related

Difficulty writing PEG recursive expression grammar with Arpeggio

My input text might have a simple statement like this:
aircraft
In my language I call this a name which represents a set of instances with various properties.
It yields an instance_set of all aircraft instances in this example.
I can apply a filter in parenthesis to any instance_set:
aircraft(Altitude < ceiling)
It yields another, possibly reduced instance_set.
And since it is an instance set, I can filter it yet again:
aircraft(Altitude < ceiling)(Speed > min_speed)
I foolishly thought I could do something like this in my grammar:
instance_set = expr
expr = source / instance_set
source = name filter?
It parses my first two cases correctly, but chokes on the last one:
aircraft(Altitude < ceiling)(Speed > min_speed)
The error reported being just before the second open paren.
Why doesn't Arpeggio see that there is just a filtered instance_set which is itself a filtered instance set?
I humbly submit my appeal to the peg parsing gods and await insight...
Your first two cases both match source. Once source is matched, it's matched; that's the PEG contract. So the parser isn't going to explore the alternative.
But suppose it did. How could that help? The rule says that if an expr is not a source, then it's an instance_set. But an instance_set is just an expr. In other words, an expr is either a source or it's an expr. Clearly the alternative doesn't get us anywhere.
I'm pretty sure Arpeggio has repetitions, which is what you really want here:
source = name filter*

Antlr4 extremely simple grammar failing

Antlr4 has always been a kind of love-hate relationship for me, but I am currently a bit perplexed. I have started creating a grammar to my best knowledge and then wanted to test it and it didnt work at all. I then reduced it a lot to just a bare minimum example and I managed to make it not work. This is my grammar:
grammar SwiftMtComponentFormat;
separator : ~ZERO EOF;
ZERO : '0';
In my understanding it should anything except a '0' and then expect the end of the file. I have been testing it with the single character input '1' which I had expected to work. However this is what happens:
If i change the ~ZEROto ZERO and change my input from 1 to 0 it actually perfectly matches... For some reason the simple negation does not seem to work. I am failing to understand what the reason here is...
In a parser rule ~ZERO matches any token that is not a ZERO token. The problem in your case is that ZERO is the only type of token that you defined at all, so any other input will lead to a token recognition error and not get to the parser at all. So if you enter the input 1, the lexer will discard the 1 with a token recognition error and the parser will only see an empty token stream.
To fix this, you can simply define a lexer rule OTHER that matches any character not matched by previous lexer rules:
OTHER: .;
Note that this definition has to go after the definition of ZERO - otherwise it would match 0 as well.
Now the input 1 will produce an OTHER token and ~ZERO will match that token. Of course, you could now replace ~ZERO with OTHER and it wouldn't change anything, but once you add additional tokens, ~ZERO will match those as well whereas OTHER would not.

ANTLR4 - replace op boundaries error|How to use TokenStreamRewriter to transform text from two listener events on overlapping tokens in original AST?

Hello ANTLR creators/users,
Some context - I am using PlSql ANTLR4 parser to do some lightweight transpiling of some queries from oracle sql to, let's say, spark sql. I have my listener class setup which extends the base listener.
Example of an issue -
Let's say the input is something like -
SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101') from xyz;
Now, I'd like to replace || with CONCAT and to_char with CAST as STRING, so that the final query looks like -
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
In my listener class, I am overriding two functions from base listener to do this - concatenation and string_function. In those, I am using a tokenStreamRewriter's replace to make the necessary transformation. Since tokenStreamRewriter is evaluated lazily, I am running to issue ->
java.lang.IllegalArgumentException: replace op boundaries of
<ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..[#53,276:276=')',
<2214>,3:63]:"CAST (to_number(substr(ATTRIBUTE_VALUE,1,4))-3 as STRING)">
overlap with previous <ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..
[#56,279:284=''0101'',<2209>,3:66]:"CONCAT
(to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3),'0101')">
Clearly, the issue is my two listener functions attempting to replace/transform text on overlapping boundaries.
Is there any work around for territory overlap kind of issues for ANTLR4? I'm sure folks run into such stuff all the time probably.
I'd appreciate any workarounds, even dirty ones at this point of time :)
I did realize that ANTLR4 does not allow us to modify original AST, otherwise this would have been a little bit easier to solve.
Thanks!
A look at how tokenstreamrewriter works leads to the following understanding:
first, a list of all modification operations are built
then, you invoke getText()
here, there is a reduction of modification operations. The idea for example is to merge multiple insert together in one reduction. Its role is also to avoid multiple replace on same data (but i will expand on this point later).
every token is then read, in the case there is a modification listed for the said token index, TokenStreamRewriter do the operation, otherwise it just pop the read token.
Let's have a look on how modification operations are implemented:
for insert, tokenstream rewriter basically just adds the string to be added at the current token index, and then do an index+1, effectively going to next token
for replace, tokenstream rewriter replace a range of tokens with the new string, and set the new index to the end of this range.
So, for tokenstreamrewriter, overlapping replaces are not possible, as when you replace you jump to the end of the range of tokens to be replaced. Especially, in the case you remove the checks of overlapping, then only the first replace will be operated, as afterwards, the token index is past the other replaces.
Basically, this has been done because there is no way to tell easily what tokens should be replaced while using overlapping replaces. You would need for that symbol recognition and matching.
So, what you are trying to do is the following (for each step, the part between '*' is what is modified):
*SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101')* from xyz;
|
V
CONCAT (*to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)*,'0101') from xyz;
|
V
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
to achieve your transformation, you could do so a replace of :
'to_char' -> 'CONCAT(CAST'
'||' -> ' as STRING),'
And, by using a bit of intelligence while parsing your tokens, like is there a '||' in my tokens to know if it's string, you would know what to replace.
regards
The way I solve it in multiple projects based on ANTLR is this: I translated ANTLR parse-tree to an AST written using Kolasu, an open-source library we developed at Strumenta.
Kolasu has all sort of utilities to process and mutate ASTs. For all non-trivial projects I end up doing transformations on the AST.
Kolasu

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

SonarLint - questions about some of the rules for VB.NET

The large majority of SonarLint rules that I've come across in Java seemed plausible and justified. However, ever since I've started using SonarLint for VB.NET, I've come across several rules that left me questioning their usefulness or even whether or not they are working correctly.
I'd like to know if this is simply a problem of me using some VB.NET constructs in a suboptimal way or whether the rule really is flawed.
(Apologies if this question is a little longer. I didn't know if I should create a separate question for each individual rule.)
The following rules I found to leave some cases unconsidered that would actually turn up as false-positives:
S1871: Two branches in the same conditional structure should not have exactly the same implementation
I found this one to bring up a lot of false-positives for me, because sometimes the order in which the conditions are checked actually does matter. Take the following pseudo code as example:
If conditionA() Then
doSomething()
ElseIf conditionB() AndAlso conditionC() Then
doSomethingElse()
ElseIf conditionD() OrElse conditionE() Then
doYetAnotherThing()
'... feel free to have even more cases in between here
Else Then
doSomething() 'Non-compliant
End If
If I wanted to follow this Sonar rule and still make the code behave the same way, I'd have to add the negated version of each ElseIf-condition to the first If-condition.
Another example would be the following switch:
Select Case i
Case 0 To 40
value = 0
Case 41 To 60
value = 1
Case 61 To 80
value = 3
Case 81 To 100
value = 5
Case Else
value = 0 'Non-compliant
There shouldn't be anything wrong with having that last case in a switch. True, I could have initialized value beforehand to 0 and ignored that last case, but then I'd have one more assignment operation than necessary. And the Java ruleset has conditioned me to always put a default case in every switch.
S1764: Identical expressions should not be used on both sides of a binary operator
This rule does not seem to take into account that some functions may return different values every time you call them, for instance collections where accessing an element removes it from the collection:
stack.Push(stack.Pop() / stack.Pop()) 'Non-compliant
I understand if this is too much of an edge case to make special exceptions for it, though.
The following rules I am not actually sure about:
S3385: "Exit" statements should not be used
While I agree that Return is more readable than Exit Sub, is it really bad to use a single Exit For to break out of a For or a For Each loop? The SonarLint rule for Java permits the use of a single break; in a loop before flagging it as an issue. Is there a reason why the default in VB.NET is more strict in that regard? Or is the rule built on the assumption that you can solve nearly all your loop problems with LINQ extension methods and lambdas?
S2374: Signed types should be preferred to unsigned ones
This rule basically states that unsigned types should not be used at all because they "have different arithmetic operators than signed ones - operators that few developers understand". In my code I am only using UInteger for ID values (because I don't need negative values and a Long would be a waste of memory in my case). They are stored in List(Of UInteger) and only ever compared to other UIntegers. Is this rule even relevant to my case (are comparisons part of these "arithmetic operators" mentioned by the rule) and what exactly would be the pitfall? And if not, wouldn't it be better to make that rule apply to arithmetic operations involving unsigned types, rather than their declaration?
S2355: Array literals should be used instead of array creation expressions
Maybe I don't know VB.NET well enough, but how exactly would I satisfy this rule in the following case where I want to create a fixed-size array where the initialization length is only known at runtime? Is this a false-positive?
Dim myObjects As Object() = New Object(someOtherList.Count - 3) {} 'Non-compliant
Sure, I could probably just use a List(Of Object). But I am curious anyway.
Thanks for raising these points. Note that not all rules apply every time. There are cases when we need to balance between false positives/false negatives/real cases. For example with identical expressions on both sides of an operator rule. Is it a bug to have the same operands? No it's not. If it was, then the compiler would report it. Is it a bad smell, is it usually a mistake? Yes in many cases. See this for example in Roslyn. Should we tune this rule to exclude some cases? Yes we should, there's nothing wrong with 2 << 2. So there's a lot of balancing that needs to happen, and we try to settle for an implementation that brings the most value for the users.
For the points you raised:
Two branches in the same conditional structure should not have exactly the same implementation
This rule generally states that having two blocks of code match exactly is a bad sign. Copy-pasted code should be avoided for many reasons, for example if you need to fix the code in one place, you'll need to fix it in the other too. You're right that adding negated conditions would be a mess, but if you extract each condition into its own method (and call the negated methods inside them) with proper names, then it would probably improves the readability of your code.
For the Select Case, again, copy pasted code is always a bad sign. In this case you could do this:
Select Case i
...
Case 0 To 40
Case Else
value = 0 ' Compliant
End Select
Or simply remove the 0-40 case.
Identical expressions should not be used on both sides of a binary operator
I think this is a corner case. See the first paragraph of the answer.
"Exit" statements should not be used
It's almost always true that by choosing another type of loop, or changing the stop condition, you can get away without using any "Exit" statements. It's good practice to have a single exit point from loops.
Signed types should be preferred to unsigned ones
This is a legacy rule from SonarQube VB.NET, and I agree with you that it shouldn't be enabled by default in SonarLint. I created the following ticket in our JIRA: https://jira.sonarsource.com/browse/SLVS-1074
Array literals should be used instead of array creation expressions
Yes, it seems to be a false positive, we shouldn't report on array creations when the size is explicitly specified. https://jira.sonarsource.com/browse/SLVS-1075