I would like to match a "{NUM}" and then have the lexer rule return "NUM". so, I tried
NUM : ('{' { skip(); }) 'NUM' ('}' { skip(); });
But, that seems to skip everything and return empty on a match. would it be possible to skip parts of a lexer match ?
antlr 3.4
Invoking skip() anywhere in your rule will remove the entire token from the lexer, not just certain characters.
What you could do is this:
NUM
: '{NUM}' {setText("NUM");}
;
Or, if NUM is variable, do:
NUM
: '{' 'A'..'Z'+ '}' {setText($text.substring(1, $text.length() - 1));}
;
which removes the first and last char from the token.
EDIT
smartnut007 wrote:
Is there an equivalent way to do this for Tokens ?
If you mean how to change the text of tokens inside parser rules, try this:
parser_rule
: LEXER_RULE {$LEXER_RULE.setText("new-text");}
;
LEXER_RULE
: 'old-text'
;
Related
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
I try to write a parser for subset of ABAP language. But I encounter some problems, when input contains misspelled/unknown statements.
In this example, one of the PERFORM statments is spelled PERFOR. So I expected parser to gobble tokens until re-sync, and proceed with follwing PERFORM statements.
FUNCTION-POOL test.
FUNCTION z_angebot_01.
PERFORM x.
LOOP AT mytable.
PERFORM test.
PERFOR test.
PERFORM test.
PERFORM test.
ENDLOOP.
ENDFUNCTION.
Instead of this, the parser seems to try it with token insertion and leaves LOOP. Later it complains about extraneous ENDLOOP.
Output messages:
line 11:4 extraneous input 'PERFOR' expecting {ENDLOOP, LOOP_AT, PERFORM}
line 15:2 extraneous input 'ENDLOOP' expecting {ENDFUNCTION, LOOP_AT, PERFORM}
While debugging the generated code, I noticed, there is no error at all with PERFOR statement. The parser stays inside of loop as long as LOOP_AT or PERFORM is found. Anything else exits loop.
But how can i treat misspelled/unknown statements as syntax errors, which are to be ignored until next EOC token?
I use separated lexer/parser, so this is my current approach:
AbapLexer.g4:
lexer grammar AbapLexer;
#lexer::header {
package generated;
}
WS : [ \t\r\n] -> skip;
EOC : '.' ;
ENDFUNCTION : [Ee][Nn][Dd][Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn];
ENDLOOP : [Ee][Nn][Dd][Ll][Oo][Oo][Pp];
FUNCTION : [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn];
FUNCTION_POOL : [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] '-' [Pp][Oo][Oo][Ll];
LOOP_AT: [Ll][Oo][Oo][Pp] WHITESPACE [Aa][Tt];
PERFORM : [Pp][Ee][Rr][Ff][Oo][Rr][Mm];
IDENTIFIER: [_a-zA-Z] [_0-9a-zA-Z]* ;
fragment WHITESPACE: [ \t\r\n]+;
AbapParser.g4:
parser grammar AbapParser;
options { tokenVocab=AbapLexer; }
#parser::header {
package generated;
}
report: (FUNCTION_POOL) IDENTIFIER EOC
(functionStatement)*
;
block:
(
loopStatement EOC
| performStatement EOC
)+
;
loopStatement:
loopStatementStart EOC
block?
ENDLOOP
;
loopStatementStart:
LOOP_AT IDENTIFIER
;
performStatement:
PERFORM IDENTIFIER
;
functionStatement
:
FUNCTION functionname = IDENTIFIER EOC
block?
ENDFUNCTION EOC
;
Any hints appreciated!
Thank you
Peter
The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?
If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.
This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.
I am evaluating a relatively simple IF/THEN language but have run into a problem: I need to match both integers AND dates that are in the format YYYYMMDD. If I could write a real regular expression I could solve this pretty easily, but haven't figured out an ANTLR solution.
Grammar looks like this:
//overall rule to evaluate a single expression
singleEvaluation returns [boolean evalResult]
: integerEvaluation {$evalResult = $integerEvaluation.evalResult;}
| dateEvaluation {$evalResult = $dateEvaluation.evalResult;}
// etc
;
dateEvaluation returns [boolean evalResult]
: expr1=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (leftOp=('+'|'-') leftModifier=INTEGER leftQualifier=DATE_QUALIFIER)?
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (rightOp=('+'|'-') rightModifier=INTEGER rightQualifier=DATE_QUALIFIER)?
{ // code }
integerEvaluation returns [boolean evalResult]
: expr1=(NUM_FIELD_IDENTIFIER|INTEGER)
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(NUM_FIELD_IDENTIFIER|INTEGER)
{ // code
}
;
fragment DIGIT: '0'..'9';
INTEGER: DIGIT+;
DATE_FIELD_IDENTIFIER: ('DOB'|'DATE_OF_HIRE');
NUM_FIELD_IDENTIFIER: ('AGE'|'DEPARTMENT_ID');
DATE_QUALIFIER:('YEAR'|'YEARS'|'MONTH'|'MONTHS'|'DAY'|'DAYS'|'TODAY');
EQ:'=';
NE: '<>';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
An example of a statement needing to be parsed would be something like "65 > AGE" or "AGE < 65", or "DOB > 19500101".
Can someone suggest a way to make the parser differentiate between an INTEGER and the 8 digit date format?
After the lexer matches a INTEGER, you can inspect its matched text (referenced through $text), and based on that custom check, decide to change it's type from INTEGER to DATE. The DATE rule can be made as an empty fragment rule, and can then be used inside a parser rule just as if it were a normal lexer rule.
A quick demo:
INTEGER
: DIGIT+
{
// If this token starts with either '19' or '20', followed
// by 6 digits, change it to a DATE-token.
if ($text.matches("(19|20)\\d{6}")) {
$type = DATE;
}
}
;
fragment DATE : /* empty! */ ;
And then in a parser rule, you can just use DATE:
dateEvaluation
: DATE ...
;
I have the following language i wish to parse using antlr 1.2.2.
TEST <name>
{
<param_name> = <param value>;
}
while
<...> - means user value, not part of the language keywords
for example
TEST myTest
{
my_param = 1.0;
}
the value can be an integer, a real or a quated string
my_param = 1.0;, my_param = 1; and my_param = "myStringValue"; are all valid inputs.
here is the grammer for this parsing.
parse_test : TESTKEYWORD TEST_NAME '{' param_value_def '}';
param_value_def : ID EQUALS param_value ';';
param_value : REAL|INTEGER|QUOTED_STRING;
TESTKEYWORD : 'TEST';
QUOTED_STRING : '"' ~('"')* '"';
INTEGER : MINUS? DIGIT DIGIT*
REAL : INTEGER '.' DIGIT DIGIT*;
EQUALS : '=';
fragment
MINUS : '-';
fragment
DIGIT : '0'..'9';
when i feed the sample input to the antlr interpreter, i get a `MismatchedTokenException' related to the param_value rule.
can you help me cipher the error message and what i am doing wrong?
thanks
Although ANTLRWorks is not a tool well written, you can use its debugger to see which token in the input leads to this exception, and then you can see which rules need to be revised (since you did not post the full grammar).
http://www.antlr.org/works/index.html