Capturing content which can start with Parser keywords in Xtext - grammar

The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?

If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.

This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.

Related

Antlr: how to switch on token type in Visitor implementation

I'm playing around with Antlr, designing a toy language, which I think is where most people start! - I had a question on how best to think about switching on token type.
consider a 'function call' in the language, where a function can consume a string, number or variable - for example like the below (project() is the function call)
project("ABC") vs project(123) vs project($SOME_VARIABLE)
I have the alteration operator in my grammar, so the grammar parses the right thing, but in the visitor code, it would be nice to tell the difference between the three versions of the above.
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
try {
s1 = ctx.STRING_LITERAL().getText();
}catch(Exception e){}
try{
s2 = ctx.NUM().getText();
}catch(Exception e){}
System.out.println("Created Project via => " + ctx.getChild(1).toString());
}
The code above worked, depending on whether s1 or s2 are null, I can infer how I was called (with a literal or a number, I haven't shown the variable case above), but I'm interested if there is a better or more elegant way - for example switching on token type inside the visitor code to actually process the language.
The grammar I had for the above was
createproj: 'project('WS?(STRING_LITERAL|NUM)')';
and when I use the intellij antlr plugin, it seems to know the token type of the argument to the project() function - but I don't seem to be able to get to it from my code.
You could do something like this:
createproj
: 'project' '(' WS? param ')'
;
param
: STRING_LITERAL
| NUM
;
and in your visitor code:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param().start.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
so by inlining the token in the grammar I had originally, I've lost the opportunity to inspect it in the visitor code?
No really, you could also do it like this:
createproj
: 'project' '(' WS? param_token=(STRING_LITERAL | NUM) ')'
;
and could then do this:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param_token.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
Just make sure you don't mix lexer rules (tokens) and parser rules in your set param_token=( ... ). When it's a parser rule, ctx.param_token.getType() will fail (it must then be ctx.param_token.start.getType()). That is why I recommended adding an extra parser rule, because this would then still work:
param
: STRING_LITERAL
| NUM
| some_parser_rule
;

Don't read specific token at a given point

At some point in my grammar file, I want ANTLR to read my input as 2 tokens instead of one.
In my source file I have the value
12345.name
and the lexer consumes
12345.
as a FLOAT-Token. At this specific point in the source file I want ANTLR to read this as
12345 (INT)
. (DOT)
name (NAME)
Is there a way to tell ANTLR that it should ignore FLOAT-Types at some given point?
This is my current .g4 file:
grammar Quest;
import Lua;
#header {
package dev.codeflush.m2qc.antlr;
}
/*
prefixed everything with "m2" to avoid nameclashes
*/
m2QuestFile
: m2Define* m2Quest* EOF
;
m2Define
: 'define' NAME m2DefineValue
;
m2DefineValue
: ~('\r\n' | '\r' | '\n')
;
m2Quest
: 'quest' NAME 'begin' m2State* 'end'
;
m2State
: 'state' NAME 'begin' (m2TriggerBlock | m2Function)* 'end'
;
m2TriggerBlock
: 'when' m2Trigger ('or' m2Trigger)* ('with' exp)? 'begin' block 'end'
;
m2Function
: 'function' NAME funcbody
;
m2Trigger
: m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerSubEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent
| m2TriggerEvent
;
m2TriggerTarget
: NAME
| INT
| NORMALSTRING
;
/*
not complete
*/
m2TriggerEvent
: 'button'
| 'enter'
| 'info'
| 'item_informer'
| 'kill'
| 'leave'
| 'letter'
| 'levelup'
| 'login'
| 'logout'
| 'unmount'
| 'target'
| 'chat'
| 'timer'
| 'server_timer'
;
m2TriggerSubEvent
: 'click'
| 'chat'
| 'arrive'
;
m2TriggerArgument
: exp
;
DOT
: '.'
;
I'm using the Lua grammar from https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4
My current sample input file looks like this:
quest test begin
state start begin
when kill begin
end
when "12345".kill begin
end
when 12345.kill begin
end
end
end
Where the first two work as intended but the third one doesn't (because the lexer reads '12345.' as one FLOAT-Token)
I had a similar need in my grammar where I wanted to issue multiple tokens (2 actually) for a single match under a specific condition (here: when a dot is directly followed by an identifier, including a keyword).
// Special rule that should also match all keywords if they are directly preceded by a dot.
// Hence it's defined before all keywords.
// Here we make use of the ability in our base lexer to emit multiple tokens with a single rule.
DOT_IDENTIFIER:
DOT_SYMBOL LETTER_WHEN_UNQUOTED_NO_DIGIT LETTER_WHEN_UNQUOTED* { emitDot(); } -> type(IDENTIFIER)
;
A helper function is needed to emit the extra token(s):
/**
* Puts a DOT token onto the pending token list.
*/
void MySQLBaseLexer::emitDot() {
_pendingTokens.emplace_back(_factory->create({this, _input}, MySQLLexer::DOT_SYMBOL, _text, channel,
tokenStartCharIndex, tokenStartCharIndex, tokenStartLine,
tokenStartCharPositionInLine));
++tokenStartCharIndex;
}
which in turn requires custom handling of the token production. You have to override the nextToken method in your token stream, to consider the pending token list before returning the next real token.
/**
* Allow a grammar rule to emit as many tokens as it needs.
*/
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
// First respond with pending tokens to the next token request, if there are any.
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
return pending;
}
// Let the main lexer class run the next token recognition.
// This might create additional tokens again.
auto next = Lexer::nextToken();
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
_pendingTokens.push_back(std::move(next));
return pending;
}
return next;
}
Keep in mind: the lexer rule still issues its own token (which I set to be an IDENTIFIER here), which means you only have to issue the additional tokens.

Why does the DSL show "Couldn't resolve reference to "?

I am implementing a grammar with three sections. In the first section I declare components with their interfaces, for instance Component A with interfaces interface_1, interface_2. In the third section I declare some restrictions, for instance component A can acces component B through interface XXXX. When I try to cross-reference the interfaces of a component I get the error "Couldn't resolve reference to ProbeInterface 'interface_1'"?.
I tried several examples from internet but none of them works to my case.
This is part of my grammar:
ArchitectureDefinition:
'Abstractions' '{' abstractions += DSLAbstraction+ '}'
'Compositions' '{' compositions += DSLComposition* '}'
'Restrictions' '{' restrictions += DSLRestriction* '}'
;
DSLComposition:
DSLProbe|DSLSensor
;
DSLRestriction:
'sensor' t=[DSLSensor] 'must-access-probe' type = [DSLProbe] 'through-interface' probeinterface=[ProbeInterface] ';'
;
DSLSensor:
'Sensor' name=ID ';'
;
DSLProbe:
'Probe' name=ID ('with-interface' probeinterface=ProbeInterface)? ';'
;
ProbeInterface :
name+=ID (',' name+=ID)*
;
And the implementation:
Abstractions
{
Sensor sensor_1 ;
Probe probe_1 with-interface interface_1, interface_2;
}
Compositions{}
Restrictions
{
sensor sensor_1 must-access-probe probe_1 through-interface
interface_1;
}
I expect that interface_1 or interface_2 can be referenced by the grammar.
Thanks.
the grammar you posted is incomplete
the way you define the interfaces is really bad.
default naming works only with single valued name attributes
ProbeInterface :
name+=ID (',' name+=ID)*
;
better
DSLProbe:
'Probe' name=ID ('with-interface' probeinterfaces+=ProbeInterface ("," probeinterfaces+=ProbeInterface)*)? ';'
;
ProbeInterface :
name=ID
;
it looks like the qualified name of a Interface is
<probename>.<interfacename>
you either have to adapt the name provider
or the grammar and model to use qualiedname ref=[Thing|FQN] with FQN: ID ("." ID)*;
or you implement scoping properly which is what you want to do likely in your case since you want to restrict the inferfaces for specific probes
here is a sample
override getScope(EObject context, EReference reference) {
if (reference === MyDslPackage.Literals.DSL_RESTRICTION__PROBEINTERFACE) {
if (context instanceof DSLRestriction) {
val probe = context.type
return Scopes.scopeFor(probe.probeinterfaces)
}
}
super.getScope(context, reference)
}

No way to implement a q quoted string with custom delimiters in Antlr4

I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>

skipping parts of a matched lexical element or token

I would like to match a "{NUM}" and then have the lexer rule return "NUM". so, I tried
NUM : ('{' { skip(); }) 'NUM' ('}' { skip(); });
But, that seems to skip everything and return empty on a match. would it be possible to skip parts of a lexer match ?
antlr 3.4
Invoking skip() anywhere in your rule will remove the entire token from the lexer, not just certain characters.
What you could do is this:
NUM
: '{NUM}' {setText("NUM");}
;
Or, if NUM is variable, do:
NUM
: '{' 'A'..'Z'+ '}' {setText($text.substring(1, $text.length() - 1));}
;
which removes the first and last char from the token.
EDIT
smartnut007 wrote:
Is there an equivalent way to do this for Tokens ?
If you mean how to change the text of tokens inside parser rules, try this:
parser_rule
: LEXER_RULE {$LEXER_RULE.setText("new-text");}
;
LEXER_RULE
: 'old-text'
;