I have a rule to match a string in the grammar. I currently need the content of the string and not the quotes itself so I am looking to strip the quotation marks.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
I saw a solution on https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687006/How+do+I+strip+quotes but it is an older version of ANTLR and I need to translate it into Python3. Does anyone have a solution to this?
The ANTLR3 syntax:
STRING: '\"' CHARS '\"' {setText(getText().substring(1, getText().length()-1));} ;
still works in ANTLR4. If you want to port that Java code to Python, do this:
StringLiteral
: UnterminatedStringLiteral '"' {self.text = self.text[1:-1]}
;
And if you also want to strip the \ that are used to escape othe characters, do this:
grammar YourGrammarName;
#lexer::header {
import re
}
...
StringLiteral
: UnterminatedStringLiteral '"' {self.text = re.sub(r'\\(.)', '\\1', self.text[1:-1])}
;
Related
Im beginner of antlr,
Im try to write an antlr grammar (.g4 file) for follow rules:
Accept AND (&) between two variable: A&B, ABC&X, ...
Accept Unicode string begin with U&'hex string': U&'000b', U&'0020', ...
Accept concat string between variable (string type) and string: A&'123', ABC&'XyZ', ...
My question is how to reject concat string between U&'XyZ'? because U& is prefix of unicode string
Thank you for reading
U&'XyZ' should never be able to be recognised as a AND expression by the parser because U&'XyZ' is already tokenised as a single Unicode string token in the lexer:
expr
: expr '&' expr
| STRING
| ID
;
STRING
: 'U&'? '\'' ( ~[\\'\r\n] | '\\' ~[\r\n] )* '\''
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
which will parse U&'XyZ'&X as this:
there are two ESCAPE type in SQL: \' AND ''
a input may like:
SELECT '\'', '''';
I parse the string with this grammar:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )* '\''
;
but ANTLR parse the input error, the tree like this:
error parsed tree
I also tried another type of STRING_LITERAL grammar with GREEDY: "?":
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )*? '\''
;
but it also give me a error parse resule like this:
error parsed tree in another grammar
the '''' should parsed as a string contain but not two empty string.
How should I modify the grammar to fix the problem?
You didn't exclude the \ in the ( ... )*. Try this:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~['\\] )* '\''
;
where ~['\\] matches any char except ' and \. You may want to include line break chars in it: ~[\r\n'\\].
I have a Hello.g4 grammar file with a grammar definition:
definition : wordsWithPunctuation ;
words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )* ;
NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Now, if I am trying to build a parse tree from the following input:
a b c d of at of abc bcd of
a b c d at abc, bcd
a b c d of at of abc, bcd of
it returns errors:
Hello::definition:1:31: extraneous input 'of' expecting {<EOF>, '(', '"', WORD, PUNCTUATION}
though the:
a b c d at: abc bcd!
works correct.
What is wrong with the grammar or input or interpreter?
If I modify the wordsWithPunctuation rule, by adding (... | 'of' | ',' word | ...) then it matches the input completely, but it looks suspicious for me - how the word of is different from the word a or abc? Or why the , is different from other punctuation characters (i.e., why does it match the : or !, but not ,?)?
Update1:
I am working with ANTLR4 plugin for Eclipse, so the project build happens with the following output:
ANTLR Tool v4.2.2 (/var/folders/.../antlr-4.2.2-complete.jar)
Hello.g4 -o /Users/.../eclipse_workspace/antlr_test_project/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
Update2:
the presented above grammar is just a partial from:
grammar Hello;
text : (entry)+ ;
entry : blub 'abrr' '-' ('1')? '.' ('(' NUMBER ')')? sims '-' '(' definitionAndExamples ')' 'Hello' 'all' 'the' 'people' 'of' 'the' 'world';
blub : WORD ;
sims : sim (',' sim)* ;
sim : words ;
definitionAndExamples : definitions (';' examples)? ;
definitions : definition (';' definition )* ;
definition : wordsWithPunctuation ;
examples : example (';' example )* ;
example : '"' wordsWithPunctuation '"' ;
words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )* ;
NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It looks now for me, that the words from the entry rule somehow breaking the other rules within the entry rule. But why? Is it a kind an anti-pattern in the grammar?
By including 'of' in a parser rule, ANTLR is creating an implicit anonymous token to represent that input. The word of will always have that special token type, so it will never have the type WORD. The only place it may appear in your parse tree is at a location where 'of' appears in a parser rule.
You can prevent ANTLR from creating these anonymous token types by separating your grammar into a separate lexer grammar HelloLexer in HelloLexer.g4 and parser grammar HelloParser in HelloParser.g4. I highly recommend you always use this form for the following reasons:
Lexer modes only work if you do this.
Implicitly-defined tokens are one of the most common sources of bugs in a grammar, and separating the grammar prevents it from ever happening.
Once you have the grammar separated, you can update your word parser rule to allow the special token of to be treated as a word.
word
: WORD
| 'of'
| ... other keywords which are also "words"
;
I need to parse user input that defines queries to a system. The heart of such queries are triplets which can also be combined to form complex queries (the idea is to restrict a result set to only show entries which satisfy these queries). Here are 3 sample inputs:
field1 = simpleValueNoQuotes
field2 ~ "valueWithQuotes"
(field1 = simpleValueNoQuotes OR field2 ~ "valueWithQuotes") AND field3 = foobar
The user must use quoted values if their values contain any reserved characters like doublequotes or parentheses as well as whitespace.
So far, my grammar has handled this well enough, but now a new requirement has come up. Users should be allowed to omit the spaces, entering queries like field1=simpleValueNoQuotes. My grammar can't handle this and I can't seem to figure out why (this is my first project with antlr).
Here is my grammar in a slightly simplified form:
grammar simple;
querytree : query EOF;
query : subquery (operator subquery)* ;
subquery : leaf | composite;
operator : 'and' | 'or';
leaf : fieldname comparison value;
value : DOUBLEQUOTE_DELIMITED_VALUE | SIMPLE_VALUE;
composite : leftParenthesis query rightParenthesis;
fieldname : 'field1' | 'field2'; //this has many keywords in reality
comparison : '=' | '~';
leftParenthesis : '(';
rightParenthesis : ')';
fragment
ESCAPE : '\\' ( '"' | '\\') ;
DOUBLEQUOTE_DELIMITED_VALUE
: '"' ( ~( '"' | '\\' ) | ESCAPE )* '"'
;
SIMPLE_VALUE
: ('\u0021'|'\u0023'..'\u0027'|'\u002A'..'\u007E'|'\u00A1'..'\uFFFF')*; /*all unicode characters except control characters, doublequotes, parentheses and whitespace defined below*/
WHITESPACE
: ('\u0009'|'\u000A'|'\u000C'|'\u000D'|'\u0020'|'\u00A0')+ {$channel = HIDDEN;} /*\t, \n, \f, \r, space, nonbreaking space*/
;
Any ideas as to why this is able to parse field1 = simpleValueNoQuotes but unable to parse field1=simpleValueNoQuotes?
You forgot to exclude = from SIMPLE_VALUE, which means field1=simpleValueNoQuotes is a single SIMPLE_VALUE token.
my real grammar is way more complex but I could strip down my problem. So this is the grammar:
grammar test2;
options {language=CSharp3;}
#parser::namespace { Test.Parser }
#lexer::namespace { Test.Parser }
start : 'VERSION' INT INT project;
project : START 'project' NAME TEXT END 'project';
START: '/begin';
END: '/end';
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
INT : '0'..'9'+;
NAME: ('a'..'z' | 'A'..'Z')+;
TEXT : '"' ( '\\' (.) |'"''"' |~( '\\' | '"' | '\n' | '\r' ) )* '"';
STARTA
: '/begin hello';
And I want to parse this (for example):
VERSION 1 1
/begin project
testproject "description goes here"
/end
project
Now it will not work like this (Mismatched token exception). If I remove the last Token STARTA, it works. But why? I don't get it.
Help is really appreciated.
Thanks.
When the lexer sees the input "/begin " (including the space!), it is committed to the rule STARTA. When it can't match said rule, because the next char in the input is a "p" (from "project") and not a "h" (from "hello"), it will try to match another rule that can match "/begin " (including the space!). But there is no such rule, producing the error:
mismatched character 'p' expecting 'h'
and the lexer will not give up the space and match the START rule.
Remember that last part: once the lexer has matched something, it will not give up on it. It might try other rules that match the same input, but it will not backtrack to match a rule that matches less characters!
This is simply how the lexer works in ANTLR 3.x, no way around it.