What is the correct way to scan "Quoted String" in ragel? - ragel

I m trying learn ragel with go, but i am not able to find a proper way to scan a Quoted-string
This is what i have defined
dquote = '"';
quoted_string = dquote (any*?) dquote ;
main := |*
quoted_string =>
{
current_token = QUOTED_STRING;
yylval.stringValue = string(lex.m_unScannedData[lex.m_ts:lex.m_te]);
fmt.Println("quoted string : ", yylval.stringValue)
fbreak;
};
The following expression with single quoted string works fine
if abc == "xyz.123" {
pp
}
If i scan the above condition then i get this printf
quoted string : "xyz.123"
But if i have 2 quoted string as shown below, it fails
if abc == "0003" {
if xyz == "5003" {
pp
}
}
it scans both the quoted string
quoted string : "0003" {
if xyz == "5003"
Can someone please help me with this ? If there is a better alternative
I am using below version
# ragel -v
Ragel State Machine Compiler version 6.10 March 2017
Copyright (c) 2001-2009 by Adrian Thurston

This did the trick
quoted_string = dquote (any - newline)* dquote ;

Related

SableCC expecting EOF

I seem to be having issues with SableCC generating lexing.grammar
This is what i run on sableCC
Package lexing ; // A Java package is produced for the
// generated scanner
Helpers
num = ['0'..'9']+; // A num is 1 or more decimal digits
letter = ['a'..'z'] | ['A'..'Z'] ;
// A letter is a single upper or
// lowercase character.
Tokens
number = num; // A number token is a whole number
ident = letter (letter | num)* ;
// An ident token is a letter followed by
// 0 or more letters and numbers.
arith_op = [ ['+' + '-' ] + ['*' + '/' ] ] ;
// Arithmetic operators
rel_op = ['<' + '>'] | '==' | '<=' | '>=' | '!=' ;
// Relational operators
paren = ['(' + ')']; // Parentheses
blank = (' ' | '\t' | 10 | '\n')+ ; // White space
unknown = [0..0xffff] ;
// Any single character which is not part
// of one of the above tokens.
This is the result
org.sablecc.sablecc.parser.ParserException: [21,1] expecting: EOF
at org.sablecc.sablecc.parser.Parser.parse(Parser.java:1792)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:203)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:171)
at org.sablecc.sablecc.SableCC.main(SableCC.java:137)
You can only have a short_comment if you put one empty line after it. If you use long_comments instead (/* ... */), there's no need for it.
The reason is that, according to the grammar that defines the SableCC 2.x input language, a short comment is defined as a consuming pattern of eol:
cr = 13;
lf = 10;
eol = cr lf | cr | lf; // This takes care of different platforms
short_comment = '//' not_cr_lf* eol;
Since the last line has:
// of one of the above tokens.
It consumes the last (invisible) EOF token expected at the end of any
.sable file, explaining the error.

How to split a string based on comma, but not based on comma in double quote

I want to split this string based on comma, but not based on the comma in double quote ":
my $str = '1,2,3,"4,5,6"';
.say for $str.split(/','/) # Or use comb?
The output should be:
1
2
3
"4,5,6"
fast solution with comb, take anything but not " nor ,
or take quoted string
my $str = '1,2,3,"4,5,6",7,8';
.say for $str.comb: / <-[",]>+ | <["]> ~ <["]> <-["]>+ / ;
as #melpomene suggested, use the Text::CSV module works too.
use Text::CSV;
my $str = '123,456,"78,91",abc,"de,f","ikm"';
for csv(in => csv(in => [$str], sep_char => ",")) -> $arr {
.say for #$arr;
}
which output:
123
456
78,91
abc
de,f
ikm
This may help:
my $str = ‘1,2,3,"4,5,6",7,8’;
for $str.split(/ \" \d+ % ',' \"/, :v) -> $l {
if $l.contains('"') {
say $l.Str;
} else {
.say for $l.comb(/\d+/);
}
}
Output:
1
2
3
"4,5,6"
7
8

antlr4: How to keep comments in parse tree? [duplicate]

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.
Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.
first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}
Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;
The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File
I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !
For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)?

I need to tokenize everything that is "outside" any comment, until end of line. For instance:
take me */ and me /* but not me! */ I'm in! // I'm not...
Tokenized as (STR is the "outside" string, BC is block-comment and LC is single-line-comment):
{
STR: "take me */ and me ", // note the "*/" in the string!
BC : " but not me! ",
STR: " I'm in! ",
LC : " I'm not..."
}
And:
/* starting with don't take me */ ...take me...
Tokenized as:
{
BC : " starting with don't take me ",
STR: " ...take me..."
}
The problem is that STR can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR.
I thought maybe to do something like:
STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;
But I don't know how to look-ahead for characters in lexer actions.
Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?
Yes, it is possible to perform the tokenization you are attempting.
Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.
I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.
/*
All 3 COMMENTS are Mutually Exclusive
*/
DOC_COMMENT
: '/**'
( [*]* ~[*/] // Cannot START/END Comment
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*'+ '/' -> channel( DOC_COMMENT )
;
BLK_COMMENT
: '/*'
(
( /* Must never match an '*' in position 3 here, otherwise
there is a conflict with the definition of DOC_COMMENT
*/
[/]? ~[*/] // No START/END Comment
| DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
)
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*/' -> channel( BLK_COMMENT )
;
INL_COMMENT
: '//'
( ~[\n\r*/] // No NEW_LINE
| INL_COMMENT // Nested Inline Comment
)* -> channel( INL_COMMENT )
;
STR // Consume everthing up to the start of a COMMENT
: ( ~'/' // Any Char not used to START a Comment
| '/' ~[*/] // Cannot START a Comment
)+
;
start
: DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| STR
;
Try something like this:
grammar T;
#lexer::members {
// Returns true iff either "//" or "/*" is ahead in the char stream.
boolean startCommentAhead() {
return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
}
}
// other rules
STR
: ( {!startCommentAhead()}? . )+
;

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.