Simple Ragel grammar with optional whitespace - ragel

Ragel is powerfull machine but I have trouble with 'optional' elements in a grammar. I have simple line with number or strings. The trouble is with whitespace. I dont know how put correctly optional whitespace between ',' and variable. Enter will be every where between token. The end line is ';' or enter. I need using $err() function for error.
This is my test set:
good
this , is , a , test ; and, this,
is,ok
next, trouble
How,produce,good
grammar;
ok
output:
line(this,is,a,test)
line(and,this,is,ok)
line(next,trouble)
line(How,produce,good)
line(grammar)
line(ok)
and fail (this not = no ',')(',,' without number or variable)
this not , working
and,
this,, too
when i use this grammar i get separate chars or error on end of line
whitespace = [ \t\v\f] ;
enter = [\r\n] ;
string = (alnum | '_')+ ;
number = ('+'|'-')?[0-9]+'.'[0-9]+( [eE] ('+'|'-')? [0-9]+ )? ;
var = string | number ;
koniec = (';' | enter) ;
line = var whitespace* ( ',' whitespace* var )* whitespace* koniec ;
main := whitespace* ( line )* ;
this is my whole code https://github.com/and09/simple_grammar

It's a bit hard to give definitive answers when you don't have a full specification of your grammar, but let's at least try to make your example work the way you want it to and then you should be able to correct it if needed.
So, your full example from Github that has some printing actions in it, actually tells a lot about what's going on in the state machine (the other thing you should be periodically checking with while working with Ragel is state machine graph that it can produce for you). In its initial specification (same as in question) it outputs the following on run:
[this]< >,< >[is]
So it has a problem going into the third variable. Why is that? Well, that's because your line only specifies one ( ',' whitespace* var) element, but if you try to fix that by specifying ( ',' whitespace* var)*, it won't also work because now you're demanding that your var is to be immediately followed by comma on repetition, without any whitespace. Let's try this (actions intentionally removed), moving whitespace into the repeating group:
line = var whitespace* ( ',' whitespace* var whitespace*)* koniec;
Now you get this in the output:
[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >
Which is an obvious improvement. So why it fails now? Well, that's because after your koniec the machine wants to wrap into the next line, but in order to do that it needs to see a var. But we have whitespace after ; in the input instead. So we need to change our definition of line to enable some whitespace in the beginning, but that also makes whitespace redundant in the main, so let's try these definitions:
line = whitespace* var whitespace* ( ',' whitespace* var whitespace*)* koniec;
main:= line*;
Now we have this output:
[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >
< >[and],< >[this]
Which again is better, but still not good enough. Now you can see that it chokes on newline, which actually is a bit unclear moment for me too. You say that
The end line is ';' or enter
Yet you want to get
line(and,this,is,ok)
So let's assume that enter starts a new line unless you have a comma in the end of line. To specify that in the grammar, let's do this:
line = whitespace* var whitespace* ( ',' (whitespace | enter)* var whitespace*)* koniec;
Now you get this in the output:
[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >
< >[and],< >[this],[is],[ok]
Why is it not going further? That's because our line has to have the var but we have an empty line in the input instead. That also raises a question of whitespace-only lines, so let's make our line work with whitespace-only content like this:
line = whitespace* (var whitespace* ( ',' (whitespace | enter)* var whitespace*)*)? koniec;
And bang! Suddenly you have all the word groups you want in the output. But you also have some excessive lines, that are actually very easy to fix, you just need to move your pisz_enter action from koniec into the line like this:
vargroup = var whitespace* ( ',' %pisz_przecinek (whitespace | enter)* var whitespace*)* %pisz_enter;
line = whitespace* vargroup? koniec;
That's it. Two other things I can notice are:
you want you number to be something like
number = (('+'|'-')?[0-9]+'.'[0-9]+( [eE] ('+'|'-')? [0-9]+ )?) >Poczatek_Napisu %pisz_stala ;
to be printed properly
you actually need to redo token extraction to work properly, the reason is that you're reading from file in some fixed-amount chunks and you're currently storing some token start pointer (poczatek_napisu) in your actions. If the token is split between chunks (which can occur with high probability on any file longer than sizeof bufor) you're gonna have a problem (and it's not a FSM problem, the machine will work just fine, it's just what you do in actions), but that's beyond the scope of current question.

Related

Parse string antlr

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".
string
: '"' character* '"'
;
character
: escapeSequence
| .
;
escapeSequence
: '\(' expression ')'
;
IDENTIFIER
: [a-zA-Z][a-zA-Z0-9]*
;
WHITESPACE
: [ \r\t,] -> skip
;
This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.
How can I parse strings that can have expressions inside of them?
Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.
This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().
The most basic way to achieve this would be like this:
Lexer:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
Parser:
parser grammar StringParser;
options {
tokenVocab = 'StringLexer';
}
start: exp EOF ;
exp : '(' exp ')'
| IDENTIFIER
| DQUOTE stringContents* DQUOTE
;
stringContents : TEXT
| ESCAPE_SEQUENCE
| '\\(' exp ')'
;
Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.
This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.
To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):
#members {
int nesting = 0;
}
LPAR: '(' {
nesting++;
pushMode(DEFAULT_MODE);
};
RPAR: ')' {
if (nesting > 0) {
nesting--;
popMode();
}
};
mode IN_STRING;
BACKSLASH_PAREN: '\\(' {
nesting++;
pushMode(DEFAULT_MODE);
};
(The parts I left out are the same as in the previous version).
This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).
If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
mode EMBEDDED;
E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;
Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).
Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.
The full code for all three alternatives can also be found on GitHub.

antlr4 multiline string parsing

If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
ie,
"foo" "bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
while:
"foo"
"bar"
would be seen as one STRING token: "foobar"
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
Sample1:
"desc" "this sample will parse as two strings.
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
Addendum:
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
Okay, I have it.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.
EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...
The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.
multiString : singleString +;
singleString : ONELINE_STRING;
ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;
As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
Tokenizing the input:
"a" "b"
"c"
"d"
"e"
"f"
would create the following 5 tokens:
"a"
"b"
"c"\n"d"
"e"
"f"
However, if the token would include a comment:
"c" // comment
"d"
then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.

"skip" changes parser behavior

Adding skip to a rule doesn't do what I expect. Here's a grammar for a pair of tokens separated by a comma and a space. I made one version where the comma is marked skip, and one where it isn't:
grammar Commas;
COMMA: ', ';
COMMASKIP: ', ' -> skip;
DATA: ~[, \n]+;
withoutSkip: data COMMA data '\n';
withSkip: data COMMASKIP data '\n';
data: DATA;
Testing the rule without skip works as expected:
$ echo 'a, b' | grun Commas withoutSkip -tree
(withoutSkip (data a) , (data b) \n)
With skip gives me an error:
$ echo 'a, b' | grun Commas withSkip -tree
line 1:1 mismatched input ', ' expecting COMMASKIP
(withSkip (data a) , b \n)
If I comment out the COMMA and withoutSkip rules I get this:
$ echo 'a, b' | grun Commas withSkip -tree
line 1:3 missing ', ' at 'b'
(withSkip (data a) <missing ', '> (data b) \n)
I am trying to get output that just has the data tokens without the comma, like this:
(withSkip (data a) (data b) \n)
What am I doing wrong?
skip causes the lexer to discard the token. Therefore, a skipped lexer rule cannot be used in parser rules.
Another thing, if two or more rules match the same input, the rule defined first will "win" from the rule(s) defined later in the grammar, no matter if the parser tries to match the rule defined later in the grammar, the first rule will always "win". In your case, the rule COMMASKIP will never be created since COMMA matches the same input.
Try something like this:
grammar Commas;
COMMA : ',' -> skip;
SPACE : (' '|'\n') -> skip;
DATA : ~[, \n]+;
data : DATA+;
EDIT
So how do I specify where the comma goes without including it in the parse tree? Your code would match a, , b.
You don't, so if the comma is significant (ie. a,,b) is invalid, it cannot be skipped from the lexer.
I think in antlr3 you're supposed to use an exclamation point.
In ANTLR 4, you cannot create an AST from your parse. In the new version, all terminals/rules are in one parse tree. You can iterate over this tree with custom visitors and/or listeners. A demo of how to do this can be found in this Q&A: Once grammar is complete, what's the best way to walk an ANTLR v4 tree?
In your case, the grammar would look like this:
grammar X;
COMMA : ',';
SPACE : (' '|'\n') -> skip;
DATA : ~[, \n]+;
data : DATA (COMMA DATA)*;
and then create a listener like this:
public class MyListener extends XBaseListener {
#Override
public void enterData(XParser.DataContext ctx) {
List dataList = ctx.DATA(); // not sure what type of list it returns...
// do something with `dataList`
}
}
As you can see, the COMMA is not removed, but inside enterData(...) you just only use the DATA tokens.

Why do i have a shift reduce/conflict on the ')' and not '('?

I have syntax like
%(var)
and
%var
and
(var)
My rules are something like
optExpr:
| '%''('CommaLoop')'
| '%' CommaLoop
CommaLoop:
val | CommaLoop',' val
Expr:
MoreRules
| '(' val ')'
The problem is it doesnt seem to be able to tell if ) belongs to %(CommaLoop) or % (val) but it complains on the ) instead of the (. What the heck? shouldnt it complain on (? and how should i fix the error? i think making %( a token is a good solution but i want to be sure why $( isnt an error before doing this.
This is due to the way LR parsing works. LR parsing is effectively bottom-up, grouping together tokens according to the RHS of your grammar rules, and replacing them with the LHS. When the parser 'shifts', it puts a token on the stack, but doesn't actually match a rule yet. Instead, it tracks partially matched rules via the current state. When it gets to a state that corresponds to the end of the rule, it can reduce, popping the symbols for the RHS off the stack and pushing back a single symbol denoting the LHS. So if there are conflicts, they don't show up until the parser gets to the end of some rule and can't decide whether to reduce (or what to reduce).
In your example, after seeing % ( val, that is what will be on the stack (top is at the right side here). When the lookahead is ), it can't decide whether it should pop the val and reduce via the rule CommaLoop: val, or if it should shift the ) so it can then pop 3 things and reduce with the rule Expr: '(' val ')'
I'm assuming here that you have some additional rules such as CommaLoop: Expr, otherwise your grammar doesn't actually match anything and bison/yacc will complain about unused non-terminals.
Right now, your explanation and your grammar don't seem to match. In your explanation, you show all three phrases as having 'var', but your grammar shows the ones starting with '%' as allowing a comma-separated list, while the one without allows only a single 'val'.
For the moment, I'll assume all three should allow a comma-separated list. In this case, I'd factor the grammar more like this:
optExpr: '%' aList
aList: CommaLoop
| parenList
parenList: '(' CommaLoop ')'
CommaLoop:
| val
| CommaLoop ',' val
Expr: MoreRules
| parenList
I've changed optExpr and Expr so neither can match an empty sequence -- my guess is you probably didn't intend that to start with. I've fleshed this out enough to run it through byacc; it produces no warnings or errors.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.