To ignore text inside double quote in antlr 2 - antlr

How can I ignore text inside double quote in antlr 2 .
for example:
I want to ignore "something random" , "some another random" type text which is inside double quote . How can I do this without my parser failing .

You could do what antlr would do for an XML parser:
form island grammars and lexer modes.
Note that this only works if there are no nested quotes:
As soon as you encounter a " you do a pushmode action and switch to a mode that skips everything except ", and switch back to regular when you see another ". This isn't elegant but should work rather well.

Related

Include certain escapement symbols into ANTLR Lexer rules

I'm creating a parser in Antlr4 and Python. Below is the Lexer rules I created in Antlr.
VARIABLE_ID : [$][a-zA-Z][a-zA-Z0-9_]*;
ARRAY_ID : [*][a-zA-Z][a-zA-Z0-9_]*;
STRINGCONST : ["][/|:.a-zA-Z0-9 ]+["];
WS : [ \r\t\f\n]+ -> skip;
I am looking at the STRINGCONST rule and I'm trying to add symbols such as - and ~, however, since they are escapement characters, Antlr is just throwing errors for me. I've tried escaping them with themselves and I haven't been able to get that to work.
Is there a way to include them in the STRINGCONST rule? The basic idea is that I want a string to be identified as any character between two " " marks however I'm happy to limit it to what's currently in the rule as long as I can get - and ~ in there as well.
You can escape chars by adding a \ in front of them:
STRINGCONST : ["] [/|:.a-zA-Z0-9 \-~]+ ["];
And note that ~ has no special meaning inside a char class (only outside of them), so ~ doesn't need to be escaped.

Why the token is displayed as 'end' type instead of STRING?

my aim is save a comment that start with any word and end with the "end" word like this
ANYWORD bla bla bla end
I have this grammar:
lexer grammar JunkLexer;
WS : [ \r\t\n]+ -> skip ;
LQUOTE : 'start' -> more, mode(START) ;
mode START;
STRING : 'end' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string
but I don't know why, the lexer generates tokens that does not exists in the grammar:
when I checkout the lexer tokens, is the same:
WS=1
STRING=2
LQUOTE=3
'start'=3
'end'=2
Thank you in advance
When you define a lexer rule using a single string literal, that string literal becomes an alternative name for the rule. So when you define FOO: 'foo'; in the lexer grammar, you can then use FOO and 'foo' interchangeably in the parser grammar. This allows you to use string literals in your grammar even if you split it up into a parser and lexer grammar. So even though you have to write PLUS: '+'; in the lexer, you can still write exp '+' exp instead of exp PLUS exp in the grammar. The string literal name is also the one used when displaying the token because that tends to be more readable.
Of course that makes sense in the PLUS example, but doesn't really make sense in your example because, due to the more, your STRING rule doesn't actually just match end, but a whole string. So writing 'end' in the parser grammar to match a complete begin-end section would be utterly confusing (though it would work) and so is the fact that it's used as the token name. However ANTLR doesn't realize that because it doesn't realize that STRING can only be reached through rules invoking more.
Note that you can still use STRING to refer to the token, so this won't actually break your grammar in any way. It will lead to confusing error messages though ("missing 'end'" when it should be "missing STRING").
To work around that you can change the STRING rule to not only consist of a single string literal:
STRING: 'e' 'n' 'd';
This will be equivalent in every way, except that 'end' will no longer be an alias for STRING and will no longer be used as the display name of the token.

ANTLR not matching empty comments

I am using ANTLR to parse a language which uses the colon for both a comment indicator and as part of a 'becomes equal to' assignment. So for example in the line
Index := 2 :Set Index
I need to recognize the first part as an assignment statement and the text after the second colon as a comment. Currently I do this using the rule:
COMMENT : ':'+ ~[:='\r\n']*;
This seems to work OK apart from when the colon is immediately followed by a new line. e.g. in the line
Index := 2 :
the newline occurs immediately after the second colon. In this case the comment is not recognized and the rest of the code is not parsed in the correct context. If there is a single space after the second colon the line is parsed correctly.
I expected the '\r'\n' to cope with this but it only seems to work if there is at least one character after the comment symbol - have I missed something from the command?
The braces denote a collection of characters without any quotes. Hence your '\r\n' literal doesn't work there (you should have got a warning that the apostrophe is included more than once in the char range.
Define the comment like this instead:
COMMENT: ':'+ ~[:=\n\r]*;

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

Searching for backslash character in vim

How to search word start \word in vim. I can do it using the find menu. Is there any other short cut for this?
Try:
/\\word
in command mode.
You can search for most anything in your document using regular expressions. From normal mode, type '/' and then start typing your regular expression, and then press enter. '\<' would match the beginning of a word, so
/\<foo
would match the string 'foo' but only where it is at the beginning of a word (preceded by whitespace in most cases).
You can search for the backslash character by escaping it with a backslash, so:
/\<\\foo
Would find the pattern '\foo' at the beginning of a word.
Not directly relevant (/\\word is the the correct solution, and nothing here changes that), but for your information:
:h magic
If you are for a pattern with many characters with special meaning to regexes, you may find "nomagic" and "very nomagic" mode useful.
/\V^.$
will search for the literal string ^.$, instead of "lines of exactly one character" (\v "very magic" and the default \m "magic" modes) or "lines of exactly one period" (\M "nomagic" mode).
The reason searching for something including "\" is different is because "\" is a special character and needs to be escaped (prepended with a backslash)
Similarly, to search for "$100", which includes the special character "$":
Press /
Type \$100
Press return
To search for "abc", which doesn't include a special character:
Press /
Type abc
Press return