I am trying to write a lexer for an IntelliJ language plugin. In the JFLex manual there is an example that can lex string literals. However in this example they use a StringBuffer to insert each part of the lexed characters and continually build up a single string. The problem I have with this method is that it creates a copy of the characters that are being read and I dont know how to integrate that example with the IntelliJ. In IntelliJ one always returns a IElementType and then the associated text is taken from yytext() using the functions getTokenStart() and getTokenEnd(), such that the start and end of the whole token is mapped directly to the input string.
So I want to be able to return a token and the associated yytext() should span over the whole text since the last time another token was returned. For example in the string literal example, I would read \" which marks the literal start, then I change into state STRING and when I read \" again I change back into another state and return the string literal token. At that point I want yytext() to contain the whole string literal.
Is this possible with JFlex? If not what is the recommended why to pass the content from a StringBuffer to the IntelliJ API after a token has been matched that spans multiple actions.
You could write a regular expression that matches the entire String literal so that you get it in one yytext() call, but this match would contain escape sequences unprocessed.
From the JFlex java example:
<STRING> {
\" { yybegin(YYINITIAL); return symbol(STRING_LITERAL, string.toString()); }
{StringCharacter}+ { string.append( yytext() ); }
/* escape sequences */
"\\b" { string.append( '\b' ); }
"\\t" { string.append( '\t' ); }
"\\n" { string.append( '\n' ); }
"\\f" { string.append( '\f' ); }
"\\r" { string.append( '\r' ); }
"\\\"" { string.append( '\"' ); }
"\\'" { string.append( '\'' ); }
"\\\\" { string.append( '\\' ); }
\\[0-3]?{OctDigit}?{OctDigit} { char val = (char) Integer.parseInt(yytext().substring(1),8);
string.append( val ); }
/* error cases */
\\. { throw new RuntimeException("Illegal escape sequence \""+yytext()+"\""); }
{LineTerminator} { throw new RuntimeException("Unterminated string at end of line"); }
}
This code doesn't just match escape sequences like "\\t", but turns them into the single character '\t'. You could match the whole string in one expression in an expression like this
\" ({StringCharacter} | \\[0-3]?{OctDigit}?{OctDigit} | "\\b" | "\\t" | .. | "\\\\") * \"
but yytext will then contain the unprocessed sequence \\t instead of the character '\t'.
If that is acceptable, then that's the easy solution. If the token is supposed to be an actual substring of the input, then it sounds like this is what you want.
If it's not, you'll need something more complicated, for instance an intermediate interface function that is not yytext(), but that returns the StringBuffer content when the last match was a string match (a flag you could set in the string action), and otherwise returns yytext().
Related
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
I would like to match any Num from part of a text string. So far, this (stolen from from https://docs.perl6.org/language/regexes.html#Best_practices_and_gotchas) does the job...
my token sign { <[+-]> }
my token decimal { \d+ }
my token exponent { 'e' <sign>? <decimal> }
my regex float {
<sign>?
<decimal>?
'.'
<decimal>
<exponent>?
}
my regex int {
<sign>?
<decimal>
}
my regex num {
<float>?
<int>?
}
$str ~~ s/( <num>? \s*) ( .* )/$1/;
This seems like a lot of (error prone) reinvention of the wheel. Is there a perl6 trick to match built in types (Num, Real, etc.) in a grammar?
If you can make reasonable assumptions about the number, like that it's delimited by word boundaries, you can do something like this:
regex number {
« # left word boundary
\S+ # actual "number"
» # right word boundary
<?{ defined +"$/" }>
}
The final line in this regex stringifies the Match ("$/"), and then tries to convert it to a number (+). If it works, it returns a defined value, otherwise a Failure. This string-to-number conversion recognizes the same syntax as the Perl 6 grammar. The <?{ ... }> construct is an assertion, so it makes the match fail if the expression on the inside returns a false value.
In Edit distance: Ignore start/end, I offered a Perl 6 solution to a fuzzy fuzzy matching problem. I had a grammar like this (although maybe I've improved it after Edit #3):
grammar NString {
regex n-chars { [<.ignore>* \w]**4 }
regex ignore { \s }
}
The literal 4 itself was the length of the target string in the example. But the next problem might be some other length. So how can I tell the grammar how long I want that match to be?
Although the docs don't show an example or using the $args parameter, I found one in S05-grammar/example.t in roast.
Specify the arguments in :args and give the regex an appropriate signature. Inside the regex, access the arguments in a code block:
grammar NString {
regex n-chars ($length) { [<.ignore>* \w]**{ $length } }
regex ignore { \s }
}
class NString::Actions {
method n-chars ($/) {
put "Found $/";
}
}
my $string = 'The quick, brown butterfly';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions(NString::Actions),
:c($from++),
:args( \(5) )
);
last unless ?$match;
}
I'm still not sure about the rules for passing the arguments though. This doesn't work:
:args( 5 )
I get:
Too few positionals passed; expected 2 arguments but got 1
This works:
:args( 5, )
But that's enough thinking about this for one night.
I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)
I have an ANTLR4 listener which handles a standard and well-formed grammar, however am struggling with how to deal the non-standard implementations. Although all of the variants go through the lexer without problems the parse stage is a lot trickier.
A traditional way of doing this would be something like
// Header of document
variant = STANDARD;
if (header.indexOf("microsoft") != -1) {
variant = MICROSOFT;
} else if (header.indexOf("google") != -1) {
variant = GOOGLE;
}
...
// Parsing a particular element
if (variant.equals(MICROSOFT)) {
// Microsoft-specific stuff
} else if (variant.equals(GOOGLE)) {
// Google-specific stuff
} else {
// Standard stuff
}
but this quickly becomes unmaintainable. The obvious solution is to have a ParseTreeListener for the standard implementation and then subclass it for each variant, but I don't know which variant it is until I've started the parse.
So how can I either switch from one listener to another part-way through the parse, or restart the parse with a new listener once I know which variant I'm dealing with?
If these variants occur frequently, you might want to consider embedding custom code to handle context sensitive parsing by using predicates (the {...}? construct in the following pseudo grammar):
rule
: { boolean-expression-a }? a-alternative
| { boolean-expression-b }? b-alternative
| /* fall through */ not-a-or-b-alternative
;
Let's say you want to parse a file containing chunks. A chunk consists of a header and a data row. In the header you can set your variant. The data of a normal variant contains 3 NUMBERs, Google's variant contains 2 NUMBERs and Microsoft's variant contains a single NUMBER. An example of such a file would look like this:
header: none
data: 1 2 3
header: google
data: 4 5
header: microsoft
data: 6
And here's a demo of a context sensitive ANTLR v4 grammar able to parse this:
grammar T;
#parser::members {
enum Variant {
GOOGLE,
MICROSOFT,
OTHER;
public static Variant tryValueOf(String name) {
try {
return Variant.valueOf(name.toUpperCase());
}
catch(Exception e) {
return OTHER;
}
}
}
private Variant variant = Variant.OTHER;
}
parse
: chunk+ EOF
;
chunk
: header data
;
header
: K_HEADER COLON NAME {variant = Variant.tryValueOf($NAME.text);}
;
data
: {variant == Variant.MICROSOFT}? K_DATA COLON NUMBER #MicrosoftData
| {variant == Variant.GOOGLE}? K_DATA COLON NUMBER NUMBER #GoogleData
| K_DATA COLON NUMBER NUMBER NUMBER #OtherData
;
K_DATA : 'data';
K_HEADER : 'header';
NAME : [a-zA-Z]+;
NUMBER : [0-9]+;
COLON : ':';
SPACE : [ \t\r\n] -> skip;
Resulting in the following parse: