How to preserve whitespace when we use text attribute in Antlr4 - antlr

I want to keep white space when I call text attribute of token, is there any way to do it?
Here is the situation:
We have the following code
IF L > 40 THEN;
ELSE
IF A = 20 THEN
PUT "HELLO";
In this case, I want to transform it into:
if (!(L>40){
if (A=20)
put "hello";
}
The rule in Antlr is that:
stmt_if_block: IF expr
THEN x=stmt
(ELSE y=stmt)?
{
if ($x.text.equalsIgnoreCase(";"))
{
WriteLn("if(!(" + $expr.text +")){");
WriteLn($stmt.text);
Writeln("}");
}
}
But the result looks like:
if(!(L>40))
{
ifA=20put"hello";
}
The reason is that the white space in $stmt was removed. I was wondering if there is anyway to keep these white space
Thank you so much
Update: If I add
SPACE: [ ] -> channel(HIDDEN);
The space will be preserved, and the result would look like below, many spaces between tokens:
IF SUBSTR(WNAME3,M-1,1) = ')' THEN M = L; ELSE M = L - 1;

This is the C# extension method I use for exactly this purpose:
public static string GetFullText(this ParserRuleContext context)
{
if (context.Start == null || context.Stop == null || context.Start.StartIndex < 0 || context.Stop.StopIndex < 0)
return context.GetText(); // Fallback
return context.Start.InputStream.GetText(Interval.Of(context.Start.StartIndex, context.Stop.StopIndex));
}
Since you're using java, you'll have to translate it, but it should be straightforward - the API is the same.
Explanation: Get the first token, get the last token, and get the text from the input stream between the first char of the first token and the last char of the last token.

#Lucas solution, but in java in case you have troubles in translating:
private String getFullText(ParserRuleContext context) {
if (context.start == null || context.stop == null || context.start.getStartIndex() < 0 || context.stop.getStopIndex() < 0)
return context.getText();
return context.start.getInputStream().getText(Interval.of(context.start.getStartIndex(), context.stop.getStopIndex()));
}

Looks like InputStream is not always updated after removeLastChild/addChild operations. This solution helped me for one grammar, but it doesn't work for another.
Works for this grammar.
Doesn't work for modern groovy grammar (for some reason inputStream.getText contains old text).
I am trying to implement function name replacement like this:
enterPostfixExpression(ctx: PostfixExpressionContext) {
// Get identifierContext from ctx
...
const token = CommonTokenFactory.DEFAULT.createSimple(GroovyParser.Identifier, 'someNewFnName');
const node = new TerminalNode(token);
identifierContext.removeLastChild();
identifierContext.addChild(node);
UPD: I used visitor pattern for the first implementation

Related

Flutter how to turn Lists encoded as Strings for a SQFL database back to Lists concisely?

I fear I'm trying to reinvent the wheel here. I'm putting Objects into my SQFL database:
https://pub.dev/packages/sqflite
some of the object fields are Lists of ints others are Lists of Strings. I'm encoding these as plain Strings to place in a TEXT field in my SQFL database.
At some point I'm going to have to turn them back, I couldn't find anything on Google, which is surprising because this must be a very common occurrence with SQFL
I've started coding the 'decoding', but it's rookie dart. Is there anything performant around I ought to use?
Code included to prove I'm not totally lazy, no need to look, edge cases make it fail.
List<int> listOfInts = new List<int>();
String testStringOfInts = "[1,2,4]";
List<String> intermediateStep2 = testStringOfInts.split(',');
int numListElements = intermediateStep2.length;
print("intermediateStep2: $intermediateStep2, numListElements: $numListElements");
for (int i = 0; i < numListElements; i++) {
if (i == 0) {
listOfInts.add(int.parse(intermediateStep2[i].substring(1)));
continue;
}
else if ((i) == (numListElements - 1)) {
print('final element: ${intermediateStep2[i]}');
listOfInts.add(int.parse(intermediateStep2[i].substring(0, intermediateStep2[i].length - 1)));
continue;
}
else listOfInts.add(int.parse(intermediateStep2[i]));
}
print('Output: $listOfInts');
/* DECODING LISTS OF STRINGS */
String testString = "['element1','element2','element23']";
List<String> intermediateStep = testString.split("'");
List<String> output = new List<String>();
for (int i = 0; i < intermediateStep.length; i++) {
if (i % 2 == 0) {
continue;
} else {
print('adding a value to output: ${intermediateStep[i]}');
//print('value is a: ${(intermediateStep[i]).runtimeType}');
output.add(intermediateStep[i]);
}
}
print('Output: $output');
}
For the integers your could make the parsing like:
void main() {
print(parseStringAsIntList("[1,2,4]")); // [1, 2, 4]
}
List<int> parseStringAsIntList(String stringOfInts) => stringOfInts
.substring(1, stringOfInts.length - 1)
.split(',')
.map(int.parse)
.toList();
I need more information about how the Strings are saved in some corner cases like if they contain , and/or ' since this will change how the parsing should be done. But if both characters are valid in the string (especially ,) I will recommend you to change the storage format into JSON instead which makes it a lot easier to encode/decode and without the risk of using characters which can give you issues).
But a rather naive solution can be made like this if we know each String does not contain ,:
void main() {
print(parseStringAsStringList("['element1','element2','element23']"));
// [element1, element2, element23]
}
List<String> parseStringAsStringList(String stringOfStrings) => stringOfStrings
.substring(1, stringOfStrings.length - 1)
.split(',')
.map((string) => string.substring(1, string.length - 1))
.toList();

How to differentiate each of the character in string

I got a question in vb.net
can I identify each of the character in string
for an example
i got a string of "Hello!. Good Afternoon!"
from this string can i trim away the period symbol?
Thank you
You should look at the methods of the String class, as they support different forms of string manipulation.
At its simplest, the Replace() method can be used to replace all occurrences of a period character with an empty string.
Alternatively, you can use the IndexOf() method to locate a specific string (e.g. the period) and the Remove() method to remove that character.
According to my 8-ball Magic, you actually want to :
Remove concecutive punctuation from a string:
With a Regex we are going to find all punctuation in the string.
The index of Match will be into an int[].
We will go iterate throught the array to find if the index is concecutive to the last punctuation index.
We will delete all the punctuation starting by the last one. Because starting with the 1rst will modify the index.
Code:
string Input = "....Thalassius! vero ea--*/-*/-- tempestate+- fectus";
string Output = Input;
var regex = new Regex(#"[^\w\s]|_"); // *1.
var matches = regex.Matches(Input) ;
var MatchesIndex = matches .Cast<Match>()
.Select(match => match.Index)
.ToArray(); // *2.
int last = 0;
List<int> toDelete = new List<int>();
for (int i = 0; i < MatchesIndex.Length; i++) // *3.
{
if ( MatchesIndex[i] == last + 1)
toDelete.Add(MatchesIndex[i]);
last = MatchesIndex[i];
}
foreach (int i in toDelete.OrderByDescending(x => x)) // *4.
Output = Output.Remove(i, 1);
Console.WriteLine("Input : " + Input);
Console.WriteLine("Output : " + Output);
C# Snippet
You can learn more about the regex used, thanks to #John Kugelman.

Lex: Gather all text not defined in rules

I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)

ANTLR4 change listener during parse

I have an ANTLR4 listener which handles a standard and well-formed grammar, however am struggling with how to deal the non-standard implementations. Although all of the variants go through the lexer without problems the parse stage is a lot trickier.
A traditional way of doing this would be something like
// Header of document
variant = STANDARD;
if (header.indexOf("microsoft") != -1) {
variant = MICROSOFT;
} else if (header.indexOf("google") != -1) {
variant = GOOGLE;
}
...
// Parsing a particular element
if (variant.equals(MICROSOFT)) {
// Microsoft-specific stuff
} else if (variant.equals(GOOGLE)) {
// Google-specific stuff
} else {
// Standard stuff
}
but this quickly becomes unmaintainable. The obvious solution is to have a ParseTreeListener for the standard implementation and then subclass it for each variant, but I don't know which variant it is until I've started the parse.
So how can I either switch from one listener to another part-way through the parse, or restart the parse with a new listener once I know which variant I'm dealing with?
If these variants occur frequently, you might want to consider embedding custom code to handle context sensitive parsing by using predicates (the {...}? construct in the following pseudo grammar):
rule
: { boolean-expression-a }? a-alternative
| { boolean-expression-b }? b-alternative
| /* fall through */ not-a-or-b-alternative
;
Let's say you want to parse a file containing chunks. A chunk consists of a header and a data row. In the header you can set your variant. The data of a normal variant contains 3 NUMBERs, Google's variant contains 2 NUMBERs and Microsoft's variant contains a single NUMBER. An example of such a file would look like this:
header: none
data: 1 2 3
header: google
data: 4 5
header: microsoft
data: 6
And here's a demo of a context sensitive ANTLR v4 grammar able to parse this:
grammar T;
#parser::members {
enum Variant {
GOOGLE,
MICROSOFT,
OTHER;
public static Variant tryValueOf(String name) {
try {
return Variant.valueOf(name.toUpperCase());
}
catch(Exception e) {
return OTHER;
}
}
}
private Variant variant = Variant.OTHER;
}
parse
: chunk+ EOF
;
chunk
: header data
;
header
: K_HEADER COLON NAME {variant = Variant.tryValueOf($NAME.text);}
;
data
: {variant == Variant.MICROSOFT}? K_DATA COLON NUMBER #MicrosoftData
| {variant == Variant.GOOGLE}? K_DATA COLON NUMBER NUMBER #GoogleData
| K_DATA COLON NUMBER NUMBER NUMBER #OtherData
;
K_DATA : 'data';
K_HEADER : 'header';
NAME : [a-zA-Z]+;
NUMBER : [0-9]+;
COLON : ':';
SPACE : [ \t\r\n] -> skip;
Resulting in the following parse:

simple math expression parser

I have a simple math expression parser and I want to build the AST by myself (means no ast parser). But every node can just hold two operands. So a 2+3+4 will result in a tree like this:
+
/ \
2 +
/ \
3 4
The problem is, that I am not able to get my grammer doing the recursion, here ist just the "add" part:
add returns [Expression e]
: op1=multiply { $e = $op1.e; Print.ln($op1.text); }
( '+' op2=multiply { $e = new AddOperator($op1.e, $op2.e); Print.ln($op1.e.getClass(), $op1.text, "+", $op2.e.getClass(), $op2.text); }
| '-' op2=multiply { $e = null; } // new MinusOperator
)*
;
But at the end of the day this will produce a single tree like:
+
/ \
2 4
I know where the problem is, it is because a "add" can occour never or infinitly (*) but I do not know how to solve this. I thought of something like:
"add" part:
add returns [Expression e]
: op1=multiply { $e = $op1.e; Print.ln($op1.text); }
( '+' op2=(multiply|add) { $e = new AddOperator($op1.e, $op2.e); Print.ln($op1.e.getClass(), $op1.text, "+", $op2.e.getClass(), $op2.text); }
| '-' op2=multiply { $e = null; } // new MinusOperator
)?
;
But this will give me a recoursion error. Any ideas?
I don't have the full grammar to test this solution, but consider replacing this (from the first add rule in the question):
$e = new AddOperator($op1.e, $op2.e);
With this:
$e = new AddOperator($e, $op2.e); //$e instead of $op1.e
This way each iteration over ('+' multiply)* extends e rather than replaces it.
It may require a little playing around to get it right, or you may need a temporary Expression in the rule to keep things managed. Just make sure that the last expression created by the loop is somewhere on the right-hand side of the = operator, as in $e = new XYZ($e, $rhs.e);.