ANTLR4 - replace op boundaries error|How to use TokenStreamRewriter to transform text from two listener events on overlapping tokens in original AST? - antlr

Hello ANTLR creators/users,
Some context - I am using PlSql ANTLR4 parser to do some lightweight transpiling of some queries from oracle sql to, let's say, spark sql. I have my listener class setup which extends the base listener.
Example of an issue -
Let's say the input is something like -
SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101') from xyz;
Now, I'd like to replace || with CONCAT and to_char with CAST as STRING, so that the final query looks like -
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
In my listener class, I am overriding two functions from base listener to do this - concatenation and string_function. In those, I am using a tokenStreamRewriter's replace to make the necessary transformation. Since tokenStreamRewriter is evaluated lazily, I am running to issue ->
java.lang.IllegalArgumentException: replace op boundaries of
<ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..[#53,276:276=')',
<2214>,3:63]:"CAST (to_number(substr(ATTRIBUTE_VALUE,1,4))-3 as STRING)">
overlap with previous <ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..
[#56,279:284=''0101'',<2209>,3:66]:"CONCAT
(to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3),'0101')">
Clearly, the issue is my two listener functions attempting to replace/transform text on overlapping boundaries.
Is there any work around for territory overlap kind of issues for ANTLR4? I'm sure folks run into such stuff all the time probably.
I'd appreciate any workarounds, even dirty ones at this point of time :)
I did realize that ANTLR4 does not allow us to modify original AST, otherwise this would have been a little bit easier to solve.
Thanks!

A look at how tokenstreamrewriter works leads to the following understanding:
first, a list of all modification operations are built
then, you invoke getText()
here, there is a reduction of modification operations. The idea for example is to merge multiple insert together in one reduction. Its role is also to avoid multiple replace on same data (but i will expand on this point later).
every token is then read, in the case there is a modification listed for the said token index, TokenStreamRewriter do the operation, otherwise it just pop the read token.
Let's have a look on how modification operations are implemented:
for insert, tokenstream rewriter basically just adds the string to be added at the current token index, and then do an index+1, effectively going to next token
for replace, tokenstream rewriter replace a range of tokens with the new string, and set the new index to the end of this range.
So, for tokenstreamrewriter, overlapping replaces are not possible, as when you replace you jump to the end of the range of tokens to be replaced. Especially, in the case you remove the checks of overlapping, then only the first replace will be operated, as afterwards, the token index is past the other replaces.
Basically, this has been done because there is no way to tell easily what tokens should be replaced while using overlapping replaces. You would need for that symbol recognition and matching.
So, what you are trying to do is the following (for each step, the part between '*' is what is modified):
*SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101')* from xyz;
|
V
CONCAT (*to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)*,'0101') from xyz;
|
V
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
to achieve your transformation, you could do so a replace of :
'to_char' -> 'CONCAT(CAST'
'||' -> ' as STRING),'
And, by using a bit of intelligence while parsing your tokens, like is there a '||' in my tokens to know if it's string, you would know what to replace.
regards

The way I solve it in multiple projects based on ANTLR is this: I translated ANTLR parse-tree to an AST written using Kolasu, an open-source library we developed at Strumenta.
Kolasu has all sort of utilities to process and mutate ASTs. For all non-trivial projects I end up doing transformations on the AST.
Kolasu

Related

Use of math in ALFA

How to get a rule like that working:
rule adminCanViewAllExams {
condition (integerOneAndOnly(my.company.attributes.subject.rights) & 0x00000040) == 0
permit
}
Syntax highlighter complains it doesn't know those items:
& (This is a binary math operation)
0x00000040 (this is the hexadecimal representation of an integer)
EDIT
(adding OP's comment inside the question)
I want to keep as much as possible in my current application. Meaning, I don't want to change a lot in my database model. I just want to implement the PEP and PDP part new. So, currently the rights of the user are stored in a Long. Each bit in the number represents a right. To get the right we do a binary &-operation which masks the other bits in the Long. We might redesign this part, but it's still good to know how far the support for mathematic operations goes
XACML does not support bitwise logic. It can do boolean logic (AND and OR) but that's about it.
To achieve what you are looking for, you could use a Policy Information Point which would take in my.company.attributes.subject.rights and 0x00000040. It would return an attribute called allowed.
Alternatively, you can extend XACML (and ALFA) to add missing datatypes and functions. But I would recommend going for human-readable policies.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?