I'm trying to write some tooling (validation/possibly autocomplete) for a SQL-esk query language. However, parser is tokenizing invalid/incomplete inputs in a way that is making it more difficult to work with.
I've reduce my scenario to its simplest reproducible form. Here is my minimized grammar:
grammar SOQL;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) -> channel(HIDDEN) ;
FROM : 'FROM' ;
SELECT : 'SELECT' ;
/********** SYMBOLS **********/
COMMA : ',' ;
ID: ( 'A'..'Z' | 'a'..'z' | '_' | '$') ( 'A'..'Z' | 'a'..'z' | '_' | '$' | '0'..'9' )* ;
soql_query: select_clause from_clause;
select_clause: SELECT field ( COMMA field )*;
from_clause: FROM table;
field : ID;
table : ID;
When I run the following code (using antlr4ts, but it should be similar to any other port):
const input = 'SELECT ID, Name, Website, Contact, FROM Account'; //invalid trailing ,
let inputStream = new ANTLRInputStream(input);
let lexer = new SOQLLexer(inputStream);
let tokenStream = new CommonTokenStream(lexer);
let parser = new SOQLParser(tokenStream);
let qry = parser.soql_query();
let select = qry.select_clause();
console.log('FIELDS: ', select.field().map(field => field.text));
console.log('FROM: ', qry.from_clause().text);
Console Log
line 1:35 extraneous input 'FROM' expecting ID
line 1:47 mismatched input '<EOF>' expecting 'FROM'
FIELDS: Array(5) ["ID", "Name", "Website", "Contact", "FROMAccount"]
FROM:
I get errors (which is expected), but I was hoping it would still be able to correctly pick out the FROM clause.
It was my understanding since FROM is a identifier, it's not a valid field in the select_clause (maybe I'm just misunderstanding)?
Is there some way to setup the grammar or parser so that it will continue on to properly identify the FROM clause in this scenario (and other common WIP query states).
It was my understanding since FROM is a identifier, it's not a valid
field in the select_clause (maybe I'm just misunderstanding)?
All the parser sees is a discrete stream of typed tokens coming from the lexer. The parser has no intrinsic way to tell if a token is intended to be an identifier, or for that matter, have any particular semantic nature.
In designing a fault-tolerant grammar, plan the parser to be fairly permissive to syntax errors and expect to use several tree-walkers to progressively identify and where possible resolve the syntax and semantic ambiguities.
Two ANTLR features particularly useful to this end include:
1) implement a lexer TokenFactory and custom token, typically extending CommonToken. The custom token provides a convenient space for flags and logic for identifying the correct syntactic/semantic use and expected context for a particular token instance.
2) implement a parser error strategy, extending or expanding on the DefaultErrorStrategy. The error strategy will allow modest modifications to the parser operation on the token stream when an attempted match results in a recognition error. If the error cannot be fully resolved and appropriately fixed upon examining the surrounding (custom) tokens, at least those same custom tokens can be suitably annotated to ease problem resolution during the subsequent tree-walks.
Related
I have a grammar with semantic predicates somewhat simplified like this:
startrule:
{<condition1 in C++>}? rule1 |
{<condition2 in C++>}? rule2
;
rule1:
{<condition1.1 in C++>}? rule1_statement1 |
{<condition1.2 in C++>}? rule1_statement2
;
rule2:
{<condition2.1 in C++>}? rule2_statement1 |
{<condition2.2 in C++>}? rule2_statement2
;
If condition1 or condition2 are evaluated to true, it correctly goes to rule1 or rule2. So the semantic predicates are working so far, but the problem I'm seeing is that, for example:
rule2 is executed
condition2.1 is false
condition2.2 is true (it should go to rule2_statement2)
When I see the Cpp code, I see this line:
switch (getInterpreter<atn::ParserATNSimulator>()->adaptivePredict(_input, 531, _ctx)) {
And then a case for each corresponding statement. When the code is executed, even if condition2.1 is false, it enters the case for rule2_statement1 (instead of the case for rule2_statement2). So it seems as if the semantic predicates are not working there?.
And since that code has a check for the condition like this:
if (!(condition2.1)) throw FailedPredicateException(this, "condition2.1");
It throws a FailedPredicate exception, my ErrorStrategy recover just calls the DefaultErrorStrategy recover, which eventually crashes because LL1Analyzer::_LOOK throws an out of range exception.
Any hint as to why some semantic predicates appear not to be working? rule2_statement1 and rule2_statement2 have the same tokens but different embedded actions.
Regards,
JZ
Nevermind... I had an issue in m C++ code with the conditions...
Here is the warning message in question:
BB_LLVM2AST.g:120:15: Decision can match input such as "'a'..'z'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
And here is the rule:
fragment
IDENTIFIER
:
((LOWERCHARS)+ (('0'..'9')+)? PERIOD?)+
| ('0'..'9')+
;
here are the other rules called:
fragment
LOWERCHARS
:
('a'..'z')
;
fragment
PERIOD
:
'.'
;
So, I tried using syntactic predicates, but still have the same warning message.
fragment
IDENTIFIER
:
(LOWERCHARS) => ((LOWERCHARS)+ (INT)? PERIOD?)+
|(INT) => INT
;
I took out an INT fragment that was there in a blind attempt to get rid of the warning. I don't understand how input such as "'a'..'z' could possibly be matched using an alternative where the alternative is ('0'..'9'). Also, what can I do to get rid of this warning? I hate warning messages.
I have the following rule:
statement : TOKEN1 opt1=TOKEN2? opt2=TOKEN3 TOKEN4 -> ^(TOKEN1 opt1? opt2);
The AST generated by this rule will have one or two children (depending on if
opt1 was defined or not).
I need to have always a fixed number of children (in this case 2). I know that
this can be achieved by doing the following (UNDEFINED is an imaginary token):
statement : TOKEN1 opt1=TOKEN2 TOKEN4 -> ^(TOKEN1 opt1 UNDEFINED)
| TOKEN1 opt1=TOKEN2 opt2=TOKEN3 TOKEN4 -> ^(TOKEN1 opt1 opt2);
This is fine for just one optional token. The problem is when I have a higher
number of optional tokens. A lot of rules must written in order to catch all
possible combinations. How can this issue be solved in an elegant way?
I'm using ANTLR 3.4/C target by the way.
Thanks,
T.
You could do this:
grammar G;
tokens {
CHILD1;
CHILD2;
CHILD3;
}
...
statement
: ROOT t2=TOKEN2? t3=TOKEN3? t4=TOKEN4?
-> ^(ROOT ^(CHILD1 $t2?) ^(CHILD2 $t3?) ^(CHILD3 $t4?))
;
which will cause the AST to always have 3 child nodes (which may or may not have a tokens as child themselves).
I use ANTLR4 with Java and I would like to store the values that a rule returns while parses the input. I use a grammar like this:
db : 'DB' '(' 'ID' '=' ID ',' query* ')'
{
System.out.println("creating db");
System.out.println("Number of queries -> "+$query.qs.size());
}
;
query returns [ArrayList<Query> qs]
#init
{
$qs = new ArrayList<Query>();
}
: 'QUERY' '(' 'ID' '=' ID ',' smth ')'
{
System.out.println("creating query with id "+$ID.text);
Query query = new Query();
query.setId($ID.text);
$qs.add(query);
}
;
but what happens is that the Number of queries printed ($query.qs size) is always one. This happens because each time a QUERY element is recognized at input it is added to the $qs ArrayList, but for each other QUERY a new ArrayList is instantiated and this query is added to this new ArrayList. When all the queries are recognized then the action for the db : rule is invoked, but the $query.qs ArrayList has only the last query. I solved this problem by maintaining global ArrayLists that store the queries. But, is there another way to do it with ANTLR while the rules are returning, and not having my own global ArrayLists?
Many thanks in advance,
Dimos.
Well, the problem is resolved. I just added the ArrayList to the db rule like this:
db [ArrayList queries] : 'DB' ....
and then at query rule:
$db::queries.add(query)
So, everything is fine!
Thanks for looking, anyway!
I have this code for query:
$repository = $em->getRepository('AcmeCrawlerBundle:Trainings');
$query = $repository->createQueryBuilder('p')
->where('p.title LIKE :word')
->orWhere('p.discription LIKE :word')
->setParameter('word', $word)
->getQuery();
$trainings = $query->getResult();
The problem is: even if matches exist, they not found by this query. I used this code to see full sql:
print_r(array(
'sql' => $query->getSQL(),
'parameters' => $query->getParameters(),
));
And what I've got:
FROM Trainings t0_ WHERE t0_.title LIKE ? OR t0_.discription LIKE ? [parameters] => Array ( [word] => Spoken )
(last part of query)
Tell me please what to change?
You forgot the % signs around the word:
->setParameter('word', '%'.$word.'%')
Below are some additional steps you can take to further sanitise input data.
You should escape the term that you insert between the percentage signs:
->setParameter('word', '%'.addcslashes($word, '%_').'%')
The percentage sign '%' and the symbol underscore '_' are interpreted as wildcards by LIKE. If they're not escaped properly, an attacker might construct arbitrarily complex queries that can cause a denial of service attack. Also, it might be possible for the attacker to get search results he is not supposed to get. A more detailed description of attack scenarios can be found here: https://stackoverflow.com/a/7893670/623685