ANTLR, how to convert BNF,EBNF data in ANTLR? - antlr

I have to generate parser of CSV data. Somehow I managed to write BNF, EBNF for CSV data but I don't know how to convert this into an ANTLR grammar (which is a parser generator). For example, in EBNF we write:
[{header entry}newline]newline
but when I write this in ANTLR to generate a parser, it's giving an error and not taking brackets. I am not expert in ANTLR can anyone help?

hi , i have to generate parser of CSV data ...
In most languages I know, there already exists a decent 3rd party CSV parser. So, chances are that you're reinventing the wheel.
For Example in EBNF we wrire [{header entry}newline]newline
The equivalent in ANTLR would look like this:
((header entry)* newline)? newline
In other words:
| (E)BNF | ANTLR
-----------------+--------+------
'a' zero or once | [a] | a?
'a' zero or more | {a} | a*
'a' once or more | a {a} | a+
Note that you can group rules by using parenthesis (sub-rules is what they're called):
'a' 'b'+
matches: ab, abb, abbb, ..., while:
('a' 'b')+
matches: ab, abab, ababab, ...

Related

Q: ANTLR 4 Grammar recognition of whole odd value not only the last digit

I'm trying to make grammar for the calculator, however it have to be working only for odd numbers.
For example it works like that:
If I put 123 the result is 123.
If I put 1234 the result is 123, and the token recognition error at: 4 but should be at: 1234.
There is my grammar:
grammar G;
DIGIT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
operator : ('+' | '-' | '*' | ':');
result: DIGIT operator (DIGIT | result);
I mean specifically to make that, the 1234 should be recognized as an error, not only the last digit.
The way that tokenization works is that it tries to find the longest prefix of the input that matches any of your regular expressions and then produces the appropriate token, consuming that prefix. So when the input is 1234, it sees 123 as the longest prefix that matches the DIGIT pattern (which should really be called ODD_INT or something) and produces the corresponding token. Then it sees the remaining 4 and produces an error because no rule matches it.
Note that it's not necessarily only the last digit that produces the error. For the input 1324, it would produce a DIGIT token for 13 and then a token recognition error for 24.
So how can you get the behaviour that you want? One approach would be to rewrite your pattern to match all sequences of digits and then use a semantic predicate to verify that the number is odd. The way that semantic predicates work on lexer rules is that it first takes the longest prefix that matches the pattern (without taking into account the predicate) and then checks the predicate. If the predicate is false, it moves on to the other patterns - it does not try to match the same pattern to a smaller input to make the predicate return true. So for the input 1234, the pattern would match the entire number and then the predicate would return false. Then it would try the other patterns, none of which match, so you'd get a token recognition error for the full number.
ODD_INT: ('0'..'9') + { Integer.parseInt(getText()) % 2 == 1 }?;
The down side of this approach is that you'll need to write some language-specific code (and if you're not using Java, you'll need to adjust the above code accordingly).
Alternatively, you could just recognize all integers in the lexer - not just odd ones - and then check whether they're odd later during semantic analysis.
If you do want to check the oddness using patterns only, you can also work around the problem by defining rules for both odd and even integers:
ODD_INT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
EVEN_INT: ('0'..'9') * ('0' | '2' | '4' | '6'| '8');
This way for an input like 1234, the longest match would always be 1234, not 123. It's just that this would match the EVEN_INT pattern, not ODD_INT. So you wouldn't get a token recognition error, but, if you consistently only use ODD_INT in the grammar, you would get an error saying that an ODD_INT was expected, but an EVEN_INT found.

Parsekit equivalent for ABNF construct <any A except B>

I want to translate a given ABNF grammar into a valid ParseKit grammar. Actually I'm trying to find a solution for this kind of statement:
tag = 1*<any Symbol except "C">
with
Symbol = "A" / "B" / "C" / "D" // a lot more symbols here...
The symbol definition is simplified for this question and normally contains a lot of special characters.
My current solution is to hard code all allowed symbols for tag, like
tag = ('A' | 'B' | 'D')+;
But what I really want is something like a "without operator"
tag = Symbol \ 'C';
Is there any construct that allows me to keep my symbol list and define some excludes?
Developer of ParseKit here.
Yes, there is a feature for exactly this. Here's an example:
allItems = 'A' | 'B' | 'C' | 'D';
someItems = allItems - 'C';
Use the - operator.

Is this grammar left recursive?

I know of two types of left recursion, immediate and indirect, and I don't think the following grammar falls into any of them, but is that the case?
And is this grammar an LL grammar? Why or why not?
E ::= T+E | T
T ::= F*T | F
F ::= id | (E)
I assume you start with E. Both of E’s alternatives start with a T. Both of T’s alternatives start with F. Both of F’s alternatives start with a terminal symbol. Thus, the grammar is not left recursive.

antlr 3 ambiguity

I try to write some simple rules and I get this ambiguity
rule: field1 field2; //ambiguity between nsf1 and nsf2 even if I use lookahead k=4
field1: nsf1 | whatever1...;
field2: nsf2 | whatever2...;
nsf1: 'N' 'S' 'F' '1'; //meaning: no such field 1
nsf2: 'N' 'S' 'F' '2'; //meaning: no such field 2
I understand the ambiguity, but I don't understand why lookahead doesn't solve this.
I have a simple solution but I don't like it:
rule: (nsf1 (nsf2 | whatever2))
| (whatever1 (nsf2 | whatever2));
Does anybody have a more elegant solution?
Thanks a lot,
Chris
I couldn't reproduce your problem, but all I could do was guess what the rules for 'whatever1' and 'whatever2' were. Can you post a more complete grammar?
However, there's nothing in the grammar that couldn't be done entirely with lexer token rather than parser rules. Try capitalizing all of the rule names to turn them into lexer token and see if that helps.

PostgreSQL String search for partial patterns removing exrtaneous characters

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...
You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.