How to prove that a grammar is LL(k) for k>1 - grammar

I have this exercise which gives me a grammar and asks to prove that it is not an LL(1). All good with that part, though afterwards it asks me if that grammar can be an LL(k)(for k>1) or not. What procedure do I follow to determine that?

For a given k and a non-left-recursive grammar, all you have to do is to build the LA(k) table (by algorithms readily available everywhere). If there is no ambiguity, the grammar is LL(k), and the language is too.
Knowing if there exists a k for which a given language is LL(k) is undecidable. You'd have to try one value of k after the other until you succeed, or the universe runs out.

Related

Is this conversion from BNF to EBNF correct?

As context, my textbook uses this style for EBNF:
Sebesta, Robert W. Concepts of Programming Languages 11th ed., Pearson, 2016, 150.
The problem:
Convert the following BNF rule with three RHSs to an EBNF rule with a single RHS.
Note: Conversion to EBNF should remove all explicit recursion and yield a single RHS EBNF rule.
A ⟶ B + A | B – A | B
My solution:
A ⟶ B [ (+ | –) A ]
My professor tells me:
"First, you should use { } instead of [ ],
Second, according to the BNF rule, <"term"> is B." (He is referring the the style guide posted above)
Is he correct? I assume so but have read other EBNF styles and wonder if I am entitled to credit.
You were clearly asked to remove explicit recursion and your proposed solution doesn't do that; A is still defined in terms of itself. So independent of naming issues, you failed to do the requested conversion and your prof is correct to mark you down for it. The correct solution for the problem as presented, ignoring the names of non-terminals, is A ⟶ B { (+ | –) B }, using indefinite repetition ({…}) instead of optionality ([…]). With this solution, the right-hand side of the production for A only references B, so there is no recursion (at least, in this particular production).
Now, for naming: clearly, your textbook's EBNF style is to use angle brackets around the non-terminal names. That's a common style, and many would say that it is more readable than using single capital letters which mean nothing to a human reader. Now, I suppose your prof thinks you should have changed the name of B to <term> on the basis that that is the "textbook" name for the non-terminal representing the operand of an additive operator. The original BNF you were asked to convert does show the two additive operators. However, it makes them right-associative, which is definitely non-standard. So you might be able to construct an argument that there's no reason to assume that these operators are additive and that their operands should be called "terms" [Note 1]. But even on that basis, you should have used some name written in lower-case letters and surrounded in angle brackets. To me, that's minor compared with the first issue, but your prof may have their own criteria.
In summary, I'm afraid I have to say that I don't believe you are entitled to credit for that solution.
Notes
If you had actually come up with that explanation, your prof might have been justified in suggesting a change of major to Law.

how to skip "and" with skip rule?

I'm working on a new antlr grammar which is similar to nattys and should recognize date expressions, but I have problem with skip rules. In more detail I want to ignore useless "and"s in expressions for example:
Call Sam, John and Adam and fix a meeting with Sarah about the finance on Monday and Friday.
The first two "and"s are useless. I wrote the rule bellow to fix this problem but it didn't work, why? what should I do?
NW : [~WeekDay];
UselessAnd : AND NW -> skip;
"Useless AND" is a semantic concept.
Grammars are about syntax, and handle semantic issues poorly. Don't couple these together.
Suggestion: when you write a grammar for a language, make your parser accept the language as it is, warts and all. In your case, I suggest you "collect" the useless ANDs. That way you can get the grammar "right" more easily, and more transparently to the next coder who has to maintain your grammar.
Once you have the AST, it is pretty easy to ignore (semantically) useless things; if nothing else, you can post-process the AST and remove the useless AND nodes.

Can a context-sensitive grammar have an empty string?

In one of my cs classes they mentioned that the difference between context-free grammar and context-sensitive grammar is that in CSG, then the left side of the production rule has to be less or equal than the right side.
So, one example they gave was that context-sensitive grammars can't have an empty string because then the first rule wouldn't be satisfied.
However, I have understood that regular grammars are contained in context-free, context-free are contained in context-sensitive, and context-sensitive are contained in recursive enumerable grammars.
So, for example if a grammar is recursive enumerable then is also of the type context-sensitive, context-free and regular.
The problem is that if this happens, then if I have a context-free grammar that contains an empty string then it would not satisfy the rule to be counted as a context-sensitive, but then a contradiction would occur, because each context-sensitive is context-free.
Empty productions ("lambda productions", so-called because λ is often used to refer to the empty string) can be mechanically eliminated from any context-free grammar, except for the possible top-level production S → λ. The algorithm to do so is presented in pretty well every text on formal language theory.
So for any CFG with lambda productions, there is an equivalent CFG without lambda productions which generates the same language, and which is also a context-sensitive grammar. Thus, the prohibition on contracting rules in CSGs does not affect the hierarchy of languages: any context-free language is a context-sensitive language.
Chomsky's original definition of context-sensitive grammars did not specify the non-contracting property, but rather an even more restrictive one: every production had to be of the form αAβ→αγβ where A is a single symbol and γ is not empty. This set of grammars generates the same set of languages as non-contracting grammars (that was also proven by Chomsky), but it is not the same set. Also, his context-free grammars were indeed a subset of context-sensitive grammars because by his original definition of CFGs, lambda productions were prohibited. (The 1959 paper is available online; see the Wikipedia article on the Chomsky hierarchy for a reference link.)
It is precisely the existence of a non-empty context -- α and β -- which leads to the names "context-sensitive" and "context-free"; it is much less clear what "context-sensitive" might mean with respect to an arbitrary non-contracting rule such as AB→BA . (Note 1)
In short, the claim that "every CFG is a CSG" is not technically correct given the common modern usage of CFG and CSG, as cited in your question. But it is only a technicality: the CFG with lambda productions can be mechanically transformed, just as a non-contracting grammar can be mechanically transformed into a grammar fitting Chomsky's definition of context-sensitive (see the Wikipedia article on non-contracting grammars).
(It is also quite common to allow both context-sensitive and context-free languages to include the empty string, by adding an exception for the rule S→λ to both CFG and CSG definitions.)
Notes
In Chomsky's formulation of context-free and -sensitive grammars, it was unambiguous what was meant by a parse tree, even for a CSG; since Chomsky is a linguist and was seeking a framework to explain the structure of natural language, the notion of a parse tree mattered. It is not at all obvious how you might describe the result of AB → BA as a parse tree, although it is quite clear in the case of, for example, a A b → B.

LL(1) Grammar and parsing

I'd like some help on how to transform a grammar to LL(1) using factoring. Possibly other techniques but I have already used left recursion
For example I have the question
S--> 1X1F|2X2F|1X
X--> 1X|0
F--> 0F|ε
ε denotes a termination without another letter.
I appreciate any help
To my understanding, this is already LL(1) as we can decide on which rule to use by seeing just 2 symbols down. What I was confused on was the rest of the symbols. But from the research I have done, I would say this is LL(1)

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.