Formal Language Theory - Unique "leftmost" and "rightmost" derivation tree - grammar

I'm studying Formal Language Theory and doing the exercises assigned to me, I found this question that I can't give a certain answer to:
When, in general, in a context-free grammar are the leftmost and rightmost derivations of each generated string equal?
Only if the grammar generates a deterministic context-free language
Only if there is at most one non-terminal symbol in the right-hand
side of each derivation
Never
(P.S. : The question is translated from Italian, I hope it is understandable)

Related

How is the term "Abstract Data Types" interpreted?

What's the right way to process the phrase "Abstract Data Types"? Is it:
Abstract-Data Types
Or,
Abstract Data-Types
Neither is a well-explained term.
Precisely, there is no need to write a hyphen among those words.
In some circumstances, there could be the necessity to define conceptual data as ‘abstract-data’. But when it comes to terminology in computer science, it’s more general to use without a hyphen which is called ‘abstract data’.
(If there should be a hyphen in the phrase, ‘abstract data-type’ would be more appropriate than ‘abstract-data type’.)
In conclusion, ‘abstract data type’ is the most generally used term.

Determine if a string can be derived ambiguously in a CFG

I know that given a specific context free grammar, to check if it is ambiguous requires checking if there exists any string that can be derived in more than 1 way. And this is undecidable.
However, I have a simpler problem. Given a specific context free grammar and a specific string, is it possible to determine if the string can be derived from the grammar ambiguously? Is there a general algorithm to do this check?
Yes, you can use any generalized parsing algorithm, such as a GLR (Tomita) parser, an Earley parser, or even a CYK parser; all of those can produce a parse "forest" (i.e. a digraph of all possible parsers) in O(N3) time and space. Creating the parse forest is a bit trickier than the "parsing" (that is, recognition), but there are known algorithms which are referenced in the Wikipedia article.
Since the generalized parsing algorithms find all possible parses, you can rest assured that if exactly one parse is found for the string, then the string is not ambiguous.
I'd stay away from CYK parsing for this algorithm because it requires converting the grammar to Chomsky Normal Form, which makes recovering the original parse tree(s) more complicated.
Bison will generate a GLR parser, if requested, so you could just use that tool. However, be aware that it does not optimize storage of the parse forest, since it is expecting to produce only a single parse, and therefore you can end up with exponentially-sized datastructures (which then take exponential time to construct). That's usually only a problem with pathological grammars, though. Also, you will have to declare a custom %merge function on all possibly ambiguous productions; otherwise, the Bison-generated parser will fail with an "ambiguous parse" error if more than one parse is possible.

Finding a regular grammar for a given regular expression?

I am trying to find a regular grammar that generates the language given by the regular expression ((a+b∗c)d)∗. Is there a general technique I can use to convert regular expressions into regular grammars?
It's usually a lot easier to convert a finite automaton for a regular language into a regular grammar than it is to convert a regular expression into a regular grammar. I'd recommend starting off by building an automaton for the regular expression - either manually or by applying Thompson's algorithm to mechanically convert the regex to an automaton - and then doing the conversion from there.

Why don't many languages accept names starting from a digit?

I am always bumping into a curious fact while reading any programming language reference:
Variable or constant name cannot start with a digit
Of course, even if names from digit were allowed, it would be a bad practice to use such.
But what are the main reasons really?
Is it going to be so hard to parse?
Is it deprecated in order to not to obfuscate a code?
This restriction exists in order to simplify the language parsers. The work needed to accept identifiers with leading digits is probably not considered worth the complexity.
Not all languages have that restriction though; consider Racket (a Lisp/Scheme dialect):
pu#pumbair: ~ racket
Welcome to Racket v5.3.6.
-> (define 9times! 9)
-> (* 9times! 2)
18
but then of course Lisp languages are particularly easy to parse.
As for obfuscation, I'm sure that the fact that identifiers can be unicode characters (such as in Racket and Go) can be way more confusing:
-> (define ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥ 144)
-> (sqrt ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥)
12
To make a parsing efficient a parser relies on looking ahead at the next character to determine the possibilities of the next token. When identifiers such as variable names, constant names and words can start with a digit, then the number of possibilities to branch on for the next token go up dramatically. Also depending on the parsing method, it may have to look ahead more characters to determine the token type which leads to greater complexity with the parser.

What a single sentence consist of? How to name it?

I'm designing architecture of a text parser. Example sentence: Content here, content here.
Whole sentence is a... sentence, that's obvious. The, quick etc are words; , and . are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a single sentence consists of in the most reasonable abstract way (because one may write it consists of letters/vowels etc).
Thanks for any help :)
What you're doing is technically lexical analysis ("lexing"), which takes a sequence of input symbols and generates a series of tokens or lexemes. So word, punctuation and white-space are all tokens.
In (E)BNF terms, lexemes or tokens are synonymous with "terminal symbols". If you think of the set of parsing rules as a tree the terminal symbols are the leaves of the tree.
So what's the atom of your input? Is it a word or a sentence? If it's words (and white-space) then a sentence is more akin to a parsing rule. In fact the term "sentence" can itself be misleading. It's not uncommon to refer to the entire input sequence as a sentence.
A semi-common term for a sequence of non-white-space characters is a "textrun".
A common term comprising the two sub-categories "words" and "punctuation", often used when talking about parsing, is "tokens".
Depending on what stage of your lexical analysis of input text you are looking at, these would be either "lexemes" or "tokens."