Determine if a string can be derived ambiguously in a CFG - grammar

I know that given a specific context free grammar, to check if it is ambiguous requires checking if there exists any string that can be derived in more than 1 way. And this is undecidable.
However, I have a simpler problem. Given a specific context free grammar and a specific string, is it possible to determine if the string can be derived from the grammar ambiguously? Is there a general algorithm to do this check?

Yes, you can use any generalized parsing algorithm, such as a GLR (Tomita) parser, an Earley parser, or even a CYK parser; all of those can produce a parse "forest" (i.e. a digraph of all possible parsers) in O(N3) time and space. Creating the parse forest is a bit trickier than the "parsing" (that is, recognition), but there are known algorithms which are referenced in the Wikipedia article.
Since the generalized parsing algorithms find all possible parses, you can rest assured that if exactly one parse is found for the string, then the string is not ambiguous.
I'd stay away from CYK parsing for this algorithm because it requires converting the grammar to Chomsky Normal Form, which makes recovering the original parse tree(s) more complicated.
Bison will generate a GLR parser, if requested, so you could just use that tool. However, be aware that it does not optimize storage of the parse forest, since it is expecting to produce only a single parse, and therefore you can end up with exponentially-sized datastructures (which then take exponential time to construct). That's usually only a problem with pathological grammars, though. Also, you will have to declare a custom %merge function on all possibly ambiguous productions; otherwise, the Bison-generated parser will fail with an "ambiguous parse" error if more than one parse is possible.

Related

How is the term "Abstract Data Types" interpreted?

What's the right way to process the phrase "Abstract Data Types"? Is it:
Abstract-Data Types
Or,
Abstract Data-Types
Neither is a well-explained term.
Precisely, there is no need to write a hyphen among those words.
In some circumstances, there could be the necessity to define conceptual data as ‘abstract-data’. But when it comes to terminology in computer science, it’s more general to use without a hyphen which is called ‘abstract data’.
(If there should be a hyphen in the phrase, ‘abstract data-type’ would be more appropriate than ‘abstract-data type’.)
In conclusion, ‘abstract data type’ is the most generally used term.

Why don't many languages accept names starting from a digit?

I am always bumping into a curious fact while reading any programming language reference:
Variable or constant name cannot start with a digit
Of course, even if names from digit were allowed, it would be a bad practice to use such.
But what are the main reasons really?
Is it going to be so hard to parse?
Is it deprecated in order to not to obfuscate a code?
This restriction exists in order to simplify the language parsers. The work needed to accept identifiers with leading digits is probably not considered worth the complexity.
Not all languages have that restriction though; consider Racket (a Lisp/Scheme dialect):
pu#pumbair: ~ racket
Welcome to Racket v5.3.6.
-> (define 9times! 9)
-> (* 9times! 2)
18
but then of course Lisp languages are particularly easy to parse.
As for obfuscation, I'm sure that the fact that identifiers can be unicode characters (such as in Racket and Go) can be way more confusing:
-> (define ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥ 144)
-> (sqrt ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥)
12
To make a parsing efficient a parser relies on looking ahead at the next character to determine the possibilities of the next token. When identifiers such as variable names, constant names and words can start with a digit, then the number of possibilities to branch on for the next token go up dramatically. Also depending on the parsing method, it may have to look ahead more characters to determine the token type which leads to greater complexity with the parser.

How do we make integer datatype behave like String data type in XML/XSD?

I have the following XSD/XML type definition. It has been used by number of business units/applications.
<xsd:simpleType name="NAICSCodeType">
<xsd:annotation>
<xsd:documentation>NAICSCode</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="000001"/>
<xsd:maxInclusive value="999000"/>
</xsd:restriction>
</xsd:simpleType>
As this one defined as "integer" data type, it strips the leading zeros of input. Eg: 0078 become 78 after parsing.
We need to pass the input as it is without stripping leading zeros eg 0078 become 0078 after parsing.
The ideal fix is to change the integer to string in restriction base. It is non-starter due to buy in from other groups.
Is there a way to redefine the above data type for desired outcome?
How do I do it?
Books and net dont seem to have helped too much either, so I am starting to question if this is theoretically possible at all
It sounds as if the values in question are not in fact integers, but strings consisting only of numeric digits. Why does the schema say that they are integers if 78 and 078 and 0078 are three distinct values instead of three ways of naming the same value?
You can of course restrict xs:integer by requiring leading zeroes in the lexical space, or a fixed number of digits. But that is unlikely to have any effect on the way software reading the document re-serializes it or passes values to other software.
In theory, there shouldn't be; and as far as I know, there aren't out of the box XML serializers that would be configurable to get what you described; leading zeroes and padding whitespace are remnants from fixed-length records era (your example would be a PIC 9(6) in a COBOL copybook).
Depending on your platform, you might be able to create custom serializers. In my shop I would argue that as just plain wrong.
If I would be forced to do it, I would simply use a "private" variation of the XSD (based on string), therefore implement whatever formatting on your side and be done with it. Private would mean that you don't need to be "sharing" your XSD artifact that you used internally to generate whatever code you need, with the other groups; this could create the "input" you refer to with minimum overhead. The "refactoring" of the schema could be done with minimum overhead...
I am suggesting it simply because having to put up with this is an indication that in your environment there are obviously bigger problems to deal with, starting with not necessarily understanding how to properly bridge XML with legacy systems (a wild guess, of course).

Equivalent of Python pickling in SWI Prolog?

I've got a Prolog program where I'm doing some brute force search on all strings up to a certain length. I'm checking which strings match a certain pattern, keeping adding patterns until hopefully I find a set of patterns that covers all strings. I would like to store which ones to a file which don't match any of my patterns, so that when I add a new pattern, I only need to check the leftovers, instead of doing the entire brute force search again.
If I were writing this in python, I would just pickle the list of strings, and load it from the file. Does anybody know how to do something similar in Prolog?
I have a good amount of Prolog programming experience, but very little with Prolog IO. I could probably write a predicate to read a file and parse it into a term, but I figured there might be a way to do it more easily.
If you want to write out a term and be able to read it back later at any time barring variables names, use the ISO built-in write_canonical/1 or write_canonical/2. It is quite well supported by current systems. writeq/1 and write/1 work often too, but not always. writeq/1 uses operator syntax (so you need to read it back with the very same operators present) and write/1 does not use quotes. So they work "most of the time" — until they break.
Alternatively, you may use the ISO write-options [quoted(true), ignore_ops(true), numbervars(false)] in write_term/2 or write_term/3. This might be interesting to you if you want to use further options like variable_names/1 to retain also the names of the variables.
Also note that the term written does not include a period at the end. So you have to write a space and a period manually at the end. The space is needed to ensure that an atom consisting of graphic characters does not clobber with the period at the end. Think of writing the atom '---' which must be written as --- . and not as ---. You might write the space only in case of an atom. Or an atom that does not "glue" with .
writeq and read make a similar job, but read the note on writeq about operators, if you declare any.
Consider using read/1 to read a Prolog term. For more complex or different kinds of parsing, consider using DCGs and then phrase_from_file/2 with SWI's library(pio).

T-SQL language specification and lexing rules

I'm thinking about writing a templating tool for generating T-SQL code, which will include delimited sections like below;
SELECT
~~idcolumn~~
FROM
~~table~~
WHERE
~~table~~.flag = 1
Notice the double-tildes delimiting bits? This is an idea for an escape sequence in my templating language. But I want to be certain that the escape sequence is valid -- that it will never occur in a valid T-SQL statement. Problem is, I can't find any official microsoft description of the T-SQL language.
Does anyone know of an official specification for the T-SQL language, or at least the lexing rules? So I can make an informed decision about the escape sequence.
UPDATES:
Thanks for the suggestions so far, but I'm not looking for confirmation of the '~~' escape sequence per se. What I need is a document I can reference I can point to and say 'microsoft says this character sequence is totally impossible in T-SQL.' For instance, microsoft publish the language specification for C# here which includes a description of what characters can go into valid C# programs. (see page 67 of the pdf.) I'm looking for a similar reference.
The double-tilde: "~~" is actually perfectly good T-SQL. For instance; "(SELECT ~~1)" returns '1'.
There are several well known and often used formats for template parameters, one example being $(paramname) (also used in other scripts as well as T-SQL scripts)
Why not use an existing format?
It doesn't matter if ~~ is legal TSQL or not, if you provide an escape for producing ~~ in actual TSQL when you need it.
Since template parameters have to have a nonzero-length identifier, you have a peculiar case where the identifier length is ridiculously "zero", e.g., ~~~~. This kind of thing makes an ideal escape sequence, since it is useless for anything else. Simply process your template text; whenever you find ~~~~ replace it by the named parameter string, and whenever you find ~~~~ replace it by ~~. Now, if ~~ is needed in the final TSQL, just write ~~~~ in your template.
I suspect that even if you do this, that the number of times you'll actually write ~~~~ in practice will be close to zero, so the reason for doing it is theoretical completeness and giving you a warm fuzzy feeling that you can write anything in a template.
Well, I'm not sure about a complete description of the language, but it appears that ~~ could occur in an identifier provided that it is quoted (in brackets, typically).
You may have more luck with a convention saying you don't support identifiers with ~~ in them. Or, just reserve your own lexical symbols and don't worry about ~~ occurring elsewhere.
You could treat quoted literals and strings as content, regardless if they contain your escape-sequence. It would make it more robust.
Run the text trough a lexer, to separate each token. If the token is a string or a quoted literal, treat it as such. But if it is a literal that begins and ends with ~~, you can safely assume it is a template placeholder.
I'm not sure you'll find something that will never occur in a valid statement. Consider:
DECLARE #TemplateBreakingString varchar(100) = '~~I hope this works~~'
or
CREATE TABLE [~~TemplateBreakingTable~~] (IDField INT Identity)
Your escape sequence can occur in string literals, but that is all. That said, Microsoft owns t-sql, and they are free to do anything they want with it moving forward for future versions of sql server. Still, I think ~~ is safe enough.