I am always bumping into a curious fact while reading any programming language reference:
Variable or constant name cannot start with a digit
Of course, even if names from digit were allowed, it would be a bad practice to use such.
But what are the main reasons really?
Is it going to be so hard to parse?
Is it deprecated in order to not to obfuscate a code?
This restriction exists in order to simplify the language parsers. The work needed to accept identifiers with leading digits is probably not considered worth the complexity.
Not all languages have that restriction though; consider Racket (a Lisp/Scheme dialect):
pu#pumbair: ~ racket
Welcome to Racket v5.3.6.
-> (define 9times! 9)
-> (* 9times! 2)
18
but then of course Lisp languages are particularly easy to parse.
As for obfuscation, I'm sure that the fact that identifiers can be unicode characters (such as in Racket and Go) can be way more confusing:
-> (define ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥ 144)
-> (sqrt ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥)
12
To make a parsing efficient a parser relies on looking ahead at the next character to determine the possibilities of the next token. When identifiers such as variable names, constant names and words can start with a digit, then the number of possibilities to branch on for the next token go up dramatically. Also depending on the parsing method, it may have to look ahead more characters to determine the token type which leads to greater complexity with the parser.
Related
What's the right way to process the phrase "Abstract Data Types"? Is it:
Abstract-Data Types
Or,
Abstract Data-Types
Neither is a well-explained term.
Precisely, there is no need to write a hyphen among those words.
In some circumstances, there could be the necessity to define conceptual data as ‘abstract-data’. But when it comes to terminology in computer science, it’s more general to use without a hyphen which is called ‘abstract data’.
(If there should be a hyphen in the phrase, ‘abstract data-type’ would be more appropriate than ‘abstract-data type’.)
In conclusion, ‘abstract data type’ is the most generally used term.
I know that given a specific context free grammar, to check if it is ambiguous requires checking if there exists any string that can be derived in more than 1 way. And this is undecidable.
However, I have a simpler problem. Given a specific context free grammar and a specific string, is it possible to determine if the string can be derived from the grammar ambiguously? Is there a general algorithm to do this check?
Yes, you can use any generalized parsing algorithm, such as a GLR (Tomita) parser, an Earley parser, or even a CYK parser; all of those can produce a parse "forest" (i.e. a digraph of all possible parsers) in O(N3) time and space. Creating the parse forest is a bit trickier than the "parsing" (that is, recognition), but there are known algorithms which are referenced in the Wikipedia article.
Since the generalized parsing algorithms find all possible parses, you can rest assured that if exactly one parse is found for the string, then the string is not ambiguous.
I'd stay away from CYK parsing for this algorithm because it requires converting the grammar to Chomsky Normal Form, which makes recovering the original parse tree(s) more complicated.
Bison will generate a GLR parser, if requested, so you could just use that tool. However, be aware that it does not optimize storage of the parse forest, since it is expecting to produce only a single parse, and therefore you can end up with exponentially-sized datastructures (which then take exponential time to construct). That's usually only a problem with pathological grammars, though. Also, you will have to declare a custom %merge function on all possibly ambiguous productions; otherwise, the Bison-generated parser will fail with an "ambiguous parse" error if more than one parse is possible.
I've got a Prolog program where I'm doing some brute force search on all strings up to a certain length. I'm checking which strings match a certain pattern, keeping adding patterns until hopefully I find a set of patterns that covers all strings. I would like to store which ones to a file which don't match any of my patterns, so that when I add a new pattern, I only need to check the leftovers, instead of doing the entire brute force search again.
If I were writing this in python, I would just pickle the list of strings, and load it from the file. Does anybody know how to do something similar in Prolog?
I have a good amount of Prolog programming experience, but very little with Prolog IO. I could probably write a predicate to read a file and parse it into a term, but I figured there might be a way to do it more easily.
If you want to write out a term and be able to read it back later at any time barring variables names, use the ISO built-in write_canonical/1 or write_canonical/2. It is quite well supported by current systems. writeq/1 and write/1 work often too, but not always. writeq/1 uses operator syntax (so you need to read it back with the very same operators present) and write/1 does not use quotes. So they work "most of the time" — until they break.
Alternatively, you may use the ISO write-options [quoted(true), ignore_ops(true), numbervars(false)] in write_term/2 or write_term/3. This might be interesting to you if you want to use further options like variable_names/1 to retain also the names of the variables.
Also note that the term written does not include a period at the end. So you have to write a space and a period manually at the end. The space is needed to ensure that an atom consisting of graphic characters does not clobber with the period at the end. Think of writing the atom '---' which must be written as --- . and not as ---. You might write the space only in case of an atom. Or an atom that does not "glue" with .
writeq and read make a similar job, but read the note on writeq about operators, if you declare any.
Consider using read/1 to read a Prolog term. For more complex or different kinds of parsing, consider using DCGs and then phrase_from_file/2 with SWI's library(pio).
I'm thinking about writing a templating tool for generating T-SQL code, which will include delimited sections like below;
SELECT
~~idcolumn~~
FROM
~~table~~
WHERE
~~table~~.flag = 1
Notice the double-tildes delimiting bits? This is an idea for an escape sequence in my templating language. But I want to be certain that the escape sequence is valid -- that it will never occur in a valid T-SQL statement. Problem is, I can't find any official microsoft description of the T-SQL language.
Does anyone know of an official specification for the T-SQL language, or at least the lexing rules? So I can make an informed decision about the escape sequence.
UPDATES:
Thanks for the suggestions so far, but I'm not looking for confirmation of the '~~' escape sequence per se. What I need is a document I can reference I can point to and say 'microsoft says this character sequence is totally impossible in T-SQL.' For instance, microsoft publish the language specification for C# here which includes a description of what characters can go into valid C# programs. (see page 67 of the pdf.) I'm looking for a similar reference.
The double-tilde: "~~" is actually perfectly good T-SQL. For instance; "(SELECT ~~1)" returns '1'.
There are several well known and often used formats for template parameters, one example being $(paramname) (also used in other scripts as well as T-SQL scripts)
Why not use an existing format?
It doesn't matter if ~~ is legal TSQL or not, if you provide an escape for producing ~~ in actual TSQL when you need it.
Since template parameters have to have a nonzero-length identifier, you have a peculiar case where the identifier length is ridiculously "zero", e.g., ~~~~. This kind of thing makes an ideal escape sequence, since it is useless for anything else. Simply process your template text; whenever you find ~~~~ replace it by the named parameter string, and whenever you find ~~~~ replace it by ~~. Now, if ~~ is needed in the final TSQL, just write ~~~~ in your template.
I suspect that even if you do this, that the number of times you'll actually write ~~~~ in practice will be close to zero, so the reason for doing it is theoretical completeness and giving you a warm fuzzy feeling that you can write anything in a template.
Well, I'm not sure about a complete description of the language, but it appears that ~~ could occur in an identifier provided that it is quoted (in brackets, typically).
You may have more luck with a convention saying you don't support identifiers with ~~ in them. Or, just reserve your own lexical symbols and don't worry about ~~ occurring elsewhere.
You could treat quoted literals and strings as content, regardless if they contain your escape-sequence. It would make it more robust.
Run the text trough a lexer, to separate each token. If the token is a string or a quoted literal, treat it as such. But if it is a literal that begins and ends with ~~, you can safely assume it is a template placeholder.
I'm not sure you'll find something that will never occur in a valid statement. Consider:
DECLARE #TemplateBreakingString varchar(100) = '~~I hope this works~~'
or
CREATE TABLE [~~TemplateBreakingTable~~] (IDField INT Identity)
Your escape sequence can occur in string literals, but that is all. That said, Microsoft owns t-sql, and they are free to do anything they want with it moving forward for future versions of sql server. Still, I think ~~ is safe enough.
I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.