How is the term "Abstract Data Types" interpreted? - oop

What's the right way to process the phrase "Abstract Data Types"? Is it:
Abstract-Data Types
Or,
Abstract Data-Types

Neither is a well-explained term.
Precisely, there is no need to write a hyphen among those words.
In some circumstances, there could be the necessity to define conceptual data as ‘abstract-data’. But when it comes to terminology in computer science, it’s more general to use without a hyphen which is called ‘abstract data’.
(If there should be a hyphen in the phrase, ‘abstract data-type’ would be more appropriate than ‘abstract-data type’.)
In conclusion, ‘abstract data type’ is the most generally used term.

Related

Determine if a string can be derived ambiguously in a CFG

I know that given a specific context free grammar, to check if it is ambiguous requires checking if there exists any string that can be derived in more than 1 way. And this is undecidable.
However, I have a simpler problem. Given a specific context free grammar and a specific string, is it possible to determine if the string can be derived from the grammar ambiguously? Is there a general algorithm to do this check?
Yes, you can use any generalized parsing algorithm, such as a GLR (Tomita) parser, an Earley parser, or even a CYK parser; all of those can produce a parse "forest" (i.e. a digraph of all possible parsers) in O(N3) time and space. Creating the parse forest is a bit trickier than the "parsing" (that is, recognition), but there are known algorithms which are referenced in the Wikipedia article.
Since the generalized parsing algorithms find all possible parses, you can rest assured that if exactly one parse is found for the string, then the string is not ambiguous.
I'd stay away from CYK parsing for this algorithm because it requires converting the grammar to Chomsky Normal Form, which makes recovering the original parse tree(s) more complicated.
Bison will generate a GLR parser, if requested, so you could just use that tool. However, be aware that it does not optimize storage of the parse forest, since it is expecting to produce only a single parse, and therefore you can end up with exponentially-sized datastructures (which then take exponential time to construct). That's usually only a problem with pathological grammars, though. Also, you will have to declare a custom %merge function on all possibly ambiguous productions; otherwise, the Bison-generated parser will fail with an "ambiguous parse" error if more than one parse is possible.

Why don't many languages accept names starting from a digit?

I am always bumping into a curious fact while reading any programming language reference:
Variable or constant name cannot start with a digit
Of course, even if names from digit were allowed, it would be a bad practice to use such.
But what are the main reasons really?
Is it going to be so hard to parse?
Is it deprecated in order to not to obfuscate a code?
This restriction exists in order to simplify the language parsers. The work needed to accept identifiers with leading digits is probably not considered worth the complexity.
Not all languages have that restriction though; consider Racket (a Lisp/Scheme dialect):
pu#pumbair: ~ racket
Welcome to Racket v5.3.6.
-> (define 9times! 9)
-> (* 9times! 2)
18
but then of course Lisp languages are particularly easy to parse.
As for obfuscation, I'm sure that the fact that identifiers can be unicode characters (such as in Racket and Go) can be way more confusing:
-> (define ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥ 144)
-> (sqrt ǝʃqɐıɹɐʌ-ɐ-sı-sıɥ⊥)
12
To make a parsing efficient a parser relies on looking ahead at the next character to determine the possibilities of the next token. When identifiers such as variable names, constant names and words can start with a digit, then the number of possibilities to branch on for the next token go up dramatically. Also depending on the parsing method, it may have to look ahead more characters to determine the token type which leads to greater complexity with the parser.

Equivalent of Python pickling in SWI Prolog?

I've got a Prolog program where I'm doing some brute force search on all strings up to a certain length. I'm checking which strings match a certain pattern, keeping adding patterns until hopefully I find a set of patterns that covers all strings. I would like to store which ones to a file which don't match any of my patterns, so that when I add a new pattern, I only need to check the leftovers, instead of doing the entire brute force search again.
If I were writing this in python, I would just pickle the list of strings, and load it from the file. Does anybody know how to do something similar in Prolog?
I have a good amount of Prolog programming experience, but very little with Prolog IO. I could probably write a predicate to read a file and parse it into a term, but I figured there might be a way to do it more easily.
If you want to write out a term and be able to read it back later at any time barring variables names, use the ISO built-in write_canonical/1 or write_canonical/2. It is quite well supported by current systems. writeq/1 and write/1 work often too, but not always. writeq/1 uses operator syntax (so you need to read it back with the very same operators present) and write/1 does not use quotes. So they work "most of the time" — until they break.
Alternatively, you may use the ISO write-options [quoted(true), ignore_ops(true), numbervars(false)] in write_term/2 or write_term/3. This might be interesting to you if you want to use further options like variable_names/1 to retain also the names of the variables.
Also note that the term written does not include a period at the end. So you have to write a space and a period manually at the end. The space is needed to ensure that an atom consisting of graphic characters does not clobber with the period at the end. Think of writing the atom '---' which must be written as --- . and not as ---. You might write the space only in case of an atom. Or an atom that does not "glue" with .
writeq and read make a similar job, but read the note on writeq about operators, if you declare any.
Consider using read/1 to read a Prolog term. For more complex or different kinds of parsing, consider using DCGs and then phrase_from_file/2 with SWI's library(pio).

Convert upper case into sentence case

How do we convert Upper case text like this:
WITHIN THE FIELD OF LITERARY CRITICISM, "TEXT" ALSO REFERS TO THE
ORIGINAL INFORMATION CONTENT OF A PARTICULAR PIECE OF WRITING; THAT
IS, THE "TEXT" OF A WORK IS THAT PRIMAL SYMBOLIC ARRANGEMENT OF
LETTERS AS ORIGINALLY COMPOSED, APART FROM LATER ALTERATIONS,
DETERIORATION, COMMENTARY, TRANSLATIONS, PARATEXT, ETC. THEREFORE,
WHEN LITERARY CRITICISM IS CONCERNED WITH THE DETERMINATION OF A
"TEXT," IT IS CONCERNED WITH THE DISTINGUISHING OF THE ORIGINAL
INFORMATION CONTENT FROM WHATEVER HAS BEEN ADDED TO OR SUBTRACTED FROM
THAT CONTENT AS IT APPEARS IN A GIVEN TEXTUAL DOCUMENT (THAT IS, A
PHYSICAL REPRESENTATION OF TEXT).
Into usual sentence case like this:
Within the field of literary criticism, "text" also refers to the
original information content of a particular piece of writing; that
is, the "text" of a work is that primal symbolic arrangement of
letters as originally composed, apart from later alterations,
deterioration, commentary, translations, paratext, etc. Therefore,
when literary criticism is concerned with the determination of a
"text," it is concerned with the distinguishing of the original
information content from whatever has been added to or subtracted from
that content as it appears in a given textual document (that is, a
physical representation of text).
The base answer is just to use the LOWER() function.
It's easy enough to separate the sentences by CHARINDEX()ing for the period (and then using UPPER() on the first letter of each sentence...).
But even then, you'll end-up leaving proper names, acronyms, etc. in lower-case.
Distinguishing proper names, etc. from the rest is beyond anything that can be done in TSQL. I've seen people attempt it in code using the dictionary from MS Word, etc...but even then, Word doesn't always get it right either.
I found a simple solution was to use INITCAP()

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese