Convert upper case into sentence case - sql

How do we convert Upper case text like this:
WITHIN THE FIELD OF LITERARY CRITICISM, "TEXT" ALSO REFERS TO THE
ORIGINAL INFORMATION CONTENT OF A PARTICULAR PIECE OF WRITING; THAT
IS, THE "TEXT" OF A WORK IS THAT PRIMAL SYMBOLIC ARRANGEMENT OF
LETTERS AS ORIGINALLY COMPOSED, APART FROM LATER ALTERATIONS,
DETERIORATION, COMMENTARY, TRANSLATIONS, PARATEXT, ETC. THEREFORE,
WHEN LITERARY CRITICISM IS CONCERNED WITH THE DETERMINATION OF A
"TEXT," IT IS CONCERNED WITH THE DISTINGUISHING OF THE ORIGINAL
INFORMATION CONTENT FROM WHATEVER HAS BEEN ADDED TO OR SUBTRACTED FROM
THAT CONTENT AS IT APPEARS IN A GIVEN TEXTUAL DOCUMENT (THAT IS, A
PHYSICAL REPRESENTATION OF TEXT).
Into usual sentence case like this:
Within the field of literary criticism, "text" also refers to the
original information content of a particular piece of writing; that
is, the "text" of a work is that primal symbolic arrangement of
letters as originally composed, apart from later alterations,
deterioration, commentary, translations, paratext, etc. Therefore,
when literary criticism is concerned with the determination of a
"text," it is concerned with the distinguishing of the original
information content from whatever has been added to or subtracted from
that content as it appears in a given textual document (that is, a
physical representation of text).

The base answer is just to use the LOWER() function.
It's easy enough to separate the sentences by CHARINDEX()ing for the period (and then using UPPER() on the first letter of each sentence...).
But even then, you'll end-up leaving proper names, acronyms, etc. in lower-case.
Distinguishing proper names, etc. from the rest is beyond anything that can be done in TSQL. I've seen people attempt it in code using the dictionary from MS Word, etc...but even then, Word doesn't always get it right either.

I found a simple solution was to use INITCAP()

Related

Distinguishing words in a sentence

I'm looking for a way to distinguish compound words in a sentence.
Although this is pretty easy in English because there are dashes between words of a compound word (e.g. daughter-in-law), it's not the same in other languages like Persian. In order to detect the words in a sentence we will look for spaces between words. Imagine there isn't a dash to connect these words together, but instead there is a space between them. Fortunately, we already have different records for "daughter" and "daughter in law" in the database. Now I'm looking for an algorithm or SQL query which would first look at bigger chunks of words like "daughter in law" and checks if they exist. If nothing was found, then it should start looking for each word.
Another example would be with digits. Imagine we have a string like "1 2 3 4 5 6". Each digit has a record in the database which corresponds to a value. However, there are extra records for combinations such as "2 3". I want to first get the records for bigger chunks and if there is no record, then check each single digit. Once again, please note that the algorithm must automatically distinguish compounds from singulars.
You can build a Directed Acyclic Word Graph (DAWG) from your dictionary. Basically, it's a trie that you can search very quickly. Once built, you can search for words or compound words pretty easily.
To search, you take the first letter of the word and, starting at the root node of the tree, see if there's a transition to that letter. As you match each letter, you get the next letter and see if there's a transition from the current node of the tree for that letter. If you reach the end of the string, then you know that you've found a word.
If you get to a point where there is not a transition from the current node, then:
if the current node is not marked as the end of a word, then the word you're working with is not a word in the dictionary or a compound word.
if the current node is marked as the end of a word, then you have a potential compound word. You take the next letter and start at the root of the tree.
Note that you probably don't want to implement a DAWG as records in a database.
For English this problem is solved using full text search binary trees (Huffman Encoding Trees), which take advantage of frequency analysis to put the words/alphabet most used on top of the tree.
But for Persian implementing such an algorithm is much more difficult because Persian alphabet combines together and it is not discrete like English. So to answer your question about the algorithm, you have to make a Huffman Encoding Tree based on frequency to be able to search against words.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

Unicode question under iOS

I have a SQLite database with a word list. In a table there is a word list that includes the word "você". This word has this representation in unicode "voc\U00ea".
I've found out that the same word can have the following representation with the same visual output:
"voc\U00ea",
"voce\U0302"
When I query my db using the second representation it returns blank. Does anyone know a way for the query work using both representations without duplicating the records in the table?
Thanks,
Miguel
These two forms are known as nfc(normal form composed) and nfd("normal form decomposed"). The letter \U0302 is known as a "combining circumflex", which modifies a preceding letter.
To cope with this situation, do the following:
Pick a normalization. Usually choosing nfc is a good idea. (Although iOS/OS X file system uses nfd.)
Before putting a string into the database, always normalize. In iOS, you can use precomposedStringWithCanonicalMapping or precomosedStringWithCompatibilityMapping. To understand the difference between canonical and compatibility mappings, see this description.
Before performing a query, always normalize the query to the same normal form.

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese