How to convert foreign characters to English characters in SQL Query? - sql

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.

If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.

I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

Related

How to find Bad characters in the column

I am trying to pull 'COURSE_TITLE' column value from 'PS_TRAINING' table in PeopleSoft and writing into UTF-8 text file to get loaded into Workday system. The file is erroring out while loading because of bad characters(Ã â and many more) present in the column. I have used a procedure which will convert non-ascii value into space. But because of this procedure, the 'Course_Title' which are written in non-english language like Chinese, Korean, Spanish also replacing with spaces.
I even tried using regular expressions (``regexp_like(course_title, 'Ã) only to find bad characters but since the table has hundreds of thousands of rows, it would be difficult to find all bad characters. Please suggest a way to solve this.
If you change your approach, this may work.
Define what you want, and retrieve it.
select *
from PS_TRAINING
where not regexp_like(course_title, '[0-9A-Za-z]')```
If you take out too much data, just add it to the regex

Is there a database that accepts special characters by default (without converting them)?

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.
Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese

which is the best collation for European + English language

HI There,
i am developing for European languages and also for English, the string are stored as NVARCHAR in sql server 2005.
so, which is the best collation to be used is "Latin1_General_CI_AS" covers all?
there are variations as well like
Latin1_General_CP1_CI_AS,Latin1_General_BIN,Latin1_General_BIN2 etc
comments\suggestions appreciated.
Regards
DEE
For general purpose sorting "General Latin1" is probably the best choice for western European and English languages.
I believe that if the code page (e.g., CP1) is not specified, then it defaults to code page 1252 (which is also what CP1 signifies). So my understanding is that Latin1_General_CI_AS and Latin1_General_CP1_CI_AS are equivalent. Given that, my opinion is that Latin1_General_CP1_CI_AS would be the better choice for clarity reasons. Whether you use CI_AS, CS_AS, or CI_AI is purely a usability issue based on whether you want case sensitivity and/or accent sensitivity. With CI, "a" == "A" and with AI, "á" == "â".
The _BIN and _BIN2 options signify that the collation will be binary based on the code point values. For sorting purposes, you probably do not want that because the order would not necessarily match any kind of dictionary order. However, if you are only using the index for searching for data, then one of those might be appropriate because it could be faster. Relatively little computation is necessary to convert a character value to the associated key value.
Edit As Martin points out in the comment, the code page will not matter unless you are using char, memo, or varchar. If you stick completely with Unicode (nchar, nvarchar, nmemo), then the code page will not come into play. If you translate a Unicode character to a single-byte character, though, it will be used.