Unicode question under iOS - objective-c

I have a SQLite database with a word list. In a table there is a word list that includes the word "você". This word has this representation in unicode "voc\U00ea".
I've found out that the same word can have the following representation with the same visual output:
"voc\U00ea",
"voce\U0302"
When I query my db using the second representation it returns blank. Does anyone know a way for the query work using both representations without duplicating the records in the table?
Thanks,
Miguel

These two forms are known as nfc(normal form composed) and nfd("normal form decomposed"). The letter \U0302 is known as a "combining circumflex", which modifies a preceding letter.
To cope with this situation, do the following:
Pick a normalization. Usually choosing nfc is a good idea. (Although iOS/OS X file system uses nfd.)
Before putting a string into the database, always normalize. In iOS, you can use precomposedStringWithCanonicalMapping or precomosedStringWithCompatibilityMapping. To understand the difference between canonical and compatibility mappings, see this description.
Before performing a query, always normalize the query to the same normal form.

Related

Creating a table in NexusDB with german umlauts?

I'm trying to import a CREATE TABLE statement in NexusDB.
The table name contains some german umlauts and so do some field names but I receive an error that there were some invalid characters in my statement (obviously the umlauts...).
My question is now: can somebody give a solution or any ideas to solve my problem?
It's not so easy to just change the umlauts into equivalent terms like ä -> ae or ö -> oe since our application has fixed table names every customer uses currently.
It is not a good idea to use characters outside what is normally permitted in the SQL standard. This will bite you not only in NexusDB, but in many other databases as well. Take special note that there is a good chance you will also run into problems when you want to access data via ODBC etc, as other environments may also have similar standard restrictions. My strong recommendation would be to avoid use of characters outside the SQL naming standard for tables, no matter which database is used.
However... having said all that, given that NexusDB is one of the most flexible database systems for the programmer (it comes with full source), there is already a solution. If you add an "extendedliterals" define to your database server project, then a larger array of characters are considered valid. For the exact change this enables, see the nxcValidIdentChars constant in the nxllConst.pas unit. The constant may also be changed if required.

Is there a database that accepts special characters by default (without converting them)?

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.
Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

Word search within texts to find text that contains most matched variant

I want to find a way to find most suitable row from table which contains a word that is most similar to the word i'm entering. any idea? (I'm using OCR that finds words not exactly the same sometimes reads word 'specific' as 'spccific')
If you are using Oracle then you can try UTL_MATCH which uses something known as the Levenshtein Distance to calculate the minimum number of edits to transform one string into another. Other systems may have something similar or you can use the alogrithm as a starting point for your own function.
Maybe you can use the SOUNDEX functionality (SQL Server) or SOUNDS LIKE (MySQL) if it is available with the SQL engine you are using.

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.