Accent-Insensitive Alphabetization and Searching [duplicate] - objective-c

I am new in Android and I'm working on a query in SQLite.
My problem is that when I use accent in strings e.g.
ÁÁÁ
ááá
ÀÀÀ
ààà
aaa
AAA
If I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%a%' ORDER BY MOVIE_NAME;
It's return:
AAA
aaa (It's ignoring the others)
But if I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%à%' ORDER BY MOVIE_NAME;
It's return:
ààà (ignoring the title "ÀÀÀ")
I want to select strings in a SQLite DB without caring for the accents and the case. Please help.

Generally, string comparisons in SQL are controlled by column or expression COLLATE rules. In Android, only three collation sequences are pre-defined: BINARY (default), LOCALIZED and UNICODE. None of them is ideal for your use case, and the C API for installing new collation functions is unfortunately not exposed in the Java API.
To work around this:
Add another column to your table, for example MOVIE_NAME_ASCII
Store values into this column with the accent marks removed. You can remove accents by normalizing your strings to Unicode Normal Form D (NFD) and removing non-ASCII code points since NFD represents accented characters roughly as plain ASCII + combining accent markers:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Do your text searches on this ASCII-normalized column but display data from the original unicode column.

In Android sqlite, LIKE and GLOB ignore both COLLATE LOCALIZED and COLLATE UNICODE (they only work for ORDER BY). However, there is a solution without having to add extra columns to your table. As #asat explains in this answer, you can use GLOB with a pattern that will replace each letter with all the available alternatives of that letter. In Java:
public static String addTildeOptions(String searchText) {
return searchText.toLowerCase()
.replaceAll("[aáàäâã]", "\\[aáàäâã\\]")
.replaceAll("[eéèëê]", "\\[eéèëê\\]")
.replaceAll("[iíìî]", "\\[iíìî\\]")
.replaceAll("[oóòöôõ]", "\\[oóòöôõ\\]")
.replaceAll("[uúùüû]", "\\[uúùüû\\]")
.replace("*", "[*]")
.replace("?", "[?]");
}
And then (not literally like this, of course):
SELECT * from table WHERE lower(column) GLOB "*addTildeOptions(searchText)*"
This way, for example in Spanish, a user searching for either mas or más will get the search converted into m[aáàäâã]s, returning both results.
It is important to notice that GLOB ignores COLLATE NOCASE, that's why I converted everything to lower case both in the function and in the query. Notice also that the lower() function in sqlite doesn't work on non-ASCII characters - but again those are probably the ones that you are already replacing!
The function also replaces both GLOB wildcards, * and ?, with "escaped" versions.

You can use Android NDK to recompile the SQLite source including the desired ICU (International Components for Unicode).
Explained in russian here:
http://habrahabr.ru/post/122408/
The process of compiling the SQLilte with source with ICU explained here:
How to compile sqlite with ICU?
Unfortunately you will end up with different APKs for different CPUs.

You need to look at these, not as accented characters, but as entirely different characters. You might as well be looking for a, b, or c. That being said, I would try using a regex for it. It would look something like:
SELECT * from TB_MOVIE WHERE MOVIE_NAME REGEXP '.*[aAàÀ].*' ORDER BY MOVIE_NAME;

Related

Cannot search nor join on other language other than english

I'm scratching my head on this SQL.
I have already changed data base collation to Chinese_PRC_CI_AS but still cannot join or search on a specific value containing Chinese. This column value comes from Excel file, I'm thinking that there might be something wrong with the excel encoding.
I have tried find the hex string using this:
SELECT master.dbo.fn_varbintohexstr(CAST(Media AS varbinary))
,Media
,master.dbo.fn_varbintohexstr(CAST('汽车之家 Autohome' AS varbinary))
FROM XXX
RESULTING different value:
0x7d6c668f4b4eb65b0a004100750074006f0068006f006d006500 汽车之家 Autohome 0xc6fbb3b5d6aebcd2204175746f686f6d65
The first hex string is the string that I cannot join or search using condition where
How can I determine that which encoding that the first string uses?
UPDATE:
Inspired by folks below, using N'', the hex string are the same. But I still could not search string using where Media = N'汽车之家 Autohome'. Any ideas why?
UPDATE:
I found out the reason, be aware that the space is not actually the space, but \n or other special character, remove this and all work fine

Is this a bug in sqlite?

I have following query for Czech language:
select Id,Name
from Account
where Name like '%Še%'
it will return me correct result. But if I change Š to š in my query:
select Id,Name
from Account
where Name like '%še%'
It returns nothing. Is this a sqlite bug?
https://www.sqlite.org/lang_expr.html#like:
Important Note: SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE. The ICU extension to SQLite includes an enhanced version of the LIKE operator that does case folding across all unicode characters.
So the solution to your problem would be to get an SQLite version with the ICU extension.

Removing replacement character � from column

Based on my research so far this character indicates bad encoding between the database and front end. Unfortunately, I don't have any control over either of those. I'm using Teradata Studio.
How can I filter this character out? I'm trying to perform a REGEX_SUBSTR function on a column that occasionally contains �, which throws the error "The string contains an untranslatable character".
Here is my SQL. AIRCFT_POSITN_ID is the column that contains the replacement character.
SELECT DISTINCT AIRCFT_POSITN_ID,
REGEXP_SUBSTR(AIRCFT_POSITN_ID, '[0-9]+') AS AUTOROW
FROM PROD_MAE_MNTNC_VW.FMR_DISCRPNCY_DFRL
WHERE DFRL_CREATE_TMS > CURRENT_DATE -25
Your diagnostic is correct, so first of all, you might want to check the Session Character Set (it is part of the connection definition).
If it is ASCII change it to UTF8 and you will be able to see the original characters instead of the substitute character.
And in case the character is indeed part of the data and not just an indication for encoding translations issues:
The substitute character AKA SUB (DEC: 26 HEX: 1A) is quite unique in Teradata.
you cannot use it directly -
select '�';
-- [6706] The string contains an untranslatable character.
select '1A'XC;
-- [6706] The string contains an untranslatable character.
If you are using version 14.0 or above you can generate it with the CHR function:
select chr(26);
If you're below version 14.0 you can generate it like this:
select translate (_unicode '05D0'XC using unicode_to_latin with error);
Once you have generated the character you can now use it with REPLACE or OTRANSLATE
create multiset table t (i int,txt varchar(100) character set latin) unique primary index (i);
insert into t (i,txt) values (1,translate ('Hello שלום world עולם' using unicode_to_latin with error));
select * from t;
-- Hello ���� world ����
select otranslate (txt,chr(26),'') from t;
-- Hello world
select otranslate (txt,translate (_unicode '05D0'XC using unicode_to_latin with error),'') from t;
-- Hello world
BTW, there are 2 versions for OTRANSLATE and OREPLACE:
The functions under syslib works with LATIN.
the functions under TD_SYSFNLIB works with UNICODE.
In addition to Dudu's excellent answer above, I wanted to add the following now that I've encountered the issue again and had more time to experiment. The following SELECT command produced an untranslatable character:
SELECT IDENTIFY FROM PROD_MAE_MNTNC_VW.SCHD_MNTNC;
IDENTIFY
24FEB1747659193DC330A163DCL�ORD
Trying to perform a REGEXP_REPLACE or OREPLACE directly on this character produces an error:
Failed [6706 : HY000] The string contains an untranslatable character.
I changed the CHARSET property in my Teradata connection from UTF8 to ASCII and I could now see the offending character, looks like a tab
IDENTIFY
Using the TRANSLATE_CHK command using this specific conversion succeeds and identifies the position of the offending character (Note that this does not work using the UTF8 charset):
TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) AS BADCHAR
BADCHAR
28
Now this character can be dealt with using some CASE statements to remove the bad character and retain the remainder of the string:
CASE WHEN TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) = 0 THEN IDENTIFY
ELSE SUBSTR(IDENTIFY, 1, TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE)-1)
END AS IDENTIFY
Hopes this helps someone out.

howto cut text from specific character in sqlite query

SQLITE Query question:
I have a query which returns string with the character '#' in it.
I would like to remove all characters after this specific character '#':
select field from mytable;
result :
text#othertext
text2#othertext
text3#othertext
So in my sample I would like to create a query which only returns :
text
text2
text3
I tried something with instr() to get the index, but instr() was not recognized as a function -> SQL Error: no such function: instr (probably old version of db . sqlite_version()-> 3.7.5).
Any hints howto achieve this ?
There are two approaches:
You can rtrim the string of all characters other than the # character.
This assumes, of course, that (a) there is only one # in the string; and (b) that you're dealing with simple strings (e.g. 7-bit ASCII) in which it is easy to list all the characters to be stripped.
You can use sqlite3_create_function to create your own rendition of INSTR. The specifics here will vary a bit upon how you're using

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".