Search by name in non latin characters - ruby-on-rails-3

I'm trying to do a search using,
Product.order(:name).where("name like ?", params[:term])
Where :term is in non latin characters (Hebrew).
Both my application and my database are set to UTF-8
application.rb
config.encoding = "utf-8"
database
utf8_unicode_ci
and the specific name I'm searching is in the database, but the search comes out empty.
Any suggestions?

I had to add % of my own in the params[:term:]
So now it is
Product.order(:name).where("name like ?", params[:term]+"%")
I am not sure if this the best way to achieve what I wanted but it works nonetheless ...

Related

Accent-Insensitive Alphabetization and Searching [duplicate]

I am new in Android and I'm working on a query in SQLite.
My problem is that when I use accent in strings e.g.
ÁÁÁ
ááá
ÀÀÀ
ààà
aaa
AAA
If I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%a%' ORDER BY MOVIE_NAME;
It's return:
AAA
aaa (It's ignoring the others)
But if I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%à%' ORDER BY MOVIE_NAME;
It's return:
ààà (ignoring the title "ÀÀÀ")
I want to select strings in a SQLite DB without caring for the accents and the case. Please help.
Generally, string comparisons in SQL are controlled by column or expression COLLATE rules. In Android, only three collation sequences are pre-defined: BINARY (default), LOCALIZED and UNICODE. None of them is ideal for your use case, and the C API for installing new collation functions is unfortunately not exposed in the Java API.
To work around this:
Add another column to your table, for example MOVIE_NAME_ASCII
Store values into this column with the accent marks removed. You can remove accents by normalizing your strings to Unicode Normal Form D (NFD) and removing non-ASCII code points since NFD represents accented characters roughly as plain ASCII + combining accent markers:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Do your text searches on this ASCII-normalized column but display data from the original unicode column.
In Android sqlite, LIKE and GLOB ignore both COLLATE LOCALIZED and COLLATE UNICODE (they only work for ORDER BY). However, there is a solution without having to add extra columns to your table. As #asat explains in this answer, you can use GLOB with a pattern that will replace each letter with all the available alternatives of that letter. In Java:
public static String addTildeOptions(String searchText) {
return searchText.toLowerCase()
.replaceAll("[aáàäâã]", "\\[aáàäâã\\]")
.replaceAll("[eéèëê]", "\\[eéèëê\\]")
.replaceAll("[iíìî]", "\\[iíìî\\]")
.replaceAll("[oóòöôõ]", "\\[oóòöôõ\\]")
.replaceAll("[uúùüû]", "\\[uúùüû\\]")
.replace("*", "[*]")
.replace("?", "[?]");
}
And then (not literally like this, of course):
SELECT * from table WHERE lower(column) GLOB "*addTildeOptions(searchText)*"
This way, for example in Spanish, a user searching for either mas or más will get the search converted into m[aáàäâã]s, returning both results.
It is important to notice that GLOB ignores COLLATE NOCASE, that's why I converted everything to lower case both in the function and in the query. Notice also that the lower() function in sqlite doesn't work on non-ASCII characters - but again those are probably the ones that you are already replacing!
The function also replaces both GLOB wildcards, * and ?, with "escaped" versions.
You can use Android NDK to recompile the SQLite source including the desired ICU (International Components for Unicode).
Explained in russian here:
http://habrahabr.ru/post/122408/
The process of compiling the SQLilte with source with ICU explained here:
How to compile sqlite with ICU?
Unfortunately you will end up with different APKs for different CPUs.
You need to look at these, not as accented characters, but as entirely different characters. You might as well be looking for a, b, or c. That being said, I would try using a regex for it. It would look something like:
SELECT * from TB_MOVIE WHERE MOVIE_NAME REGEXP '.*[aAàÀ].*' ORDER BY MOVIE_NAME;

SQL special characters JSP

I am having some trouble while searching the Database for some strings with special characters through JSP.
eg:
col name obj_id
value blah/blah[blah]
and I have got to search it dynamically through JSP, eg:
select *
from table
where objid like '" +obj_id+ "'**
this query works for normal strings. Please can anyone tell me how to make it work for special characters as well?
I read up on that escape { escape '/' } bit. Can that be included here? If so, how?

Lucene 5.0.0 - search string with special characters

I am using Lucene version 5.0.0.
In my search string, there is a minus character like “test-”.
I read that the minus sign is a special character in Lucene. So I have to escape that sign, as in the queryparser documentation:
Escaping Special Characters:
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
To do that I use the QueryParser.escape method:
query = parser.parse(QueryParser.escape(searchString));
I use the classic Analyzer because I noticed that the standard Analyzer has some problems with escaping special characters.
The problem is that the Parser deletes the special characters and so the Query has the term
content:test
How can I set up the parser and searcher to search for the real value “test-“?
I also created my own query with the content test- but that also didn’t work. I recieved 0 results but my index has entries like:
Test-VRF
Test-IPLS
I am really confused about this problem.
While escaping special characters for the queryparser deals with part of the problem, it doesn't help with analysis.
Neither classic nor standard analyzer will keep punctuation in the indexed form of the field. For each of these examples, the indexed form will be in two terms:
test and vrf
test and ipls
This is why a manually constructed query for "test-" finds nothing. That term does not exist in the index.
The goal of these analyzers is to attempt to index words. As such, punctuation is mostly eliminated, and is not searchable. A phrase query for "test vrf" or "test-vrf" or "test_vrf" are all effectively identical. If that is not what you need, you'll need to look to other analyzers.
The goal to fix this issue is to store the value content in an NOT_ANALYZED way.
Field fieldType = new Field(key.toLowerCase(),value, Field.Store.YES, Field.Index.NOT_ANALYZED);
Someone who has the same problem has to take care how to store the contents in the index.
To request the result create a query in this way
searchString = QueryParser.escape(searchString);
and use for example a WhitespaceAnalyzer.

Cleaning SQL Data

What is causing the two fields to be different ? Is it a tab or something else ? What is an easy way to clean it ? I know I can somehow use replace, but I am unsure of what I am replacing and there are many more records with the same problem.
Name Binary
MCMPAD 0x4D0043004D00500041004400200020
MCMPAD  0x4D0043004D00500041004400A00020
SELECT Name , convert(binary(15), (Name)) Binary from VirtualTerminal
where Name like '%MCMPAD%'
One string ends with space-space 0x2020, the other with linefeed-space 0x0A20. The difference is not visible when you display Name as a string.
The extra zeroes (space is 0x0020 instead of 0x20) are for the Windows standard UCS-2 encoding.

Rails utf-8 problem

I there, I'm new to ruby (and rails) and having som problems when using Swedish letters in strings. In my action a create a instance variable like this:
#title = "Välkommen"
And I get the following error:
invalid multibyte char (US-ASCII)
syntax error, unexpected $end, expecting keyword_end
#title = "Välkommen"
^
What's happening?
EDIT: If I add:
# coding: utf-8
at the top of my controller it works. Why is that and how can I slove this "issue"?
See Joel spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
To quote the part that answers this questions concisely
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember
one extremely important fact. It does not make sense to have a string
without knowing what encoding it uses. You can no longer stick your
head in the sand and pretend that "plain" text is ASCII.
This is why you must tell ruby what encoding is used in your file. Since the encoding is not marked in some sort of metadata associated with your file, some software assumed ASCII until it knows better. Ruby 1.9 probably does so until your comment when it will stop, and restart reading the file now decoding it as utf-8.
Obviously, if you used some other Unicode encoding or some more local encoding for your ruby file, you would need to change the comment to indicate the correct encoding.
The "magic comment" in Ruby 1.9 (on which Rails 3 is based) tells the interpreter what encoding to expect. It is important because in Ruby 1.9, every string has an encoding. Prior to 1.9, every string was just a sequence of bytes.
A very good description of the issue is in James Gray's series of blog posts on Ruby and Unicode. The one that is exactly relevant to your question is http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings (but see the others because they are very good).
The important line from the article:
The first is the main rule of source Encodings: source files receive a US-ASCII Encoding, unless you say otherwise.
There are several places that can cause problems with utf-8 encoding.
but some tricks are to solve this problem:
make sure that every file in your project is utf-8 based (if you
are using rad rails, this is simple to accomplish: mark your project,
select properties, in the "text-file-encoding" box, select "other:
utf-8")
Be sure to put in your strange "å,ä,ö" characters in your files again
or you'll get a mysql error, because it will change your "å,ä,ö" to a
"square" (unknown character)
in your databases.yml set for each server environment (in this
example "development" with mysql)
development:
adapter: mysql
encoding: utf8
set a before filter in your application controller
(application.rb):
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
#headers["Content-Type"] = "text/html; charset=utf-8"
end
end
be sure to set the encoding to utf-8 in your mysql (I've only used
mysql.. so I don't know about other databases) for every table. If you
use mySQL Administrator you can do like this: edit table, press the
"table option" tab, change charset to "utf8" and collation to
"utf8_general_ci"
( Courtsey : kombatsanta )