SQL fuzzy search and Google-like improvements - sql

Interesting challenge; my client enters some product information in a SQL database. The product is a painting of a famous old Russian composer called Rachmaninoff. So that name is in the description field. Now, only a few of their customers searching for products know exactly how to spell this name, but most of the time it's misspelled. Besides misspelling there are also a lot of international customers who just write this name completely different like, Rachmaninow, Rahmaninov, Рахманінаў.
If i put any of these misspellings or translations in Google it (almost) always knows how to correct it and to redirect me straight to the right page.
Does anyone know what my possibilities are to get some of this magic in my product search? Are there some API's i can use? Some super free text option that i don't know of? Or ...

We solved a similar problem with quite some success: Searching for people (german names) by name given over phone.
E.g.: The very common german last names "Schmidt", "Schmitt", "Schmied", "Schmid", "Schmit" and "Schmiedt" will be all but impossible to hold apart when given by a voice. Combine this with a first name of "Sylvia" or "Silvia" or "Sylvya" and a caller saying "Hi, I'm Sylvia Schmidt, I have forgotten my customer number" has no chance of being quickly found.
Our solution was to put up a list of synophones, e.g. (in pseudo code, for german):
{consonant}+ := {consonant}
ie := i
ii := i
dt* := t
y|j := i
{vocal}v := {vocal}f
etc., you get the drift. Now we stored the synophone-translated strings with the original strings to make search possible. This works really well.
I understand that MySQL has the Soundex() function for English strings. I would expect MSSQL to have something similar.

Related

Lucene part:123 vs part 123 vs part123

I'm not too familiar with Lucene so my apologies if this isn't clear or I'm getting my terms/nomenclature mixed up.
So I have a requirement where a field containing text (example part:123) should be able to be found via the following:
part:123
part 123
part123
Now my understanding of StandardAnalyzer is that it will break the word "part:123" into terms "part" and "123".
So with that, I'm able to search with part:123 or part 123, but because they're two different terms "part123" won't work.
It seems to me like I'd also need to get the indexer to add another term where both are combined, so it'd be "part", "123", "part123".
Is this the right way to accomplish this -- and does anyone know how I'd go about implementing this?
Thanks!

How to request a page title in a foreign language using wikipedia API

I am trying to use a simple GET request
"https://en.wikipedia.org/w/api.php?action=query&titles=音乐&prop=langlinks&lllimit=500"
but with chinese characters in the title wikipedia can't find the page even though it exists https://zh.wikipedia.org/wiki/%E9%9F%B3%E4%B9%90
I have tried URL encoding the chinese word, I tried both simplified and traditional. I tried giving it unicode in ascii like "\u97f3\u4e50"
Does anyone know how to do this?
I solved it.
There are a few things to remember when doing this:
You need to use the wikipedia of your target language. so in this case
zh.wikipedia.org
Chinese wikipedia externally displays the charset of the users region (simplified for mainland, standard for Taiwan). But internally it depends on who wrote the article. The title in your api query must be in the original character set of the person who created it. So for Music, 音樂 will not work and you must use the simplified 音乐. But for Notebook computer the simplified 笔记本 will not work and you must use 筆記本. You have no choice but to try both. .NET includes a set of methods for converting between the two character sets.

How exact phrase search is performed by a Search Engine?

I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.
I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search
A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

Handling Grammar / Spelling Issues in Translation Strings

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.