Indexing multilingual words in lucene - lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.

It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).

so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese

Related

Lucene: combining ASCII folding and stemming for French

I am implementing a Lucene search for French text. The search must work regardless of whether the user has typed accents or not, and it must also support stemming. I am currently using the Snowball-based French stemmer in Lucene 3.
On the indexing side, I have added an ASCIIFoldingFilter into my analyzer, which runs after the stemmer.
However, on the search side, the operation is not reversible: the stemmer only works given the input content contains accents. For example, it stems the ité from the end of université, but with a user search input of universite, the stemmer returns universit during query analysis. Of course, since the index contains the term univers, the search for universit returns no results.
A solution seems to be to change the order of stemming and folding in the analyzer: instead of stemming and then folding, do the folding before stemming. This effectively makes the operation reversible, but also significantly hobbles the stemmer since many words no longer match the stemming rules.
Alternatively, the stemmer could be modified to operate on folded input i.e. ignore accents, but could this result in over-stemming?
Is there a way to effectively do folded searches without changing the behavior of the stemming algorithm?
Step 1.) Use an exhaustive lemma synonym mapping file
Step 2.) ASCII (ICU) Fold after lemmatizing.
You can get exhaustive French lemmas here:
http://www.lexiconista.com/datasets/lemmatization/
Also, because lemmatizers are NOT destructive like stemmers you can apply the lemmatizer multiple times, perhaps your lemmatizer also contains accent-free normalizations... Just apply the lemmatizer again.

Rails 3 Sunspot Fulltext Search Usage

So I've implemented the sunspot_rails gem into my application to utilize the powerful Solr search engine. I recently checked out Ryan's railscast on full-text searching and I noticed he was using additional parameters in his search queries such as "-" to denote something that should NOT be including in the full-text search.
I never heard about this until now, I was wondering if there was a user-friendly usage guide somewhere both me and my users can reference to take my search functionality to it's maximum capability.
I think ideally I would like to make an abridged version similar to Github's markdown cheat-sheet for my search forms that users can quickly reference.
Sunspot uses Solr's DisMax Query Parser, which has a very simple query syntax. For the most part, it is intended to flexibly parse user-created queries.
DisMax recognizes three special characters: +, -, and ". From the documentation:
[DisMax] is designed to be support raw input strings provided by users with no special escaping. '+' and '-' characters are treated as "mandatory" and "prohibited" modifiers for the subsequent terms. Text wrapped in balanced quote characters '"' are treated as phrases, any query containing an odd number of quote characters is evaluated as if there were no quote characters at all.
There are a few other "behind the scenes" options to tune the relevancy of matched documents. For example, "minimum match" specifies the number or proportion of "optional" fields (i.e., not prefixed with - or +) which must be present. As well as options to boost term matches in specific fields, or term matches within close proximity to each other, and so on.
In Sunspot, these are all exposed in the options parameter to the fulltext method, or as methods within a block supplied to that method.

Lucene character sequence search in a term

I want to find a character sequence (more than three characters) within a Term. I have tried *character_sequence*(I know this is not recommended), but it does not return result if the character sequence itself is equivalent to the Term.
For example, if the terms are "testsomething", "somethingtest" and "sometestthing", I want all these Terms in my search result if the text "test" is searched.
Is there any way to do it?
Thanks!
Prefix queries are by default supported in lucene and to support suffix queries you might have to do a little work. You can check How to query lucene with "like" operator?

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

Mysql query: it's not fetching first result

I have below values in my database.
been Lorem Ipsum and scrambled ever
scrambledtexttextofandtooktooktypetexthastheunknownspecimenstandardsincetypesett
Here is my query:
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
It's not fetching first result, why? It should fetch first result because its contain text word in it. I have disabled the stopword filtering.
This one is also not working
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
The full text search only matches words and word prefixes. Because your data in the database does not contain word boundaries (spaces) the words are not indexed, so they are not found.
Some possible choices you could make are:
Fix your data so that it contains spaces between words.
Use LIKE '%text%' instead of a full text search.
Use an external full-text search engine.
I will expand on each of these in turn.
Fix your data so that it contains spaces between words.
Your data seems to have been corrupted somehow. It looks like words or sentences but with all the spaces removed. Do you know how that happened? Was it intentional? Perhaps there is a bug elsewhere in the system. Try to fix that. Find out where the data came from and see if it can be reimported correctly.
If the original source doesn't contain spaces, perhaps you could use some natural language toolkit to guess where the spaces should be and insert them. There most likely already exist libraries that can do this, although I don't happen to know any. A Google search might find something.
Use LIKE '%text%' instead of a full text search.
A workaround is to use LIKE '%text%' instead but note that this will be much slower as it will not be able to use the index. However it will give the correct result.
Use an external full-text search engine.
You could also look at Lucene or Sphinx. For example I know that Sphinx supports finding text using *text*. Here is an extract from the documentation which explains how to enable infix searching, which is what you need.
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and 'middle' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.