Rails 3 Sunspot Fulltext Search Usage - ruby-on-rails-3

So I've implemented the sunspot_rails gem into my application to utilize the powerful Solr search engine. I recently checked out Ryan's railscast on full-text searching and I noticed he was using additional parameters in his search queries such as "-" to denote something that should NOT be including in the full-text search.
I never heard about this until now, I was wondering if there was a user-friendly usage guide somewhere both me and my users can reference to take my search functionality to it's maximum capability.
I think ideally I would like to make an abridged version similar to Github's markdown cheat-sheet for my search forms that users can quickly reference.

Sunspot uses Solr's DisMax Query Parser, which has a very simple query syntax. For the most part, it is intended to flexibly parse user-created queries.
DisMax recognizes three special characters: +, -, and ". From the documentation:
[DisMax] is designed to be support raw input strings provided by users with no special escaping. '+' and '-' characters are treated as "mandatory" and "prohibited" modifiers for the subsequent terms. Text wrapped in balanced quote characters '"' are treated as phrases, any query containing an odd number of quote characters is evaluated as if there were no quote characters at all.
There are a few other "behind the scenes" options to tune the relevancy of matched documents. For example, "minimum match" specifies the number or proportion of "optional" fields (i.e., not prefixed with - or +) which must be present. As well as options to boost term matches in specific fields, or term matches within close proximity to each other, and so on.
In Sunspot, these are all exposed in the options parameter to the fulltext method, or as methods within a block supplied to that method.

Related

GraphDB Lucene Connector - uppercase problem when indexing URIs

There seems to be a strange behaviour with GraphDB Lucene connectors when a URI is indexed (either because of using the $self quasiproperty or because the property chain leads to a URI). I would summarize the issue as follows:
Uppercase letters in the URI must be escaped in the query text (i.e. query text must be "*\Merlo" instead of "*Merlo" in the wine example provided here)
No snippet can be extracted from URIs
Any idea how this could be overcome?
The Lucene connectors treats URIs as non-analyzed fields, i.e. as a single chunk of string whose tokens aren't words and aren't tokenized or analyzed as words. The logic is that URIs are identifiers and they have a meaning only in their entirety. URIs don't contain text, even if they sometimes make sense to people. This also means that normal full-text queries will not work on such fields, nor any snippets can be extracted. They can be search but since it's not full-text they might behave unexpectedly.
In your particular example with a query like "*Merlo", Lucene will run the query through the analyzer in order to be able to match analyzed fields (i.e. the fields normally used for full-text search). By escaping the capital letter you're preventing the analyzer from normalizing it to a lowercase m and you get a match.
If you need an exact match you can do this too (note there's no need to escape the capital letters):
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT * {
?search a inst:my_index ;
# Surround with double quotes to force an exact match
:query "\"http://www.ontotext.com/example/wine#Merlo\"" ;
:entities ?entity .
}

Lucene: combining ASCII folding and stemming for French

I am implementing a Lucene search for French text. The search must work regardless of whether the user has typed accents or not, and it must also support stemming. I am currently using the Snowball-based French stemmer in Lucene 3.
On the indexing side, I have added an ASCIIFoldingFilter into my analyzer, which runs after the stemmer.
However, on the search side, the operation is not reversible: the stemmer only works given the input content contains accents. For example, it stems the ité from the end of université, but with a user search input of universite, the stemmer returns universit during query analysis. Of course, since the index contains the term univers, the search for universit returns no results.
A solution seems to be to change the order of stemming and folding in the analyzer: instead of stemming and then folding, do the folding before stemming. This effectively makes the operation reversible, but also significantly hobbles the stemmer since many words no longer match the stemming rules.
Alternatively, the stemmer could be modified to operate on folded input i.e. ignore accents, but could this result in over-stemming?
Is there a way to effectively do folded searches without changing the behavior of the stemming algorithm?
Step 1.) Use an exhaustive lemma synonym mapping file
Step 2.) ASCII (ICU) Fold after lemmatizing.
You can get exhaustive French lemmas here:
http://www.lexiconista.com/datasets/lemmatization/
Also, because lemmatizers are NOT destructive like stemmers you can apply the lemmatizer multiple times, perhaps your lemmatizer also contains accent-free normalizations... Just apply the lemmatizer again.

SQL2008 fulltext index search without word breakers

I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?
You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

Lucene character sequence search in a term

I want to find a character sequence (more than three characters) within a Term. I have tried *character_sequence*(I know this is not recommended), but it does not return result if the character sequence itself is equivalent to the Term.
For example, if the terms are "testsomething", "somethingtest" and "sometestthing", I want all these Terms in my search result if the text "test" is searched.
Is there any way to do it?
Thanks!
Prefix queries are by default supported in lucene and to support suffix queries you might have to do a little work. You can check How to query lucene with "like" operator?

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese