Sitecore 7 + Lucene: Query-Time Boosting: how? - lucene

for items of a certain template, our users can indicate that the item should be shown on top of the list.
For this, we have added a field in the index "ShowOnTop".
Now when searching for items of this template (to build the list page), we would like to have these "ShowOnTop" items to effectively be returned on top of the other items.
This field however should not affect other site search (general search).
We think this could be possible by applying Query-Time Boosting to these documents. But, how can we achieve this?

To do boosting at query-time simply use Boost(value) method (using a search predicate as it sounds like you might be doing some advanced searching where added flexibility of predicates might come in handy) -
var queryPredicate = PredicateBuilder.True<SearchResult>();
queryPredicate = queryPredicate.And(i =>
i.Headline.Contains(model.Query).Boost(50));

Probably the best way would be to apply a Sort based on that field, something along the lines of:
Sort sort = new Sort(new SortField("ShowOnTop", SortField.STRING, true), true);
var hits = new SearchHits(context.Searcher.Search(query, sort));
You could also add it as a heavily boosted optional query term, something along the lines of, and make the rest of the query is required (as a whole), like:
ShowOnTop:true^10000 +(the rest of the query)
With a large enough boost factor, those terms should always come up first unless there is a really drastic difference in relevance.

Easiest is creating a rule under /sitecore/system/Settings/Rules/Indexing and Search/... that filters on your ShowOnTop field (I used a checkbox and compared the value with 1) and adjust the boost by 99999999
You can either add this rule as Global Rule or you can add it as Item rule and assign the rule from within the item.
Good luck!

Related

How to search in lucene where there is only a single token for a field

I am creating an index where the documents are only a single term.
I am indexing domain names, so the field "domain" would look like:
example.com
thisiscool.com
justtesting.org
cnn.com
I am creating my search terms etc. programatically, and because all my document field is just a single term, it appears as though my searches won't work as they are since there is only a single term and if I add multiple terms in a boolean query it will never find anything.
How should I be searching given I have only a single term? I want to make this as efficient as possible.
Query term = new TermQuery("domain", "this")
Query term2 = new TermQuery("domain", "cool")
// add to boolean query
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
indexSearcher.search(bq, 100)
I was expecting to get "thisiscool.com" back, but I get 0 hits. My guess is because lucene can't break things down into tokens, so it will never find any document that has both tokens "this" and "cool".
How should I be searching given this scenerio?
Apply a wildcard to your search clause.
Query term = new TermQuery("domain", "this*");
Query term2 = new TermQuery("domain", "cool*"); // *cool* won't work sadly
However, that might not work because the logic is going to result in a query like this, where the domain has to begin with "this" as well as "cool"
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
=> +domain:this* +domain:cool*
Query term = new TermQuery("domain", "this*cool*");
=> +domain:this*cool* // probably gets hits
If you're using newer versions then you can use regular expressions in queries:
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/util/automaton/RegExp.html
The above example isn't actually how you should do this. I tested it out, and it doesn't even really work. What you'll want to do is build specialized queries, such as PrefixQuery, WildcardQuery, or RegexpQuery.
Additionally, if you're not using QueryParser or something that takes an Analyzer, queries have to match exactly to what's in your index. If domain is a TextField it might have been lowercased or had something else happen to it, so you'll need to know that too.
I'd just use regex.
RegExp r = new RegExp("this.*cool");
Query q = new RegexpQuery(new Term("domain", r.toString()));
It can be slow, but if you don't prefix with any char it should be perfectly fine. I'm also not entirely sure how to ignore case with this, but that might be default.

How to index only words with a minimum length using Apache Lucene 5.3.1?

may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

How can I perform a fuzzy search for all words provided in a Lucene.net search

I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?
You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).