Lucene ignore keywords in search term - lucene

This seems like it should be simple, but I can't figure out how to get Lucene to ignore the AND, OR, and NOT keywords - the query parser throws a parse error when it gets one. I have a query builder class that splits the search term so that it searches on the words themselves as well as on n-grams in the word. I'm using Lucene in Java.
So in a search for, say, "ANDERSON COOPER" the query string looks like:
name: (ANDERSON COOPER "ANDERSON COOPER")^5 gram4: ( ANDE NDER DERS ERSO RSON
SONC ONCO NCOO COOP OOPE OPER)
the query parser throws an error when it gets those ANDs. Ideally, I'd like the parser to just ignore AND, OR, NOT altogether, and I'll use the &&, ||, and ! equivalents if I need them - do I have to modify the code in the QueryParser class itself to get this? Or is there an easier way? I could also just insert an escape character for these cases if that is the best way to do it, but adding \ before the word AND doesn't seem to do anything.

You can wrap the AND in quotes like this: "AND". Is that easy? A regex could probably do that easily if you know exactly what your queries look like.
The parser shouldn't have a problem with it, and the PhraseQuery will be rewritten as a term query, so it will be a small constant-time performance difference big-oh O(1).
The regex could probably look like this:
\b(AND|OR|NOT)\b
Which would be replaced with
"$1"

Related

Lucene.net GetFieldQuery vs TermQuery

Using Lucene's standard analyzer. Title field in question is non-stored, analyzed. The query is as follows:
title:"Some-Url-Friendly-Title"
In Luke, this query gets correctly re-written as:
title:"some url friendly title" (- replaced by whitespace, everything lowercased).
I thought the Lucene.net version would be:
new TermQuery(new Term("title","Some-Url-Friendly-Title"))
However, no results are returned.
Then I tried:
_parser.GetFieldQuery("title","Some-Url-Friendly-Title")
And it worked as expected!
Both queries were executed via:
_searcher.Search([query object], [sort object])
Can somebody point me in the right direction to see what the differences between TermQuery and _parser.GetFieldQuery() are?
A TermQuery is much simpler than running a query through a queryparser. Not only is it not lowercased, and doesn't understand to break up hyphenated terms, it isn't even tokenized. It just searches for the term you tell it to look for. That means it is looking for the term "Some-Url-Friendly-Title" as a single untokenized keyword, in your index. I assume you are using an analyzer, so chances are no such tokens exist.
To take it a step further, if you had been searching for "Some Url Friendly Title" as the Term text, you still wouldn't come up with anything, since it's looking for "Some Url Friendly Title" as a Single Token, not as the four tokens (or rather, terms) in your index.
If you look at what a the standard query parser generates when you parse your query, you'll see that TermQueries are only one of the building blocks it uses to generate the complete query, along with BooleanQuery, and possibly PhraseQuery, PrefixQueriy, etc.
In Lucene.Net version 3.0.3 the GetFieldQuery is inaccessible due to it's protection modifier. Use
MultiFieldQueryParser.Parse(searchText, field)
instead.

Space issue in Lucene.NET C#

I want to search sentence which has space in full text search.
Ex: Tom is a very good boy in class.
I want to Search the key word "very good".
I'm using white space tokenizer to create/search index. But it is not finding the keyword if it is separated by space.
Code:
Query searchItemQuery = new WildcardQuery(new Term(string-field-name, searchkeyword.ToLower()));
I've tried with split but it is not working properly.
Do anyone suggest me a solution for this problem?
Thanks,
Vijay
Since, you are working with tokenized string, every word is a separate term.
In order too find a phrase consisting of multiple terms, you would need to use PhraseQuery instead of WildcardQuery.
Like this:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term(string-field-name, "very"));
phraseQuery.Add(new Term(string-field-name, "good"));
Note also, that you are using wildcard query. Wildcards in phrase query are a bit complex. Check this post for details: Lucene - Wildcards in phrases
And finally, I would suggest to consider using QueryParser instead of constructing query manually.

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.

a proper way to escape %% when building LIKE queries in Rails 3 / ActiveRecord

I want to match a url field against a url prefix (which may contain percent signs), e.g. .where("url LIKE ?", "#{some_url}%").
What's the most Rails way?
From Rails version 4.2.x there is an active record method called sanitize_sql_like. So, you can do in your model a search scope like:
scope :search, -> search { where('"accounts"."name" LIKE ?', "#{sanitize_sql_like(search)}%") }
and call the scope like:
Account.search('Test_%')
The resulting escaped sql string is:
SELECT "accounts".* FROM "accounts" WHERE ("accounts"."name" LIKE 'Test\_\%%')
Read more here: http://edgeapi.rubyonrails.org/classes/ActiveRecord/Sanitization/ClassMethods.html
If I understand correctly, you're worried about "%" appearing inside some_url and rightly so; you should also be worried about embedded underscores ("_") too, they're the LIKE version of "." in a regex. I don't think there is any Rails-specific way of doing this so you're left with gsub:
.where('url like ?', some_url.gsub('%', '\\\\\%').gsub('_', '\\\\\_') + '%')
There's no need for string interpolation here either. You need to double the backslashes to escape their meaning from the database's string parser so that the LIKE parser will see simple "\%" and know to ignore the escaped percent sign.
You should check your logs to make sure the two backslashes get through. I'm getting confusing results from checking things in irb, using five (!) gets the right output but I don't see the sense in it; if anyone does see the sense in five of them, an explanatory comment would be appreciated.
UPDATE: Jason King has kindly offered a simplification for the nightmare of escaped escape characters. This lets you specify a temporary escape character so you can do things like this:
.where("url LIKE ? ESCAPE '!'", some_url.gsub(/[!%_]/) { |x| '!' + x })
I've also switched to the block form of gsub to make it a bit less nasty.
This is standard SQL92 syntax, so will work in any DB that supports that, including PostgreSQL, MySQL and SQLite.
Embedding one language inside another is always a bit of a nightmarish kludge and there's not that much you can do about it. There will always be ugly little bits that you just have to grin and bear.
https://gist.github.com/3656283
With this code,
Item.where(Item.arel_table[:name].matches("%sample!%code%"))
correctly escapes % between "sample" and "code", and matches "AAAsample%codeBBB" but does not for "AAAsampleBBBcodeCCC" on MySQL, PostgreSQL and SQLite3 at least.
Post.where('url like ?', "%#{some_url + '%'}%)

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.