I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.
Related
I am creating an index where the documents are only a single term.
I am indexing domain names, so the field "domain" would look like:
example.com
thisiscool.com
justtesting.org
cnn.com
I am creating my search terms etc. programatically, and because all my document field is just a single term, it appears as though my searches won't work as they are since there is only a single term and if I add multiple terms in a boolean query it will never find anything.
How should I be searching given I have only a single term? I want to make this as efficient as possible.
Query term = new TermQuery("domain", "this")
Query term2 = new TermQuery("domain", "cool")
// add to boolean query
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
indexSearcher.search(bq, 100)
I was expecting to get "thisiscool.com" back, but I get 0 hits. My guess is because lucene can't break things down into tokens, so it will never find any document that has both tokens "this" and "cool".
How should I be searching given this scenerio?
Apply a wildcard to your search clause.
Query term = new TermQuery("domain", "this*");
Query term2 = new TermQuery("domain", "cool*"); // *cool* won't work sadly
However, that might not work because the logic is going to result in a query like this, where the domain has to begin with "this" as well as "cool"
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
=> +domain:this* +domain:cool*
Query term = new TermQuery("domain", "this*cool*");
=> +domain:this*cool* // probably gets hits
If you're using newer versions then you can use regular expressions in queries:
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/util/automaton/RegExp.html
The above example isn't actually how you should do this. I tested it out, and it doesn't even really work. What you'll want to do is build specialized queries, such as PrefixQuery, WildcardQuery, or RegexpQuery.
Additionally, if you're not using QueryParser or something that takes an Analyzer, queries have to match exactly to what's in your index. If domain is a TextField it might have been lowercased or had something else happen to it, so you'll need to know that too.
I'd just use regex.
RegExp r = new RegExp("this.*cool");
Query q = new RegexpQuery(new Term("domain", r.toString()));
It can be slow, but if you don't prefix with any char it should be perfectly fine. I'm also not entirely sure how to ignore case with this, but that might be default.
What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.
I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.
I am migrating my code from Lucene 3.5 to Lucene 4.1 but I am having some problems with getting the term vector without indexing.
The problem is, given a text string and an Analyzer, I need to compute the term vector (technically, find the terms and their frequencies tf). Obviously, it can be achieved by writing the index (using IndexWriter) and then reading them back (using IndexReader) but I reckon it would be expensive. Furthermore, I don't need document frequency (df). Thus, I think an indexing-free solution is suitable.
In Lucene 2 and 3, a simple technique for the above purpose is to use QueryTermVector which extends TermFreqVector and has a constructor taking a string and an Analyzer. Unfortunately, QueryTermVector (along with TermFreqVector) has been removed in Lucene 4 and it seems the migration documentation did not mention anything about QueryTermVector.
Do you have a solution for this problem in Lucene 4? Thank you very much.
If you just need to know the terms/frequencies, you can just obtain the single tokens directly from the analyzer (you can get the TF by counting them, e.g. by using a Map or a Multiset).
This is how you do it in Lucene 4.0:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = charTermAttribute.toString();
//term contains your token
}
I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...
For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.
I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!
Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).