emit every document in the database with lucene

emit every document in the database with lucene - lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit.
My first idea is to add a field that is always matched, but that might deform the relevance score.

Use a BooleanQuery to combine your original query with a MatchAllDocsQuery. You can mitigate the effect this has on scoring by setting the boost on the MatchAllDocsQuery to zero before you combine it with your main query. This way you don't have to add an otherwise bogus field to the index.
For example:
// Parse a query by the user.
QueryParser qp = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer());
Query standardQuery = qp.parse("User query may go here");
// Make a query that matches everything, but has no boost.
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
matchAllDocsQuery.setBoost(0f);
// Combine the queries.
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.add(standardQuery, BooleanClause.Occur.SHOULD);
boolQuery.add(matchAllDocsQuery, BooleanClause.Occur.SHOULD);
// Now just pass it to the searcher.
This should give you hits from standardQuery followed by the rest of the documents in the index.

Related

Lucene phrase query with wildcards

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:
public static Query createPhraseQuery(String[] phraseWords, String field) {
SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
for (int i = 0; i < phraseWords.length; i++) {
WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
}
return new SpanNearQuery(queryParts, //words
0, //max distance
true //exact order
);
}
Example creation and call toString() method will output:
String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());
outputs:
spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)
Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:
Sentence with foo bar.
Foolies beer drinkers.
...
And not something like:
Bar fooes.
Foo has bar.
I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.
Example:
Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).
This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception.
This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).
This is where I am confused.
If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).
Documentation says that Filter reduces search space of Query. So I tried to use filter:
Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));
And query:
Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");
Than search, (after several warm-up queries):
Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);
OK, what is my question?
How come is quering:
Query query = new WildcardQuery(new Term("text", "karanjin*"));
is fast, but using Filter described above is still slow?

Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.
I'll assume:
Query query = new WildcardQuery(new Term("text", "an*"));
On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.
Query query = new PrefixQuery(new Term("text", "an"));
Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:
Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));

Sitecore term query for filter data

In Sitecore lucene search i am using "term query" to filter data from sitecore.
Here i have one field in Sitecore called "Description" and i want to do fileration based on term "Lorem". But every time I am getting 0 result. If i dont use rterm query i get all result that means my index configuration is correct. Please help.
TermQuery bothQuery = new TermQuery (new Term("Description", "Lorem"));
BooleanQuery query = new BooleanQuery();
query.Add(bothQuery, BooleanClause.Occur.MUST);
TopDocs topDocs = sc.Searcher.Search(query, int.MaxValue);
SearchHits searchHits = new SearchHits(topDocs, sc.Searcher.GetIndexReader());
return searchHits.FetchResults(0, int.MaxValue).Select(r => r.GetObject<Item>()).ToList();

I note that your Term definition above has a field name containing a capital letter. You don't specify the version of Sitecore / Lucene you're working in, but my experience with the 6.x series of Sitecore is that the indexing process transforms all the Field names to lower case at index time.
Hence your field in Sitecore might be called "Description" but in Lucene's index it is probably called "description". Try changing your code to use a lower case field name.
You can check this using an index display tool like the Lucene Index Viewer from the Sitecore Marketplace. It will show you the names of the fields in your index, and let you test queries against them without the need to recompile code.

Lucene search using StopWords in StandardAnalyzer

I have the following issue using Lucene.NET 3.0.3.
My project analyze Documents using StandardAnalyzer with StopWord-List (combined german and english words).
While searching I create my searchterm by hand and parse it using MultiFieldQueryParser. The Parser is initialized with the same analyzer as indexing documents.
The parsed search query initialized a BooleanQuery. The BooleanQuery and a TopScoreDocCollector search in the Lucene index with IndexSearcher.
My code looks like:
using (StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, roxConnectionTools.getServiceInstance<ISearchIndexService>().GetStopWordList()))
{
...
MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, analyzer);
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
parser.AllowLeadingWildcard = true;
...
Query searchQuery = parser.Parse(searchStringBuilder.ToString().Trim);
...
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.Add(searchQuery, Occur.MUST);
...
TopScoreDocCollector scoreCollector = TopScoreDocCollector.Create(SearchServiceTools.MAX_SCORE_COLLECTOR_SIZE, true);
...
searcher.Search(boolQuery, scoreCollector);
ScoreDoc[] scoreDocs = scoreCollector.TopDocs().ScoreDocs;
}
If I index a document field with value "Test- und Produktivumgebung" I can´t find this document by searching this term.
I get results if I correct the search term to "Test- Produktivumgebung".
The word "und" is in my StopWord-List.
My search query looks like the following:
Manually generated search query: (+*Test* +*und* +*Produktivumgebung*)
Parsed search query: +(title:*Test*) +(title:*und*) +(title:*Produktivumgebung*)
Why I can´t find the document searching for "Test- und Produktivumgebung"?

Wildcard Queries are not analyzed (See this question, for an example). Since you are (if I understand correctly), interpreting the query "Test- und Produktivumgebung" to (+*Test* +*und* +*Produktivumgebung*), the analyzer is not used for any of those wildcard queries, and stop words will not be eliminated.
If you eliminate the step that performs that translation, the query "Test- und Produktivumgebung" should be parsed to a phrase query and analyzed, and should work just fine. Another reason to eliminate that step, is that applying a leading wildcard to every term will cause your performance to become very poor. That's why leading wildcards must be manually enabled, because it is generally a bad idea to use them.

Lucene - "AND" sets of "OR" terms

Suppose I have a search using criteria such as a list countries. A user can select a set of countries to search across and combine this set with other criteria.
In SQL I'd do this in my where clause i.e. WHERE (country = 'brazil' OR country = 'france' OR country = 'china) AND (other search criteria).
It isn't clear how to do this in Lucene. Query.combine seems to have promise but that would increase in complexity very quickly if I have multiple sets of "OR" terms to work through.
Is Lucene capable in this regard? Or should I just hit my regular DB with these types of criteria and filter my Lucene results?
Digging deeper, it looks like you can nest boolean queries to accomplish this. I'll update with an answer if this technique works and if it is performant.

Using the standard query parser(and you can take a look at the relevant documentation), you can use syntax similar to a DB query, such as:
(country:brazil OR country:france OR country:china) AND (other search criteria)
Or, to simplify a bit:
country:(brazil OR france OR china) AND (other search criteria)
Alternatively, Lucene also supports queries written using +/-, rather than AND/OR syntax. I find that syntax more expressive for a Lucene query. The equivalent in this form would be:
+country:(brazil france china) +(other search criteria)
If manually constructing queries, you can indeed nest BooleanQueries to create a similar structure, using the correct BooleanClauses to establish the And/Or logic you've specified:
Query countryQuery = new BooleanQuery();
countryQuery.add(new TermQuery(new Term("country","brazil")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","france")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","china")),BooleanClause.Occur.SHOULD);
Query otherStuffQuery = //Set up the other query here,
//or get it from a query parser, or something
Query rootQuery = new BooleanQuery();
rootQuery.add(countryQuery, BooleanClause.Occur.MUST);
rootQuery.add(otherStuffQuery, BooleanClause.Occur.MUST);

Two ways.
Let the Lucene formulate the query. To accomplish that, send in the query string in the following format.
Query: "country(brazil france china)"
An inbuilt QueryParser parses the above string to a BooleanQuery with an OR operator.
QueryParser qp = new QueryParser(Version.LUCENE_41, "country", new StandardAnalyzer(Version.LUCENE_41));
Query q = qp.parse(s);
If you want to formulate the query yourself,
BooleanQuery bq = new BooleanQuery();
//
TermQuery tq = new TermQuery(new Term("country", "brazil"));
bq.add(tq, Occur.SHOULD); // SHOULD ==> OR operator
//
tq = new TermQuery(new Term("country", "france"));
bq.add(tq, Occur.SHOULD);
//
tq = new TermQuery(new Term("country", "china"));
bq.add(tq, Occur.SHOULD);
Unless you add hundreds of subqueries, Lucene will meet your expectations performance-wise.

I am executing query using WildcardQuery of Lucene,but it doesn't work

I am executing query using WildcardQuery of Lucene.but I don't know why the result cannot be found.
Below are the details.
Here is the code for create WildcardQuery,and The record of Field Name :'Full Name' Value:'ABC123DD456CC' is existed Index Document.
BooleanQuery booleanQuery = new BooleanQuery();
for (IndexQueryField field : quickSearchFields)
{
Query query = new WildcardQuery(new Term(queryField.getFieldName(),"ABC*DD*CC"));
booleanQuery.add(query, BooleanClause.Occur.SHOULD);
}
The part of code: Executing query:
Session hibernateSession = (Session) em.getDelegate();
FullTextSession session = SwitchSession.getFullTextSession(hibernateSession, specifyIndexName);
// Set Hibernate flushMode
session.setFlushMode(FlushMode.MANUAL);
// Ignore Hibernate Cache
session.setCacheMode(CacheMode.IGNORE);
FullTextQuery query = session.createFullTextQuery(booleanQuery,XXX.class);
List list = query.setFirstResult(1).setMaxResults(100).list();
The list is empty, i am sure the 'ABC123DD456CC' is existed in Lucene Document.
I just want to do it with WildcardQuery. Any help will be thankful!

I believe that last line should be:
List list = query.setFirstResult(0).setMaxResults(100).list();
Since results are numbered from 0. If there is only 1 document matching that search, which seems likely enough, that probably explains why you're getting nothing (having skipped the first and only result, at index 0).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

emit every document in the database with lucene - lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit. My first idea is to add a field that is always matched, but that might deform the relevance score.

Related

Lucene phrase query with wildcards

Sitecore term query for filter data

Lucene search using StopWords in StandardAnalyzer

Lucene - "AND" sets of "OR" terms

I am executing query using WildcardQuery of Lucene,but it doesn't work

Categories

Resources