I am executing query using WildcardQuery of Lucene,but it doesn't work - lucene

I am executing query using WildcardQuery of Lucene.but I don't know why the result cannot be found.
Below are the details.
Here is the code for create WildcardQuery,and The record of Field Name :'Full Name' Value:'ABC123DD456CC' is existed Index Document.
BooleanQuery booleanQuery = new BooleanQuery();
for (IndexQueryField field : quickSearchFields)
{
Query query = new WildcardQuery(new Term(queryField.getFieldName(),"ABC*DD*CC"));
booleanQuery.add(query, BooleanClause.Occur.SHOULD);
}
The part of code: Executing query:
Session hibernateSession = (Session) em.getDelegate();
FullTextSession session = SwitchSession.getFullTextSession(hibernateSession, specifyIndexName);
// Set Hibernate flushMode
session.setFlushMode(FlushMode.MANUAL);
// Ignore Hibernate Cache
session.setCacheMode(CacheMode.IGNORE);
FullTextQuery query = session.createFullTextQuery(booleanQuery,XXX.class);
List list = query.setFirstResult(1).setMaxResults(100).list();
The list is empty, i am sure the 'ABC123DD456CC' is existed in Lucene Document.
I just want to do it with WildcardQuery. Any help will be thankful!

I believe that last line should be:
List list = query.setFirstResult(0).setMaxResults(100).list();
Since results are numbered from 0. If there is only 1 document matching that search, which seems likely enough, that probably explains why you're getting nothing (having skipped the first and only result, at index 0).

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Search by exact words in a phrase using Umbraco Examine

I have some description field per content and those are:
For content1:
The quick brown fox jumps over the lazy dog. And the lazy dog is good.
For content2:
The lazy fog is crazy.
Now, when I use keyword = lazy dog, I want to give result as content1 and not content2
I tried like:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog") )
.Compile();
ISearchResults result = searcher.Search( criteria );
But it didn't gave me desired results, it give me results: content1 and content2.
What should I do in order to get as content1 result ?
By default examine is compiling this query to:
+(+description:lazy dog)
and based on it it's returning the results with both: lazy and dog words.
What you want to achieve is:
+(+description:"lazy dog")
First of what you need to try is to escape the phrase. In your case it will be:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog".Escape()) )
.Compile();
ISearchResults result = searcher.Search( criteria );
Can't test it now, but there were some problems with it in the past from what I remember. The second option and a life saver for you, may be building the search query manually and using the raw query.
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria = searcher.CreateSearchCriteria();
var query = criteria.RawQuery("+description:\"lazy dog\"");
ISearchResults result = searcher.Search( query );
And it should return you correct = matched result only. Personally, I've used also some boosting of specific words to just point some results higher in the score list, but if you want to have only matched items, try above solutions and let me know if it helped you.
If you want to deal with more than one property, you can either use some fluent API methods like GroupedAnd or GroupedOr (depending of the desired behaviour of search) or build more advanced raw query.
For the first option, check Grouped Operations documentation: https://github.com/Shazwazza/Examine/wiki/Grouped-Operations.
For the second scenario it would be the best to analyze how it's done e.g. in ezSearch package (which btw. is awesome!): https://github.com/umco/umbraco-ezsearch/blob/master/Src/Our.Umbraco.ezSearch/Web/UI/Views/MacroPartials/ezSearch.cshtml.

Lucene phrase query with wildcards

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:
public static Query createPhraseQuery(String[] phraseWords, String field) {
SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
for (int i = 0; i < phraseWords.length; i++) {
WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
}
return new SpanNearQuery(queryParts, //words
0, //max distance
true //exact order
);
}
Example creation and call toString() method will output:
String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());
outputs:
spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)
Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:
Sentence with foo bar.
Foolies beer drinkers.
...
And not something like:
Bar fooes.
Foo has bar.
I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.
Example:
Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).
This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception.
This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).
This is where I am confused.
If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).
Documentation says that Filter reduces search space of Query. So I tried to use filter:
Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));
And query:
Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");
Than search, (after several warm-up queries):
Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);
OK, what is my question?
How come is quering:
Query query = new WildcardQuery(new Term("text", "karanjin*"));
is fast, but using Filter described above is still slow?
Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.
I'll assume:
Query query = new WildcardQuery(new Term("text", "an*"));
On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.
Query query = new PrefixQuery(new Term("text", "an"));
Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:
Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));

Sitecore term query for filter data

In Sitecore lucene search i am using "term query" to filter data from sitecore.
Here i have one field in Sitecore called "Description" and i want to do fileration based on term "Lorem". But every time I am getting 0 result. If i dont use rterm query i get all result that means my index configuration is correct. Please help.
TermQuery bothQuery = new TermQuery (new Term("Description", "Lorem"));
BooleanQuery query = new BooleanQuery();
query.Add(bothQuery, BooleanClause.Occur.MUST);
TopDocs topDocs = sc.Searcher.Search(query, int.MaxValue);
SearchHits searchHits = new SearchHits(topDocs, sc.Searcher.GetIndexReader());
return searchHits.FetchResults(0, int.MaxValue).Select(r => r.GetObject<Item>()).ToList();
I note that your Term definition above has a field name containing a capital letter. You don't specify the version of Sitecore / Lucene you're working in, but my experience with the 6.x series of Sitecore is that the indexing process transforms all the Field names to lower case at index time.
Hence your field in Sitecore might be called "Description" but in Lucene's index it is probably called "description". Try changing your code to use a lower case field name.
You can check this using an index display tool like the Lucene Index Viewer from the Sitecore Marketplace. It will show you the names of the fields in your index, and let you test queries against them without the need to recompile code.

emit every document in the database with lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit.
My first idea is to add a field that is always matched, but that might deform the relevance score.
Use a BooleanQuery to combine your original query with a MatchAllDocsQuery. You can mitigate the effect this has on scoring by setting the boost on the MatchAllDocsQuery to zero before you combine it with your main query. This way you don't have to add an otherwise bogus field to the index.
For example:
// Parse a query by the user.
QueryParser qp = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer());
Query standardQuery = qp.parse("User query may go here");
// Make a query that matches everything, but has no boost.
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
matchAllDocsQuery.setBoost(0f);
// Combine the queries.
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.add(standardQuery, BooleanClause.Occur.SHOULD);
boolQuery.add(matchAllDocsQuery, BooleanClause.Occur.SHOULD);
// Now just pass it to the searcher.
This should give you hits from standardQuery followed by the rest of the documents in the index.