Creating a MoreLikeThis query yields empty queries and terms - lucene

I'm trying to get similar Document of another one . I'm using the Lucene.Net MoreLikeThis-Class to achieve this. For this i seperate my Documents in multiple Fields - Title and Content. Now creating the actual query results in an empty query without interesting Terms.
My could looks like this:
var queries = new List<Query>();
foreach(var docField in docFields)
var similarSearch = new MoreLikeThis(indexReader);
similarSearch.SetFieldNames(docField.fieldName);
similarSearch.Analyzer = new GermanAnalyzer(Version.LUCENE_30, new HashSet<string>(StopWords));
similarSearch.MinDocFreq = 1;
similarSearch.MinTermFreq = 1;
similarSearch.MinWordLen = 1;
similarSearch.Boost = true;
similarSearch.BoostFactor = boostFactor;
using(var reader = new StringReader(docField.Content)){
var searchQuery = similarSearch.Like(reader);
// debugging purpose
var queryString = searchQuery.ToString(); // empty
var terms = similarSearch.RetrieveInterestingTerms(reader); // also empty
queries.Add(searchQuery);
}
var booleanQuery = new BooleanQuery();
foreach(var moreLikeThisQuery in queries)
{
booleanQuery.Add(moreLikeThisQuery, Occur.SHOULD);
}
var topDocs = indexSearcher.Search(booleanQuery, maxNumberOfResults); // and of course no results obtained
So the question is:
Why there are no Terms / why there is no query generated?
I hope important thing's be seen, if not please help me to make my first question better :)

I got it to work.
The problem was, that i worked on the false directory.
I have different Solutions for creating the index and creating the queries and had a missmatch with the index-location.
So the generall solution would be:
Is your Querygenerating-Class fully initialized? (MinDocFreq, MinTermFreq, MinWordLen, has a Analyzer, set the fieldNames)
Is your used IndexReader correctly initialized?

Related

Query Performance for multiple IQueryable in .NET Core with LINQ

I am currently updating a BackEnd project to .NET Core and having performance issues with my Linq queries.
Main Queries:
var queryTexts = from text in _repositoryContext.Text
where text.KeyName.StartsWith("ApplicationSettings.")
where text.Sprache.Equals("*")
select text;
var queryDescriptions = from text in queryTexts
where text.KeyName.EndsWith("$Descr")
select text;
var queryNames = from text in queryTexts
where !(text.KeyName.EndsWith("$Descr"))
select text;
var queryDefaults = from defaults in _repositoryContext.ApplicationSettingsDefaults
where defaults.Value != "*"
select defaults;
After getting these IQueryables I run a foreach loop in another context to build my DTO model:
foreach (ApplicationSettings appl in _repositoryContext.ApplicationSettings)
{
var applDefaults = queryDefaults.Where(c => c.KeyName.Equals(appl.KeyName)).ToArray();
description = queryDescriptions.Where(d => d.KeyName.Equals("ApplicationSettings." + appl.KeyName + ".$Descr"))
.FirstOrDefault()?
.Text1 ?? "";
var name = queryNames.Where(n => n.KeyName.Equals("ApplicationSettings." + appl.KeyName)).FirstOrDefault()?.Text1 ?? "";
// Do some stuff with data and return DTO Model
}
In my old Project, this part had an execution from about 0,45 sec, by now I have about 5-6 sec..
I thought about using compiled queries but I recognized these don't support returning IEnumerable yet. Also I tried to avoid Contains() method. But it didn't improve performance anyway.
Could you take short look on my queries and maybe refactor or give some hints how to make one of the queries faster?
It is to note that _repositoryContext.Text has compared to other contexts the most entries (about 50 000), because of translations.
queryNames, queryDefaults, and queryDescriptions are all queries not collections. And you are running them in a loop. Try loading them outside of the loop.
eg: load queryNames to a dictionary:
var queryNames = from text in queryTexts
where !(text.KeyName.EndsWith("$Descr"))
select text;
var queryNamesByName = queryName.ToDictionary(n => n.KeyName);
one can write queries like below
var Profile="developer";
var LstUserName = alreadyUsed.Where(x => x.Profile==Profile).ToList();
you can also use "foreach" like below
lstUserNames.ForEach(x=>
{
//do your stuff
});

Apache Lucene fuzzy search for multi-worded phrases

I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}

ZF2 Lucene query returns nothing - same query in Luke returns expected result

In ZF2 I am using the following code to generate a Lucene query:
$query = new Lucene\Search\Query\Boolean();
$term = new Lucene\Index\Term($search->type, 'type');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true);
$term = new Lucene\Index\Term('[+<= ' . $search->purchase_value . ']', 'min_price');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true); // required
$term = new Lucene\Index\Term('[' . $search->purchase_value . ' >=]', 'max_price');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true); // required
which produces the following query:
+(type:Buying) +(min_price:[+<= 160.00]) +(max_price:[160.00 >=])
When I run this query in ZF2 ($hits = $index->find($query);) I get an empty array returned, however when I use Luke to manually run the query against the index it returns the result I am expecting.
What do I need to change in my code to make it return the same results as Luke?
I am using the default analyser for both systems:
Luke: org.apache.lucene.analysis.KeywordAnalyzer
ZF2: (which I think is) \ZendSearch\Lucene\Analysis\Analyzer\Common\Text
Do I need to use a different QueryParser?
You are searching for the term "[+<= 160.00]" in the field min_price. You are not creating a range query. I don't quite get what you mean by using a different QueryParser, you aren't using a QueryParser at all, you are using the API to construct your queries. You need to use a range search:
$term = new Lucene\Index\Term($search->purchase_value, 'min_price');
$subqueryRange = new Lucene\Search\Query\Range(null, $term, true);
$query->addSubquery($subqueryRange, true);
$term = new Lucene\Index\Term($search->purchase_value, 'max_price');
$subqueryRange = new Lucene\Search\Query\Range($term, null, true);
$query->addSubquery($subqueryRange, true);
I'm assuming here, by the way, that "+<=" is meant to be query syntax for less than or equal to. I'm not familiar with anything in lucene that supports that sort of syntax. An open ended range, as shown, would be the correct thing to use.

Lucene DrillSideways with custom sort field or with TopScoreDocCollector

I am trying to implement drillsideways search with Lucene 4.6.1.
Following code works fine:
DrillSideways ds = new DrillSideways(searcher, taxoReader);
FacetSearchParams fsp = new FacetSearchParams(getAllFacetCounts());
DrillDownQuery ddq = new DrillDownQuery(fsp.indexingParams, mainQuery);
List<CategoryPath> paths = new ArrayList<CategoryPath>();
...
add category path
...
if (paths.size() >0)
ddq.add(paths.toArray(new CategoryPath[paths.size()]));
DrillSidewaysResult dsr = ds.search(null, ddq, 500, fsp); // <-- here
TopDocs topDocs = dsr.hits;
ScoreDoc[] hits = topDocs.scoreDocs;
// list search results
listSearchResults(searcher, hits, Math.min(500, topDocs.totalHits));
But what if I want to pass TopScoreDocCollector, like
// for now it is top score collector,
// but I may want to implement custom sort
TopScoreDocCollector topDocsCollector = TopScoreDocCollector.create(500, true);
DrillSidewaysResult dsr = ds.search(ddq, topDocsCollector, fsp);
the result is empty set and no errors. What is wrong?
I'm guessing you are referring to the value of DrillSidewaysResult.hits, and it is intended behavior, as noted in the documentation of DrillSidewaysResult:
Note that if you called DrillSideways.search(DrillDownQuery, Collector, FacetSearchParams), then hits will be null.
You should get your hits from the Collector instead.

Rally: Query Filtered to Specific Tags

I am writing a custom app for Rally and I would like to filter the data to stories with specific tags. So far I haven't found a way to write the correct syntax to achieve this purpose. Here's an example of a standard query that I would include in the cardboardConfig:
var query = new rally.sdk.util.Query('SomeField = "Some Value"');
This works well enough when trying to query a field that contains a single value, but this doesn't appear to work on tags since tags are arrays -- assuming I am even referencing the correct field name. I tried all of the following without success:
var query = new rally.sdk.util.Query('Tags = "Some Value"');
var query = new rally.sdk.util.Query('Tags contains "Some Value"');
var query = new rally.sdk.util.Query('Tag = "Some Value"');
var query = new rally.sdk.util.Query('Tag contains "Some Value"');
var query = new rally.sdk.util.Query('Tags = {"Some Value"}');
var query = new rally.sdk.util.Query('Tags contains {"Some Value"}');
var query = new rally.sdk.util.Query('Tag = {"Some Value"}');
var query = new rally.sdk.util.Query('Tag contains {"Some Value"}');
var SearchTags = { "Some Value" };
var query = new rally.sdk.util.Query('Tags = SearchTags');
var SearchTags = { "Some Value" };
var query = new rally.sdk.util.Query('Tags contains SearchTags');
What is the correct field name and operator for filtering the data to specific tags?
Try this:
var query = new rally.sdk.util.Query('Tags.Name Contains "Some Value");
This works for tags, but won't currently work for all collections.