Delete all documents that don't match a term? - lucene

How would I match all documents that don't match a term in lucene or lucene.net ?
If I want to delete all the documents that match a term it's easy :
writer.DeleteDocuments(new Term("SomeParameter", SomeValue));
But i actually need to do the opposite (I'm marking each updated document with a unique tag, I want to delete everything that wasn't updated, so everything whose tag is diferent from my tag, but it could be very diferent things)

You need a query that matches all documents that don't have the term, but BooleanQuery cannot contain just a single MUST_NOT clause.
But you can use the following trick to bypass this:
var query = new BooleanQuery();
query.Add(new MatchAllDocsQuery(), Occur.SHOULD);
query.Add(new Term("SomeParameter", someValue), Occur.MUST_NOT);
writer.DeleteDocuments(query);

Related

Umbraco Lucene index for multiple values under one field

I have a requirement to index a series of key phrases assigned to articles. The phrases are stored as a string with a \r\n delimiter and one phrase may contain another phrase, for example:
This is a key phrase
This is a key phrase too
This is also a key phrase
Would be stored as
keywords: "This is a key phrase\r\nThis is a key phrase too\r\nThis is also a key phrase"
An article which has only the phrase This is a key phrase too should not be matched when a search for This is a key phrase is performed.
I have a custom indexer implementing ISimpleDataService which works fine and indexes the content, but I can't work out how to get a query such as "This is a key phrase" to return results.
From what I've read, I thought the default QueryParser should split on delimiters and see each entry as a separate value, but it doesn't seem to work that way.
Although I've tried various implementations, my current search code looks like this:
var searcher = ExamineManager.Instance.SearchProviderCollection["KeywordsSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.Field("keywords", keyword).Compile();
var searchResults = searcher.Search(query).OrderByDescending(x => x.Score).ToList();
The 'simple' way I thought to do this was to add each keyword as a separate 'keyword' field, but the SimpleDataSet provided as part of the .NET implementation uses a Dictionary<string, string>, which precludes me from being able to have more than one key with the same name.
I'm new to Lucene and Umbraco, so any advice would be gratefully received.

what is the difference between TermQuery and QueryParser in Lucene 6.0?

There are two queries,one is created by QueryParser:
QueryParser parser = new QueryParser(field, analyzer);
Query query1 = parser.parse("Lucene");
the other is term query:
Query query2=new TermQuery(new Term("title", "Lucene"));
what is the difference between query1 and query2?
This is the definition of Term from lucene docs.
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.
So in your case the query will be created to search the word "Lucene" in the field "title".
To explain the difference between the two let me take a difference example,
consider the following
Query query2 = new TermQuery(new Term("title", "Apache Lucene"));
In this case the query will search for the exact word "Apache Lucene" in the field title.
In the other case
As an example, let's assume a Lucene index contains two fields, "title" and "body".
QueryParser parser = new QueryParser("title", "StandardAnalyzer");
Query query1 = parser.parse("title:Apache body:Lucene");
Query query2 = parser.parse("title:Apache Lucene");
Query query3 = parser.parse("title:\"Apache Lucene\"");
couple of things.
"title" is the field that QueryParser will search if you don't prefix it with a field.(as given in the constructor).
parser.parse("title:Apache body:Lucene"); -> in this case the final query will look like this. query2 = title:Apache body:Lucene.
parser.parse("body:Apache Lucene"); -> in this case the final query will also look like this. query2 = body:Apache title:Lucene. but for a different reason.
So the parser will search "Apache" in body field and "Lucene" in title field. Since The field is only valid for the term that it directly precedes,(http://lucene.apache.org/core/2_9_4/queryparsersyntax.html)
So since we do not specify any field for lucene , the default field which is "title" will be used.
query2 = parser.parse("title:\"Apache Lucene\""); in this case we are explicitly telling that we want to search for "Apache Lucene" in field "title". This is phrase query and is similar to Term query if analyzed correctly.
So to summarize the term query will not analyze the term and search as it is. while Query parser parses the input based on some conditions described above.
The QueryParser parses the string and constructs a BooleanQuery (afaik) consisting of BooleanClauses and analyzes the terms along the way.
The TermQuery does NOT do analysis, and takes the term as-is. This is the main difference.
So the query1 and query2 might be equivalent (in a sense, that they provide the same search results) if the field is the same, and the QueryParser's analyzer is not changing the term.

Sitecore term query for filter data

In Sitecore lucene search i am using "term query" to filter data from sitecore.
Here i have one field in Sitecore called "Description" and i want to do fileration based on term "Lorem". But every time I am getting 0 result. If i dont use rterm query i get all result that means my index configuration is correct. Please help.
TermQuery bothQuery = new TermQuery (new Term("Description", "Lorem"));
BooleanQuery query = new BooleanQuery();
query.Add(bothQuery, BooleanClause.Occur.MUST);
TopDocs topDocs = sc.Searcher.Search(query, int.MaxValue);
SearchHits searchHits = new SearchHits(topDocs, sc.Searcher.GetIndexReader());
return searchHits.FetchResults(0, int.MaxValue).Select(r => r.GetObject<Item>()).ToList();
I note that your Term definition above has a field name containing a capital letter. You don't specify the version of Sitecore / Lucene you're working in, but my experience with the 6.x series of Sitecore is that the indexing process transforms all the Field names to lower case at index time.
Hence your field in Sitecore might be called "Description" but in Lucene's index it is probably called "description". Try changing your code to use a lower case field name.
You can check this using an index display tool like the Lucene Index Viewer from the Sitecore Marketplace. It will show you the names of the fields in your index, and let you test queries against them without the need to recompile code.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

emit every document in the database with lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit.
My first idea is to add a field that is always matched, but that might deform the relevance score.
Use a BooleanQuery to combine your original query with a MatchAllDocsQuery. You can mitigate the effect this has on scoring by setting the boost on the MatchAllDocsQuery to zero before you combine it with your main query. This way you don't have to add an otherwise bogus field to the index.
For example:
// Parse a query by the user.
QueryParser qp = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer());
Query standardQuery = qp.parse("User query may go here");
// Make a query that matches everything, but has no boost.
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
matchAllDocsQuery.setBoost(0f);
// Combine the queries.
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.add(standardQuery, BooleanClause.Occur.SHOULD);
boolQuery.add(matchAllDocsQuery, BooleanClause.Occur.SHOULD);
// Now just pass it to the searcher.
This should give you hits from standardQuery followed by the rest of the documents in the index.