Lucene query with filter "without property" - lucene

I need to write lucene query/filter to get objects without specific property.
I tried with ... ISNULL:"cm:param_name" but id didn't work.
Edit: I have added new property in aspect but objects that haven't been updated yet don't have it amongst their listed properties (checked with node browser).

With a query like "cm:*", you should only receive documents that have the field "cm" plus content. Note that you have to allow leading wildcard queries by the query parser with setAllowLeadingWildcard(true).
Also check out this post, which deals with a reversed version of your problem:
Find all Lucene documents having a certain field

Can you please be more clear as to what "without property" means ? Do you mean that you do not want to specify the field like so "field:value" and instead set the filter to "value" ?
EDIT
Are you generating these field names dynamically or is this the only field name that can have it's value missing ? If there is only one field that may or may not appear in your document then you could just populate it with a default value when it's missing and then search for that . Otherwise, you could try a negated rangequery like so : NOT foo:[* TO *] . This should match all documents without a value in the foo field. For performance purposes , in the second case the field should be indexed as a string field (not analyzed).

I managed to get this done with .. AND NOT (#namespace\:property:"")

In Java and Lucene 3.6.2 the "FieldValueFilter" with activated negation can be used: (which was not the question)
import org.apache.lucene.search.FieldValueFilter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.TopDocs;
final IndexSearcher indexSearcher = getIndexSearcher() <- whereever that comes from
final TopDocs topdocs = indexSearcher.search(new MatchAllDocsQuery(), new FieldValueFilter("cm", true), Integer.MAX_VALUE);

You can use ISUNSET and/or ISNULL for this scenario.
ISUNSET:"cm:title"
ISNULL:"cm:title"

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

Detailed information in Lucene/Solr results

After having performed a search in Lucene/Solr without having specified a field, how can I know in which fields of a result document the search string was found (and how often)?
You could use Query Highlighting.
Try setting debugQuery=on. See this example.
As mentioned, use debugQuery=true. The response will then include an "explain" section. By default, this will give you some awful formatted text that looks like this:
0.69102794 = (MATCH) weight(body:arrai^1.5 in 6357), product of:
0.46610788 = queryWeight(body:arrai^1.5), product of:
1.5 = boost
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.055577915 = queryNorm
1.4825494 = (MATCH) fieldWeight(body:arrai in 6357), product of:
2.828427 = tf(termFreq(body:arrai)=8)
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.09375 = fieldNorm(field=body, doc=6357)
For each match in each field, you will get a block like this that explains how SOLR computed the relevancy of this document to your query. What you're asking about (how many matches in this document's field) SOLR calls term frequency "tf". You can see this on the 7th line of the output i pasted above. In this line, SOLR is telling you that it found 8 matches for arrai in the field called "body".
The other lines stand for things like inverse document frequency-"idf" (how rare the matched term is) and fieldNorm, which relates to how short the document's field is relative to the match. You can learn about these here: http://wiki.apache.org/solr/SolrRelevancyFAQ
FYI if you need this "explain" information in a structured format instead of clumsy text you can pass this parameter with your query: debug.explain.structured=true However, its still pretty hard to use = )

nutch field problem

I was using something like:
Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "notdirectory:1" can be processed quite well all the time.
But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string:
Field stateField = new Field("state","irn_" + state, Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "state:irn_CA" can never fetch any results any more,even though I watch through hadoop logs that "irn_CA" is added to "state" field in fact.
So I doubt for Fields that satisfy "Field.Store.NO, Field.Index.UN_TOKENIZED",only numeric Fields can searchable,but I didn't see any documents about that.
So what's the true reason for this?
I think, you are using StandardAnalyzer for parsing the input query, which will tokenize your input query "irn_CA" into two tokens - "irn" and "CA". Since the index has "irn_CA" as single token, it won't match.
Try using KeywordAnalyzer for while searching. It will generate single token for the query string and match the indexed token correctly.
I think the searcher bean forces everything to lowercase...so make the state is in lower case when adding to the index:
Field stateField = new Field("state","irn_" + state.toLowerCase(), Field.Store.NO, Field.Index.UN_TOKENIZED);
and when you query: 'state:irn_ca' instead of 'state:irn_CA'.
I also note you prefixed with 'irn_' - good call, otherwise the highlighter flags up the the query.