Umbraco Lucene index for multiple values under one field - lucene

I have a requirement to index a series of key phrases assigned to articles. The phrases are stored as a string with a \r\n delimiter and one phrase may contain another phrase, for example:
This is a key phrase
This is a key phrase too
This is also a key phrase
Would be stored as
keywords: "This is a key phrase\r\nThis is a key phrase too\r\nThis is also a key phrase"
An article which has only the phrase This is a key phrase too should not be matched when a search for This is a key phrase is performed.
I have a custom indexer implementing ISimpleDataService which works fine and indexes the content, but I can't work out how to get a query such as "This is a key phrase" to return results.
From what I've read, I thought the default QueryParser should split on delimiters and see each entry as a separate value, but it doesn't seem to work that way.
Although I've tried various implementations, my current search code looks like this:
var searcher = ExamineManager.Instance.SearchProviderCollection["KeywordsSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.Field("keywords", keyword).Compile();
var searchResults = searcher.Search(query).OrderByDescending(x => x.Score).ToList();
The 'simple' way I thought to do this was to add each keyword as a separate 'keyword' field, but the SimpleDataSet provided as part of the .NET implementation uses a Dictionary<string, string>, which precludes me from being able to have more than one key with the same name.
I'm new to Lucene and Umbraco, so any advice would be gratefully received.

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Delete all documents that don't match a term?

How would I match all documents that don't match a term in lucene or lucene.net ?
If I want to delete all the documents that match a term it's easy :
writer.DeleteDocuments(new Term("SomeParameter", SomeValue));
But i actually need to do the opposite (I'm marking each updated document with a unique tag, I want to delete everything that wasn't updated, so everything whose tag is diferent from my tag, but it could be very diferent things)
You need a query that matches all documents that don't have the term, but BooleanQuery cannot contain just a single MUST_NOT clause.
But you can use the following trick to bypass this:
var query = new BooleanQuery();
query.Add(new MatchAllDocsQuery(), Occur.SHOULD);
query.Add(new Term("SomeParameter", someValue), Occur.MUST_NOT);
writer.DeleteDocuments(query);

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

nutch field problem

I was using something like:
Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "notdirectory:1" can be processed quite well all the time.
But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string:
Field stateField = new Field("state","irn_" + state, Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "state:irn_CA" can never fetch any results any more,even though I watch through hadoop logs that "irn_CA" is added to "state" field in fact.
So I doubt for Fields that satisfy "Field.Store.NO, Field.Index.UN_TOKENIZED",only numeric Fields can searchable,but I didn't see any documents about that.
So what's the true reason for this?
I think, you are using StandardAnalyzer for parsing the input query, which will tokenize your input query "irn_CA" into two tokens - "irn" and "CA". Since the index has "irn_CA" as single token, it won't match.
Try using KeywordAnalyzer for while searching. It will generate single token for the query string and match the indexed token correctly.
I think the searcher bean forces everything to lowercase...so make the state is in lower case when adding to the index:
Field stateField = new Field("state","irn_" + state.toLowerCase(), Field.Store.NO, Field.Index.UN_TOKENIZED);
and when you query: 'state:irn_ca' instead of 'state:irn_CA'.
I also note you prefixed with 'irn_' - good call, otherwise the highlighter flags up the the query.