Lucene analyzer for substrings - lucene

I have a database table with about 40,000 records containing code fields, such as
FLEFSU25B-25M
EMG1090-5S
I need to be able to very quickly select all codes that contain a given substring. For example "109" matches EMG1090-5S.
My current approach is to store the codes in Lucene and have Lucene filter by substring - such as 109
But that is not very efficient if I just store the codes, because than Lucene has to search through all the tokens.
To overcome this, I'm thinking of creating a new analyzer that will split each code into tokens, like this:
EMG1090-5S
MG1090-5S
G1090-5S
1090-5S
...
Then to find all codes with substring 109, I can search on 109* which is much more efficient (I understand Lucene stores tokens alphabetically, just like SQL Server indexes).
Does this make sense?
Does such an analyzer already exist? I'm using .Net/C#.

A token filter to accomplish this does indeed already exist! Take a look at EdgeNGramTokenFilter. An Analyzer using it might look something like:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
KeywordTokenizer source = new KeywordTokenizer(reader);
LowercaseFilter filter = new LowercaseFilter(source);
filter = new EdgeNGramTokenFilter(filter, EdgeNGramTokenFilter.Side.BACK, 2, 50);
return new TokenStreamComponents(source, filter);
}
};
For your consideration, WordDelimiterTokenizer might also prove useful to you. It has a number of configuartion options, and can be used to separate at punctuation and at transitions from letter to number, etc. So with it, you could get the from your input: "EMG1090-5S"
You could get the tokens:
EMG
1090
5
S
Which might work well for your case, but would not be particularly helpful in finding something like: "MG1"

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

lucene wildcard query with space

I have Lucene index which has city names.
Consider I want to search for 'New Delhi'. I have string 'New Del' which I want to pass to Lucene searcher and I am expecting output as 'New Delhi'.
If I generate query like Name:New Del* It will give me all cities with 'New and Del'in it.
Is there any way by which I can create Lucene query wildcard query with spaces in it?
I referred and tried few solutions given # http://www.gossamer-threads.com/lists/lucene/java-user/5487
It sounds like you have indexed your city names with analysis. That will tend to make this more difficult. With analysis, "new" and "delhi" are separate terms, and must be treated as such. Searching over multiple terms with wildcards like this tends to be a bit more difficult.
The easiest solution would be to index your city names without tokenization (lowercasing might not be a bad idea though). Then you would be able to search with the query parser simply by escaping the space:
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("cityname:new\\ del*");
Or you could use a simple WildcardQuery:
Query query = new WildcardQuery(new Term("cityname", "new del*"));
With the field analyzed by standard analyzer:
You will need to rely on SpanQueries, something like this:
SpanQuery queryPart1 = new SpanTermQuery(new Term("cityname", "new"));
SpanQuery queryPart2 = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("cityname", "del*")));
Query query = new SpanNearQuery(new SpanQuery[] {query1, query2}, 0, true);
Or, you can use the surround query parser (which provides query syntax intended to provide more robust support of span queries), using a query like W(new, del*):
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new, del*)");
query = srndquery.makeLuceneQueryField("cityname", new BasicQueryFactory());
As I learnt from the thread mentioned by you (http://www.gossamer-threads.com/lists/lucene/java-user/5487), you can either do an exact match with space or treat either parts w/ wild card.
So something like this should work - [New* Del*]

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

With Lucene, if someone searches for "red barn", how do you also return results that contain "redbarn"?

I have a working search where if someone searches for two separate words (like "red barn", Lucene does a great job of returning records that have "red barn", "barn red", and "red tractor next to the big brown barn". That's great, but the results do not return anything that contains "redbarn" (unless you specifically search for "redbarn", but then you don't get "red barn" records).
I'm just using the standard analyzer at the moment, but am not sure what needs to change in order to get the all records I'd like.
If it matters, I'm using the NEST client on top of ElasticSearch (which is Lucene under the hood). I've researched the various analyzers and properties available but haven't found the right combination to do this.
The best way to do this is to write an analyzer that would tokenize "redbarn" as ["red", "barn"]. Lucene already does it for german languages, you can look at DictionaryCompoundWordTokenFilter for example.
Standard analyzer works for most cases but if you need detailed text analysis you need to write your own analyzer.
WorldDelimeterFilter that comes with Solr should solve your problem. Solr is built on top of lucene so you should not have a problem if you use the filters that come with solr.See the example below:
public class CustomAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = new WhitespaceTokenizer(reader);
ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 1);
ts = new LowerCaseFilter(ts);
return ts;
}
}