How to implement Did You Mean and Spellchecker feature in lucene full text search engine.
After you've created the index you can can create the index with the dictionary used by the spell checker using:
public void createSpellChekerIndex() throws CorruptIndexException,
IOException {
final IndexReader reader = IndexReader.open(this.indexDirectory, true);
final Dictionary dictionary = new LuceneDictionary(reader,
LuceneExample.FIELD);
final SpellChecker spellChecker = new SpellChecker(this.spellDirectory);
final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig writerConfig = new IndexWriterConfig(
Version.LUCENE_36, analyzer);
spellChecker.indexDictionary(dictionary, writerConfig, true);
spellChecker.close();
}
and than ask for a suggestions array with:
public String[] getSuggestions(final String queryString,
final int numberOfSuggestions, final float accuracy) {
try {
final SpellChecker spellChecker = new SpellChecker(
this.spellDirectory);
final String[] similarWords = spellChecker.suggestSimilar(
queryString, numberOfSuggestions, accuracy);
return similarWords;
} catch (final Exception e) {
return new String[0];
}
}
Example:
After indexing the following document:
luceneExample.index("spell checker");
luceneExample.index("did you mean");
luceneExample.index("hello, this is a test");
luceneExample.index("Lucene is great");
And creating the spell index with the method above, i tried to search for the string "lucete" and, asking for suggestion with
final String query = "lucete";
final String[] suggestions = luceneExample.getSuggestions(query, 5,
0.2f);
System.out.println("Did you mean:\n" + Arrays.toString(suggestions));
This was the output:
Did you mean:
[lucene]
Related
I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?
I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:
public class IdentifierAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return tokenStream;
}
}
So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.
This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.
Here is my custom analyzer, using a WordDelimiterGraphFilter:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;
public class IdentifierAnalyzer extends Analyzer {
private WordDelimiterGraphFilterFactory getWordDelimiter() {
Map<String, String> settings = new HashMap<>();
settings.put("generateWordParts", "1"); // e.g. "PowerShot" => "Power" "Shot"
settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
settings.put("catenateAll", "1"); // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
settings.put("preserveOriginal", "1"); // e.g. "500-42" => "500" "42" "500-42"
settings.put("splitOnCaseChange", "1"); // e.g. "fooBar" => "foo" "Bar"
return new WordDelimiterGraphFilterFactory(settings);
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = getWordDelimiter().create(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
return tokenStream;
}
}
It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.
You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.
Here is a test index builder for the following 3 input values:
978-3-86680-192-9
TS 123
123.abc
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new IdentifierAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String identifier : identifiers) {
doc = new Document();
doc.add(new TextField("identifiers", identifier, Field.Store.YES));
writer.addDocument(doc);
}
}
}
This creates the following tokens:
For querying the above indexed data I used this:
public static void doSearch() throws IOException, ParseException {
Analyzer analyzer = new IdentifierAnalyzer();
QueryParser parser = new QueryParser("identifiers", analyzer);
List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
for (String search : searches) {
Query query = parser.parse(search);
printHits(query, search);
}
}
private static void printHits(Query query, String search) throws IOException {
System.out.println("search term: " + search);
System.out.println("parsed query: " + query.toString());
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("hits: " + hits.length);
for (ScoreDoc hit : hits) {
System.out.println("");
System.out.println(" doc id: " + hit.doc + "; score: " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println(" identifier: " + doc.get("identifiers"));
}
System.out.println("-----------------------------------------");
}
This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):
9783
9783*
978 3
978-3
TS1*
TS 1*
The only query which failed to find any matching documents was the first one:
search term: 9783
parsed query: identifiers:9783
hits: 0
This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.
The final query I tested TS 1* found three hits - but the one we want has the best matching score:
search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3
doc id: 1; score: 1.590861
identifier: TS 123
doc id: 0; score: 1.0
identifier: 978-3-86680-192-9
doc id: 2; score: 1.0
identifier: 123.abc
I am using Lucene 3.6. I want to know why update does not work. Is there anything wrong?
public class TokenTest
{
private static String IndexPath = "D:\\update\\index";
private static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
public static void main(String[] args) throws Exception
{
try
{
update();
display("content", "content");
}
catch (IOException e)
{
e.printStackTrace();
}
}
#SuppressWarnings("deprecation")
public static void display(String keyField, String words) throws Exception
{
IndexSearcher searcher = new IndexSearcher(FSDirectory.open(new File(IndexPath)));
Term term = new Term(keyField, words);
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits)
{
Document doc = searcher.doc(hit.doc);
System.out.println("doc_id = " + hit.doc);
System.out.println("内容: " + doc.get("content"));
System.out.println("路径:" + doc.get("path"));
}
}
public static String update() throws Exception
{
IndexWriterConfig writeConfig = new IndexWriterConfig(Version.LUCENE_33, analyzer);
IndexWriter writer = new IndexWriter(FSDirectory.open(new File(IndexPath)), writeConfig);
Document document = new Document();
Field field_name2 = new Field("path", "update_path", Field.Store.YES, Field.Index.ANALYZED);
Field field_content2 = new Field("content", "content update", Field.Store.YES, Field.Index.ANALYZED);
document.add(field_name2);
document.add(field_content2);
Term term = new Term("path", "qqqqq");
writer.updateDocument(term, document);
writer.optimize();
writer.close();
return "update_path";
}
}
I assume you want to update your document such that field "path" = "qqqq". You have this exactly backwards (please read the documentation).
updateDocument performs two steps:
Find and delete any documents containing term
In this case, none are found, because your indexed documents does not contain path:qqqq
Add the new document to the index.
You appear to be doing the opposite, trying to lookup by document, then add the term to it, and it doesn't work that way. What you are looking for, I believe, is something like:
Term term = new Term("content", "update");
document.removeField("path");
document.add("path", "qqqq");
writer.updateDocument(term, document);
My index is being created sucessfully. My problem is that when trying the read it in Luke, I am getting an error:
Caused by: java.io.IOException: read past EOF
I am aware that Lucene does provide an option to not store a Field. However, what would be the best way to go about this?
Store the field regardless of the size, and if a hit is found for a search, fetch the appropriate Field from Document OR
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out?
Here is the code used to create the index:
public class CREATEiNDEX {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
Directory index = FSDirectory.open(new File("C:/toturials/luceneindex/"));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
IndexWriter w = new IndexWriter(index, config);
List <String>list=readingPersonFile();
for (int i = 0; i < list.size(); i++)
{
addDoc(w, String.valueOf(i),list.get(i));
}
w.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new StringField("Id", title, Field.Store.YES));
doc.add(new TextField("Name", isbn, Field.Store.YES));
w.addDocument(doc);
}
i think you should make new index after deleing the index.
I am trying to index a set of documents using Lucene 4.2. I've created a custom analyzer, that doesn't tokenize and doesn't lowercase the terms, with the following code:
public class NoTokenAnalyzer extends Analyzer{
public Version matchVersion;
public NoTokenAnalyzer(Version matchVersion){
this.matchVersion=matchVersion;
}
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
// TODO Auto-generated method stub
//final Tokenizer source = new NoTokenTokenizer(matchVersion, reader);
final KeywordTokenizer source=new KeywordTokenizer(reader);
TokenStream result = new LowerCaseFilter(matchVersion, source);
return new TokenStreamComponents(source, result);
}
}
I use the analyzer to construct the index (inspired by the code provided in the Lucene documentation):
public static void IndexFile(Analyzer analyzer) throws IOException{
boolean create=true;
String directoryPath="path";
File folderToIndex=new File(directoryPath);
File[]filesToIndex=folderToIndex.listFiles();
Directory directory=FSDirectory.open(new File("index path"));
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_42, analyzer);
if (create) {
// Create a new index in the directory, removing any
// previously indexed documents:
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
IndexWriter writer = new IndexWriter(directory, iwc);
for (final File singleFile : filesToIndex) {
//process files in the directory and extract strings to index
//..........
String field1;
String field2;
//index fields
Document doc=new Document();
Field f1Field= new Field("f1", field1, TextField.TYPE_STORED);
doc.add(f1Field);
doc.add(new Field("f2", field2, TextField.TYPE_STORED));
}
writer.close();
}
The problem with the code is that the indexed fields are not tokenized, but they are also not lowercased,i.e, it seems that the analyzer is not being applied during indexing.
I can't figure out what's wrong? How can I make the analyzer work?
The code works correctly. So it might serve someone in creating a custom analyzer in Lucene 4.2, and using it for indexing and searching.
I cannot find any complete examples of how to use this API. The code below is not giving any results. Any idea why?
static String spatialPrefix = "_point";
static String latField = spatialPrefix + "lat";
static String lngField = spatialPrefix + "lon";
public static void main(String[] args) throws IOException {
SpatialLuceneExample spatial = new SpatialLuceneExample();
spatial.addData();
IndexReader reader = DirectoryReader.open(modules.getDirectory());
IndexSearcher searcher = new IndexSearcher(reader);
searchAndUpdateDocument(38.9510000, -77.4107000, 100.0, searcher,
modules);
}
private void addLocation(IndexWriter writer, String name, double lat,
double lng) throws IOException {
Document doc = new Document();
doc.add(new org.apache.lucene.document.TextField("name", name,
Field.Store.YES));
doc.add(new org.apache.lucene.document.DoubleField(latField, lat,
Field.Store.YES));
doc.add(new org.apache.lucene.document.DoubleField(lngField, lng,
Field.Store.YES));
doc.add(new org.apache.lucene.document.TextField("metafile", "doc",
Field.Store.YES));
writer.addDocument(doc);
System.out.println("===== Added Doc to index ====");
}
private void addData() throws IOException {
IndexWriter writer = modules.getWriter();
addLocation(writer, "McCormick & Schmick's Seafood Restaurant",
38.9579000, -77.3572000);
addLocation(writer, "Jimmy's Old Town Tavern", 38.9690000, -77.3862000);
addLocation(writer, "Ned Devine's", 38.9510000, -77.4107000);
addLocation(writer, "Old Brogue Irish Pub", 38.9955000, -77.2884000);
//...
writer.close();
}
private final static Logger logger = LogManager
.getLogger(SpatialTools.class);
public static void searchAndUpdateDocument(double lo, double la,
double dist, IndexSearcher searcher, LuceneModules modules) {
SpatialContext ctx = SpatialContext.GEO;
SpatialArgs args = new SpatialArgs(SpatialOperation.IsWithin,
ctx.makeCircle(lo, la, DistanceUtils.dist2Degrees(dist,
DistanceUtils.EARTH_MEAN_RADIUS_KM)));
PointVectorStrategy strategy = new PointVectorStrategy(ctx, "_point");
// RecursivePrefixTreeStrategy recursivePrefixTreeStrategy = new
// RecursivePrefixTreeStrategy(grid, fieldName);
// How to use it?
Query makeQueryDistanceScore = strategy.makeQueryDistanceScore(args);
LuceneSearcher instance = LuceneSearcher.getInstance(modules);
instance.getTopResults(makeQueryDistanceScore);
//no results
Filter geoFilter = strategy.makeFilter(args);
try {
Sort chainedSort = new Sort().rewrite(searcher);
TopDocs docs = searcher.search(new MatchAllDocsQuery(), geoFilter,
10000, chainedSort);
logger.debug("search finished, num: " + docs.totalHits);
//no results
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
double la1 = Double.parseDouble(doc.get(latField));
double lo1 = Double.parseDouble(doc.get(latField));
double distDEG = ctx.getDistCalc().distance(
args.getShape().getCenter(), lo1, la1);
logger.debug("dist deg: : " + distDEG);
double distKM = DistanceUtils.degrees2Dist(distDEG,
DistanceUtils.EARTH_MEAN_RADIUS_KM);
logger.debug("dist km: : " + distKM);
}
} catch (IOException e) {
logger.error("fail to get the search result!", e);
}
}
Did you see the javadocs? These docs in turn point to SpatialExample.java which is what you're looking for. What could I do to make them more obvious?
If you're bent on using a pair of doubles as the internal index approach then use PointVectorStrategy. However, you'll get superior filter performance if you instead use RecursivePrefixTreeStrategy. Presently, PVS does better distance sorting, though, scalability wise. You could use both for their respective benefits.
Just looking quickly at your example, I see you didn't use SpatialStrategy.createIndexableFields(). The intention is that you use that.
See the following link for example : http://mad4search.blogspot.in/2013/06/implementing-geospatial-search-using.html