how to use lucene-gosen analyser with lucene.net? - lucene

Please guide me how to use japanese analyser (lucene-gosen) with Lucene.net. And also suggest me some good analyzer for Lucene.net that support Japanese.

The Lucene-Gosen analyzer does not appear to be ported to Lucene.Net. You can make a request on their github page or you could help them out by porting it and submitting a pull request.
Once that analyzer exists and using the article here - using their basic code, just change the analyzer:
string strIndexDir = #"D:\Index";
Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Analyzer std = new JapaneseAnalyzer(Lucene.Net.Util.Version.LUCENE_29); //Version parameter is used for backward compatibility. Stop words can also be passed to avoid indexing certain words
IndexWriter idxw = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
//Create an Index writer object.
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldText = new Lucene.Net.Documents.Field("text", System.IO.File.ReadAllText(#"d:\test.txt"), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldText);
//write the document to the index
idxw.AddDocument(doc);
//optimize and close the writer
idxw.Optimize();
idxw.Close();
Response.Write("Indexing Done");

Related

Index field using: new TextField(String fieldName, Reader reader)

I've been trying to index a field using the readerValue() that Lucene provides in the Fields. The thing is that the terms are not being indexed. This is the interesting part of the code:
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
String str = "Some random text to be indexed";
Reader reader = new StringReader(str);
Document doc = new Document();
doc.add(new TextField("content", reader));
indexWriter.addDocument(doc);
Now, if I index that text as a String with the other TextField constructor it works fine, but like this it does not index the terms, instead returns null when I try to get the value of the field after a search:
QueryParser queryParser = new QueryParser("content", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
Reader r = document.getField("content").readerValue();
I really can't see the problem, maybe it is some dumb thing that I missed, or maybe I'm using it wrong? Thanks in advance for any help.
By default, TextField is unstored. The behavior you are seeing is expected for an un-stored field. You should be able to search on it, but not retrieve it from the index. The constructor that takes a string argument for the field contents allows you to set whether to store the field or not, thus the different behavior.
The reason the store option is not available on that constructor, is that Lucene explicitly disallows a stored field to be set with a Reader or TokenStream value. If you want to store the field, you will simply need get the string value from the Reader yourself.

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0?

Such as the title said, I have encountered a puzzled problem.
I have built an index for my test program, then I use IndexWriter to add a document into index. The code is :
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc1 = new Document();
doc1.add(new Field("name", "张三", Field.Store.YES, Field.Index.ANALYZED));
doc1.add(new IntField("year", 2013, Field.Store.YES));
doc1.add(new TextField("content", "123456789", Field.Store.YES));
iwriter.addDocument(doc1);
iwriter.commit();
iwriter.close();
When I try to search in this index, I can't get this doc. I really get a correct result count, it is one more than before. But when I try to print the doc.get('name'), the output is wrong.
The code in search part is:
DirectoryReader ireader = DirectoryReader.open(directory);
System.out.println(ireader.numDeletedDocs());
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query query = parser.parse("张");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length);
In results, there is a "Name: 李四".
I'm sure that I use the StandardAnalyzer during indexing and searching. And StandardAnalyzer will make one Chinese character as a single token. Why when I search "张", I will get "李四"? Is there anything wrong when I add a doc? Or the docid is mismatch?
Did you (re)open the index after adding the doc? Lucene searches only return the documents that existed as of the time the index was opened for searching.
[edit...]
Use IndexReader.Open() or IndexReader.doOpenIfChanged() to open the index again. doOpenIfChanged() has the advantage that it returns null if you can still use the old IndexReader instance (because the index has not changed).
(If I recall correctly, DirectoryReader.Open() just opens the index directory, so the higher-level Lucene code does not realize that the index has changed if you just call DirectoryReader.Open.)

Set Lucene IndexWriter max fields

I was started working my way through the second edition of 'Lucene in Action' which uses the 3.0 API, the author creates a basic INdexWriter with the following method
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
return new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.Unlimited);
}
In the code Below I've made the changes according the current API, with the exception that I cannot figure out how to set the writer's max field length to unlimited like the constant in the book example. I've just inserted the int 1000 below. Is this unlimited constant just gone completely in the current API?
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(Version.LUCENE_36), 1000));
return new IndexWriter(directory, iwc);
}
Thanks, this is just for curiosity.
IndexWriter javadoc says:
#deprecated use LimitTokenCountAnalyzer instead. Note that the
behvaior slightly changed - the analyzer limits the number of
tokens per token stream created, while this setting limits the
total number of tokens to index. This only matters if you index
many multi-valued fields though.
So, in other words, a hard-wired method has been replaced with a nice adapter/delegate pattern.

Use lucene index in java application

Recently i stared working on solr. I have created index in solr and i want to query on it through my java application. I don't want to use solr.war in my application. How can i use it through solrj api or lucene java api? My thinking is to add those index in project context and use it. I gone through some examples/tutorials but did not find any on how to work with already created index. Please tell me a proper solution for it or any link specifying the solution will be appreciated.
You can use Lucene apis to create/update and search on an index.
As solr is based on lucene, the underlying index is the lucene index.
Lucene exposes classes as IndexWriter and IndexSearcher, which would help you interact with index.
Example for searching over an solr/lucene index -
Directory index = FSDirectory.open(new File("/path/to/index"));
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Should be able to find examples on this.
Yes, you can use a Solr-created index with Lucene, there's nothing particular about it because Solr itself uses Lucene. So all Lucene documentation applies unchanged.
Or if you don't want to use Solr as a server you can use it embedded in your Java application.
I made it this way..
String realPath = request.getRealPath("/");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
Directory index = FSDirectory.open(new File(realPath+"/index"));
IndexSearcher indexSearcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(2000, true);
QueryParser query = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query q = null;
try {
q = query.parse("*:*");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
indexSearcher.search(q, collector);
ScoreDoc[] scoreDoc = collector.topDocs().scoreDocs;