Lucene 4.0 sample code - lucene

I can't get this to work with Lucene 4.0 and its new features... Could somebody please help me??
I have crawled a bunch of html-documents from the web. Now I would like to count the number of distinct words of every Document.
This is how I did it with Lucene 3.5 (for a single document. To get them all I loop over all documents... every time with a new RAMDirectory containing only one doc) :
Analyzer analyzer = some Lucene Analyzer;
RAMDirectory index;
index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
String _words = new String();
// get somehow the String containing a certain text:
_words = doc.getPageDescription();
try {
IndexWriter w = new IndexWriter(index, config);
addDoc(w, _words);
w.close();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
try {
// System.out.print(", count Terms... ");
IndexReader reader = IndexReader.open(index);
TermFreqVector[] freqVector = reader.getTermFreqVectors(0);
if (freqVector == null) {
System.out.println("Count words: ": 0");
}
for (TermFreqVector vector : freqVector) {
String[] terms = vector.getTerms();
int[] freq = vector.getTermFrequencies();
int n = terms.length;
System.out.println("Count words: " + n);
....
How can I do this with Lucene 4.0?
I'd prefer to do this using a FSDirectory instead of RAMDirectory however; I guess this is more performant if I have a quite high number of documents?
Thanks and regards
C.

Use the Fields/Terms apis.
See especially the example 'access term vector fields for a specific document'
Seeing as you are looping over all documents, if your end goal is really something like the average number of unique terms across all documents, keep reading to the 'index statistics section'. For example in that case, you can compute that efficiently with #postings / #documents: getSumDocFreq()/maxDoc()
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/package-summary.html#package_description

Related

I have synonym matching working EXCEPT in quoted phrases

Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.
For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.

Querying part-of-speech tags with Lucene 7 OpenNLP

For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not in SolR, as I want to be independent of any search engine on top.
Idea
Input sentence 1: The quick brown fox jumped over the lazy dogs.
Applied Lucene OpenNLP tokenizer results in: [The][quick][brown][fox][jumped][over][the][lazy][dogs][.]
Next, applying Lucene OpenNLP POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]
Input sentence 2: Give it to me, baby!
Applied Lucene OpenNLP tokenizer results in: [Give][it][to][me][,][baby][!]
Next, applying Lucene OpenNLP POS tagging results in: [VB][PRP][TO][PRP][,][UH][.]
Query: JJ NN VBD matches part of sentence 1, so sentence 1 should be returned. (At this point I am only interested in exact matches, i.e. let's leave aside partial matches, wildcards etc.)
Indexing
First, I created my own class com.example.OpenNLPAnalyzer:
public class OpenNLPAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String fieldName) {
try {
ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());
TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);
SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);
Tokenizer source = new OpenNLPTokenizer(
AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);
POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);
// Perhaps we should also use a lower-case filter here?
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
// Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);
return new TokenStreamComponents(source, payloadFilter);
}
catch (IOException e) {
throw new RuntimeException(e.getMessage());
}
}
Note that we are using a TypeAsPayloadTokenFilter wrapped around OpenNLPPOSFilter. This means, our POS tags will be indexed as payloads, and our query - however it'll look like - will have to search on payloads as well.
Querying
This is where I am stuck. I have no clue how to query on payloads, and whatever I try does not work. Note that I am using Lucene 7, it seems that in older versions querying on payload has changed several times. Documentation is extremely scarce. It's not even clear what the proper field name is now to query - is it "word" or "type" or anything else? For example, I tried this code which does not return any search results:
// Step 1: Indexing
final String body = "The quick brown fox jumped over the lazy dogs.";
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
Document document = new Document();
document.add(new TextField("body", body, Field.Store.YES));
writer.addDocument(document);
writer.close();
// Step 2: Querying
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
final String queryText = "JJ";
Term term = new Term(fieldName, queryText);
SpanQuery match = new SpanTermQuery(term);
BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));
System.out.println(query.toString());
TopDocs topDocs = searcher.search(query, topN);
Any help is very much appreciated here.
Why don't you use TypeAsSynonymFilter instead of TypeAsPayloadTokenFilter and just make a normal query. So in your Analyzer:
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);
And indexing side:
static Directory index() throws Exception {
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
writer.addDocument(doc("Give it to me, baby!"));
writer.close();
return index;
}
static Document doc(String body){
Document document = new Document();
document.add(new TextField(FIELD, body, Field.Store.YES));
return document;
}
And searching side:
static void search(Directory index, String searchPhrase) throws Exception {
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
Query query = parser.parse(searchPhrase);
System.out.println(query);
TopDocs topDocs = searcher.search(query, topN);
System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
for(ScoreDoc scoreDoc: topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.printf("\t%s\n", doc.get(FIELD));
}
}
And then use them like this:
public static void main(String[] args) throws Exception {
Directory index = index();
search(index, "\"JJ NN VBD\""); // search the sequence of POS tags
search(index, "\"brown fox\""); // search a phrase
search(index, "\"fox brown\""); // search a phrase (no hits)
search(index, "baby"); // search a word
search(index, "\"TO PRP\""); // search the sequence of POS tags
}
The result looks like this:
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
Give it to me, baby!

High performance unique document id retrieval

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?
I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

Longer search time for shorter index in lucene

I was experimenting with Lucene on the Cystic Fybrosis collection. I made 4 indexes (seperate indexes),where one index had only title, while other had abstract and another subject. the last one had all fields.
Now I find that the search time for index which uses only title is significantly larger than for other 3 indexes. This seems counter-intuitive as the index size is small, when compared to other indices. What can be the probable reason for this?
Here is the code I have used for the benchmark
public class PrecisionRecall {
public static void main(String[] args) throws Throwable {
File topicsFile = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/LIA/lia2e/src/lia/benchmark/topics.txt");
File qrelsFile = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/LIA/lia2e/src/lia/benchmark/qrels.txt");
Directory dir = FSDirectory.open(new File("C:/Users/Raden/Documents/myindex"));
Searcher searcher = new IndexSearcher(dir, true);
String docNameField = "filename";
PrintWriter logger = new PrintWriter(System.out, true);
TrecTopicsReader qReader = new TrecTopicsReader(); //#1
QualityQuery qqs[] = qReader.readQueries( //#1
new BufferedReader(new FileReader(topicsFile))); //#1
Judge judge = new TrecJudge(new BufferedReader( //#2
new FileReader(qrelsFile))); //#2
judge.validateData(qqs, logger); //#3
QualityQueryParser qqParser = new SimpleQQParser("title", "contents"); //#4
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(judge, //#5
submitLog, logger);
QualityStats avg = QualityStats.average(stats); //#6
avg.log("SUMMARY",2,logger, " ");
dir.close();
}
}
The response time of a query does not depend on index size. It depends on the number of hits and the number of terms in the query.
This is because you don't have to read all the index data. You only need to read the document list for the query terms.

How to use Lucene IndexReader to read index in version 4.4?

For the just the sake of learning I've created an index from 1 file and wanted to search it. I am using Lucene Version 4.4. I know that indexing part is true.
tempFileName is the name of file which contains tokens and this file has the following words :
"odd plus odd is even ## even plus even is even ## odd plus even is odd ##"
However when I provide a query it returns nothing. I can't see what would be the problem. Any help is greatly appreciated.
Indexing part :
public void startIndexingDocument(String indexPath) throws IOException {
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_44);
SimpleFSDirectory directory = new SimpleFSDirectory(new File(indexPath));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_44,
analyzer);
IndexWriter writer = new IndexWriter(directory, config);
indexDocs(writer);
writer.close();
}
private void indexDocs(IndexWriter w) throws IOException {
Document doc = new Document();
File file = new File(tempFileName);
BufferedReader br = new BufferedReader(new FileReader(tempFileName));
Field field = new StringField(fieldName, br.readLine().toString(),
Field.Store.YES);
doc.add(field);
w.addDocument(doc);
}
Searching part :
public void readFromIndex(String indexPath) throws IOException,
ParseException {
Analyzer anal = new WhitespaceAnalyzer(Version.LUCENE_44);
QueryParser parser = new QueryParser(Version.LUCENE_44, fieldName, anal);
Query query = parser.parse("odd");
IndexReader reader = IndexReader.open(NIOFSDirectory.open(new File(
indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// display
System.out.println("fieldName =" + fieldName);
System.out.println("Found : " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get(fieldName));
}
reader.close();
}
The problem is that you are using a StringField. StringField indexes the entire input as a single token. Good for atomic strings, like keywords, identifiers, stuff like that. Not good for full text searching.
Use a TextField.
StringField have a single token. So, I try to test with simple code.
for example #yns~ If you have a file that this is cralwer file and this contents hava a single String.
ex) file name : data03.scd , contents : parktaeha
You try to search with "parktaeha" queryString.
You get the search result!
field name : acet, queryString parktaeha
======== start search!! ========== q=acet:parktaeha Found 1 hits. result array length :1 search result=> parktaeha
======== end search!! ==========
Look under the code. This code is test code.
while((target = in.readLine()) != null){
System.out.println("target:"+target);
doc.add(new TextField("acet",target ,Field.Store.YES)); // use TextField
// TEST : doc.add(new StringField("acet", target.toString(),Field.Store.YES));
}
ref url