Deleting document by Term from lucene - lucene

The following code does not delete the document by Term as expected:
RAMDirectory idx = new RAMDirectory();
IndexWriter writer = new IndexWriter(idx,
new SnowballAnalyzer(Version.LUCENE_30, "English"),
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field("title", "mydoc", Field.Store.YES, Field.Index.NO));
doc.add(new Field("content", "some content, deleteme", Field.Store.YES, Field.Inde
x.ANALYZED));
writer.addDocument(doc);
Document doc2 = new Document();
doc2.add(new Field("title", "mydoc2", Field.Store.YES, Field.Index.NO));
doc2.add(new Field("content", "other content, don't deleteme", Field.Store.YES, Field.I
ndex.ANALYZED));
writer.addDocument(doc2);
writer.optimize();
writer.close();
/*
IndexReader reader = IndexReader.open(idx, false);
int docs_up_for_deletion = reader.docFreq(new Term("title"));
int before = reader.numDocs();
int docs_deleted = reader.deleteDocuments(new Term("title", "mydoc"));
reader.close();
*/
IndexWriter writer2 = new IndexWriter(idx,
new SnowballAnalyzer(Version.LUCENE_30, "English"),
IndexWriter.MaxFieldLength.LIMITED);
int before = writer2.numDocs();
writer2.deleteDocuments(new Term("title", "mydoc"));
writer2.commit();
writer2.optimize();
int after = writer2.numDocs();
writer2.close();
int docs_deleted = before - after;
I've tried deleting with the IndexReader and IndexWriter and neither works.
I've also tried adding another IndexReader search after the above code just in case the number only gets updated after closing writer2 (mentioned in this FAQ), but that doesn't help. Doing a writer.deleteAll() works, just not the delete by Term.
I found an old reference to the fact that only fields of type Field.Keyword can be deleted, but this is no longer a valid field type in Lucene 3.x

Your title field is not indexed. Change
new Field("title", "mydoc", Field.Store.YES, Field.Index.NO)
to
new Field("title", "mydoc", Field.Store.YES, Field.Index.ANALYZED)
or
new Field("title", "mydoc", Field.Store.YES, Field.Index.NOT_ANALYZED)
depending on whether or not you want your field analyzed.

Related

Lucene doesn't search number fields

I'm trying to index and then search integer field with lucene. But it doesn't find anything (Text fields search well).
Document doc = new Document();
//UserType = 1
doc.add(new IntField("userType", user.getType().getId(), Field.Store.YES));
FSDirectory dir = FSDirectory.open(FileSystems.getDefault().getPath(indexDir));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
writer = new IndexWriter(dir, config);
writer.addDocument(doc);
For search I tried to use next queries:
1) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:1");
2) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:[1 TO 1]");
3) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:\"1\"");
But it doesn't work.
QueryParser doesn't handle numerics. You can search using NumericRangeQuery:
Query query = NumericRangeQuery.newIntRange("userType", 1, 1, true, true);

NOT operator doesn't work in query lucene

I use lucene version 3.0.3.0, but some expression that i search, doesn't work properly. for example if i search "!Fiesta OR Astra" on field "Model", "vauxhallAstra" is returned only and "fordFocus" is not returned. my code is below:
var fordFiesta = new Document();
fordFiesta.Add(new Field("Id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
fordFiesta.Add(new Field("Make", "Ford", Field.Store.YES, Field.Index.ANALYZED));
fordFiesta.Add(new Field("Model", "Fiesta", Field.Store.YES, Field.Index.ANALYZED));
var fordFocus = new Document();
fordFocus.Add(new Field("Id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
fordFocus.Add(new Field("Make", "Ford", Field.Store.YES, Field.Index.ANALYZED));
fordFocus.Add(new Field("Model", "Focus", Field.Store.YES, Field.Index.ANALYZED));
var vauxhallAstra = new Document();
vauxhallAstra.Add(new Field("Id", "3", Field.Store.YES, Field.Index.NOT_ANALYZED));
vauxhallAstra.Add(new Field("Make", "Vauxhall", Field.Store.YES, Field.Index.ANALYZED));
vauxhallAstra.Add(new Field("Model", "Astra", Field.Store.YES, Field.Index.ANALYZED));
Directory directory = FSDirectory.Open(new DirectoryInfo(Environment.CurrentDirectory + "\\LuceneIndex"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
writer.AddDocument(fordFiesta);
writer.AddDocument(fordFocus);
writer.AddDocument(vauxhallAstra);
writer.Optimize();
writer.Close();
IndexReader indexReader = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader);
var queryParser = new QueryParser(Version.LUCENE_30, "Model", analyzer);
var query = queryParser.Parse("!Fiesta OR Astra");
Console.WriteLine("Searching for: " + query.ToString());
TopDocs resultDocs = indexSearch.Search(query, 200);
Console.WriteLine("Results Found: " + resultDocs.MaxScore);
var hits = resultDocs.ScoreDocs;
foreach (var hit in hits)
{
var documentFromSearcher = indexSearch.Doc(hit.Doc);
Console.WriteLine(documentFromSearcher.Get("Make") + " " + documentFromSearcher.Get("Model"));
}
indexSearch.Close();
directory.Close();
Console.ReadKey();
!Fiesta OR Astra doesn't mean what you think it means. The !Fiesta portion does NOT mean, "get everything except Fiesta", as you might expect, but rather more like "forbid Fiesta". A NOT term in a Lucene query only filters out results, it does not find anything.
The only query you have defined that will actually fetch results is Astra. So everything containing Astra will be found, then anything with Fiesta will be filtered out.
In order to perform the query I believe you are expecting, you would need something like:
Astra OR (*:* !Fiesta)
*:* as a MatchAllDocsQuery. Since you do need to match all the documents to perform this sort of query, it can be expected to perform poorly.
Confusing interpretation of "boolean" logic like this are why I really don't like AND/OR/NOT syntax for Lucene. +/- is much clearer, more powerful, and doesn't introduce the oddball gotchas like this.
This excellent article on the topic clarifies somewhat why you should be thinking in terms of MUST/MUST_NOT/SHOULD, rather than traditional boolean logic.

Lucene white space analyzer ignoring phrases?

I'm updating a document in Lucene, but when I search for the full value in one of the fields no results come back. If I search for just one word, then I get a result back.
This example comes from chapter 2 of the Lucene in Action 2nd Edition book and I'm using the Lucene 3 Java Library.
Here's the main logic
"Document fields show new value when updated, and not old value" in {
getHitCount("city", "Amsterdam") must equal(1)
val update = new Document
update add new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED)
update add new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO)
update add new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED)
update add new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED)
wr updateDocument(new Term("id", "1"), update)
wr close
getHitCount("city", "Amsterdam") must equal(0)
getHitCount("city", "Den Haag") must equal(1)
}
It's the last line in the above that fails - the hit count is 0. If I change the query to either "Den" or "Haag" then I get 1 hit.
Here is all the setup and dependencies. Note how the writer uses a white space query analyzer as the book suggests. Is this the problem?
override def beforeEach{
dir = new RAMDirectory
val wri = writer
for (i <- 0 to ids.length - 1) {
val doc = new Document
doc add new Field("id", ids(i), Field.Store.YES, Field.Index.NOT_ANALYZED)
doc add new Field("country", unindexed(i), Field.Store.YES, Field.Index.NO)
doc add new Field("contents", unstored(i), Field.Store.NO, Field.Index.ANALYZED)
doc add new Field("city", text(i), Field.Store.YES, Field.Index.ANALYZED)
wri addDocument doc
}
wri close
wr = writer
}
var dir: RAMDirectory = _
def writer = new IndexWriter(dir, new WhitespaceAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED)
var wr: IndexWriter = _
def getHitCount(field: String, q: String): Int = {
val searcher = new IndexSearcher(dir)
val query = new TermQuery(new Term(field, q))
val hitCount = searcher.search(query, 1).totalHits
searcher.close()
hitCount
}
You may want to look at PhraseQuery instead of TermQuery.

How Lucene search works?

I'm testing Lucene indexing/searchin and I have a doubt. To test I create some simple files.
Example:
mark_test_mark.txt
mark test mark
a.txt
mark
test
mark
mark
test
mark
mark
test
mark
mark
test
mark
I extrac the files' content and I'm indexing this too.
I'm creating the document to indexing this way:
doc.add(new Field(FILE_NAME, index.getFileName().trim(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.add(new Field(FILE_NAME_LOWER, index.getFileName().toLowerCase().trim(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.add(new Field(CONTENT, index.getFileContent(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
My question is when I do a seach for a keyword like 'mark'.
Lucene returns to me the following result:
mark_test_mark.txt -> 0.36452034
a.txt -> 0.36452034
Where, the 1st part represent the file name, and the second, the search score.
In my opinion, these 2 files don't have the same score and the first file should be a.txt.
Am I wrong?
EDIT:
I forget to say that I'm searching by name and content, so I do a multi-field search.
I'm using this code to do this:
IndexReader reader = IndexReader.open(Indexer.getFSDirectory(searchDirectory));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_36, new String[] {Indexer.FILE_NAME_LOWER, Indexer.CONTENT}, analyzer);
TopDocs topDocs = null;
try {
topDocs = searcher.search(queryParser.parse(searchQuery.getQuery()), getHitsPerPage());
} catch (ParseException e) {
e.printStackTrace();
}
ScoreDoc[] hits = topDocs.scoreDocs;

Different analyzers for each field

How can I enable different analyzers for each field in a document I'm indexing with Lucene? Example:
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
Field field1 = new Field("field1", someText1, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
Field field2 = new Field("field2", someText2, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.Add(field1);
doc.Add(field2);
iw.AddDocument(doc);
iw.Commit();
The analyzer is an argument to the IndexWriter, but I want to use StandardAnalyzer for field1 and SimpleAnalyzer for field2, how can I do that? The same applies when searching, of course. The correct analyzer must be applied for each field.
PerFieldAnalyzerWrapper is what you are looking for. The equivalent of this in Lucene.net is here.
Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>();
analyzerMap.put(fieldone, new IKAnalyzer4PinYin(false, IKAnalyzer4PinYin.PINYIN));
analyzerMap.put(fieldtwo, new IKAnalyzer4PinYin(false, KAnalyzer4PinYin.PINYIN_SHOUZIMU));
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new IKAnalyzer4PinYin(false), analyzerMap);
IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_40 , wrapper);
Necromancing.
For C#:
Lucene.Net.Util.LuceneVersion version = Lucene.Net.Util.LuceneVersion.LUCENE_48;
Dictionary<string, Lucene.Net.Analysis.Analyzer> fieldAnalyzers =
new Dictionary<string, Lucene.Net.Analysis.Analyzer>(System.StringComparer.OrdinalIgnoreCase);
fieldAnalyzers["YourFieldName"] = new Lucene.Net.Analysis.Core.KeywordAnalyzer();
Lucene.Net.Analysis.Miscellaneous.PerFieldAnalyzerWrapper wrapper =
new Lucene.Net.Analysis.Miscellaneous.PerFieldAnalyzerWrapper(
new Lucene.Net.Analysis.Core.KeywordAnalyzer(), fieldAnalyzers);
Lucene.Net.Index.IndexWriterConfig writerConfig = new Lucene.Net.Index.IndexWriterConfig(version, wrapper);