How to check if document exists in lucene index? - lucene

I have an index of news articles, where i save title,link,description of news.. sometimes its possible that same news from same link is published with different titles by different news sources. it don't want exactly same description articles to be added twice..how to find if document already exists?

I'm assuming you're working with Java.
Assuming your link is being saved in the index as a StringField (so any analyzers you use won't break up the link into multiple terms), you can use a TermQuery.
TopDocs results = searcher.search(new TermQuery(new Term("link", "http://example.com")), 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
writer.addDocument(doc);
}
Note that StringFields are stored exactly so you may want to convert to lowercase when searching/indexing.
If you wish to make sure not more than 1 field already exists, then you can run it as a BooleanQuery using the Occur.SHOULD condition:
BooleanQuery matchingQuery = new BooleanQuery();
matchingQuery.add(new TermQuery(new Term("link", "http://example.com")), Occur.SHOULD);
matchingQuery.add(new TermQuery(new Term("description", "the unique description of the article")), Occur.SHOULD);
TopDocs results = searcher.search(matchingQuery, 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
doc.add(new StringField("description", "the unique description of the article", Stored.YES));
// note if you need the description to be tokenized, you need to add another TextField to the document with a different field name
doc.add(new TextField("descriptionText", "the unique description of the article", Stored.NO));
writer.addDocument(doc);
}

Related

Lucene not indexing String field with value "this"

I am adding the document to lucene index as follows:
Document doc = new Document();
String stringObj = (String)field.get(obj);
doc.add(new TextField(fieldName, stringObj.toLowerCase(), org.apache.lucene.document.Field.Store.YES));
indexWriter.addDocument(doc);
I am doing a wild card search as follows:
searchTerm = "*" + searchTerm + "*";
term = new Term(field, sTerm.toLowerCase());
Query query = new WildcardQuery(term);
TotalHitCountCollector collector = new TotalHitCountCollector();
indexSearcher.search(query, collector);
if(collector.getTotalHits() > 0){
TopDocs hits = indexSearcher.search(query, collector.getTotalHits());
}
When I have a string with a "this" value, it is not getting added to the index, hence i do not get the result on searching by "this". I am using a StandardAnalyzer.
Common terms of English language like prepositions, pronouns etc are marked as stop words and omitted before indexing. You can define a custom analyzer or custom stop word list for your analyzer. That way you will be able to omit words that you don't want to be indexed and keep the stop words that you need.

Lucene problems searchinh hyphenated field

I'm having some problems with Lucene that are driving me nuts. I have the following field:
doc.Add(new Field("cataloguenumber", i.CatalogueNumber.ToLower(), Field.Store.YES, Field.Index.ANALYZED));
Which will contain a catalogue number that looks something like this:
DF-GH5
DF-FJ4
DF-DOG
AC-DP
AC-123
AC-DOCO
i.e. two characters followed by a hyphen followed by 2-5 alphanumeric characters.
I'm trying to run a boolean query to allow users to search over the data:
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "cataloguenumber", "title", "author", "categories", "year", "length", "keyword", "description" };
// Making a boolean query for searching and get the searched hits
BooleanQuery mainQuery = new BooleanQuery();
QueryParser parser;
//Add filter for main keyword
parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
parser.AllowLeadingWildcard = true;
mainQuery.Add(parser.Parse(GetMainSearchQueryString(SearchPhrase)), Occur.MUST);
The system is working fine for all fields EXCEPT cataloguenumber which for whatever reason is not working at all.
Ideally we would like to be able to search by full or partial cataloguenumber so for example "DF-" should return all items prefixed DF
Does anyone know how I can make this work?
Thanks very much in advance
Olly
A common source of problems is to use different analyzers on index-time and query-time. You should be able to get good results by using a StandardAnalyzer - it treats the text DF-GH5 as a single token so you will be able to search using fx df-gh5 or df-* but make sure to use it for the IndexWriter and the QueryParser.
Here is a simple example which builds an in-memory index with a single document, and tries to query the index by cataloguenumber.
public static void Test()
{
// Use an in-memory index.
RAMDirectory indexDirectory = new RAMDirectory();
// Make sure to use the same analyzer for indexing
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
// Add single document to the index.
using (IndexWriter writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
Document document = new Document();
document.Add(new Field("content", "This is just some text", Field.Store.YES, Field.Index.ANALYZED));
document.Add(new Field("cataloguenumber", "DF-GH5", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(document);
}
var parser = new MultiFieldQueryParser(
Lucene.Net.Util.Version.LUCENE_30,
new[] { "cataloguenumber", "content" },
analyzer);
var searcher = new IndexSearcher(indexDirectory);
DoSearch("df-gh5", parser, searcher);
DoSearch("df-*", parser, searcher);
}
private static void DoSearch(string queryString, MultiFieldQueryParser parser, IndexSearcher searcher)
{
var query = parser.Parse(queryString);
TopDocs docs = searcher.Search(query, 10);
foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
{
Document searchHit = searcher.Doc(scoreDoc.Doc);
string cataloguenumber = searchHit.GetValues("cataloguenumber").FirstOrDefault();
string content = searchHit.GetValues("content").FirstOrDefault();
Console.WriteLine($"Found object: {cataloguenumber} {content}");
}
}

Lucene phrase query does not work

I can't figure out how to make phrase query to work. It returns exact mathes, but slop option doesn't seem to make a difference.
Here's my code:
static void Main(string[] args)
{
using (Directory directory = new RAMDirectory())
{
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
using (IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
// index a few documents
writer.AddDocument(createDocument("1", "henry morgan"));
writer.AddDocument(createDocument("2", "henry junior morgan"));
writer.AddDocument(createDocument("3", "henry immortal jr morgan"));
writer.AddDocument(createDocument("4", "morgan henry"));
}
// search for documents that have "foo bar" in them
String sentence = "henry morgan";
IndexSearcher searcher = new IndexSearcher(directory, true);
PhraseQuery query = new PhraseQuery()
{
//allow inverse order
Slop = 3
};
query.Add(new Term("contents", sentence));
// display search results
List<string> results = new List<string>();
Console.WriteLine("Looking for \"{0}\"...", sentence);
TopDocs topDocs = searcher.Search(query, 100);
foreach (ScoreDoc scoreDoc in topDocs.ScoreDocs)
{
var matchedContents = searcher.Doc(scoreDoc.Doc).Get("contents");
results.Add(matchedContents);
Console.WriteLine("Found: {0}", matchedContents);
}
}
private static Document createDocument(string id, string content)
{
Document doc = new Document();
doc.Add(new Field("id", id, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("contents", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
return doc;
}
I thought that all options except document with id=3 are supposed to match, but only the first one does. Did I miss something?
In Lucene In Action 2nd, 3.4.6 Searching by phrase: PhraseQuery.
PhraseQuery uses this information to locate documents where terms are within a certain distance of one another
Sure, a plain TermQuery would do the trick to locate this document
knowing either of those words, but in this case we only want documents
that have phrases where the words are either exactly side by side
(quick fox) or have one word in between (quick [irrelevant] fox)
So the PhraseQuery is actually used between terms and the sample code in that chapter also proves it. As you use StandardAnalyzer, so "henry morgan"
will be henry and morgan after analyzation. Therefore, you can not add "henry morgan" as one Term
/*
Sets the number of other words permitted between words
in query phrase.If zero, then this is an exact phrase search.
*/
public void setSlop(int s) { slop = s; }
The definition of setSlop may further explain the case.
After a little change on your code, I got it nailed.
// code in Scala
val query = new PhraseQuery();
query.setSlop(3)
List("henry", "morgan").foreach { word =>
query.add(new Term("contents", word))
}
In this case, the four documents will all be matched. If you have any further problem, I suggest you read that chapter in Lucene In Action 2nd.
That might help.

How Lucene search works?

I'm testing Lucene indexing/searchin and I have a doubt. To test I create some simple files.
Example:
mark_test_mark.txt
mark test mark
a.txt
mark
test
mark
mark
test
mark
mark
test
mark
mark
test
mark
I extrac the files' content and I'm indexing this too.
I'm creating the document to indexing this way:
doc.add(new Field(FILE_NAME, index.getFileName().trim(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.add(new Field(FILE_NAME_LOWER, index.getFileName().toLowerCase().trim(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.add(new Field(CONTENT, index.getFileContent(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
My question is when I do a seach for a keyword like 'mark'.
Lucene returns to me the following result:
mark_test_mark.txt -> 0.36452034
a.txt -> 0.36452034
Where, the 1st part represent the file name, and the second, the search score.
In my opinion, these 2 files don't have the same score and the first file should be a.txt.
Am I wrong?
EDIT:
I forget to say that I'm searching by name and content, so I do a multi-field search.
I'm using this code to do this:
IndexReader reader = IndexReader.open(Indexer.getFSDirectory(searchDirectory));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_36, new String[] {Indexer.FILE_NAME_LOWER, Indexer.CONTENT}, analyzer);
TopDocs topDocs = null;
try {
topDocs = searcher.search(queryParser.parse(searchQuery.getQuery()), getHitsPerPage());
} catch (ParseException e) {
e.printStackTrace();
}
ScoreDoc[] hits = topDocs.scoreDocs;

Lucene.net return all fields in a document

I am storing data in lucene.Net
I add a document with multiple fields :
var doc = new Document();
doc.Add(new Field("CreationDate", dt, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("FileName", path, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("Directory", dtpath, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("Text", text.ToString(), Field.Store.YES, Field.Index.ANALYZED));
...
writer.AddDocument(doc);
I want to move through all items and return fields "CreationDate" and "Directory" for each document.
But a Term can only except 1 field :
var termEnum = reader.Terms(new Term("CreationDate"));
How do i make it return the 2 fields ???
Thanks
Martin
When you iterate over search results, read the document and load the values from it:
int docId = hits[i].doc;
Document doc = searcher.Doc(docId);
String creationDate = doc.Get("CreationDate");
String directory = doc.get("Directory");
// ...and so on
You can ontain a enumeration of all documents containing a given term with:
var termDocEnum = reader.TermDocs(new Term("CreationDate"));
You can use that enumeration to fetch documents using the docId:
Document doc = searcher.doc(termDocEnum.doc);
Which should make it easy enough to use the Document API to get the information you are looking for.
Note, as hinted at earlier, this will only get documents for which the term given has a value specified! If that is a problem, you can either call TermDocs once for each pertinent argument and merge the sets as needed (done easily enough with a hash table indexed on docId, or some such), or call TermDocs with no argument, and use seek to find the appropriate Terms (merging will again need to be done manually if necessary).
A TermEnum, as passed from the 'terms' method does not give you a docId (you would have to use the termDocs method to get them with each term in the enum, I believe.)