lucene search is working for only small letters

lucene search is working for only small letters - lucene

i am adding my lucene document like following
final Document document = new Document();
document.add(new Field("login", user.getLogin(), Field.Store.YES, Field.Index.NO));
document.add(new Field("email", user.getEmail(), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("firstName", user.getFirstName(), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("lastName", user.getLastName(), Field.Store.YES, Field.Index.ANALYZED));
userIndexWriter.addDocument(document);
So if i search with small letters , the search is successful, but if i search with capital letters, the search returns nothing.
Anybody has a clue if i am missing something..?
analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_36, analyzer);
final IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
and my search manager
final SearcherManager searcherManager = new SearcherManager(indexWriter, true, null);
and i am searching like following
final BooleanQuery booleanQuery = new BooleanQuery();
final Query query1 = new PrefixQuery(new Term("email", prefix));
final Query query2 = new PrefixQuery(new Term("firstName", prefix));
final Query query3 = new PrefixQuery(new Term("lastName", prefix));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
booleanQuery.add(query3, BooleanClause.Occur.SHOULD);
final SortField sortField = new SortField("firstName", SortField.STRING, true);
final Sort sort = new Sort(sortField);
final TopDocs topDocs = searcherManager .search(booleanQuery, DEFAULT_TOP_N_SEARCH_USER, sort);

Make sure you apply the same analysis to both the document and query. For instance, if you set the indexing analyzer to be StandardAnalzyer, then you need also to apply it to your query like this:
QueryParser queryParser = new QueryParser(Version.LUCENE_CURRENT, "firstName", new StandardAnalyzer(Version.LUCENE_CURRENT));
try {
Query q = queryParser.parse("Ameer");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Related

perform query if document contains a word has low score

I created a Lucene index and want to find all documents that contain a certain word or phrase.
When i do that, i recognized that the score gets lower the longer the text is that contains that word.
How can I create a query that only checks for the existence of a word in my documents / fields?
That's how I created the index
public static Directory CreateIndex(IEnumerable<WorkItemDto> workItems)
{
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Directory index = new RAMDirectory();
IndexWriter writer = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (WorkItemDto workItemDto in workItems)
{
Document doc = new Document();
doc.Add(new Field("Title", workItemDto.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
//doc.Add(new NumericField("ID", Field.Store.YES, true).SetIntValue(workItemDto.Id));
writer.AddDocument(doc);
}
writer.Dispose();
return index;
}
And this is how i created the query:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Query query = new QueryParser(Version.LUCENE_30, "Title", analyzer).Parse("Some");
IndexSearcher searcher = new IndexSearcher(indexDir);
TopDocs docs = searcher.Search(query, 10);
ScoreDoc[] hits = docs.ScoreDocs;

lucene .net fuzzy multiple words

I want combined fuzzy search on two terms , eg:- 'magikal~0.8' and 'mistery'~0.8, should return the results where words 'magical' and 'mystery' both are there.
My current code is as follows
Directory createIndex(DataTable table)
{
var directory = new RAMDirectory();
using (Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(directory, analyzer, new IndexWriter.MaxFieldLength(1000)))
{
foreach (DataRow row in table.Rows)
{
var document = new Document();
document.Add(new Field("DishName", row["DishName"].ToString(), Field.Store.YES, Field.Index.ANALYZED));
document.Add(new Field("CustomisationID", row["CustomisationID"].ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.AddDocument(document);
}
writer.Optimize();
writer.Flush(true, true, true);
}
return directory;
}
private DataTable SearchDishName(string textSearch)
{
string MatchingCutomisationIDs = "0"; //There is no Dish with ID zero, this is just to easen the coding..
var ds = new DataSet();
ds.ReadXml(System.Web.HttpContext.Current.Server.MapPath("~/App_data/MyDataset.xml"));
DataTable Sample = new DataTable();
Sample = ds.Tables[0];
var table = Sample.Clone();
var Index = createIndex(Sample);
using (var reader = IndexReader.Open(Index, true))
using (var searcher = new IndexSearcher(reader))
{
using (Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
{
var queryParser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "DishName", analyzer);
var collector = TopScoreDocCollector.Create(1000, true);
try
{
var query = queryParser.Parse(textSearch);
searcher.Search(query, collector);
}
catch
{ }
var matches = collector.TopDocs().ScoreDocs;
foreach (var item in matches)
{
var id = item.Doc;
var doc = searcher.Doc(id);
var row = table.NewRow();
row["CustomisationID"] = doc.GetField("CustomisationID").StringValue;
table.Rows.Add(row);
}
}
}
return table;
}
Also another issue with this code is that. If I run a normal query with an && operator it still don't get the accurate results
eg :- results for the query "magical"&&"mystery", includes cases where only "magical" or only "mystery" is there.

Searching sentences in PDF using Lucene phrase query and PDFBOX

I have used the following code for searching text in pdf. It is working fine with single word. But for sentences as mentioned in the code, it is showing that it is not present even if the text is present in the document. can any one help me in resolving this?
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
int i =1;
String name = null;
String output=new PDFTextStripper().getText(document);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer);
String sentence = "Following are the";
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
}

You should use the query parser to create a query from your sentence instead of creating your phrasequery by yourself. your self created query contains the term "Following" which is not indexed since the standard analyzer will lowercase it during indexing so only "following" is indexed.

Lucene SpanNearQuery

I am trying to understand Lucene SpanNearQuery and wrote up a dummy example. I am looking for "not" followed by "fox" within 5 of each other.
I would expect document 3 to be returned as the only hit. However, I end up getting no hits. Any thoughts on what might I be doing wrong will be appreciated.
Here is the code:
//indexing
public void doSpanIndexing() throws IOException {
IndexWriter writer=new IndexWriter(directory, AnalyzerUtil.getPorterStemmerAnalyzer(new StandardAnalyzer(Version.LUCENE_30)),IndexWriter.MaxFieldLength.LIMITED);
Document doc1=new Document();
doc1.add(new Field("content", " brown fox jumped ", Field.Store.YES, Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc1);
Document doc2=new Document();
doc2.add(new Field("content", "foxes not jumped over the huge fence", Field.Store.YES, Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc2);
Document doc3=new Document();
doc3.add(new Field("content", " brown not fox", Field.Store.YES, Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc3);
}
//searching
public void doSpanSearching(String text) throws CorruptIndexException, IOException, ParseException {
IndexSearcher searcher=new IndexSearcher(directory);
SpanTermQuery term1 = new SpanTermQuery(new Term("content", "not"));
SpanTermQuery term2 = new SpanTermQuery(new Term("content", text));
SpanNearQuery query = new SpanNearQuery(new SpanQuery[] {term1, term2}, 5, true);
TopDocs topDocs=searcher.search(query,5);
for(int i=0; i<topDocs.totalHits; i++) {
System.out.println("Hit Document number: "+topDocs.scoreDocs[i].doc);
System.out.println("Hit Document score: "+topDocs.scoreDocs[i].score);
Document result=searcher.doc(topDocs.scoreDocs[i].doc);
System.out.println("Search result "+(i+1)+ " is "+result.get("content"));
}
}

"Not" is a stop word in the standard analyzer (i.e. it is removed from your text). Can you try it with another word which is not a stop word?

lucene updation problem

i am using this function to update the index ..
private static void insert_index(String url)throws Exception
{
System.out.println(url);
IndexWriter writer = new IndexWriter(
FSDirectory.open(new File(INDEX_DIR)),
new StandardAnalyzer(Version.LUCENE_CURRENT),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc;
String field;
String text;
doc = new Document();
field = "url";
text = url;
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
field = "tags";
text = "url";
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.close();
}
it index more urls and if i search the field with url it shows only the last indexed url....

When creating a new index for the first time, the create parameter for the IndexWriter constructor has to be set to true. From then on it must be set to false, otherwise the previously saved index content is overridden. I'd change my code to detect index files before creating a new instance of IndexWriter.
This code can be used to workout if the index files exist
private bool IndexExists(string sIndexPath)
{
return IndexReader.IndexExists(sIndexPath))
}
Then create the IndexWriter instance like this:
IndexWriter writer = new IndexWriter(
FSDirectory.open(new File(INDEX_DIR)),
new StandardAnalyzer(Version.LUCENE_CURRENT),
IndexExists(INDEX_DIR) == false, // <-- This is what I mean
IndexWriter.MaxFieldLength.UNLIMITED);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

lucene search is working for only small letters - lucene

Related

perform query if document contains a word has low score

lucene .net fuzzy multiple words

Searching sentences in PDF using Lucene phrase query and PDFBOX

Lucene SpanNearQuery

lucene updation problem

Categories

Resources