I have been trying to use Lucene to index our code database. Unfortunately, some terms get omitted from the index. E.g. in the below string, I can search on anything other than "version-number":
version-number "cAELimpts.spl SCOPE-PAY:10.1.10 25nov2013kw101730 Setup EMployee field if missing"
I have tried implementing it with both Lucene.NET 3.1 and pylucene 6.2.0, with the same result.
Here are some details of my implementation in Lucene.NET:
using (var writer = new IndexWriter(FSDirectory.Open(INDEX_DIR), new CustomAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED))
{
Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'...");
IndexDirectory(writer, docDir);
Console.Out.WriteLine("Optimizing...");
writer.Optimize();
writer.Commit();
}
The CustomAnalyzer class:
public sealed class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
{
return new LowerCaseFilter(new CustomTokenizer(reader));
}
}
Finally, the CustomTokenizer class:
public class CustomTokenizer : CharTokenizer
{
public CustomTokenizer(TextReader input) : base(input)
{
}
public CustomTokenizer(AttributeFactory factory, TextReader input) : base(factory, input)
{
}
public CustomTokenizer(AttributeSource source, TextReader input) : base(source, input)
{
}
protected override bool IsTokenChar(char c)
{
return System.Char.IsLetterOrDigit(c) || c == '_' || c == '-' ;
}
}
It looks like "version-number" and some other terms are not getting indexed because they are present in 99% of the documents. Can it be the cause of the problem?
EDIT: As requested, the FileDocument class:
public static class FileDocument
{
public static Document Document(FileInfo f)
{
// make a new, empty document
Document doc = new Document();
doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Millisecond, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("contents", new StreamReader(f.FullName, System.Text.Encoding.Default)));
// return the document
return doc;
}
}
I think I was being an idiot. I was limiting the number of hits to 500 and then applying filters on the found hits. The items were expected to be retrieved in the order they had been indexed. So when I was looking for something at the end of the index, it would tell me that nothing was found. In fact, it would retrieve the expected 500 items but they would all have been filtered out.
Related
I have three fields in my document
Title
Content
Modified Date
So when I search a term it's giving by results sorted by score
Now I would like to further sort the results with same score based upon on modifiedDate i.e. showing recent documents on top with the same score.
I tried sort by score, modified date but it's not working. Anyone can point me to the right direction?
This can be done simply by defining a Sort:
Sort sort = new Sort(
SortField.FIELD_SCORE,
new SortField("myDateField", SortField.Type.STRING));
indexSearcher.search(myQuery, numHits, sort);
Two possible gotchas here:
You should make sure your date is indexed in a searchable, and sortable, form. Generally, the best way to accomplish this is to convert it using DateTools.
The field used for sorting must be indexed, and should not be analyzed (a StringField, for instance). Up to you whether it is stored.
So adding the date field might look something like:
Field dateField = new StringField(
"myDateField",
DateTools.DateToString(myDateInstance, DateTools.Resolution.MINUTE),
Field.Store.YES);
document.add(dateField);
Note: You can also index dates as a numeric field using Date.getTime(). I prefer the DateTools string approach, as it provides some nicer tools for handling them, particularly with regards to precision, but either way can work.
You can use a custom collector for solving this problem. It will sort result by score, then by timestamp. In this collector you should retrieve the timestamp value for second sorting. See class below
public class CustomCollector extends TopDocsCollector<ScoreDocWithTime> {
ScoreDocWithTime pqTop;
// prevents instantiation
public CustomCollector(int numHits) {
super(new HitQueueWithTime(numHits, true));
// HitQueue implements getSentinelObject to return a ScoreDoc, so we know
// that at this point top() is already initialized.
pqTop = pq.top();
}
#Override
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
final NumericDocValues modifiedDate =
DocValues.getNumeric(context.reader(), "modifiedDate");
return new LeafCollector() {
Scorer scorer;
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void collect(int doc) throws IOException {
float score = scorer.score();
// This collector cannot handle these scores:
assert score != Float.NEGATIVE_INFINITY;
assert !Float.isNaN(score);
totalHits++;
if (score <= pqTop.score) {
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop.timestamp = modifiedDate.get(doc);
pqTop = pq.updateTop();
}
};
}
#Override
public boolean needsScores() {
return true;
}
}
Also to do second sorting you need add to ScoreDoc an additional field
public class ScoreDocWithTime extends ScoreDoc {
public long timestamp;
public ScoreDocWithTime(long timestamp, int doc, float score) {
super(doc, score);
this.timestamp = timestamp;
}
public ScoreDocWithTime(long timestamp, int doc, float score, int shardIndex) {
super(doc, score, shardIndex);
this.timestamp = timestamp;
}
}
and create a custom priority queue to support this
public class HitQueueWithTime extends PriorityQueue<ScoreDocWithTime> {
public HitQueueWithTime(int numHits, boolean b) {
super(numHits, b);
}
#Override
protected ScoreDocWithTime getSentinelObject() {
return new ScoreDocWithTime(0, Integer.MAX_VALUE, Float.NEGATIVE_INFINITY);
}
#Override
protected boolean lessThan(ScoreDocWithTime hitA, ScoreDocWithTime hitB) {
if (hitA.score == hitB.score)
return (hitA.timestamp == hitB.timestamp) ?
hitA.doc > hitB.doc :
hitA.timestamp < hitB.timestamp;
else
return hitA.score < hitB.score;
}
}
After this you can search result as you need. See example below
public class SearchTest {
public static void main(String[] args) throws IOException {
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
Directory directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
addDoc(indexWriter, "w1", 1000);
addDoc(indexWriter, "w1", 3000);
addDoc(indexWriter, "w1", 500);
addDoc(indexWriter, "w1 w2", 1000);
addDoc(indexWriter, "w1 w2", 3000);
addDoc(indexWriter, "w1 w2", 2000);
addDoc(indexWriter, "w1 w2", 5000);
final IndexReader indexReader = DirectoryReader.open(indexWriter, false);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("desc", "w1")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("desc", "w2")), BooleanClause.Occur.SHOULD);
CustomCollector results = new CustomCollector(100);
indexSearcher.search(query, results);
TopDocs search = results.topDocs();
for (ScoreDoc sd : search.scoreDocs) {
Document document = indexReader.document(sd.doc);
System.out.println(document.getField("desc").stringValue() + " " + ((ScoreDocWithTime) sd).timestamp);
}
}
private static void addDoc(IndexWriter indexWriter, String decs, long modifiedDate) throws IOException {
Document doc = new Document();
doc.add(new TextField("desc", decs, Field.Store.YES));
doc.add(new LongField("modifiedDate", modifiedDate, Field.Store.YES));
doc.add(new NumericDocValuesField("modifiedDate", modifiedDate));
indexWriter.addDocument(doc);
}
}
Program will output following results
w1 w2 5000
w1 w2 3000
w1 w2 2000
w1 w2 1000
w1 3000
w1 1000
w1 500
P.S. this solution for Lucene 5.1
I'm using lucene.net 3.0.3 and I have a simple customized analyzer and tokenizer which are breaking the terms by TAB. I measured it and this turns out that the indexing is twice as slow as using a StandardAnalyzer (which does many more things). Do you know what the problem might be, or if there is a better solution ?
Code is below
public class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new CustomTokenizer(reader);
//return new LetterTokenizer(reader);
}
public override TokenStream ReusableTokenStream(string fieldName, TextReader reader)
{
Tokenizer tokenizer = this.PreviousTokenStream as Tokenizer;
if (tokenizer == null)
{
tokenizer = new CustomTokenizer(reader);
//tokenizer = new LetterTokenizer(reader);
}
else
{
tokenizer.Reset(reader);
}
return tokenizer;
}
}
public class CustomTokenizer : LetterTokenizer
{
public CustomTokenizer(TextReader reader)
: base(reader)
{ }
protected override char Normalize(char c)
{
return char.ToLower(c, CultureInfo.InvariantCulture);
}
protected override bool IsTokenChar(char c)
{
// TAB has the same code in Unicode
return c != '\x0009';
}
}
I forgot to update this thread : the issue was in the custom analyzer. I did not actually save the tokenizer to PreviousTokenStream , so it was creating a new tokenizer each time.
I have created a Lucene Index using StandardAnalyzer with following three fields.
StreetName
City
State
I am using below wrapper class to ease out writing boolean queries
public interface IQuery
{
BooleanQuery GetQuery();
}
public class QueryParam : IQuery
{
public string[] Fields { get; set; }
public string Term { get; set; }
private BooleanQuery _indexerQuery;
public QueryParam(string term, params string[] fields)
{
Term = term;
Fields = fields;
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var field in Fields)
_indexerQuery.Add(new FuzzyQuery(new Term(field, Term)), Occur.SHOULD);
return _indexerQuery;
}
}
public class AndQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private BooleanQuery _indexerQuery;
public AndQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.MUST);
return _indexerQuery;
}
}
public class OrQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private readonly BooleanQuery _indexerQuery = new BooleanQuery();
public OrQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.SHOULD);
return _indexerQuery;
}
public OrQuery AddQuery(IQuery query)
{
_queryParams.Add(query);
return this;
}
}
Below query is not giving me any results in Lucene.Net but when i search the same query in Luke,it works flawlessly.
var query = new AndQuery(new QueryParam(city.ToLower(), "city"), new QueryParam(state.ToLower(), "state"), new QueryParam(streetAddress.ToLower(), "streetname"));
Executing query.GetQuery() gives me below resultant query.
{+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)}
You can search using BooleanQuery. Break your term with white space in segments, then create the query and search in index.
EX:-
BooleanQuery booleanQuery = new BooleanQuery()
BooleanQuery searchTermQuery = new BooleanQuery();
foreach (var searchTerm in searchTerms)
{
var searchTermSegments = searchTerm.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (searchTermSegments.Count() > 1)
{
searchTermQuery.Clauses().Clear();
foreach (var SegTex in searchTermSegments)
{
searchTermQuery.Add( new FuzzyQuery(new Term("FieldName", SegTex.ToLower().Trim())),BooleanClause.Occur.MUST);
}
booleanQuery.Add(searchTermQuery, BooleanClause.Occur.MUST);
}
else
{
booleanQuery.Add(new FuzzyQuery(new Term("FieldName", searchTerm.ToLower().Trim())), BooleanClause.Occur.MUST);
}
}
The problem is the treatment of tennis court. You haven't shown how you are indexing these fields, but I will assume they are tokenized in the index, using something like a StandardAnalyzer, for instance. This means, "tennis court" will be split into two separate terms "tennis" and "court". When creating a FuzzyQuery manually, though, there is no analysis or tokenization, and so you will only have a single term "tennis court". There is a large edit distance between "tennis court" and either "tennis" (6 edits) or "court" (7 edits), so neither of them match.
A source of confusion here seems to be that
+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)
Seems to work. It is not safe, however, to assume that the text query output for debugging can be run through the queryparser to generate the same query, and this is a good example. The QueryParser syntax is simply not capable of expressing everything you can do with manually constructed queries. Running that query through the query parser will generate a query more like:
+(city:tampa~0.5) +(state:fl~0.5) +((street:tennis) (defaultField:court~0.5))
Which will find a match, since we can expect it to find city:tampa, state:fl, and street:tennis (See this Lucene Query Parser documentation section for another example explaining this behavior of the query parser). Whether it finds a match on court in the default field I have no idea, but it doesn't really need to.
A PhraseQuery is the typical way to string multiple terms (words) together in a Lucene Query (this would look like street:"tennis court" in a parsed query).
I want to add new fields to my Lucene-based search engine site, however I want to be able to intercept queries and modify them before I pass them on to the Searcher.
For example each document has the field userid so you can search for documents authored by a particular user by their ID, e.g. foo bar userid:123 however I want to add the ability to search by username.
I'd like to add a field user:RonaldMcDonald to queries (not to documents), however I want to be able to intercept that term and replace it with an equivalent userid:123 term (my own code would be responsible for converting "RonaldMcDonald" to "123").
Here's the simple code I'm using right now:
Int32 get = (pageIndex + 1) * pageSize;
Query query;
try {
query = _queryParser.Parse( queryText );
} catch(ParseException pex) {
log.Add("Could not parse query.");
throw new SearchException( "Could not parse query text.", pex );
}
log.Add("Parsed query.");
TopDocs result = _searcher.Search( query, get );
I've had a look at the Query class, but I can't see any way to retrieve, remove, or insert terms.
You can subclass the QueryParser and override NewTermQuery.
QP qp = new QP("user", new SimpleAnalyzer());
var s = qp.Parse("user:RonaldMcDonald data:[aaa TO bbb]");
Where s is will be userid:123 data:[aaa TO bbb]
public class QP : QueryParser
{
Dictionary<string, string> _dict =
new Dictionary<string, string>(new MyComparer()) {{"RonaldMcDonald","123"} };
public QP(string field, Analyzer analyzer) : base(field, analyzer)
{
}
protected override Query NewTermQuery(Term term)
{
if (term.Field() == "user")
{
//Do your username -> userid mapping
return new TermQuery(new Term("userid", _dict[term.Text()]));
}
return base.NewTermQuery(term);
}
//Case insensitive comparer
class MyComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return String.Compare(x, y, true, CultureInfo.InvariantCulture)==0;
}
public int GetHashCode(string obj)
{
return obj.ToLower(CultureInfo.InvariantCulture).GetHashCode();
}
}
}
when i query for "elegant" in solr i get results for "elegance" too.
I used these filters for index analyze
WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory
and for query analyze:
WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
I want to know which filter affecting my search result.
EnglishPorterFilterFactory
Thats the short answer ;)
A little more information:
English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.
You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.
From Solr source:
public void inform(ResourceLoader loader) {
String wordFiles = args.get(PROTECTED_TOKENS);
if (wordFiles != null) {
try {
Here exactly comes the protwords file into play:
File protectedWordFiles = new File(wordFiles);
if (protectedWordFiles.exists()) {
List<String> wlist = loader.getLines(wordFiles);
//This cast is safe in Lucene
protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
} else {
List<String> files = StrUtils
.splitFileNames(wordFiles);
for (String file : files) {
List<String> wlist = loader.getLines(file
.trim());
if (protectedWords == null)
protectedWords = new CharArraySet(wlist,
false);
else
protectedWords.addAll(wlist);
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
Thats the part which affects the stemming. There you see the invocation of the snowball library
public EnglishPorterFilter create(TokenStream input) {
return new EnglishPorterFilter(input, protectedWords);
}
}
/**
* English Porter2 filter that doesn't use reflection to
* adapt lucene to the snowball stemmer code.
*/
#Deprecated
class EnglishPorterFilter extends SnowballPorterFilter {
public EnglishPorterFilter(TokenStream source,
CharArraySet protWords) {
super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
protWords);
}
}