Lucene.Net is not giving any results but Luke does - lucene

I have created a Lucene Index using StandardAnalyzer with following three fields.
StreetName
City
State
I am using below wrapper class to ease out writing boolean queries
public interface IQuery
{
BooleanQuery GetQuery();
}
public class QueryParam : IQuery
{
public string[] Fields { get; set; }
public string Term { get; set; }
private BooleanQuery _indexerQuery;
public QueryParam(string term, params string[] fields)
{
Term = term;
Fields = fields;
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var field in Fields)
_indexerQuery.Add(new FuzzyQuery(new Term(field, Term)), Occur.SHOULD);
return _indexerQuery;
}
}
public class AndQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private BooleanQuery _indexerQuery;
public AndQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.MUST);
return _indexerQuery;
}
}
public class OrQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private readonly BooleanQuery _indexerQuery = new BooleanQuery();
public OrQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.SHOULD);
return _indexerQuery;
}
public OrQuery AddQuery(IQuery query)
{
_queryParams.Add(query);
return this;
}
}
Below query is not giving me any results in Lucene.Net but when i search the same query in Luke,it works flawlessly.
var query = new AndQuery(new QueryParam(city.ToLower(), "city"), new QueryParam(state.ToLower(), "state"), new QueryParam(streetAddress.ToLower(), "streetname"));
Executing query.GetQuery() gives me below resultant query.
{+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)}

You can search using BooleanQuery. Break your term with white space in segments, then create the query and search in index.
EX:-
BooleanQuery booleanQuery = new BooleanQuery()
BooleanQuery searchTermQuery = new BooleanQuery();
foreach (var searchTerm in searchTerms)
{
var searchTermSegments = searchTerm.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (searchTermSegments.Count() > 1)
{
searchTermQuery.Clauses().Clear();
foreach (var SegTex in searchTermSegments)
{
searchTermQuery.Add( new FuzzyQuery(new Term("FieldName", SegTex.ToLower().Trim())),BooleanClause.Occur.MUST);
}
booleanQuery.Add(searchTermQuery, BooleanClause.Occur.MUST);
}
else
{
booleanQuery.Add(new FuzzyQuery(new Term("FieldName", searchTerm.ToLower().Trim())), BooleanClause.Occur.MUST);
}
}

The problem is the treatment of tennis court. You haven't shown how you are indexing these fields, but I will assume they are tokenized in the index, using something like a StandardAnalyzer, for instance. This means, "tennis court" will be split into two separate terms "tennis" and "court". When creating a FuzzyQuery manually, though, there is no analysis or tokenization, and so you will only have a single term "tennis court". There is a large edit distance between "tennis court" and either "tennis" (6 edits) or "court" (7 edits), so neither of them match.
A source of confusion here seems to be that
+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)
Seems to work. It is not safe, however, to assume that the text query output for debugging can be run through the queryparser to generate the same query, and this is a good example. The QueryParser syntax is simply not capable of expressing everything you can do with manually constructed queries. Running that query through the query parser will generate a query more like:
+(city:tampa~0.5) +(state:fl~0.5) +((street:tennis) (defaultField:court~0.5))
Which will find a match, since we can expect it to find city:tampa, state:fl, and street:tennis (See this Lucene Query Parser documentation section for another example explaining this behavior of the query parser). Whether it finds a match on court in the default field I have no idea, but it doesn't really need to.
A PhraseQuery is the typical way to string multiple terms (words) together in a Lucene Query (this would look like street:"tennis court" in a parsed query).

Related

Lucene does not index some terms in documents

I have been trying to use Lucene to index our code database. Unfortunately, some terms get omitted from the index. E.g. in the below string, I can search on anything other than "version-number":
version-number "cAELimpts.spl SCOPE-PAY:10.1.10 25nov2013kw101730 Setup EMployee field if missing"
I have tried implementing it with both Lucene.NET 3.1 and pylucene 6.2.0, with the same result.
Here are some details of my implementation in Lucene.NET:
using (var writer = new IndexWriter(FSDirectory.Open(INDEX_DIR), new CustomAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED))
{
Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'...");
IndexDirectory(writer, docDir);
Console.Out.WriteLine("Optimizing...");
writer.Optimize();
writer.Commit();
}
The CustomAnalyzer class:
public sealed class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
{
return new LowerCaseFilter(new CustomTokenizer(reader));
}
}
Finally, the CustomTokenizer class:
public class CustomTokenizer : CharTokenizer
{
public CustomTokenizer(TextReader input) : base(input)
{
}
public CustomTokenizer(AttributeFactory factory, TextReader input) : base(factory, input)
{
}
public CustomTokenizer(AttributeSource source, TextReader input) : base(source, input)
{
}
protected override bool IsTokenChar(char c)
{
return System.Char.IsLetterOrDigit(c) || c == '_' || c == '-' ;
}
}
It looks like "version-number" and some other terms are not getting indexed because they are present in 99% of the documents. Can it be the cause of the problem?
EDIT: As requested, the FileDocument class:
public static class FileDocument
{
public static Document Document(FileInfo f)
{
// make a new, empty document
Document doc = new Document();
doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Millisecond, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("contents", new StreamReader(f.FullName, System.Text.Encoding.Default)));
// return the document
return doc;
}
}
I think I was being an idiot. I was limiting the number of hits to 500 and then applying filters on the found hits. The items were expected to be retrieved in the order they had been indexed. So when I was looking for something at the end of the index, it would tell me that nothing was found. In fact, it would retrieve the expected 500 items but they would all have been filtered out.

RavenDB querying metadata

I want to prevent documents from being deleted in my project and I decided to use metadata to mark document as Archived. I used below code to do that:
public class DeleteDocumentListener : IDocumentDeleteListener
{
public void BeforeDelete(string key, object entityInstance, RavenJObject metadata)
{
metadata.Add("Archived", true);
throw new NotSupportedException();
}
}
After that I wanted to alter query to return only documents which have Archived metadata value set to false:
using (var session = _store.OpenSession())
{
var query = session.Advanced.DocumentQuery<Cutter>()
.WhereEquals("#metadata.Archived", false);
}
Unfortunately this query return empty result set. It occurs that if Document doesn't have this metadata property then above condition is treated as false. It wasn't what I expected.
How can I compose query to return Documents which don't have metadata property or this property has some value ?
You can solve it by creating an index for you Cutter documents and then query against that:
public class ArchivedIndex : AbstractIndexCreationTask<Cutter>
{
public class QueryModel
{
public bool Archived { get; set; }
}
public ArchivedIndex()
{
Map = documents => from doc in documents
select new QueryModel
{
Archived = MetadataFor(doc)["Archived"] != null && MetadataFor(doc).Value<bool>("Archived")
};
}
}
Then query it like this:
using (var session = documentStore.OpenSession())
{
var cutters = session.Query<ArchivedIndex.QueryModel, ArchivedIndex>()
.Where(x => x.Archived == false)
.OfType<Cutter>()
.ToList();
}
Hope this helps!
Quick side note. To create the index, the following code may need to be run:
new ArchivedIndex().Execute(session.Advanced.DocumentStore);

Lucene - How can you modify a Query object's terms?

I want to add new fields to my Lucene-based search engine site, however I want to be able to intercept queries and modify them before I pass them on to the Searcher.
For example each document has the field userid so you can search for documents authored by a particular user by their ID, e.g. foo bar userid:123 however I want to add the ability to search by username.
I'd like to add a field user:RonaldMcDonald to queries (not to documents), however I want to be able to intercept that term and replace it with an equivalent userid:123 term (my own code would be responsible for converting "RonaldMcDonald" to "123").
Here's the simple code I'm using right now:
Int32 get = (pageIndex + 1) * pageSize;
Query query;
try {
query = _queryParser.Parse( queryText );
} catch(ParseException pex) {
log.Add("Could not parse query.");
throw new SearchException( "Could not parse query text.", pex );
}
log.Add("Parsed query.");
TopDocs result = _searcher.Search( query, get );
I've had a look at the Query class, but I can't see any way to retrieve, remove, or insert terms.
You can subclass the QueryParser and override NewTermQuery.
QP qp = new QP("user", new SimpleAnalyzer());
var s = qp.Parse("user:RonaldMcDonald data:[aaa TO bbb]");
Where s is will be userid:123 data:[aaa TO bbb]
public class QP : QueryParser
{
Dictionary<string, string> _dict =
new Dictionary<string, string>(new MyComparer()) {{"RonaldMcDonald","123"} };
public QP(string field, Analyzer analyzer) : base(field, analyzer)
{
}
protected override Query NewTermQuery(Term term)
{
if (term.Field() == "user")
{
//Do your username -> userid mapping
return new TermQuery(new Term("userid", _dict[term.Text()]));
}
return base.NewTermQuery(term);
}
//Case insensitive comparer
class MyComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return String.Compare(x, y, true, CultureInfo.InvariantCulture)==0;
}
public int GetHashCode(string obj)
{
return obj.ToLower(CultureInfo.InvariantCulture).GetHashCode();
}
}
}

NHibernate Search and Lucene initial indexing of IList

I have some boring problem of indexing IList using Lucene, and I can not fix.
My entity contains IList which I apply IndexedEmbedded attribute like this:
[ScriptIgnore] //will not serialize
[IndexedEmbedded(Depth = 1, Prefix = "BookAdditionalInfos_"]
public virtual IList<BookAdditionalInfo> BookAdditionalInfos { get; set; }
Also, some other properties used Field attribute for indexing:
[Field(Index.Tokenized, Store = Store.Yes)]
After marking entity for indexing, I have to make initial indexing of 12 millions of rows (using batch processing). And everything works perfect until I start to index IList called BookAdditionalInfos. Without this IndexedEmbedded attribute (or without indexing this IList) everything is OK, and every property mark with Field attribute will be indexed.
I am using Fluent NHibernate.
What can be a problem?
Thank you
EDIT: Also I looked at http://ayende.com/blog/3992/nhibernate-search, but without any results
The problem is: when I try to index IList, indexing taking forever and nothing will be indexed. Without indexing this IList (or without specify IndexedEmbedded to IList) indexing is OK, and I got indexed results.
EDIT (Initial Indexing function):
public void BuildInitialBookSearchIndex()
{
FSDirectory directory = null;
IndexWriter writer = null;
var type = typeof(Book);
var info = new DirectoryInfo(GetIndexDirectory());
//if (info.Exists)
//{
// info.Delete(true);
//}
try
{
directory = FSDirectory.GetDirectory(Path.Combine(info.FullName, type.Name), true);
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
}
finally
{
if (directory != null)
{
directory.Close();
}
if (writer != null)
{
writer.Close();
}
}
var fullTextSession = Search.CreateFullTextSession(Session);
var currentIndex = 0;
const int batchSize = 5000;
while (true)
{
var entities = Session
.CreateCriteria<Book>()
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = Session.BeginTransaction())
{
foreach (var entity in entities)
{
fullTextSession.Index(entity);
}
currentIndex += batchSize;
Session.Flush();
tx.Commit();
Session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
It looks to me like you've got a classic N+1 select problem there - basically when you're selecting your books, you're not also selecting the BookAdditionalInfos, so NHibernate will have to issue a new select for each and every book to retrieve the BookAdditionalInfo's for that book while indexing.
A quick fix would be to change your select to:
var entities = Session
.CreateCriteria<Book>()
.SetFetchMode("BookAdditionalInfos", FetchMode.Eager)
.SetResultTransformer(Transformers.DistinctRootEntity)
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
You'll probably run into additional problems with your paging now however because it will do a join onto the BookAdditionalInfo table giving you multiple rows for the same entity in your result set, so you might want to look at doing something like:
var pagedEntities = DetachedCriteria.For<Book>()
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.SetProjection(Projections.Id());
var entities = Session
.CreateCriteria<Book>()
.Add(Property.ForName("id").In(pagedEntities))
.SetFetchMode("BookAdditionalInfos", FetchMode.Eager)
.SetResultTransformer(Transformers.DistinctRootEntity)
.List();

ROW_NUMBER() and nhibernate - finding an item's page

given a query in the form of an ICriteria object, I would like to use NHibernate (by means of a projection?) to find an element's order,
in a manner equivalent to using
SELECT ROW_NUMBER() OVER (...)
to find a specific item's index in the query.
(I need this for a "jump to page" functionality in paging)
any suggestions?
NOTE: I don't want to go to a page given it's number yet - I know how to do that - I want to get the item's INDEX so I can divide it by page size and get the page index.
After looking at the sources for NHibernate, I'm fairly sure that there exists no such functionality.
I wouldn't mind, however, for someone to prove me wrong.
In my specific setting, I did solve this problem by writing a method that takes a couple of lambdas (representing the key column, and an optional column to filter by - all properties of a specific domain entity). This method then builds the sql and calls session.CreateSQLQuery(...).UniqueResult(); I'm not claiming that this is a general purpose solution.
To avoid the use of magic strings, I borrowed a copy of PropertyHelper<T> from this answer.
Here's the code:
public abstract class RepositoryBase<T> where T : DomainEntityBase
{
public long GetIndexOf<TUnique, TWhere>(T entity, Expression<Func<T, TUnique>> uniqueSelector, Expression<Func<T, TWhere>> whereSelector, TWhere whereValue) where TWhere : DomainEntityBase
{
if (entity == null || entity.Id == Guid.Empty)
{
return -1;
}
var entityType = typeof(T).Name;
var keyField = PropertyHelper<T>.GetProperty(uniqueSelector).Name;
var keyValue = uniqueSelector.Compile()(entity);
var innerWhere = string.Empty;
if (whereSelector != null)
{
// Builds a column name that adheres to our naming conventions!
var filterField = PropertyHelper<T>.GetProperty(whereSelector).Name + "Id";
if (whereValue == null)
{
innerWhere = string.Format(" where [{0}] is null", filterField);
}
else
{
innerWhere = string.Format(" where [{0}] = :filterValue", filterField);
}
}
var innerQuery = string.Format("(select [{0}], row_number() over (order by {0}) as RowNum from [{1}]{2}) X", keyField, entityType, innerWhere);
var outerQuery = string.Format("select RowNum from {0} where {1} = :keyValue", innerQuery, keyField);
var query = _session
.CreateSQLQuery(outerQuery)
.SetParameter("keyValue", keyValue);
if (whereValue != null)
{
query = query.SetParameter("filterValue", whereValue.Id);
}
var sqlRowNumber = query.UniqueResult<long>();
// The row_number() function is one-based. Our index should be zero-based.
sqlRowNumber -= 1;
return sqlRowNumber;
}
public long GetIndexOf<TUnique>(T entity, Expression<Func<T, TUnique>> uniqueSelector)
{
return GetIndexOf(entity, uniqueSelector, null, (DomainEntityBase)null);
}
public long GetIndexOf<TUnique, TWhere>(T entity, Expression<Func<T, TUnique>> uniqueSelector, Expression<Func<T, TWhere>> whereSelector) where TWhere : DomainEntityBase
{
return GetIndexOf(entity, uniqueSelector, whereSelector, whereSelector.Compile()(entity));
}
}
public abstract class DomainEntityBase
{
public virtual Guid Id { get; protected set; }
}
And you use it like so:
...
public class Book : DomainEntityBase
{
public virtual string Title { get; set; }
public virtual Category Category { get; set; }
...
}
public class Category : DomainEntityBase { ... }
public class BookRepository : RepositoryBase<Book> { ... }
...
var repository = new BookRepository();
var book = ... a persisted book ...
// Get the index of the book, sorted by title.
var index = repository.GetIndexOf(book, b => b.Title);
// Get the index of the book, sorted by title and filtered by that book's category.
var indexInCategory = repository.GetIndexOf(book, b => b.Title, b => b.Category);
As I said, this works for me. I'll definitely tweak it as I move forward. YMMV.
Now, if the OP has solved this himself, then I would love to see his solution! :-)
ICriteria has this 2 functions:
SetFirstResult()
and
SetMaxResults()
which transform your SQL statement into using ROW_NUMBER (in sql server) or limit in MySql.
So if you want 25 records on the third page you could use:
.SetFirstResult(2*25)
.SetMaxResults(25)
After trying to find an NHibernate based solution for this myself, I ultimately just added a column to the view I happened to be using:
CREATE VIEW vw_paged AS
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS [Row], p.column1, p.column2
FROM paged_table p
This doesn't really help if you need complex sorting options, but it does work for simple cases.
A Criteria query, of course, would look something like this:
public static IList<Paged> GetRange(string search, int rows)
{
var match = DbSession.Current.CreateCriteria<Job>()
.Add(Restrictions.Like("Id", search + '%'))
.AddOrder(Order.Asc("Id"))
.SetMaxResults(1)
.UniqueResult<Paged>();
if (match == null)
return new List<Paged>();
if (rows == 1)
return new List<Paged> {match};
return DbSession.Current.CreateCriteria<Paged>()
.Add(Restrictions.Like("Id", search + '%'))
.Add(Restrictions.Ge("Row", match.Row))
.AddOrder(Order.Asc("Id"))
.SetMaxResults(rows)
.List<Paged>();
}