Integer index behavior in neo4j + lucene - lucene

I want to index an integer field on nodes in a large database (~20 million nodes). The behavior I would expect to work is as follows:
HashMap<String, Object> properties = new HashMap<String, Object>();
// Add node to graph (inserter is an instance of BatchInserter)
properties.put("id", 1000);
long node = inserter.createNode(properties);
// Add index
index.add(node, properties);
// Retrieve node by index. RETURNS NULL!
IndexHits<Node> hits = index.query("id", 1000);
I add the key-value pair to the index and then query by it. Sadly, this doesn't work.
My current hackish workaround is to use a Lucene object and query by range:
// Add index
properties.put("id", ValueContext.numeric(1000).indexNumeric());
index.add(node, properties);
// Retrieve node by index. This still returns null
IndexHits<Node> hits = index.query("id", 1000);
// However, this version works
hits = index.query(QueryContext.numericRange("id", 1000, 1000, true, true));
This is perfectly functional, but the range query is really dumb. Is there a way to run exact integer queries without this QueryContext.numericRange mess?

The index lookup you require is an exact match, not a query.
Try replacing
index.query("id", 1000)
with
index.get("id", 1000)

Related

ZF2 Lucene query returns nothing - same query in Luke returns expected result

In ZF2 I am using the following code to generate a Lucene query:
$query = new Lucene\Search\Query\Boolean();
$term = new Lucene\Index\Term($search->type, 'type');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true);
$term = new Lucene\Index\Term('[+<= ' . $search->purchase_value . ']', 'min_price');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true); // required
$term = new Lucene\Index\Term('[' . $search->purchase_value . ' >=]', 'max_price');
$subqueryTerm = new Lucene\Search\Query\Term($term);
$query->addSubquery($subqueryTerm, true); // required
which produces the following query:
+(type:Buying) +(min_price:[+<= 160.00]) +(max_price:[160.00 >=])
When I run this query in ZF2 ($hits = $index->find($query);) I get an empty array returned, however when I use Luke to manually run the query against the index it returns the result I am expecting.
What do I need to change in my code to make it return the same results as Luke?
I am using the default analyser for both systems:
Luke: org.apache.lucene.analysis.KeywordAnalyzer
ZF2: (which I think is) \ZendSearch\Lucene\Analysis\Analyzer\Common\Text
Do I need to use a different QueryParser?
You are searching for the term "[+<= 160.00]" in the field min_price. You are not creating a range query. I don't quite get what you mean by using a different QueryParser, you aren't using a QueryParser at all, you are using the API to construct your queries. You need to use a range search:
$term = new Lucene\Index\Term($search->purchase_value, 'min_price');
$subqueryRange = new Lucene\Search\Query\Range(null, $term, true);
$query->addSubquery($subqueryRange, true);
$term = new Lucene\Index\Term($search->purchase_value, 'max_price');
$subqueryRange = new Lucene\Search\Query\Range($term, null, true);
$query->addSubquery($subqueryRange, true);
I'm assuming here, by the way, that "+<=" is meant to be query syntax for less than or equal to. I'm not familiar with anything in lucene that supports that sort of syntax. An open ended range, as shown, would be the correct thing to use.

High performance unique document id retrieval

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?
I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.

figuring out reason for maxClauseCount is set to 1024 error

I've two sets of search indexes.
TestIndex (used in our test environment) and ProdIndex(used in PRODUCTION environment).
Lucene search query: +date:[20090410184806 TO 20091007184806] works fine for test index but gives this error message for Prod index.
"maxClauseCount is set to 1024"
If I execute following line just before executing search query, then I do not get this error.
BooleanQuery.SetMaxClauseCount(Int16.MaxValue);
searcher.Search(myQuery, collector);
Am I missing something here? Why am not getting this error in test index?The schema for two indexes are same.They only differ wrt to number of records/data.PROD index has got higher number of records(around 1300) than those in test one (around 950).
The range query essentially gets transformed to a boolean query with one clause for every possible value, ORed together.
For example, the query +price:[10 to 13] is tranformed to a boolean query
+(price:10 price:11 price:12 price:13)
assuming all the values 10-13 exist in the index.
I suppose, all of your 1300 values fall in the range you have given. So, boolean query has 1300 clauses, which is higher than the default value of 1024. In the test index, the limit of 1024 is not reached as there are only 950 values.
I had the same problem. My solution was to catch BooleanQuery.TooManyClauses and dynamically increase maxClauseCount.
Here is some code that is similar to what I have in production.
private static Hits searchIndex(Searcher searcher, Query query) throws IOException
{
boolean retry = true;
while (retry)
{
try
{
retry = false;
Hits myHits = searcher.search(query);
return myHits;
}
catch (BooleanQuery.TooManyClauses e)
{
// Double the number of boolean queries allowed.
// The default is in org.apache.lucene.search.BooleanQuery and is 1024.
String defaultQueries = Integer.toString(BooleanQuery.getMaxClauseCount());
int oldQueries = Integer.parseInt(System.getProperty("org.apache.lucene.maxClauseCount", defaultQueries));
int newQueries = oldQueries * 2;
log.error("Too many hits for query: " + oldQueries + ". Increasing to " + newQueries, e);
System.setProperty("org.apache.lucene.maxClauseCount", Integer.toString(newQueries));
BooleanQuery.setMaxClauseCount(newQueries);
retry = true;
}
}
}
I had this same issue in C# code running with the Sitecore web content management system. I used Randy's answer above, but was not able to use the System get and set property functionality. Instead I retrieved the current count, incremented it, and set it back. Worked great!
catch (BooleanQuery.TooManyClauses e)
{
// Increment the number of boolean queries allowed.
// The default is 1024.
var currMaxClause = BooleanQuery.GetMaxClauseCount();
var newMaxClause = currMaxClause + 1024;
BooleanQuery.SetMaxClauseCount(newMaxClause);
retry = true;
}
Add This Code
#using Lucene.Net.Search;
#BooleanQuery.SetMaxClauseCount(2048);
Just put, BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );and that's it.