I have a problem with RavenDB indexing.
Simple query looks like this:
var values =
myCollection.Query.Where(w =>
w.MyId == MyId &&
w.IsReady == false &&
w.IsDeleted &&
w.Rate > 0)
During execution Raven creates dynamic index:
from doc in docs.MyCollection
select new { Rate = doc.Rate, IsReady = doc.IsReady, IsDeleted = doc.IsDeleted, MyId = doc.MyId }
with extra options:
Field -> Rate;
Storage -> No;
Indexing -> Default;
Sort -> Double;
Field Rate has decimal type.
Problem:
I wanted to add static index, but when I specified index like this:
public class MyIndex : AbstractIndexCreationTask<MyCollection> {
public MyIndex () {
Map = d => d.Select(s => new { Rate = s.Rate, IsReady = s.IsReady, IsDeleted = s.IsDeleted, MyId = s.MyId });
Sort(x => x.Rate, SortOptions.Double);
}
}
Raven is creating index slightly different:
from doc in docs.MyCollection
select new { Rate = (decimal)doc.Rate, IsReady = doc.IsReady, IsDeleted = doc.IsDeleted, MyId = doc.MyId }
with extra options:
Field -> Rate;
Storage -> No;
Indexing -> Default;
Sort -> Double;
The only difference is that I have casting in static index, because my field type is decimal and I'm using Double sort option.
Because of that Raven is not using my static index but instead creates dynamic one every time query is being executed.
I tried to do some casting inside Sort() but then index has not been created at all. One way to overcome this issue is to manually modify static index from management console after it was created, but it's not good solution.
Any ideas how to deal with that?
Thanks.
Edit:
Another example:
Type of field DateTime and querying using DateTime values as predicates (greater than / less than). Raven in dynamic index creation picks String as a SortOption, and when I try to prepare static index I get casting issue.
You can use the IDocumentSession.Query(string indexName, [bool isMapReduce]) or the IDocumentSession.Query<TResult, TIndexCreator>() overloads to explicitly specify a static index. So in your specific case, either IDocumentSession.Query<MyCollection, MyIndex>() or IDocumentSession.Query("MyIndex").
Related
What is the ordering of the keys in an Ignite cache (without using indexing) and is it possible to do the equivalent of the following RocksDB snippet
try (final RocksIterator rocksIterator =
rocksDB.newIterator(columnFamilyHandleList.get(1))) {
for (rocksIterator.seek(prefixKey);
i.e. jump to the next entry starting with a given byte[] or String?
The way you'd do that in Ignite is by using SQL.
var query = new SqlFieldsQuery("select x,y,z from table where z like ? order by x").setArgs("prefix%");
try (var cursor = cache.query(query)) {
for (var r : cursor) {
Long id = (Long) r.get(0);
BigDecimal value = (BigDecimal) r.get(1);
String name = (String) r.get(2);
}
}
I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !
This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}
Kotlin has some pretty cool functions for collections. However, I have come across a problem in which the solution is not apparent to me.
I have a List of Objects. Those Objects have an ID field which coincides with a SQLite database. SQL operations are performed on the database, and a new list is generated. How can the index of an item from the new list be found based on the "ID" field (or any other field for that matter)?
the Collection.find{} function return the object, but not the index.
indexOfFirst can find the index of the first element of a collection that satisfies a specified predicate.
We have a DB SQlite that a call is made to to retrieve parentList We can obtain the items in the ArrayList with this type of code
fun onDoIt(view: View){
initDB()
for (t in 0..X-1) {
var N:String = parentList[t].dept
// NOTE two syntax here [t] and get(t)
if(t == 1){
var B:String = parentList[0].idD.toString()
println("$$$$$$$$$$$$$$$$$$$$$ ====== "+B)
}
var I:String = parentList.get(t).idD.toString()
println("################### id "+I+" for "+N)
}
}
private fun initDB() {
parentList = db.querySPDept()
if (parentList.isEmpty()) {
title = "No Records in DB"
} else {
X = parentList.size
println("**************************************** SIZE " + X)
title = "SP View Activity"
}
}
I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.
I need some hibernate/SQL help, please. I'm trying to generate a report against an accounting database. A commission order can have multiple account entries against it.
class CommissionOrderDAO {
int id
String purchaseOrder
double bookedAmount
Date customerInvoicedDate
String state
static hasMany = [accountEntries: AccountEntryDAO]
SortedSet accountEntries
static mapping = {
version false
cache usage: 'read-only'
table 'commission_order'
id column:'id', type:'integer'
purchaseOrder column: 'externalId'
bookedAmount column: 'bookedAmount'
customerInvoicedDate column: 'customerInvoicedDate'
state column : 'state'
accountEntries sort : 'id', order : 'desc'
}
...
}
class AccountEntryDAO implements Comparable<AccountEntryDAO> {
int id
Date eventDate
CommissionOrderDAO commissionOrder
String entryType
String description
double remainingPotentialCommission
static belongsTo = [commissionOrder : CommissionOrderDAO]
static mapping = {
version false
cache usage: 'read-only'
table 'account_entry'
id column:'id', type:'integer'
eventDate column: 'eventDate'
commissionOrder column: 'commissionOrder'
entryType column: 'entryType'
description column: 'description'
remainingPotentialCommission formula : SQLFormulaUtils.AccountEntrySQL.REMAININGPOTENTIALCOMMISSION_FORMULA
}
....
}
The criteria for the report is that the commissionOrder.state==open and the commissionOrder.customerInvoicedDate is not null. And the account entries in the report should be between the startDate and the endDate and with remainingPotentialCommission > 0.
I'm looking to display information on the CommissionOrder mainly (and to display account entries on that commission order between the dates), but when I use the following projection:
def results = accountEntryCriteria.list {
projections {
like ("entryType", "comm%")
ge("eventDate", beginDate)
le("eventDate", endDate)
gt("remainingPotentialCommission", 0.0099d)
and {
commissionOrder {
eq("state", "open")
isNotNull("customerInvoicedDate")
}
}
}
order("id", "asc")
}
I get the correct accountEntries with the proper commissionOrders, but I'm going in backwards: I have loads of accountEntries which can reference the same commissionOrder. Aut when I look at the commissionOrders that I've retrieved, each one has ALL its accountEntries not just the accountEntries between the dates.
I then loop through the results, get the commissionOrder from the accountEntriesList, and remove accountEntries on that commissionOrder after the end date to get the "snapshot" in time that I need.
def getCommissionOrderListByRemainingPotentialCommissionFromResults(results, endDate) {
log.debug("begin getCommissionOrderListByRemainingPotentialCommissionFromResults")
int count = 0;
List<CommissionOrderDAO> commissionOrderList = new ArrayList<CommissionOrderDAO>()
if (results) {
CommissionOrderDAO[] commissionOrderArray = new CommissionOrderDAO[results?.size()];
Set<CommissionOrderDAO> coDuplicateCheck = new TreeSet<CommissionOrderDAO>()
for (ae in results) {
if (!coDuplicateCheck.contains(ae?.commissionOrder?.purchaseOrder) && ae?.remainingPotentialCommission > 0.0099d) {
CommissionOrderDAO co = ae?.commissionOrder
CommissionOrderDAO culledCO = removeAccountEntriesPastDate(co, endDate)
def lastAccountEntry = culledCO?.accountEntries?.last()
if (lastAccountEntry?.remainingPotentialCommission > 0.0099d) {
commissionOrderArray[count++] = culledCO
}
coDuplicateCheck.add(ae?.commissionOrder?.purchaseOrder)
}
}
log.debug("Count after clean is ${count}")
if (count > 0) {
commissionOrderList = Arrays.asList(ArrayUtils.subarray(commissionOrderArray, 0, count))
log.debug("commissionOrderList size = ${commissionOrderList?.size()}")
}
}
log.debug("end getCommissionOrderListByRemainingPotentialCommissionFromResults")
return commissionOrderList
}
Please don't think I'm under the impression that this isn't a Charlie Foxtrot. The query itself doesn't take very long, but the cull process takes over 35 minutes. Right now, it's "manageable" because I only have to run the report once a month.
I need to let the database handle this processing (I think), but I couldn't figure out how to manipulate hibernate to get the results I want. How can I change my criteria?
Try to narrow down the bottle neck of that process. If you have a lot of data, then maybe this check could be time expensive.
coDuplicateCheck.contains(ae?.commissionOrder?.purchaseOrder)
in Set contains have O(n) complexity. You can use i.e. Map to store keys that you would check and then search for "ae?.commissionOrder?.purchaseOrder" as key in the map.
The second thought is that maybe when you're getting ae?.commissionOrder?.purchaseOrder it is always loaded from db by lazy mechanism. Try to turn on query logging and check that you don't have dozens of queries inside this processing function.
Finally and again I would suggest to narrow down where is the most expensive part and time waste.
This plugin maybe helpful.