Proper Way to Retrieve More than 128 Documents with RavenDB - ravendb

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}

It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.

The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default

The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging

Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();

You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Related

Transcoding HTTP to gRPC: Same endpoint with different parameters

I already have a working gRPC project working. I'm looking to build an API to be able to do some HTTP requests.
I have the following 2 types:
message FindRequest {
ModelType model_type = 1;
oneof by {
string id = 2;
string name = 3;
}
}
message GetAllRequest {
ModelType model_type = 1;
int32 page_size = 2;
oneof paging {
int32 page = 3;
bool skip_paging = 4;
}
}
And then, I would like to have those 2 endpoints:
// Get a data set by ID or name. Returns an empty data set if there is no such
// data set in the database. If there are multiple data sets with the same
// name in the database it returns the first of these data sets.
rpc Find(FindRequest) returns (DataSet){
option (google.api.http) = { get: "/datasets" };
}
// Get (a page of) all data sets of a given type. If no page size is given
// (page <= 0) it defaults to 100. An unset page (page <= 0) defaults to the
// first page.
rpc GetAll(GetAllRequest) returns (GetAllResponse){
option (google.api.http) = { get: "/datasets" };
}
It makes sense to me to have 2 different endpoints with the same name, but that differ with the parameters.
For instance, requesting /datasets?model-type=XXX should be mapped to the GetAll function, while requesting /datasets?model-type=XXX&name=YYY should be mapped to Find function.
However, it doesn't work, since the mapping fails I guess, so none of these endpoints returns me anything.
I think the solution to make the mapping working would be to force the parameter to be required, however, I am working with proto3, which has disallowed the required field.
So how could I be able to have 2 endpoints with the same name, but different parameters, with proto3 ?
I know that if I am using different endpoint names, it is working, for example for the findRequest, I could have the following endpoint : /findDatasets, but regarding the best practice about API naming convention, it is not advisable, or is it ?
The conventional way to solve this problem is to use different methods. My hunch is that it's an anti-pattern to try to differentiate using the fields in the request string.
service YourService {
rpc FindSomething(FindSomethingRequest) returns (FindSomethingResponse){
option (google.api.http) = { get: "/something/find" };
}
rpc ListSomething(ListSomethingRequest) returns (ListSomethingResponse){
option (google.api.http) = { get: "/something/list" };
}
}
message FindSomethingRequest {
ModelType model_type = 1;
string id = 2;
string name = 3;
}
message ListSomethingRequest {
int32 page_size = 2;
int32 page_token = 3;
}
message ListSomethingResponse {
repeated ModelType model_types = 1;
int32 page_size = 2;
int32 next_page_token = 3;
}
I'm unsure of your underlying thing structure but, I think it's better practice to model things with all possible properties and permit leaving some unset (e.g. either id or name or possibly both in FindSomethingRequest) rather than creating different message types for all possible queries. You model the thing not how you interact with it.
In your implementation (!) of FindSomething, you then deal with the permutations of how users of the message may construct the fields. Perhaps reporting an error "Either id or name is required`.
I think ListSomething's messages could be simpler too. You request a List of (ModelTypes) and give a page_size and an page_token (that could be ""). It returns a list of ModelTypes, the size of the page returned (possibly less than requested) and a next_page_token if there is more data, that you can use to make the next ListSomething request.

How to get last write date of a RavenDB collection/index

I want to detect when a collection or index was last modified or written to.
Note: This works on document level, but i need on collection/index level. How to get last write date of a RavenDB document via C#
RavenQueryStatistics stats;
using(var session = _documentStore.OpenSession()) {
session.Query<MyDocument>()
.Statistics(out stats)
.Take(0) // don't load any documents as we need only the stats
.ToArray(); // this is needed to trigger server side query execution
}
DateTime indexTimestamp = stats.IndexTimestamp;
string indexEtag = stats.IndexEtag;;
getting meta data only via RavenDB Http Api:
GET http://localhost:8080/indexes/dynamic/MyDocuments/?metadata-only=true
would return:
{ "Results":[],
"Includes":[],
"IsStale":false,
"IndexTimestamp":"2013-09-16T15:54:58.2465733Z",
"TotalResults":0,
"SkippedResults":0,
"IndexName":"Raven/DocumentsByEntityName",
"IndexEtag":"01000000-0000-0008-0000-000000000006",
"ResultEtag":"3B5CA9C6-8934-1999-45C2-66A9769444F0",
"Highlightings":{},
"NonAuthoritativeInformation":false,
"LastQueryTime":"2013-09-16T15:55:00.8397216Z",
"DurationMilliseconds":102 }
ResultEtag and IndexTimestamp change on every write

SQL to Magento model understanding

Understanding Magento Models by reference of SQL:
select * from user_devices where user_id = 1
select * from user_devices where device_id = 3
How could I perform the same using my magento models? getModel("module/userdevice")
Also, how can I find the number of rows for each query
Following questions have been answered in this thread.
How to perform a where clause ?
How to retrieve the size of the result set ?
How to retrieve the first item in the result set ?
How to paginate the result set ? (limit)
How to name the model ?
You are referring to Collections
Some references for you:
http://www.magentocommerce.com/knowledge-base/entry/magento-for-dev-part-5-magento-models-and-orm-basics
http://alanstorm.com/magento_collections
http://www.magentocommerce.com/wiki/1_-_installation_and_configuration/using_collections_in_magento
lib/varien/data/collection/db.php and lib/varien/data/collection.php
So, assuming your module is set up correctly, you would use a collection to retrieve multiple objects of your model type.
Syntax for this is:
$yourCollection = Mage::getModel('module/userdevice')->getCollection()
Magento has provided some great features for developers to use with collections. So your example above is very simple to achieve:
$yourCollection = Mage::getModel('module/userdevice')->getCollection()
->addFieldToFilter('user_id', 1)
->addFieldToFilter('device_id', 3);
You can get the number of objects returned:
$yourCollection->count() or simply count($yourCollection)
EDIT
To answer the question posed in the comment: "what If I do not require a collection but rather just a particular object"
This depends if you still require both conditions in the original question to be satisfied or if you know the id of the object you wish to load.
If you know the id of the object then simply:
Mage::getModel('module/userdevice')->load($objectId);
but if you wish to still load based on the two attributes:
user_id = 1
device_id = 3
then you would still use a collection but simply return the first object (assuming that only one object could only ever satisfy both conditions).
For reuse, wrap this logic in a method and place in your model:
public function loadByUserDevice($userId, $deviceId)
{
$collection = $this->getResourceCollection()
->addFieldToFilter('user_id', $userId)
->addFieldToFilter('device_id', $deviceId)
->setCurPage(1)
->setPageSize(1)
;
foreach ($collection as $obj) {
return $obj;
}
return false;
}
You would call this as follows:
$userId = 1;
$deviceId = 3;
Mage::getModel('module/userdevice')->loadByUserDevice($userId, $deviceId);
NOTE:
You could shorten the loadByUserDevice to the following, though you would not get the benefit of the false return value should no object be found:
public function loadByUserDevice($userId, $deviceId)
{
$collection = $this->getResourceCollection()
->addFieldToFilter('user_id', $userId)
->addFieldToFilter('device_id', $deviceId)
;
return $collection->getFirstItem();
}

RavenDB Paging Behaviour

I have the following test for skip take -
[Test]
public void RavenPagingBehaviour()
{
const int count = 2048;
var eventEntities = PopulateEvents(count);
PopulateEventsToRaven(eventEntities);
using (var session = Store.OpenSession(_testDataBase))
{
var queryable =
session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
var entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
eventEntity.Key = "Modified";
}
session.SaveChanges();
queryable = session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
Assert.AreEqual(eventEntity.Key, "Modified");
}
}
}
PopulateEventsToRaven simply adds 2048 very simple documents to the database.
The first skip take combination gets the first 1024 doucuments modifies the documents and then commits changes.
The next skip take combination again wants to get the first 1024 documents but this time it gets the document number 1024 to 2048 and hence fails the test. Why is this , I would expect the first 1024 again?
Edit: I have varified that if I dont modify the documents the behaviour is fine.
The problem is that you don't specify an order by, and that means that RavenDB is free to choose with items to return, those aren't necessarily going to be the same items that it returned in the previous call.
Use an OrderBy and it will be consistent.

Having difficulty with specific Future<T> query and collections with nHibernate.

For the sake of example, I am removing non-queried and non-essential data just to figure out how to do the initial query here.
I have a model structure like this.
class Path {
Guid Id { get; protected set; }
IList<Step> Steps { get; set; }
void AddStep(Step entity) {
// write up bidirectional association
}
}
class Step {
Guid Id { get; protected set; }
Path Path { get; set; }
// other data irreleveent
}
Now assuming 50000 steps, each with 5000 steps... I do realize I don't want to return all of them at once. But putting a limit on my query fetch isn't my real problem.
Here is the exact query I am attempting to use. I am getting the exception..
NHibernate.QueryException : duplicate alias: lpStep
----> System.ArgumentException : An item with the same key has already been added.
I'm not entirely sure how to handle this scenario. if I use a flat out Fetch on the Path query, I get Select+N errors from the NHibernate Profiler.
I do have batching enabled - but as far as I am aware, that only really applies to inserts, not retrievals. But in any case I am getting back these errors and not sure how to handle it. Any ideas?
using (var Transaction = Session.BeginTransaction()) {
Path lpPath = null;
Step lpStep = null;
var lpPaths = Session.QueryOver<Path>(() => lpPath)
.Take(50)
.Future<Path>();
var lpSteps = Session.QueryOver<Step>(() => lpStep)
.JoinAlias(() => lpPath.Steps, () => lpStep)
.Where(o => o.Path.Id == lpPath.Id)
.Take(12)
.Future<Step>();
Transaction.Commit();
foreach (var path in lpPaths) {
Console.WriteLine("{0} fetched {1} Steps",
path.Id, path.Steps.Count);
}
}
I basically want to say ..
Select (50) Paths, also, as a separate select but part of the same trip, Select the first (12) Steps that belong the previously selected Paths.
But if I use a flat out join, I get 110 rows, whereas I expect to have 2 tables, 1 of 50 rows, 1 of 600 rows.
Can someone explain to me what I am doing wrong?
mind you, I can do some minor alterations and the query runs, but it isn't 'optimized'. I can get the data I want, but it takes multiple trips and lazy loading. I can optimize the actual Path selection easily enough but it is those blasted Steps. If I just take a restrictive where clause out of the lpSteps query, it just returns the first 12 steps, not returning 12 steps for each query done.
I've looked at some of the other stack overflow posts on Future<T> and found them to look a lot like this. So I don't understand why it isn't working. I suspect that what is going on is this..
lpPaths runs.
lpSteps tries to run, first one succeeds.
lpSteps then tries to run again, finds it cannot redefine lpPaths.
Apocolypse
I'm really hoping someone smarter than me can enlighten me on the absolute most optimal way to write this.
i cant really understand what your use case is. why do you only need the first 12 Steps of each Path? What about batches of Steps to process
IList<Guid> pathIds;
while ((pathIds = QueryOver.For<Path>()
.Where(...)
.Projection(path => path.Id)
.SetmaxResults(100)).Count > 0)
{
int batch = 0;
const int batchsize = 600;
IList<Step> steps;
while ((steps = Session.QueryOver<Step>()
.Where(step => step.Path.Id).In(pathIds)
.Where(step => step. ...)
.SetFirstResult(batch * batchsize)
.Take(batchsize)
.List<Step>()).Count > 0)
{
DoSomething(steps);
batch++;
}
}