This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I am using RavenDB version 888.
My client application inserts hundreds of thousands of documents into RavenDB. It works fine. After the insertion, my app will query some data out of the static index I predefined. I don't want stale results so my app will query periodically and wait until the index is up-to-date.
Unfortunately, today I find my app hang (more accurately, keep querying RavenDB again and again) because the server always tells it that the index is still stale. It is a bit strange because last insertion finished long ago - theoratically the server should have finished indexing.
I looked into the management studio and checked my simpliest index which does some counting on one type of my doc collection. Interesting that the count given by the index is up-to-date (same as the number I saw in 'Collections' tab of the management studio) but the status is 'stale'. And its last update shows '6 hours ago'. Altogether, half of my indexes is stale like this but another half is fresh according to the studio.
I have no idea why RavenDB leaves them stale and what RavenDB is doing now. It does not have a high CPU usage. How can I debug the scenario?
UPDATE:
I think I spotted one thing which might be helpful to find the root cause. After comparing my non-stale indexes with always-stale indexes, it seems that the reduce result matters: stale indexes have large Value property in reduce results while up-to-date indexes have small Value property.
public class ReduceResult
{
public string ID { get; set; }
public string Key { get; set; }
public long Value { get; set; } //This field seems to matter
}
Here is one of my index definition:
public class InternalPageCountIndex : AbstractIndexCreationTask<InternalPage, ReduceResult>
{
public InternalPageCountIndex()
{
Map = posts => from post in posts
select new
{
Key = post.BatchID,
Value = 1
};
Reduce = results => from result in results
group result by result.Key
into g
select new
{
Key = g.Key,
Value = g.Sum(c => c.Value)
};
}
}
Btw, the server log also looks interesting. This afternoon the server thought there is no job to do:
2012-04-07 16:36:44.6725,Raven.Database.Tasks.ReduceTask,Debug,Indexed 65 reduce keys in 00:00:03.5535907 with 493666 results for index SNRTotalByteSizeIndex,
2012-04-07 17:35:21.1888,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: ReducingExecuter, will wait for additional work",
2012-04-07 17:35:21.1888,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: IndexingExecuter, will wait for additional work",
2012-04-07 18:35:39.4759,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: ReducingExecuter, will wait for additional work",
2012-04-07 18:35:39.4759,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: IndexingExecuter, will wait for additional work",
2012-04-07 19:35:56.5994,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: ReducingExecuter, will wait for additional work",
2012-04-07 19:35:56.5994,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: IndexingExecuter, will wait for additional work",
2012-04-07 20:36:12.3345,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: ReducingExecuter, will wait for additional work",
2012-04-07 20:36:12.3345,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 5, for: IndexingExecuter, will wait for additional work",
But when I query RavenDB to see how many stale index there are through Management Studio tonight, the server starts to do map/reduce! Yes, no insertion in between afternoon and tonight, but the server find something to do with the index after the studio query...
2012-04-07 21:23:16.9357,Raven.Database.Tasks.ReduceTask,Debug,Read 1 reduce keys in 00:03:05.6481176 with 505406 results for index InternalPageCountIndex,
2012-04-07 21:23:19.5103,Raven.Database.Indexing.Index.Indexing,Debug,"Indexing on batches/1 result in index PageCountMissingDescriptionIndex gave document: __reduce_key I-: batches/1 Key IS: batches/1 Value IS: 505406 Value_Range IS: 505406",
2012-04-07 21:23:19.6797,Raven.Database.Indexing.Index.Indexing,Debug,Reduce resulted in 1 entries for InternalPageCountIndex for reduce keys: batches/1,
2012-04-07 21:23:19.6797,Raven.Database.Tasks.ReduceTask,Debug,Indexed 1 reduce keys in 00:00:02.7426449 with 505406 results for index InternalPageCountIndex,
And according to the studio query, the server still tells me half of my index is stale :(
This is a bug.
Thanks to Ayende and the RavenDB team, this has been fixed since build 909. See http://groups.google.com/group/ravendb/browse_thread/thread/13b16ce3f562472d for more.
It looks like you are doing a reduce on BatchID, and that there are a LOT of items with the same batch id.
That means that for each unique key, we have to load all of the mapped result for that unique key, and in your case, there are 505,406 of them, so that is taking time.
Mapping the results took relatively small amount of time.
It is reducing them that takes time, because we need to reduce over a large amount of data.
Related
In our raven based application we are starting to experience major performance issues when the master document starts increasing in size, as it holds a lot of collections that keep growing. As such I am now planning a major data redesign that is likely to take months and I want to be sure I'm on the right track before I do so.
The current design looks like this:
Community
{
id:,
name,
//other properties
members[
{
id:,
name:,
date of birth:
//etc
},
{
//another member, this list could potentially grow to hundreds of thousands
}
],
league
[
{
id,
name,
seasons[
{...},
{
id:,
divisions
[
{
id:,
name:
matches[
{
id:,
//match details
},
{
//another match. there could be hundreds here in a big league
},
{}
As we started hitting performance issues, we started using transformers to only load what is needed, but that didn't solve the problem fully as some of our leagues are a couple mb's just on their own. The other issue is we always need to be doing member checks to check for admin/membership rights so the members list is always needed.
I understand I could omit the member list completely using a transformer and use an index for membership checks, but the problem remains about what to do, when a member is added , that list will need to be loaded and with an upcoming project there is a potential for it to grow to half a million people or more.
So my plan is separate each entity into it's own document, so in the case of leagues, I will have a league document, and a match document, with matches containing {leagueId, season number, division number, other match details}.
Each member will have their own document with a list of community document Id's they're a member of.
I'm just a bit worried, that using this design, is missing the whole point of a document db and we may as well have used sql, or do you think I'm on the right track with approach?
I am using TaxonomyManager gettree(path) method to get a particular tree hierarchy in my c# code but it is taking more than 3 min to get the result, due to this the website is taking long time to load. How to reduce the time to load the website, is there any other way i can use to get the hierarchy from Ektron.
We had this exact same issue and actually got on with Ektron support to help resolve it.
Now, whenever we work with taxonomies we cache them on the server-side to avoid the performance hit. Something like
string cacheKey = "Something unique for your situation";
TaxonomyData taxonomyData;
if (Ektron.Cms.Context.HttpContext.Cache[cacheKey] == null)
{
// Pull taxonomy data and store in cache.
Ektron.Cms.Context.HttpContext.Cache.Insert(cacheKey, taxonomyData);
}
else
{
taxonomyData = (TaxonomyData)Ektron.Cms.Context.HttpContext.Cache[cacheKey];
}
Since you already know how to pull the TaxonomyData I left that out. We don't store the taxonomy data, instead we store the object we create with the taxonomy data, so just cache whatever you need to and then you can avoid the performance hit 'most' of the time.
I don't remember where the ektron cache time is set, whether it's in the web.config or within the WorkArea. Ektron support said to use the Ektron cache, not sure how much of a difference it would make to use the regular cache instead.
Let's imagine we have this code:
while (true)
{
foreach($array as $row)
{
$item = $em->getRepository('reponame')->findOneBy(array('filter'));
if (!$item)
{
$needPersist = true;
$item = new Item();
}
$item->setItemName()
// and so on ...
if ($needPersist)
{
$em->persist();
}
}
$em->flush();
}
So, the point is that code will be executed a lot of times (while server won't die :) ). And we want to optimize it. Every time we:
Select already entry from repository.
If entry not exists, create it.
Set new (update) vars to it.
Apply actions (flush).
So question is - how to avoid unnecessary queries and optimize "check if entry is exist"? Because when there are 100-500 queries it's not so scary... But when it comes up to 1000-10000 for one while loop - it's too much.
PS: Each entry in DB is unique by several columns (not only by ID).
Instead of fetching results one-by-one, load all results with one query.
Eg.
let's say your filter wants to load ids 1, 2, 10. So QB would be something like:
$allResults = ...
->where("o.id IN (:ids)")->setParameter("ids", $ids)
->getQuery()
->getResults() ;
"foreach" of these results, do your job of updating them and flushing
While doing that loop, save ids of those fetched objects in new array
Compare that array with original one using array_diff. Now you have ids that were not fetched the first time
Rinse and repeat :)
And don't forget $em->clear() to free memory
While this can still be slow when working with 10.000 records (dunno, never tested), it will be much faster to have 2 big queries than 10.000 small ones.
Regardless if you need them to persist or not after the update, retrieving 10k+ and up entries from the database and hydrating them to php objects is going to need too much memory. In such cases you should better fallback to the Doctrine DBAL Layer and fire pure SQL queries.
My SQL and Entity Framework knowledge is a somewhat limited. In one Entity Framework (4) application, I notice it takes forever (about 2 minutes) to complete one of my method calls. The first queries do not take much time, but when I loop through the Entity Framework objects returned by the queries, even though I am only reading (not modifying) the data I supposedly got, it takes forever to complete the nested loops, even though there are only dozens of entries in each list and a few levels of looping.
I expect the example below could be re-written with a fancier query that could probably include all of the filtering I am doing in my loops with some SQL words I don't really know how to use, so if someone could show me what the equivalent SQL expression would be, that would be extremely educational to me and probably solve my current performance problem.
Moreover, since other parts of this and other applications I develop often want to do more complex computations on SQL data, I would also like to know a good way to retrieve data from Entity Framework to local memory objects that do not have huge delays in reading them. In my LINQ-to-SQL project there was a similar performance problem, and I solved it by refactoring the whole application to load all SQL data into parallel objects in RAM, which I had to write myself, and I wonder if there isn't a better way to either tell Entity Framework to not keep doing whatever high-latency communication it is doing, or to load into local RAM objects.
In the example below, the code gets a list of food menu items for a member (i.e. a person) on a certain date via a SQL query, and then I use other queries and loops to filter out the menu items on two criteria: 1) If the member has a rating of zero for any group id which the recipe is a member of (a many-to-many relationship) and 2) If the member has a rating of zero for the recipe itself.
Example:
List<PFW_Member_MenuItem> MemberMenuForCookDate =
(from item in _myPfwEntities.PFW_Member_MenuItem
where item.MemberID == forMemberId
where item.CookDate == onCookDate
select item).ToList();
// Now filter out recipes in recipe groups rated zero by the member:
List<PFW_Member_Rating_RecipeGroup> ExcludedGroups =
(from grpRating in _myPfwEntities.PFW_Member_Rating_RecipeGroup
where grpRating.MemberID == forMemberId
where grpRating.Rating == 0
select grpRating).ToList();
foreach (PFW_Member_Rating_RecipeGroup grpToExclude in ExcludedGroups)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
PFW_Recipe rcp = GetRecipeById(rcpOnMenu.RecipeID);
foreach (PFW_RecipeGroup group in rcp.PFW_RecipeGroup)
{
if (group.RecipeGroupID == grpToExclude.RecipeGroupID)
{
rcpsToRemove.Add(rcpOnMenu);
break;
}
}
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
// Now filter out recipes rated zero by the member:
List<PFW_Member_Rating_Recipe> ExcludedRecipes =
(from rcpRating in _myPfwEntities.PFW_Member_Rating_Recipe
where rcpRating.MemberID == forMemberId
where rcpRating.Rating == 0
select rcpRating).ToList();
foreach (PFW_Member_Rating_Recipe rcpToExclude in ExcludedRecipes)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
if (rcpOnMenu.RecipeID == rcpToExclude.RecipeID)
rcpsToRemove.Add(rcpOnMenu);
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
You can use EFProf http://www.hibernatingrhinos.com/products/EFProf to track see exactly what EF is sending to SQL. It can also show you how many queries you are sending and how many unique queries. It also provides you some analysis of each query (e.g. is it unbound etc). Entity Framework with its navigation properties, it is quite easy to not realize you are making a db request. When you are in a loop, and have a navigation property, you get in to the N + 1 problem.
You could use the Keyword Virtual on your List parts of your model if you are using code first to enable proxying, that way you will not have to get all the data back at once, only as you need it.
Also consider NoTracking for read only data
context.bigTable.MergeOption = MergeOption.NoTracking;
I am storing documents - and each document has a collection of 'labels' - like this. Labels are user defined, and could be any plain text.
{
"FeedOwner": "4ca44f7d-b3e0-4831-b0c7-59fd9e5bd30d",
"MessageBody": "blablabla",
"Labels": [
{
"IsUser": false,
"Text": "Mine"
},
{
"IsUser": false,
"Text": "Incomplete"
}
],
"CreationDate": "2012-04-30T15:35:20.8588704"
}
I need to allow the user to query for any combination of labels, i.e.
"Mine" OR "Incomplete"
"Incomplete" only
or
"Mine" AND NOT "Incomplete"
This results in Raven queries like this:
Query: (FeedOwner:25eb541c\-b04a\-4f08\-b468\-65714f259ac2) AND (Labels,
Text:Mine) AND (Labels,Text:Incomplete)
I realise that Raven will generate a 'dynamic index' for queries it has not seen before. I can see with this, this could result in a lot of indexes.
What would be the best approach to achieving this functionality with Raven?
[EDIT]
This is my Linq, but I get an error from Raven "All is not supported"
var result = from candidateAnnouncement in session.Query<FeedAnnouncement>()
where listOfRequiredLabels.All(
requiredLabel => candidateAnnouncement.Labels.Any(
candidateLabel => candidateLabel.Text == requiredLabel))
select candidateAnnouncement;
[EDIT]
I had a similar question, and the answer for that resolved both questions: Raven query returns 0 results for collection contains
Please notice that in case of FeedOwner being a unique property of your documents the query doesn't make a lot of sense at all. In that case, you should do it on the client using standard linq to objects.
Now, given that FeedOwner is not something unique, your query is basically correct. However, depending on what you actually want to return, you may need to create a static index instead:
If you're using the dynamically generated indexes, then you will always get the documents as the return value and you can't get the particular labels which matched the query. If this is ok for you, then just go with that approach and let the query optimizer do its job (only if you have really a lot of documents build the index upfront).
In the other case, where you want to use the actual labels as the query result, you have to build a simple map index upfront which covers the fields you want to query upon, in your sample this would be FeedOwner and Text of every label. You will have to use FieldStorage.Yes on the fields you want to return from a query, so enable that on the Text property of your labels. However, there's no need to do so with the FeedOwner property, because it is part of the actual document which raven will give you as part of any query results. Please refer to ravens documentation to see how you can build a static index and use field storage.