Querying Raven with Where() only filters against the first 128 documents? - lucene

We're using Raven to validate logins so people can get into our site.
What we've found is that if you do this:
// Context is an IDocumentSession
Context.Query<UserModels>()
.SingleOrDefault(u => u.Email.ToLower() == email.ToLower());
The query only filters on the first 128 docs of the documents in
Raven. There are several thousand in our database, so unless your
email happens to be in that first 128 returned, you're out of luck.
None of the Raven samples code or any
sample code I've come across on the net performs any looping using
Skip() and Take() to iterate through the set.
Is this the desired behavior of Raven?
Is it the same behavior even if you use an advanced Lucene Query? ie; Do advanced queries behave any differently?
Is the solution below appropriate? Looks a little ugly. :P
My solution is to loop through the set of all documents until I
encounter a non null result, then I break and return .
public T SingleWithIndex(string indexName, Func<T, bool> where)
{
var pageIndex = 1;
const int pageSize = 1024;
RavenQueryStatistics stats;
var queryResults = Context.Query<T>(indexName)
.Statistics(out stats)
.Customize(x => x.WaitForNonStaleResults())
.Take(pageSize)
.Where(where).SingleOrDefault();
if (queryResults == null && stats.TotalResults > pageSize)
{
for (var i = 0; i < (stats.TotalResults / (pageIndex * pageSize)); i++)
{
queryResults = Context.Query<T>(indexName)
.Statistics(out stats)
.Customize(x => x.WaitForNonStaleResults())
.Skip(pageIndex * pageSize)
.Take(pageSize)
.Where(where).SingleOrDefault();
if (queryResults != null) break;
pageIndex++;
}
}
return queryResults;
}
EDIT:
Using the fix below is not passing query params to my RavenDB instance. Not sure why yet.
Context.Query<UserModels>()
.Where(u => u.Email == email)
.SingleOrDefault();
In the end I am using Advanced Lucene Syntax instead of linq queries and things are working as expected.

RavenDB does not understand SingleOrDefault, so it performs a query without the filter. Your condition is then executed on the result set, but per default Raven only returns the first 128 documents.
Instead, you have to call
Context.Query<UserModels>()
.Where(u => u.Email == email)
.SingleOrDefault();
so the filtering is done by RavenDB/Lucene.

Related

Get and manipulate data from elasticsearch

I am new to elasticsearch so I will need some help. Unfortunately, I didnt found the answer in other topics here on SO.
I have some .net core application which I inherited and now there is a need to implement some changes.
I already have a method of getting data from elasticsearch, but after getting them, I am not sure how to change it and use it in application.
To be precise, I need to parse first and last name and to remove special characters, specific serbian latin letters like "šđžčć" etc... I already have a method for this parsing written but not sure how to call it...
So, my question is can I and how can I do this?
What I have now is the following:
var result = await _elasticClient.SearchAsync<CachedUserEntity>(
s =>
s.Index(_aliasName)
.Query(q => andQuery));
CachedUserEntity, among others, contains property about FirstName and LastName.
Inside results.Documents, I am getting the data about FirstName and LastName from elasticsearch, but I am not sure how to access it in order to update it via aformentioned NameParser ...
Sorry if the question is too easy, not to say stupid :)
I wont use updateByQuery here, for some reasons. I would scroll on documents (i use matchAll on my exemple, you obviously need to replace it with your query), or, if you dont know how to identify documents to update, only update usefull documents in UpdateManyWithIndex/UpdateManyPartial function.
For performance, we have to update severals documents at once, so we use bulk/updateMany function.
You can use both solution, the classic update, or the second (partial update) with an object containing the targeteds fields.
On server sides, both solutions will have the same cost / performance.
var searchResponse = Client.Search<CachedUserEntity>(s => s
.Query(q => q
MatchAll()
)
.Scroll("10s")
);
while (searchResponse.Documents.Any())
{
List<CachedUserEntity> NewSearchResponse = RemoveChar(searchResponse);
UpdateManyWithIndex<CachedUserEntity>(NewSearchResponse, _aliasName);
searchResponse = Client.Scroll<Project>("2h", searchResponse.ScrollId);
}
public void UpdateManyWithIndex<C>(List<C> obj, string index) where C : class {
var bulkResponse = Client.Bulk(b => b
.Index(index).Refresh(Elasticsearch.Net.Refresh.WaitFor) // explicitly provide index name
.UpdateMany<C>(obj, (bu, d) => bu.Doc(d)));
}
Or, using partial update object
Note: in this case Indix is already set on my client (add .index if needed)
var searchResponse = Client.Search<CachedUserEntity>(s => s
.Query(q => q
MatchAll()
)
.Scroll("2h")
);
while (searchResponse.Documents.Any())
{
List<object> listPocoPartialObj = GetPocoPartialObjList(searchResponse);
UpdateManyPartial(listPocoPartialObj);
searchResponse = Client.Scroll<Project>("2h", searchResponse.ScrollId);
}
private List<object> GetPocoPartialObjList(List<CachedUserEntity> cachedList) {
List<object> listPoco = new List<object>();
//note if you dont have cachedList.Id, take a look at result.source, comments if needed
foreach (var eltCached in cachedList) {
listPoco.Add( new object() { Id = cachedList.Id, FirstName = YOURFIELDWITHOUTSPECIALCHAR, LastName = YOURSECONDFIELDWITHOUTSPECIALCHAR});
}
return listPoco;
}
public bool UpdateManyPartial(List<object> partialObj)
{
var bulkResponse = Client.Bulk(b => b
.Refresh(Elasticsearch.Net.Refresh.WaitFor)
.UpdateMany(partialObj, (bu, d) => bu.Doc(d))
);
if (!bulkResponse.IsValid)
{
GetErrorMsgs(bulkResponse);
}
return (bulkResponse?.IsValid == true);
}

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Best way to get Count for paging in ravenDB

I need to find the number of documents that are in the raven database , so that I can properly page the documents out. I had the following implementation -
public int Getcount<T>()
{
IQueryable<T> queryable = from p in _session.Query<T>().Customize(x =>x.WaitForNonStaleResultsAsOfLastWrite())
select p;
return queryable.Count();
}
But if the count is too large then it times out.
I tried the method suggested in FAQs -
public int GetCount<T>()
{
//IQueryable<T> queryable = from p in _session.Query<T>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
// select p;
//return queryable.Count();
RavenQueryStatistics stats;
var results = _session.Query<T>()
.Statistics(out stats);
return stats.TotalResults;
}
This always returns 0.
What am I doing wrong?
stats.TotalResults is 0 because the query was never executed. Try this instead:
var results = _session
.Query<T>()
.Statistics(out stats)
.Take(0)
.ToArray();
The strange syntax to get the statistics tripped me up as well. I can see why the query needs to be run in order to populate the statistic object but the syntax is a bit verbose imo.
I have written the following extension method for use in my unit tests. It helps keep the code terse.
Extension Method
public static int QuickCount<T>(this IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
Unit Test
...
db.Query<T>().QuickCount().ShouldBeGreaterThan(128);
...

NHibernate - Linq query using COUNT(DISTINCT)

I'm trying to get a paged query to work properly using LINQ and NHibernate. On a single table, this works perfect, but when more than one table is joined, it's causing me havoc. Here is what I have so far.
public virtual PagedList<Provider> GetPagedProviders(int startIndex, int count, System.Linq.Expressions.Expression<Func<Record, bool>> predicate) {
var firstResult = startIndex == 1 ? 0 : (startIndex - 1) * count;
var query = (from p in session.Query<Record>()
.Where(predicate)
select p.Provider);
var rowCount = query.Select(x => x.Id).Distinct().Count();
var pageOfItems = query.Distinct().Skip(firstResult).Take(count).ToList<Provider>();
return new PagedList<Provider>(pageOfItems, startIndex, count, rowCount);
}
There are a couple issues I'm having with this. First off, the rowCount variable is pulling back a COUNT(*) on a join that has one to many. So my count is way off of what is should be, I'm expecting 129, but getting back over 1K.
The second issue I'm happening is with the results. If I'm on the first page (firstResult = 0), then I'm getting correct results. However, if firstResult is something other than 0, it produces completely different SQL. Below is the SQL generated by both scenarios, I've trimmed out some of the fat to make it a little more readable.
--firstResult = 0
select distinct TOP (30) provider1_.Id as Id12_, provider1_.TaxID as TaxID12_,
provider1_.Facility_Name as Facility5_12_, provider1_.Last_name as Last6_12_,
provider1_.First_name as First7_12_, provider1_.MI as MI12_
from PPORecords record0_ left outer join PPOProviders provider1_ on record0_.Provider_id=provider1_.Id,
PPOProviders provider2_ where record0_.Provider_id=provider2_.Id
and provider2_.TaxID='000000000'
--firstResult = 30
SELECT TOP (30) Id12_, TaxID12_, Facility5_12_, Last6_12_, First7_12_, MI12_,
FROM (select distinct provider1_.Id as Id12_, provider1_.TaxID as TaxID12_,
provider1_.Facility_Name as Facility5_12_,
provider1_.Last_name as Last6_12_,
provider1_.First_name as First7_12_,
provider1_.MI as MI12_,
ROW_NUMBER() OVER(ORDER BY CURRENT_TIMESTAMP) as __hibernate_sort_row
from PPORecords record0_ left outer join PPOProviders provider1_ on record0_.Provider_id=provider1_.Id,
PPOProviders provider2_
where record0_.Provider_id=provider2_.Id and provider2_.TaxID='000000000') as query
WHERE query.__hibernate_sort_row > 30
ORDER BY query.__hibernate_sort_row
The problem with the second query is the "distinct" keyword is not present on the outer query, only the inner query. Any ideas how to correct this?
Thanks for any suggestions!
[UPDATE]
This is the code that is building the the Linq query predicate.
private Expression<Func<Record, bool>> ParseQueryExpression(string Query) {
Expression<Func<Record, bool>> mExpression = x => true;
string[] splitQuery = Query.Split('|');
foreach (string query in splitQuery) {
if (string.IsNullOrEmpty(query))
continue;
int valStartIndex = query.IndexOf('(');
string variable = query.Substring(0, valStartIndex);
string value = query.Substring(valStartIndex + 1, query.IndexOf(')') - valStartIndex - 1);
switch (variable) {
case "tax":
mExpression = x => x.Provider.TaxID == value;
break;
case "net":
mExpression = Combine<Record>(mExpression, x => x.Network.Id == int.Parse(value));
break;
case "con":
mExpression = Combine<Record>(mExpression, x => x.Contract.Id == int.Parse(value));
break;
case "eff":
mExpression = Combine<Record>(mExpression, x => x.Effective_Date >= DateTime.Parse(value));
break;
case "trm":
mExpression = Combine<Record>(mExpression, x => x.Term_Date <= DateTime.Parse(value));
break;
case "pid":
mExpression = Combine<Record>(mExpression, x => x.Provider.Id == long.Parse(value));
break;
case "rid":
mExpression = Combine<Record>(mExpression, x => x.Rate.Id == int.Parse(value));
break;
}
}
return mExpression;
}
It seems as if the Linq provider has some "unexpected" behaviour there. I could reproduce the issue with the rowCount that you are describing and also encountered some issues when trying to fix it with a subquery.
If I understand your intentions correctly you basically want to have a paged list of Providers whose Records match certain criteria.
In that case I would suggest using a subquery. However, I tried implementing the subquery with the Query() method but it did not work. So I tried the QueryOver() method which worked flawlessly. In your case the desired queries would look like this:
Provider pAlias = null;
var query = session.QueryOver<Provider>(() => pAlias).WithSubquery
.WhereExists(QueryOver.Of<Record>()
.Where(predicate)
.And(r => r.Provider.Id == pAlias.Id)
.Select(r => r.Provider));
var rowCount = query.RowCount();
var pageOfItems = query.Skip(firstResult).Take(count).List<Provider>();
That way you do not have to struggle with the Distinct(). And as much as I personally like and use NHibernate.Linq, I suppose, if it does not work in a given situation, one should use something else that works.
Edit: Splitting the query into smaller units.
// the method ParseQueryExpression() does not need to be modified, the Expression should work like that
System.Linq.Expressions.Expression<Func<Record, bool>> predicate = x => x.Provider.TaxID == "000000000";
// Query to evaluate predicate
IQueryable<Record> queryId = session.Query<Record>().Where(predicate).Select(r => r.Provider);
// extracting the IDs to use in a subquery
List<int> idList = queryId.Select(p => p.Id).Distinct().ToList();
// total count of distinct Providers
int rowCount = idList.Count;
// new Query on Provider, using the distinct Ids
var query = session.Query<Provider>().Where(p => idList.Contains(p.Id));
// the List<Provider> to display on the page
var pageOfItems = query.Skip(firstResult).Take(count).ToList();
The problem here is that you have multiple queries and therefore multiple roundtrips to the db. Another problem apperas when you examine the generated sql. The subquery idList.Contains(p.Id) will generate an in-clause with a lot of parameters.
As you can see, these are NHibernate.Linq queries. That is because the new QueryOver feature is not a true Linq provider and does not work well with dynamic Linq expressions.
You could get around that limitation by using Detached QueryOvers. The drawback here is that you would have to modify your ParseQueryExpression() somehow, e.g. like that:
private QueryOver<Record> ParseQueryExpression(string Query)
{
Record rAlias = null;
var detachedQueryOver = QueryOver.Of<Record>(() => rAlias)
.JoinQueryOver(r => r.Provider)
.Where(() => rAlias.Provider.TaxID == "000000000")
.Select(x => x.Id);
// modify your method to match the return value
return detachedQueryOver;
}
// in your main method
var detachedQueryOver = QueryOver.Of<Record>()
.WithSubquery
.WhereProperty(r => r.Id
.In(this.ParseQueryExpression(...));
var queryOverList = session.QueryOver<Provider>()
.WithSubquery
.WhereProperty(x => x.Id)
.In(detachedQueryOver.Select(r => r.Provider.Id));
int rowCount = queryOverList.RowCount();
var pageOfItems = queryOverList.Skip(firstResult).Take(count).List();
Well, I hope you are not too confused by that.

LINQ & Lambda Expressions equivalent of SQL In

Is there a lambda equivalent of IN? I will like to select all the funds with ids either 4, 5 or 6. One way of writing it is:
List fundHistoricalPrices = lionContext.FundHistoricalPrices.Where(fhp => fhp.Fund.FundId == 5 || fhp.Fund.FundId == 6 || fhp.Fund.FundId == 7).ToList();
However, that quickly becomes unmanageable if I need it to match say 100 different fundIds. Can I do something like:
List
fundHistoricalPrices =
lionContext.FundHistoricalPrices.Where(fhp
=> fhp.Fund.FundId in(5,6,7)).ToList();
It's somewhere along these lines, but I can't quite agree with the approach you have taken. But this will do if you really want to do this:
.Where(fhp => new List<int>{5,6,7}.Contains( fhp.Fund.FundId )).ToList();
You may want to construct the List of ids before your LINQ query...
You can use the Contains() method on a collection to get the equivalent to in.
var fundIds = new [] { 5, 6, 7 };
var fundHistoricalPrices = lionContext.FundHistoricalPrices.Where(fhp => fundIds.Contains(fhp.Fund.FundId)).ToList();
You could write an extension method like this :
public static bool In<T>(this T source, params T[] list)
{
if(null==source) throw new ArgumentNullException("source");
return list.Contains(source);
}
Then :
List fundHistoricalPrices = lionContext.FundHistoricalPrices.Where(fhp => fhp.Fund.FundId.In(5,6,7)).ToList();
No, the only similar operator i'm aware of is the Contains() function.
ANother was is to construct your query dynamically by using the predicate builder out of the LINQkit: http://www.albahari.com/nutshell/predicatebuilder.aspx
Example
int[] fundIds = new int[] { 5,6,7};
var predicate = PredicateBuilder.False<FundHistoricalPrice>();
foreach (int id in fundIds)
{
int tmp = id;
predicate = predicate.Or (fhp => fhp.Fund.FundId == tmp);
}
var query = lionContext.FundHistoricalPrices.Where (predicate);