Efficient Querying Data With Shared Conditions - sql

I have multiple sets of data which are sourced from an Entity Framework code-first context (SQL CE). There's a GUI which displays the number of records in each query set, and upon changing some set condition (e.g. Date), the sets all need to recalculate their "count" value.
While every set's query is slightly different in some way, most of them share common conditions in some way. A simple example:
RelevantCustomers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Customer")
RelevantSuppliers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Supplier")
So the thing is, there's enough of these demanding queries, that each time the user changes some condition (e.g. SelectedDate), it takes a really long time to recalculate the number of records in each set.
I realise that part of the reason for this is the need to query through, for example, the transactions each time to check what is really the same condition for both RelevantCustomers and RelevantSuppliers.
So my question is that, given these sets share common "base conditions" which depend on the same sets of data, is there some more efficicent way I could be calculating these sets?
I was thinking something with custom generic classes like this:
QueryGroup<People>(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0)
{
new Query<People>("Customers", P=>P.Type=="Customer"),
new Query<People>("Suppliers", P=>P.Type=="Supplier")
}
I can structure this just fine, but what I'm finding is that it makes basically no difference to the efficiency as it still needs to repeat the "shared condition" for each set.
I've also tried pulling the base condition data out as a static "ToList()" first, but this causes issues when running into navigation entities (i.e. People.Addresses don't get loaded).
Is there some method I'm not aware of here in terms of efficiency?
Thanks in advance!

Give something like this a try: Combine "similar" values into fewer queries, then separate the results afterwards. Also, use Any() rather than Count() for exists check. Your updated attempt goes part-way, but will still result in 2x hits to the database. Also, when querying it helps to ensure that you are querying against indexed fields, and those indexes will be more efficient with numeric IDs rather than strings. (I.e. a TypeID of 1 vs. 2 for "Customer" vs. "Supplier") Normalized values are better for indexing and lead to smaller records, at the cost of extra verbose queries.
var types = new string[] {"Customer", "Supplier"};
var people = People.Where(p => types.Contains(p.Type)
&& p.Transactions.Any(t => t.Date > selectedDate)).ToList();
var relevantCustomers = people.Where(p => p.Type == "Customer").ToList();
var relevantSuppliers = people.Where(p => p.Type == "Supplier").ToList();
This results in just one hit to the database, and the Any should be more perform-ant than fetching an entire count. We split the customers and suppliers after the fact from the in-memory set. The caveat here is that any attempt to access details such as transactions etc. on customers and suppliers would result in lazy-load hits since we didn't eager load them. If you need entire entity graphs then be sure to .Include() relevant details, or be more selective on the data extracted from the first query. I.e. select anonymous types with the applicable details rather than just the entity.

Related

How to handle concurrency in faunadb

I've some backend APIs which connect to faunadb; I'm able to do everything I need with data but I've some serious doubts about concurrent modifications (which maybe are not strictly related to faunadb only, but I'd like to understand how to deal with it using this technology).
One example above all: I want to create a new document (A) in a collection (X) which is linked (via reference or other fields) to other documents (B and C) in another collection (Y); in order to be linked, these documents (B and C) must satisfy a condition (e.g. field F = "V"). Once A has been created, B and C cannot be modified (or the condition will be invalidated!).
Of course the API to create the document A can run concurrently with the API used to modify documents B and C.
Here comes the doubt: what if, while creating the document A linked to document B and C, someone else changes field F of document B to something different from "V"?
I could end up with A linked to a wrong document, because both APIs don't know what the other one is doing..
Do I need to use the "Do" function in both APIs to create atomic transactions? So I can:
Check if B and C are valid and, if yes, create A in a single transaction
Check if B is linked to A and, if it doesn't, modify it in a single transaction
Thanks everyone.
Fauna tries to present a consistent data view no matter or when your clients need to ask. Isolation of transaction effects is what matters on short time scales (typically less than 10ms).
The Do function merely lets you combine multiple disparate FQL expressions into a single query. There is no conditional processing aspect to Do.
You can certainly check conditions before undertaking operations, and all Fauna queries are atomic transactions: all of the query succeeds or none of its does.
Arranging for intermediate query values in order to perform conditional logic does tend to make FQL queries more complex, but they are definitely possible:
The query for your first API might look something like this:
Let(
{
document_b: Get(<reference to B document>),
document_c: Get(<reference to C document>),
required_b: Select(["data", "required_field"], Var("document_b"),
required_c: Select(["data", "other_required"], Var("document_c"),
condition: And(Var("required_b"), Var("required_c")),
},
If(
Var("condition"),
Create(Collection("A"), { data: { <document A data }}),
Abort("Cannot create A because the conditions have not been met.")
)
)
The Let function allows you to compose named values for intermediate expressions, which can read or write whatever they need, along with logical operations that determine which conditions need to be tested. The value composition is followed by an expression which, in this example, tests the conditions and only creates the document in the A collection when the conditions are met. When the conditions are not met, the transaction is aborted with an appropriate error message.
Let can nest Lets as much as required, provide the query fits within the maximum query length of 16MB, so you can embed a significant amount of logic into your queries. When the length of a single query is not sufficient, you can define UDFs that can be called, which allow you to store business logic that you can use any number of times.
See the E-commerce tutorial for a UDF that performs all of the processing required to submit an order, check if there is sufficient product in stock, deduct requested quantities from inventory, set backordered status, and create the order.

What's more optimal: query chaining parent & child or selecting from parent's child objects

Curious which of these is better performance wise. If you have a User with many PlanDates, and you know your user is the user with an id of 60 in a variable current_user, is it better to do:
plan_dates = current_user.plan_dates.select { |pd| pd.attribute == test }
OR
plan_dates = PlanDate.joins(:user).where("plan_dates.attribute" => test).where("users.id" => 60)
Asking because I keep reading about the dangers of using select since it builds the entire object...
select is discouraged because, unlike the ActiveRelation methods where, joins, etc., it's a Ruby method from Enumerable which means that the entire user.plan_dates relation must be loaded into memory before the selection can begin. That may not make a difference at a small scale, but if an average user has 3,000 plan dates, then you're in trouble!
So, your second option, which uses just one SQL query to get the result, is the better choice. However, you can also write it like so:
user.plan_dates.where(attribute: test)
This is still just one SQL query, but leverages the power of ActiveRelation for a more expressive result.
The second. The select has to compare objects on the code level, and the second is just a query.
In addition, the second expression may not be actually executed unless you use the variable, while the first will be always executed.

Best Way to Handle Multiple Selects Per Object

I was wondering for a good while now about the best method to retrieve data from multiple tables within my database. Sadly I couldn't find anything to actually help me understand what the right way to do so is.
Let's say I have a table of content pages named ContentPages. This table consists the following fields:
PageID
PageTitle
PageContent
Now, in addition to the ContentPages table I have also got the table ContentPagesTags which is in charge of storing the tags which describe best what the page is about (just like in this very website - stackoverflow, where you get to apply specific tags to your question). The ContentPagesTags table consists of the following fields:
PageID
TagID
The ContentPagesTags table is in charge of the relationship between the pages and the attached tags. The TagID field is taken from the last table, PageTags, which stores all of the possible tags which can be applied on a content page. The last table structure looks like this:
TagID
TagTitle
That's pretty much it. Now, whenever I want to retrieve a ContentPage object which extracts the needed information from its data-table, I also want to load an array of all the related tags. By default, what I have been doing so far is running two separate queries in order to achieve my goal:
SELECT * FROM ContentPages
And then running the next query per each page before returning the ContentPage object:
SELECT * FROM ContentPagesTags WHERE PageID = #PageID
With PageID being the ID of the current page I am building an object of.
To sum it all up I am running (at least) two queries per each Content Page object in order to retrieve all of the needed information. In this particular example I only showed what I do in order to extract information from one more table, but in time I find myself running multiple queries per each object in order to get my required information (for instance, other than the page tags I might as well want to select the page comments, the page drafts and additional information I might consider needed). This, eventually, gets me to query multiple commands, which makes my web-application run much slower than expected.
I am pretty sure there's a better, faster and more-efficient way to handle such tasks. Would be glad to get a heads-up on this subject in order to improve my knowledge regarding different SQL selects and how to handle massive amount of data requested by the user without turning to multiple selects per each object.
While waiting for clarification regarding the question I asked in a comment on the Original Question, I can at least say this:
From a pure "query performance" stand-point, this information is disparate in terms of not being related to each other (i.e. [Tags] and [Comments] tables) outside of the PageID relationship, but certainly not in terms of a row-by-row basis between these extra tables. As such, there is nothing more to do that can gain efficiency at the query level outside of:
Make sure you have the PageID Foreign Keyed between all subtables back to the [ContentPages] table.
Make sure you have indexes on the PageID field in each of the subtables (non-clustered should be fine and a FILLFACTOR of 90 - 100, depending on usage pattern).
Make sure to perform index maintenance regularly. At least do REORGANIZE somewhat frequently and REBUILD when necessary.
Make sure that the tables are properly modeled: use appropriate datatypes (i.e. don't use INT to store values of 1 - 10 that will never, ever go above 10 or 50 at the worst, just because it is easier to code int at the app layer; don't use UNIQUEIDENTIFIER for any PKs or Clustered Indexes; etc.). Seriously: poor data modeling (datatypes as well as structure) can hurt overall performance of some, or even all, queries such that no amount of indexes, or any other features or tricks, will help.
If you have Enterprise Edition, consider enabling Row or Page Compression (is a feature of an index), especially for tables like [Comments] or even a large association table such as [ContentPagesTags] if it will be really large (in terms of row count) as compression allows for using smaller fixed-length datatypes to store values that are declared as larger types. Meaning: if you have an INT (4 bytes) or BIGINT (8 bytes) for TagID then it will be a short while before the IDENTITY value needs more than the 2 bytes used by the SMALLINT datatype, and a great while before you exceed the 4 bytes of the INT datatype, but SQL Server will store a value of 1005 in a 2 byte space as if it were a SMALLINT. Essentially, reducing row-size will fit more rows on each 8k datapage (which is how SQL Server reads and stores data) and hence reduces physical IO and makes better use of the data pages that are cached in memory.
If concurrency is (or becomes) an issue, check out Snapshot Isolation.
Now, from an application / process stand-point, you want to reduce the number of connections / calls. You could try to merge some of the info into CSV or XML fields to end up as 1-to-1 with each PageID / PageContent row, but this is actually less efficient than just letting the RDBMS give you the data in its simplest form. It certainly can't be faster to take extra time to convert INT values into strings to then merge into a larger CSV or XML string, only to have the app layer spend even more time unpackaging it.
Instead, you can both reduce the number of calls and not increase operational time / complexity by returning multiple result sets. For example:
CREATE PROCEDURE GetPageData
(
#PageID INT
)
AS
SET NOCOUNT ON;
SELECT fields
FROM [Page] pg
WHERE pg.PageID = #PageID;
SELECT tag.TagID,
tag.TagTitle
FROM [PageTags] tag
INNER JOIN [ContentPagesTags] cpt
ON cpt.TagID = tag.TagID
WHERE cpt.PageID = #PageID;
SELECT cmt.CommentID,
cmt.Comment
cmd.CommentCreatedOn
FROM [PageComments] cmt
WHERE cmt.PageID = #PageID
ORDER BY cmt.CommentCreatedOn ASC;
And cycle through the result sets via SqlDataReader.NextResult().
But, just for the record, I don't really think that calling three separate "get" stored procedures for this info would really increase the total time of the operation to fill out each page all that much. I would suggest doing some performance testing of both methods first to ensure that you aren't solving a problem that is more perception/theory than reality :-).
EDIT:
Notes:
Multiple result sets (not the SQL Server M.A.R.S. feature "Multiple Active Result Sets") is not specific to stored procedures. You could just as well issue multiple parameterized SELECT statements via the SqlCommand:
string _Query = #"
SELECT fields
FROM [Page] pg
WHERE pg.PageID = #PageID;
SELECT tag.TagID,
tag.TagTitle
FROM [PageTags] tag
INNER JOIN [ContentPagesTags] cpt
ON cpt.TagID = tag.TagID
WHERE cpt.PageID = #PageID;
--assume SELECT statement as shown above for [PageComments]";
SqlCommand _Command = new SqlCommand(_Query, _SomeSqlConnection);
_Command.CommandType = CommandType.Text;
SqlParameter _ParamPageID = new SqlParameter("#PageID", SqlDbType.Int);
_ParamPageID.Value = _PageID;
_Command.Parameters.Add(_ParamPageID);
If you are using SqlDataReader.Read() it would be something like the following. Please note that I am purposefully showing multiple ways of getting the values out of the _Reader just to show options. Also, the number of Tags and/or Comments is really irrelevant from a CPU perspective. More items does equate to more memory, but no way around that (unless you use AJAX to build the page one item at a time and never pull the full set into memory, but I highly doubt a single page would have enough tags and comments to even be noticeable).
// assume the code block above is right here
SqlDataReader _Reader;
_Reader = _Command.ExecuteReader();
if (_Reader.HasRows)
{
// only 1 row returned from [ContentPages] table
_Reader.Read();
PageObject.Title = _Reader["PageTitle"].ToString();
PageObject.Content = _Reader["PageContent"].ToString();
PageObject.ModifiedOn = (DateTime)_Reader["LastModifiedDate"];
_Reader.NextResult(); // move to next result set
while (_Reader.Read()) // retrieve 0 - n rows
{
TagCollection.Add((int)_Reader["TagID"], _Reader["TagTitle"].ToString());
}
_Reader.NextResult(); // move to next result set
while (_Reader.Read()) // retrieve 0 - n rows
{
CommentCollection.Add(new PageComment(
_Reader.GetInt32(0),
_Reader.GetString(1),
_Reader.GetDateTime(2)
));
}
}
else
{
throw new Exception("PageID " + _PageID.ToString()
+ " does not exist. What were you thinking??!?");
}
You can also load multiple result sets into a DataSet and each result set will be its own DataTable. For details please see the MSDN page for DataSet.Load
// assume the code block 2 blocks above is right here
SqlDataReader _Reader;
_Reader = _Command.ExecuteReader();
DataSet _Results = new DataSet();
if (_Reader.HasRows)
{
_Results.Load(_Reader, LoadOption.Upsert, "Content", "Tags", "Comments");
}
else
{
throw new Exception("PageID " + _PageID.ToString()
+ " does not exist. What were you thinking??!?");
}
I would suggest putting the tags in a delimited list. You can do this in SQL Server with the following query:
select cp.*,
stuff((select ', ' + TagTitle
from ContentPagesTags cpt join
PageTags pt
on cpt.TagId = pt.TagId
where cpt.PageId = cp.PageId
for xml path ('')
), 1, 2, '') as Tags
from ContentPages cp;
The syntax for the string concatenation is, shall I say, less than intuitive. Other databases have nice functions for this (such as listagg() and group_concat()). But, the performance is usually quite reasonable, particularly if you have the appropriate indexes (which include ContentPagesTags(PageId, TagId)).

NHibernate Filtering data best practices

I have the following situation:
User logs in, opens an overview of all products, can only see a list of products where a condition is added, this condition is variable. Example: WHERE category in ('catA', 'CatB')
Administrator logs in, opens an overview of all products, he can see all products no filter applied.
I need to make this as dynamically as possible. My data access classes are using Generics for most of the time.
I've seen filters but my conditions are very variable, so i don't see this as scalable enough.
We use NH filters for something similar, and it works fine. If no filter needs to be applied, you can omit setting any value for the filter. We use these filters for more basic stuff, filters that are applied nearly 100% of the time, fx deleted objects filters, client data segregating, etc. Not sure what scalability aspect you're looking for?
For more high level and complex filtering, we use a custom class that manipulates a repository root. Something like the following:
public IQueryOver<TIn, TOut> Apply(IQueryOver<TIn, TOut> query)
{
return query.Where(x => ... );
}
If you have an IoC container integrated with your NH usage, something like this can easily be generalized and plugged into your stack. We have these repository manipulators that do simple where clauses, and others that generate complex where clauses that reference domain logic and others that joins a second table on and filters on that.
You could save all categories in an category list and pass this list to the query. If the list is not null and contains elements you can work with the following:
List<string> allowedCategoriesList = new List<string>();
allowedCategoriesList.Add(...);
...
.WhereRestrictionOn(x => x.category).IsIn(allowedCategoriesList)
It's only important to skip this entry if you do not have any filters (so, you want to see all entries without filtering), as you will otherwise see not one single result.

General strategy for complex multi-stage searches

I have an application with allows for a certain entity to be searched upon based on several different criteria (somewhere in the order of 20 different methods in total). I want to be able to combine the results of several searches in order to produce a single result set.
For example:
results = (entities from search 1 AND entities from search 2) OR (entities from search 3)
Let us assume that the searches are complex enough in nature such that combining them into a single logical query is not possible (due to complex relationships that need to be queried, etc).
Let us also assume that the number of entities involved (likely) makes any sort of in-memory strategy infeasible.
My initial thoughts were something along the lines of:
1) Perform the searches seperately, obtain a list of matching "entity ids" from each of them, and then perform a "root-level" search based upon these.
For example:
select * from entity e
where
(e.Id in (search 1 id list) AND e.Id in(search 2 id list))
OR e.Id in (search 3 id list)
2) Perform an outer query that selects the entity based upon the results returned by my (complex) subqueries.
For example:
select * from entity e
where (e.Id in (select e1.id from entity e1 where ...) AND e.Id in (select e2.id from entity e2 where...))
OR e.Id in (select e3.id from entity e3 where...)
Obviously these examples are drastically simplified for illustration purposes; the individual queries will be much more involved, and the combination of them will be arbitrary (I've just illustrated a representative example here).
I'd be very interested in hearing suggestions for how others have handled this situation. I'm certainly open to any possibilities that I haven't explored above.
For reference, this is a .NET application making use of an NHibernate ORM backed by a SQL Server 2008 R2 database.
I've already decided upon using either hql or native sql for this as ICriteria or Linq do not provide the flexibility needed for performing the individual queries nor the combining operations required.
I've done this by keeping search performance counters in a table. Basically monitoring the average percentage of rows that the search filters and the run time.
I then create a performance figure based on
TotalNumberOfRowsToSearch * Percent_Not_Matched / RunTimeInSeconds
This figure is a direct correlation of rows per second it can filter out. Averaged over thousands of runs, it is a rather good prediction.
I then run each query in order with highest performance figure one first.
If you're doing a logical AND on the total result, run each subsequent query only on the results of the previous query.
If you're doing a logical OR, run each subsequent query only on the results NOT IN the combined previous search results.
By doing it this way, your query will change based on indexes and types of data.
If you want a less dynamic solution, simply calculate performance figures for each part of the search and use the better performing ones first. Remember a query that runs in 55ms but matches 99% of the results is not as useful as one that runs in 1 second and matches 1% of the results, so be wary that results may go against your initial ideas.
Just look out for the divide by 0 error when calculating performance figures.
My approach using Linq is by building a list of where expressions that construct the complex criteria, and applying them together in the end.
Something like that:
List<Expression<Func<WorkItem, bool>>> whereExpressions = new List<Expression<Func<WorkItem, bool>>>();
if (!string.IsNullOrEmpty(searchMask))
{
whereExpressions.Add(
x =>
(x.Name.ToLower().IndexOf(searchMask.ToLower()) > -1 ||
x.Id.ToString().IndexOf(searchMask) > -1 ||
(x.Description != null &&
x.Description.ToLower().IndexOf(searchMask.ToLower()) > -1)));
}
whereExpressions.Add(x => (x.Status == status));
Eventually after building the expression list you apply the expressions:
IQueryable<WorkItem> result = Session.Linq<WorkItem>();
foreach (Expression<Func<WorkItem, bool>> whereExpression in whereExpressions)
{
result = result.Where(whereExpression);
}
You can also provide flexibility in the sorting method and allow paging:
IQueryable<WorkItem> items;
if (ascOrDesc == "asc")
{
items = result.OrderBy(DecideSelector(indexer)).Skip(startPoint - 1).Take(numOfrows);
}
else
{
items = result.OrderByDescending(DecideSelector(indexer)).Skip(startPoint - 1).Take(numOfrows);
}
Where DecideSelector is defined like this:
private Expression<Func<WorkItem, object>> DecideSelector(string fieldCode)
{
switch (fieldCode)
{
case "Deadline":
return item => item.Deadline;
case "name":
return item => item.Name;
case "WiStatus":
return item => item.Status;
case "WiAssignTo":
return item => item.AssignedUser;
default:
return item => item.Id;
}
}
If you can use ICriteria, I'd recommend it. It can drastically cut down on the amount of code with complex searches. For example, the difference between using one search by itself, and using it as a subquery in your aggregate search, would be an added projection.
I haven't as yet tried to split complex searches up and running them seperately. Combining the entire search into one call to the database, as per your second example, so far has worked for me. If I'm not getting a decent response time (minutes as opposed to seconds), the Database Engine Tuning Advisor has proved invaluable with suggested indexes and statistics.