Best Way to Handle Multiple Selects Per Object - sql

I was wondering for a good while now about the best method to retrieve data from multiple tables within my database. Sadly I couldn't find anything to actually help me understand what the right way to do so is.
Let's say I have a table of content pages named ContentPages. This table consists the following fields:
PageID
PageTitle
PageContent
Now, in addition to the ContentPages table I have also got the table ContentPagesTags which is in charge of storing the tags which describe best what the page is about (just like in this very website - stackoverflow, where you get to apply specific tags to your question). The ContentPagesTags table consists of the following fields:
PageID
TagID
The ContentPagesTags table is in charge of the relationship between the pages and the attached tags. The TagID field is taken from the last table, PageTags, which stores all of the possible tags which can be applied on a content page. The last table structure looks like this:
TagID
TagTitle
That's pretty much it. Now, whenever I want to retrieve a ContentPage object which extracts the needed information from its data-table, I also want to load an array of all the related tags. By default, what I have been doing so far is running two separate queries in order to achieve my goal:
SELECT * FROM ContentPages
And then running the next query per each page before returning the ContentPage object:
SELECT * FROM ContentPagesTags WHERE PageID = #PageID
With PageID being the ID of the current page I am building an object of.
To sum it all up I am running (at least) two queries per each Content Page object in order to retrieve all of the needed information. In this particular example I only showed what I do in order to extract information from one more table, but in time I find myself running multiple queries per each object in order to get my required information (for instance, other than the page tags I might as well want to select the page comments, the page drafts and additional information I might consider needed). This, eventually, gets me to query multiple commands, which makes my web-application run much slower than expected.
I am pretty sure there's a better, faster and more-efficient way to handle such tasks. Would be glad to get a heads-up on this subject in order to improve my knowledge regarding different SQL selects and how to handle massive amount of data requested by the user without turning to multiple selects per each object.

While waiting for clarification regarding the question I asked in a comment on the Original Question, I can at least say this:
From a pure "query performance" stand-point, this information is disparate in terms of not being related to each other (i.e. [Tags] and [Comments] tables) outside of the PageID relationship, but certainly not in terms of a row-by-row basis between these extra tables. As such, there is nothing more to do that can gain efficiency at the query level outside of:
Make sure you have the PageID Foreign Keyed between all subtables back to the [ContentPages] table.
Make sure you have indexes on the PageID field in each of the subtables (non-clustered should be fine and a FILLFACTOR of 90 - 100, depending on usage pattern).
Make sure to perform index maintenance regularly. At least do REORGANIZE somewhat frequently and REBUILD when necessary.
Make sure that the tables are properly modeled: use appropriate datatypes (i.e. don't use INT to store values of 1 - 10 that will never, ever go above 10 or 50 at the worst, just because it is easier to code int at the app layer; don't use UNIQUEIDENTIFIER for any PKs or Clustered Indexes; etc.). Seriously: poor data modeling (datatypes as well as structure) can hurt overall performance of some, or even all, queries such that no amount of indexes, or any other features or tricks, will help.
If you have Enterprise Edition, consider enabling Row or Page Compression (is a feature of an index), especially for tables like [Comments] or even a large association table such as [ContentPagesTags] if it will be really large (in terms of row count) as compression allows for using smaller fixed-length datatypes to store values that are declared as larger types. Meaning: if you have an INT (4 bytes) or BIGINT (8 bytes) for TagID then it will be a short while before the IDENTITY value needs more than the 2 bytes used by the SMALLINT datatype, and a great while before you exceed the 4 bytes of the INT datatype, but SQL Server will store a value of 1005 in a 2 byte space as if it were a SMALLINT. Essentially, reducing row-size will fit more rows on each 8k datapage (which is how SQL Server reads and stores data) and hence reduces physical IO and makes better use of the data pages that are cached in memory.
If concurrency is (or becomes) an issue, check out Snapshot Isolation.
Now, from an application / process stand-point, you want to reduce the number of connections / calls. You could try to merge some of the info into CSV or XML fields to end up as 1-to-1 with each PageID / PageContent row, but this is actually less efficient than just letting the RDBMS give you the data in its simplest form. It certainly can't be faster to take extra time to convert INT values into strings to then merge into a larger CSV or XML string, only to have the app layer spend even more time unpackaging it.
Instead, you can both reduce the number of calls and not increase operational time / complexity by returning multiple result sets. For example:
CREATE PROCEDURE GetPageData
(
#PageID INT
)
AS
SET NOCOUNT ON;
SELECT fields
FROM [Page] pg
WHERE pg.PageID = #PageID;
SELECT tag.TagID,
tag.TagTitle
FROM [PageTags] tag
INNER JOIN [ContentPagesTags] cpt
ON cpt.TagID = tag.TagID
WHERE cpt.PageID = #PageID;
SELECT cmt.CommentID,
cmt.Comment
cmd.CommentCreatedOn
FROM [PageComments] cmt
WHERE cmt.PageID = #PageID
ORDER BY cmt.CommentCreatedOn ASC;
And cycle through the result sets via SqlDataReader.NextResult().
But, just for the record, I don't really think that calling three separate "get" stored procedures for this info would really increase the total time of the operation to fill out each page all that much. I would suggest doing some performance testing of both methods first to ensure that you aren't solving a problem that is more perception/theory than reality :-).
EDIT:
Notes:
Multiple result sets (not the SQL Server M.A.R.S. feature "Multiple Active Result Sets") is not specific to stored procedures. You could just as well issue multiple parameterized SELECT statements via the SqlCommand:
string _Query = #"
SELECT fields
FROM [Page] pg
WHERE pg.PageID = #PageID;
SELECT tag.TagID,
tag.TagTitle
FROM [PageTags] tag
INNER JOIN [ContentPagesTags] cpt
ON cpt.TagID = tag.TagID
WHERE cpt.PageID = #PageID;
--assume SELECT statement as shown above for [PageComments]";
SqlCommand _Command = new SqlCommand(_Query, _SomeSqlConnection);
_Command.CommandType = CommandType.Text;
SqlParameter _ParamPageID = new SqlParameter("#PageID", SqlDbType.Int);
_ParamPageID.Value = _PageID;
_Command.Parameters.Add(_ParamPageID);
If you are using SqlDataReader.Read() it would be something like the following. Please note that I am purposefully showing multiple ways of getting the values out of the _Reader just to show options. Also, the number of Tags and/or Comments is really irrelevant from a CPU perspective. More items does equate to more memory, but no way around that (unless you use AJAX to build the page one item at a time and never pull the full set into memory, but I highly doubt a single page would have enough tags and comments to even be noticeable).
// assume the code block above is right here
SqlDataReader _Reader;
_Reader = _Command.ExecuteReader();
if (_Reader.HasRows)
{
// only 1 row returned from [ContentPages] table
_Reader.Read();
PageObject.Title = _Reader["PageTitle"].ToString();
PageObject.Content = _Reader["PageContent"].ToString();
PageObject.ModifiedOn = (DateTime)_Reader["LastModifiedDate"];
_Reader.NextResult(); // move to next result set
while (_Reader.Read()) // retrieve 0 - n rows
{
TagCollection.Add((int)_Reader["TagID"], _Reader["TagTitle"].ToString());
}
_Reader.NextResult(); // move to next result set
while (_Reader.Read()) // retrieve 0 - n rows
{
CommentCollection.Add(new PageComment(
_Reader.GetInt32(0),
_Reader.GetString(1),
_Reader.GetDateTime(2)
));
}
}
else
{
throw new Exception("PageID " + _PageID.ToString()
+ " does not exist. What were you thinking??!?");
}
You can also load multiple result sets into a DataSet and each result set will be its own DataTable. For details please see the MSDN page for DataSet.Load
// assume the code block 2 blocks above is right here
SqlDataReader _Reader;
_Reader = _Command.ExecuteReader();
DataSet _Results = new DataSet();
if (_Reader.HasRows)
{
_Results.Load(_Reader, LoadOption.Upsert, "Content", "Tags", "Comments");
}
else
{
throw new Exception("PageID " + _PageID.ToString()
+ " does not exist. What were you thinking??!?");
}

I would suggest putting the tags in a delimited list. You can do this in SQL Server with the following query:
select cp.*,
stuff((select ', ' + TagTitle
from ContentPagesTags cpt join
PageTags pt
on cpt.TagId = pt.TagId
where cpt.PageId = cp.PageId
for xml path ('')
), 1, 2, '') as Tags
from ContentPages cp;
The syntax for the string concatenation is, shall I say, less than intuitive. Other databases have nice functions for this (such as listagg() and group_concat()). But, the performance is usually quite reasonable, particularly if you have the appropriate indexes (which include ContentPagesTags(PageId, TagId)).

Related

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

Efficient Querying Data With Shared Conditions

I have multiple sets of data which are sourced from an Entity Framework code-first context (SQL CE). There's a GUI which displays the number of records in each query set, and upon changing some set condition (e.g. Date), the sets all need to recalculate their "count" value.
While every set's query is slightly different in some way, most of them share common conditions in some way. A simple example:
RelevantCustomers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Customer")
RelevantSuppliers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Supplier")
So the thing is, there's enough of these demanding queries, that each time the user changes some condition (e.g. SelectedDate), it takes a really long time to recalculate the number of records in each set.
I realise that part of the reason for this is the need to query through, for example, the transactions each time to check what is really the same condition for both RelevantCustomers and RelevantSuppliers.
So my question is that, given these sets share common "base conditions" which depend on the same sets of data, is there some more efficicent way I could be calculating these sets?
I was thinking something with custom generic classes like this:
QueryGroup<People>(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0)
{
new Query<People>("Customers", P=>P.Type=="Customer"),
new Query<People>("Suppliers", P=>P.Type=="Supplier")
}
I can structure this just fine, but what I'm finding is that it makes basically no difference to the efficiency as it still needs to repeat the "shared condition" for each set.
I've also tried pulling the base condition data out as a static "ToList()" first, but this causes issues when running into navigation entities (i.e. People.Addresses don't get loaded).
Is there some method I'm not aware of here in terms of efficiency?
Thanks in advance!
Give something like this a try: Combine "similar" values into fewer queries, then separate the results afterwards. Also, use Any() rather than Count() for exists check. Your updated attempt goes part-way, but will still result in 2x hits to the database. Also, when querying it helps to ensure that you are querying against indexed fields, and those indexes will be more efficient with numeric IDs rather than strings. (I.e. a TypeID of 1 vs. 2 for "Customer" vs. "Supplier") Normalized values are better for indexing and lead to smaller records, at the cost of extra verbose queries.
var types = new string[] {"Customer", "Supplier"};
var people = People.Where(p => types.Contains(p.Type)
&& p.Transactions.Any(t => t.Date > selectedDate)).ToList();
var relevantCustomers = people.Where(p => p.Type == "Customer").ToList();
var relevantSuppliers = people.Where(p => p.Type == "Supplier").ToList();
This results in just one hit to the database, and the Any should be more perform-ant than fetching an entire count. We split the customers and suppliers after the fact from the in-memory set. The caveat here is that any attempt to access details such as transactions etc. on customers and suppliers would result in lazy-load hits since we didn't eager load them. If you need entire entity graphs then be sure to .Include() relevant details, or be more selective on the data extracted from the first query. I.e. select anonymous types with the applicable details rather than just the entity.

Limiting Amount of Rows in List View

Simple enough question, how would I be able to limit the amount of rows on a ListView to the amount of items/rows that actually contain information. I know how to count the rows with items by using this code
ListView1.Items.Count
But how can I limit the amount of rows the listview has to the amount of items?
Assuming a version of .Net that includes LINQ (3.5+), you get some really nice features which help a lot. These apply to any IQueryable including IList..
Dim MyList = [Some code to get hundreds of items]
Dim MyShortList = MyList.Take(30)
You can also implement paging very easily by using Skip...
Dim MyShortListPage2 = MyList.Skip(30).Take(30)
You should look into using the Entity framework or equivalents which implement IQueryable. These reduce memory overhead by using deferred processing aka Lazy Loading.
In short, if I were to do the following using the EF:
Dim Users = DBContext.Set(Of Users)
Users won't actually contain all users in the database, instead it will contain the query to get all users. If I did Users.First, it would run the query against SQL to get the first user. If instead, I did Users.Where(function(x) x.Age=30).First it would only query SQL for the first user whose age is 30.
Thus, IQueryable lets you pare down a dataset quickly using the power of the underlying provider instead of doing it in-memory.
If, instead, I did
Dim Users = DBContext.Set(Of Users).ToList()
It would retrieve all users from the database into memory. The ToList() is what forces this to happen. A List has to be stored in local memory, an IQueryable does not, it can run the appropriate query at the last possible moment and get as little as possible to satisfy your request.
Whether you want this to happen or not depends on the use case.

Optimize linq query for performance(now takes 12-15seconds, need 3seconds) [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Edit: I need help to rewrite these linq querys to SQL querys for highest possible performance.
I have a table with about 10 million rows. It consists of 7 columns including Id. First is Id, then three keys to "TradeObjectModel", finally three integers keeping the different TradeObjectModels rating values. Like this:
When a user, eg To1Id (TradeObjectModel1 owner) with key 71 handles her ratings of other Trade objects only one row is sufficent for the current view.
My attempt to solve this looks like this(explanation below code sample):
IEnumerable<RatingListTriangleModel> allTriangleModels1 =
this._ratingListTriangleRepository.All.Where(
ratingListRow =>
ratingListRow.To1Id == myTradeObject.TradeObjectId);
var filteredallTriangleModels1 = from row in allTriangleModels1
group row by row.To2Id into g
select g.First();
IEnumerable<RatingListTriangleModel> allTriangleModels2 =
this._ratingListTriangleRepository.All.Where(
ratingListRow =>
ratingListRow.To2Id == myTradeObject.TradeObjectId);
var filteredallTriangleModels2 = from row in allTriangleModels2
group row by row.To3Id into g
select g.First().
IEnumerable<RatingListTriangleModel> allTriangleModels3 =
this._ratingListTriangleRepository.All.Where(
ratingListRow =>
ratingListRow.To3Id == myTradeObject.TradeObjectId);
var filteredallTriangleModels3 = from row in allTriangleModels3
group row by row.To1Id into g
select g.First();
var fileredallTriangleModels =
filteredallTriangleModels1.Union(filteredallTriangleModels2).Union(filteredallTriangleModels3).ToList();
ViewBag.TriangleCount = fileredallTriangleModels.Count();
foreach (var ratingListRow in fileredallTriangleModels)
{
//Find which one is my ad and set me as setter and their object as receiver
if (ratingListRow.To1Id == customer.TradeObjectId)
{
var ri = new TriangleViewModel(
customer.TradeObjectId,
this._customerRepository.FindTradeObjectId(ratingListRow.To2Id),
ratingListRow,
this._tradeobjectRepository.Find(ratingListRow.To2Id));
model.Models3.Add(ri);
continue;
}
if (ratingListRow.To2Id == customer.TradeObjectId)
{
var ri = new TriangleViewModel(
customer.TradeObjectId,
this._customerRepository.FindTradeObjectId(ratingListRow.To3Id),
ratingListRow,
this._tradeobjectRepository.Find(ratingListRow.To3Id));
model.Models3.Add(ri);
continue;
}
if (ratingListRow.To3Id == customer.TradeObjectId)
{
var ri = new TriangleViewModel(
customer.TradeObjectId,
this._customerRepository.FindTradeObjectId(ratingListRow.To1Id),
ratingListRow,
this._tradeobjectRepository.Find(ratingListRow.To1Id));
model.Models3.Add(ri);
}
}
First I get all rows where my object is on the first column, groups them to select only one and then continues to do the same with me on the second and third column. The ToList() here is just temporary for med to be able to run stopwatch on them, each of these takes 0-12 seconds. Then I join them and run through them all to create the model used by the webgrid in front-end code.
This causes two problems: 1. It takes much to long. and 2. If my tradeobject id is on more than one column I will get more than one row presenting more than one o the Tradeobject I'm interested in.
Try using Database Engine Tuning Advisor to see if adding/removing/changing the indices on your tables significantly improves the performance of the workload presented by your LINQ query.
Try capturing your queries with the profiler and isolate the top 3 longest running ones. Copy them into the SSMS and execute them. Look for the actual execution plan. Look for table scans or a huge discrepancy between estimated record counts and actual record counts. From here, either statistics are off, or you might consider placing an index to cover the query.
From a performance perspective it may be better to use some sort of stored procedure. LINQ tends to slow things down a lot with these types of queries/lookups. However, if a stored procedure is not enough, there are a few things you can do.
First off, you may want to have a look at Incremental Search , which is written by basically just keeping a search going or "deferring" execution for a given amount of time. That should work on any IEnumerable.
The next thing I would recommend would be to see if you can potentially integrate something similar to the Incremental Search above (by using the Add functions, etc) to make a similar implementation for whatever works with your program.
Seriously though - I have seen HUGE improvements from stored procs in the past that will definitely increase the speed (in my case one query was reduced almost 10 fold!)

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.