I'm using SSRS 2008 and wanted some advice on the best practices for handling multiple result sets.
I have a stored procedure which returns 4 result sets of related data, however each result set has a different number of records returned. Of course in SSRS only the first result set is processed, so I'm left with 2 options, as far as I can tell:
Create 4 different stored procedures which return the 4 different data sets
Insert all 4 result sets into a temporary table and return the results from that table.
The problem with the first option is that the 4 results are all derived from the same basic data (into a temp table) and then joined/grouped with other tables/data. So to split them out into separate stored procedures seems like it would cause more stress to the DB than a single sproc.
The problem with the second method is that I would have to include the same dataset into SSRS 4 times, pulling different pieces of the result set each time and filtering out the nulls on the correct columns.
For instance, let's say I have 4 result sets that return 4 columns each and 4 records each. The first 4 columns and the first 4 records are related to the first result set (the rest of the columns are null). The second result set only populates columns 5-8 and records 5-8, etc.
The question is, which way is more efficient? Multiple stored procedures or 1 stored procedure used multiple times in SSRS? Thanks!
Sounds like the data retrieval is the more expensive operation, so you probably would want to do that only once. (Although I'm not sure you would "stress" SQL Server, if that's the data source: the basic data and joined/grouped data would likely be in memory at that point, so additional physical reads probably would not happen if different stored procedures re-read the data.)
Consider an Entity Framework LINQ-to-SQL query with several entities INCLUDEd: the shape of a result set is very wide, with a column for each field of each entity. But this avoids a trip to the database for each included entity.
Unless your report has hundreds of pages or is otherwise rendering- or processing-intensive, I think option #2 probably is more efficient.
Are you able to look at the execution statistics in the SSRS execution log? I always use the ExecutionLog3 view, and check the times of TimeDataRetrieval, versus TimeProcessing and TimeRendering.
Edit
I know including just links is frowned upon here, but I don't want to include possibly copyrighted stuff that I didn't create.
Please see the section Examining Queries Sent to the Database here http://www.asp.net/web-forms/tutorials/continuing-with-ef/maximizing-performance-with-the-entity-framework-in-an-asp-net-web-application for an example of a LINQ-to-SQL query along the lines of ...Departments.Include("Person").Include("Courses"). Note especially the picture of the Text Visualizer, which shows the query constructed: it is similar to your option #2 in that multiple entities are included in the same result set.
Related
Problem:
We use entity framework (6.21) as our ORM manager.
Our database is Azure Sql Database.
Because some of the parametrized queries (frequently used in our app) are slow on some of the inputs (on some input it runs 60 seconds on other input it runs 0.4 seconds)
We started investigate those queries using QueryStore and QueryStore explorer in MS SQL Management Studio (MSSMS -> Object Explorer -> Query Store).
We found out, that QueryStore stores two same (same sql query but different params - params are not even stored) queries as different queries (with different query_id).
By different query I mean different row in table
sys.query_store_query).
I checked this by looking into QueryStore tables:
SELECT
qStore.query_id,
qStore.query_text_id,
queryTextStore.query_sql_text
ROW_NUMBER() OVER(PARTITION BY query_sql_text ORDER BY query_sql_text ASC) AS rn
FROM
sys.query_store_query qStore
INNER JOIN
sys.query_store_query_text queryTextStore
ON qStore.query_text_id = queryTextStore.query_text_id
I am not able to compare plans of those queries easily in MSSMS, because each query has its own associated plan.
Expected behaviour:
I would assume that each subsequent run of same query with different parametres would result in either:
1/ re-use of existing plan
or
2/ in creation of another plan based on passed params values...
Example:
The query would look like this (in reality queries are much more complex as they are generated by EntityFramework):
SELECT * FROM tbl WHERE a = #__plinq__
and it's two subsequent runs (with different params) would result in two rows in sys.query_store_query.
Question:
How can I make Azure to save queries with same text as same queries? Or am I missing something or is this expected behaviour?
Or more generally how to tune database queries if they are generated by Entity Framework?
How SQL Server Query Store considers two queries same or different?
Edit1: Update
Based on #PeterB comment (Adding a query hint when calling Table-Valued Function) we were able to solve our problem with slow queries on some params values (we added hint "recompile" on problematic queries).
Based on #GrantFritchey hint I checked context_settings, but there are still multiple rows in query_store table which have same query_sql_text and same context_settings_id but with different query_id .
So we still wonder how SQL Server Query Store consider two queries same or different?
As for the different query entries, the key that Query Store uses for a query consists of:
query_text_id,
context_settings_id,
object_id,
batch_sql_handle,
query_parameterization_type
If any of these is different for a query it will generate a new entry in the query table. Note that batch_sql_handle is only populated for queries referencing temp tables.
So you can check which of these values is different for the queries that you listed.
Currently there are no settings that control the way Query Store aggregates queries. The only way to make it treat them as same is to change your workload so that the fields listed above match. But alternatively probably better approach is that you write your own reporting queries that will aggregate queries and their statistics according to your needs.
Short Intro:
When it is required to have a dozen nested calculating queries, is it more optimal to
A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects
Full Description:
I am trying to calculate some advanced forecasts from a series of input tables in SQL.
I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.
Queries range from 7k-200k rows.
A single schema's/module's tables might look like this:
Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output
Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.
The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)
My requirements are as follows:
A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'
I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.
My Problem:
I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.
What I have tried:
Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.
I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.
I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.
My Question:
However I ran into a thought that I could not find an answer for - which is the following:
Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?
Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.
Additional Info:
Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.
For Example:
ForecastID, Group, Year, Type, Sub-Type, Value
So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.
I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!
Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.
Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.
Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.
Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.
Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.
It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.
If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.
I have 2 different databases. Upon changing something in the big one (i don't have access to), i get some rows imported in my databases in a similar HUGE table. I have a job checking for records in this table, and if any, execute a stored procedure, process and delete from table.
Performance. (Huge amount of data) I would like to know what is the fastest way to know if something has changed using let's say 2 imported rows with 100 columns each. Don't have FK-s, don't need. Chances are, that even though I have records in my table, nothing has actually changed.
Also. Let's say there is actually changed something. Is it possible for example to check only for changes inside datetime columns?
Thanks
You can always use update triggers - these will give you access to two logical tables, inserted and updated. You can compare the values of these and base your action on the results.
I need to store a large table (several millions or rows) that contains a large number of user-defined fields (not known at compile time, but probably around 20 to 40 custom fields). It is very important (performance-wise) for me to be able to query the data based on those custom fields: i.e. "Select the rows where this attribute has that value, that attribute is that value, etc.". Each query has some 20 to 30 WHERE clauses.
My ideas so far:
Change the database schema everytime a new user field is implemented. Keep each user defined field as a column in the table. Add and maintain indexes on each custom-created column. How to properly build those indexes is a big problem, as I don't know what attributes(columns) will be used in the WHERE queries.
Store the custom fields as an XML type column. As I understand from SQL2005 I can query inside the XML in the XML type columns. Not so sure about performance though.
Entity Attribute Value . This is what I am using now, but it's a pain.
Any suggestions?
Edit:
Some clarifications on my requirements. I have a table, 40 -50 million rows of (say) ID numbers and various attributes associated with those IDs.
Let's say 20 million of them have "CustomAttribute1" equal to 2, then 5 million have "CustomAttribute2" equal to "Yes" and 3 million have "CustomAttribute20" equal to 'No'
I need a FAST method of returning all IDs where:
1. CustomAttribute1 = 2
2. CustomAttribute2 = 'Yes'
3. CustomAttribute4 = null
4. CustomAttribute20 != 'No'
etc...
We have this implemented as EAV: the select query is a nightmare to implement and maintain, it takes a long time to return result, and most anoyingly the DB scales to huge sizes even for small ammounts of data, which is weird since the EAV is essentially normalizing the data but I assume all the indexes take up a bunch of space.
It seems like you've listed your available options. EAV can be a pain for querying (and slow, depending on how many criteria you want to search on simultaneously), but it tends to be the most "sane" and RDBMS-agnostic implementation.
Modifying the schema is a no-no...obviously it can be done, but such a practice is abhorrent. I do not approve.
The XML option is a solution, and SQL Server can query inside the structure. I'm not certain about other RDBMS's, and you don't list which one you're using in the post or the tags.
If you're going to be querying on many attributes (say, 20+) simultaneously, then I would probably recommend the XML solution just to limit the number of joins you'll have to make. Aside from that, I would stick with EAV.
You could represent all of the user defined fields with an XML Column, e.g.
"But I am not sure what the performance impact of doing this would be, however it is definitely the prettiest way of handling UDF's in a database in my opinion."
<UDF>
<Field Name="ConferenceAddress" DBType="NVarChar" Size="255">Some Address</Field>
<Field Name="ConferenceCity" DBType="NVarChar" Size="255">Some City</Field>
...etc
</UDF>
Then what I would do is put a trigger on the table so that when the column is updated it recreates a view for the table which pulls out the xml values as columns on the view. Lock the view etc during the recreation of it to prevent access errors application side.
Then I would create a stored procedure for updating the XML so that it would work for any XML Column following your User Defined Field xml formatting, e.g. Insert/Update/Remove/Get
GetUDFFieldValue
AddUDFField
UpdateUDFField
DeleteUDFField
--Shared Parameters
TableName
ColumnName
(e.g. use Dynamic SQL to get the XML from X table by X Column Name to make it universal/generic to all of your UDF Fields)
Here is an article on XML Performance Optimization from Sql Server 2005 (not seeing an equivalent in newer versions):
http://technet.microsoft.com/en-us/library/ms345118(v=sql.90).aspx
Lastly:
Are you sure you even need an RDBMS? NoSql Is a better fit for User Generated Fields, I might even consider using Both NoSql and Sql Server.